In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from pydataset import data

# Pandas DataFrame

## Lesson Goals

By the end of theis lesson and exercises, you will understand...

- 

## Lesson Extra Resources

- [DataFrames Review Notebook](https://ds-review-hub.github.io/pandas_dataframes_review)

- [Pandas Overview Canva](https://ds-review-hub.github.io/meet_the_pandas_series.pdf)

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Create a Pandas DataFrame:

- There are multiple ways to create pandas DataFrame objects, and I will demonstrate some of these below. 


- If you want more, see the official doc on the pandas DataFrame [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

**The pandas DataFrame constructor function defaults.**

```python
pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
```

#### From a List of Dictionaries

**<font color=purple>Does this look familiar?</font>**

In [2]:
shopping_cart = {
    "tax": .08,
    "items": [
        {
            "title": "orange juice",
            "price": 3.99,
            "quantity": 1
        },
        {
            "title": "rice",
            "price": 1.99,
            "quantity": 3
        },
        {
            "title": "beans",
            "price": 0.99,
            "quantity": 3
        },
        {
            "title": "chili sauce",
            "price": 2.99,
            "quantity": 1
        },
        {
            "title": "chocolate",
            "price": 0.75,
            "quantity": 9
        }
    ]
}

- `shopping_cart` is a dictionary with two keys, `tax` and `items`, but the value for items happens to be a list of dictionaries, so I can create a pandas DataFrame from the `items` list!

In [3]:
items_list = shopping_cart['items']
items_list

[{'title': 'orange juice', 'price': 3.99, 'quantity': 1},
 {'title': 'rice', 'price': 1.99, 'quantity': 3},
 {'title': 'beans', 'price': 0.99, 'quantity': 3},
 {'title': 'chili sauce', 'price': 2.99, 'quantity': 1},
 {'title': 'chocolate', 'price': 0.75, 'quantity': 9}]

- Now, I can pass my `items_list` variable as the data argument to `pd.DataFrame`.

In [4]:
cart_items = pd.DataFrame(items_list)
cart_items

Unnamed: 0,title,price,quantity
0,orange juice,3.99,1
1,rice,1.99,3
2,beans,0.99,3
3,chili sauce,2.99,1
4,chocolate,0.75,9


In [5]:
type(cart_items)

pandas.core.frame.DataFrame

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

#### From a Dictionary

- The dictionary keys will become column labels and the list values will become column values.

In [6]:
fam = {'name':['Milla', 'Steve', 'Faith', 'Freya'], 
       'signs':['Virgo', 'Gemini', 'Aquarius', 'Aquarius'],
       'age': [15, 44, 34, 1]} 
fam

{'name': ['Milla', 'Steve', 'Faith', 'Freya'],
 'signs': ['Virgo', 'Gemini', 'Aquarius', 'Aquarius'],
 'age': [15, 44, 34, 1]}

In [7]:
type(fam)

dict

In [8]:
fam_df = pd.DataFrame(fam)
fam_df

Unnamed: 0,name,signs,age
0,Milla,Virgo,15
1,Steve,Gemini,44
2,Faith,Aquarius,34
3,Freya,Aquarius,1


- I have the option to pass index labels when creating a DataFrame, too.

In [9]:
fam_df = pd.DataFrame(fam, index =['kane_1', 'kane_2', 'kane_3', 'kane_4'])
fam_df

Unnamed: 0,name,signs,age
kane_1,Milla,Virgo,15
kane_2,Steve,Gemini,44
kane_3,Faith,Aquarius,34
kane_4,Freya,Aquarius,1


In [10]:
fam_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, kane_1 to kane_4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    4 non-null      object
 1   signs   4 non-null      object
 2   age     4 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes


<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

#### From SQL

Create your DataFrame using a SQL query and connection url to access a database.

- I can use the pandas `pd.read_sql()` function to read the results of a SQL query into a pandas DataFrame.

```python
df = pd.read_sql(sql_query, connection_url)
```

- First, I need to install the mysqlclient and pymysql driver packages as directed in the curriculum [here](https://ds.codeup.com/python/advanced-dataframes/#from-sql).

<p style="background:black">
<code style="background:black;color:white">python -m pip install mysqlclient pymysql
</code>
</p>

- **Next, I will need to import host, password, and user from my env file.** *This keeps my private login information private because I have included my env.py file in my .gitignore file and pushed up my .gitignore file BEFORE pushing up my env.py file!*

In [11]:
from env import host, password, user

- Now, I can save the database name to a variable and use my imported variables, host, password, and user, to create my connection url string for use in the pandas `read_sql()` function.

In [12]:
db = 'employees'
connection_url = f'mysql+pymysql://{user}:{password}@{host}/{db}'

- Next, I can save my valid SQL query to a variable to use in the `read_sql()` function, as well.

In [13]:
sql_query = 'SELECT * FROM employees LIMIT 100'

- Finally, I use my variables with the `read_sql()` function and assign my resulting DataFrame to a variable.

In [14]:
employees_df = pd.read_sql(sql_query, connection_url)
employees_df.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


In [18]:
employees_df.shape

(100, 6)

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

#### Write to a CSV file

- When I import a large dataset using a SQL query, it only takes me about 3 or 4 times of restarting my kernel and waiting minutes for my query to run to decide I need to stop and write my new DataFrame to a CSV file that I can access instantly. 

- I only need to run this code once to create a new CSV file in my current directory, and then I can read in my data as shown in the next section. I can comment out the code I was using to read in my data and write to a CSV file after I have my CSV file, or I can re-run my code to pull fresh data and write over my CSV file.

```python
employees.to_csv('file_name.csv')
```

In [19]:
# Write my DataFrame employees_df to a csv file in the current directory.

employees_df.to_csv('employees_df.csv')

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

#### Read from a CSV file.

- I can use the pandas `pd.read_csv()` function to read the data from a CSV file into a pandas DataFrame.

```python
# If my csv file is in the same directory as my notebook, I can do this.
df = pd.read_csv('file_name.csv')
```

```python
# If my csv file is not in the same directory as my notebook, I have to include the file path.
df = pd.read_csv('file_path/file_name.csv')
```

In [22]:
# Create my DataFrame reading from my own CSV file; way faster now.

employees_df = pd.read_csv('employees_df.csv', index_col=0)
employees_df.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


In [23]:
employees_df.shape

(100, 6)

- I'm reading from another CSV file, interesting_data.csv, that I have in a subdirectory named data, so that I have another dataset to play with in here.

In [48]:
student_df = pd.read_csv('data/interesting_data.csv')
student_df.head()

Unnamed: 0,Country,Region,DataYear,ClassGrade,Gender,Ageyears,Handed,Height_cm,Footlength_cm,Armspan_cm,...,Watching_TV_Hours,Paid_Work_Hours,Work_At_Home_Hours,Schoolwork_Pressure,Planned_Education_Level,Favorite_Music,Superpower,Preferred_Status,Role_Model_Type,Charity_Donation
0,USA,TX,2018,12,Female,17.0,Right-Handed,152,22.5,24,...,0.0,35.0,1.0,Very little,Graduate degree,Rap/Hip hop,Invisibility,Happy,Business person,Environment
1,USA,TN,2018,12,Male,18.0,,,,,...,,,,,,,,,,
2,USA,CO,2018,12,Female,17.0,Ambidextrous,164cm,22cm,160cm,...,6.0,2.0,2.0,Some,Graduate degree,Rap/Hip hop,Invisibility,Happy,Friend,International aid
3,USA,ID,2018,12,Male,23.0,Right-Handed,171.5,21.6,162.5,...,2.0,0.0,8.0,A lot,Graduate degree,Gospel,Freeze time,Happy,Relative,Religious
4,USA,NC,2018,12,Female,17.0,Right-Handed,136,26,54,...,0.0,2.0,2.0,A lot,Undergraduate degree,Rap/Hip hop,Freeze time,Happy,Relative,International aid


<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Indexing

- Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator `[]`. 


- I will demonstrate concrete examples of indexing using the indexing operator `[]` alone and with the `.loc` and `.iloc` attributes below.

#### `[]`

- I can pass a list of columns from a DataFrame to the indexing operator (aka bracket notation) to return a subset of my original DataFrame.

In [49]:
# Peek at columns in employees_df

employees_df.head(1)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26


In [50]:
student_df.shape

(100, 60)

In [51]:
# Create a df that is a subset of the original employees_df.

employees_subset = employees_df[['last_name', 'gender']]
employees_subset.head()

Unnamed: 0,last_name,gender
0,Facello,M
1,Simmel,F
2,Bamford,M
3,Koblick,M
4,Maliniak,M


In [52]:
# I can create a subset using a boolean Series; I will use this to subset my original df.

female_bool_series = employees_df.gender == 'F'
female_bool_series.head(10)

0    False
1     True
2    False
3    False
4    False
5     True
6     True
7    False
8     True
9     True
Name: gender, dtype: bool

In [53]:
# I use my boolean Series to select only observations where gender == 'F' in my original employees_df.

female_subset = employees_df[female_bool_series]
female_subset.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
5,10006,1953-04-20,Anneke,Preusig,F,1989-06-02
6,10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
8,10009,1952-04-19,Sumant,Peac,F,1985-02-18
9,10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24


In [54]:
# There are 37 observations that meet my condition of gender == 'F'.

female_subset.shape

(37, 6)

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

#### `.loc` 

- I can use the `.loc` attribute to select specific rows AND columns by index labels. My index label can be a number, but it can also be a string label. This method offers a lot of flexibility! **The `.loc` attribute's indexing is inclusive and uses an index label, not integer position.** 

```python
df.loc[row_indexer, column_indexer]
```

In [55]:
# I want the rows from start through 5 (inclusive) and columns from last_name through hire_date (inclusive).

loc_subset = employees_df.loc[:5, 'last_name': 'hire_date']
loc_subset

Unnamed: 0,last_name,gender,hire_date
0,Facello,M,1986-06-26
1,Simmel,F,1985-11-21
2,Bamford,M,1986-08-28
3,Koblick,M,1986-12-01
4,Maliniak,M,1989-09-12
5,Preusig,F,1989-06-02


In [56]:
loc_subset.shape

(6, 3)

___

>**Example of Boolean Indexing Using `.loc`**

- Here I am passing a boolean Series as a selector to the .loc attribute called on my original series from above. As you can see below, where the boolean Series has a `True` value, the observation from the original Series is returned.

In [57]:
# Create a bool Series to select a subset of female employees using `.loc`

female_bool = employees_df.gender == 'F'
female_bool.head()

0    False
1     True
2    False
3    False
4    False
Name: gender, dtype: bool

In [58]:
# Create subset of female employees using my bool Series

employees_df.loc[female_bool]

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
5,10006,1953-04-20,Anneke,Preusig,F,1989-06-02
6,10007,1957-05-23,Tzvetan,Zielinski,F,1989-02-10
8,10009,1952-04-19,Sumant,Peac,F,1985-02-18
9,10010,1963-06-01,Duangkaew,Piveteau,F,1989-08-24
10,10011,1953-11-07,Mary,Sluis,F,1990-01-22
16,10017,1958-07-06,Cristinel,Bouloucos,F,1993-08-03
17,10018,1954-06-19,Kazuhide,Peha,F,1987-04-03
22,10023,1953-09-29,Bojan,Montemayor,F,1989-12-17
23,10024,1958-09-05,Suzette,Pettey,F,1997-05-19


In [59]:
# I can select only employees who are female using my bool Series to select rows AND select specific columns.

employees_df.loc[female_bool, ['last_name', 'hire_date' ]]

Unnamed: 0,last_name,hire_date
1,Simmel,1985-11-21
5,Preusig,1989-06-02
6,Zielinski,1989-02-10
8,Peac,1985-02-18
9,Piveteau,1989-08-24
10,Sluis,1990-01-22
16,Bouloucos,1993-08-03
17,Peha,1987-04-03
22,Montemayor,1989-12-17
23,Pettey,1997-05-19


___

#### `.iloc`

- I can use the `.iloc` attribute to select specific rows and colums by index position. `.iloc` does not accept a boolean Series as a selector like `.loc` does. **It takes in integers representing index position and is NOT inclusive.**

```python
df.iloc[row_indexer, column_indexer]
```

In [60]:
student_df.iloc[90:, :3 ]

Unnamed: 0,Country,Region,DataYear
90,USA,MA,2018
91,USA,MA,2018
92,USA,CA,2018
93,USA,PA,2018
94,USA,NJ,2018
95,USA,ID,2018
96,USA,ID,2018
97,USA,PA,2018
98,USA,PA,2018
99,USA,AL,2018


<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Manipulating the Index

#### `.set_index()`

- This method allows me to set my DataFrame index using an existing column. This will not change my original DataFrame because the default is `inplace=False`.

```python
df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
```

In [None]:
# My original fam_df.

fam_df.head(1)

In [None]:
# I can set the `name` column to be my new index.

fam_df.set_index('name')

___

In [None]:
# Create a new index using a pandas Index object of the same length.

fam_df.set_index(pd.Index([1, 2, 3, 4]))

___

#### `.reset_index()`

- This method will come in handy a lot as we move into methodologies, but for now I'll at least introduce it. This method does not change your original DataFrame unless you pass `inplace=True`; otherwise, just reassign or assign the new copy.

```python
df.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
```

In [None]:
# This resets the fam_df index to default and add my original index as a column.

fam_df.reset_index()

___

In [None]:
# I reset my index and rename original index column if I want to keep it.

fam_df.reset_index().rename(columns={'index': 'id'})

___

In [None]:
# If I don't want the original index as a column, I can set `drop=True`.

fam_df.reset_index(drop=True)

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Aggregating

#### `.groupby()`

- This powerful method allows you to group your data by one or more columns and apply any type of function to each group returning the calculations in a Series or DataFrame. 


- I can use a single grouping column, a single aggregating column, and a single aggregating function.

```python
df.groupby('grouping_column').agg_column.agg_func()
```

In [None]:
student_df.head(1)

In [None]:
# Use a `groupby()` to calculate the average age by Gender; return a df by using [['double_brackets']].

student_df.groupby('Gender')[['Ageyears']].mean()

- I can use a list of grouping columns and a single aggregating function.

```python
df.groupby(['list', 'of', 'grouping', 'columns']).agg_func()
```

In [None]:
# Perform a multi-level groupby aggregation.

student_df.groupby(['Gender', 'Handed']).size()

___

#### `.agg()`

Chaining the `.agg()` method with a `.groupby()` provides more flexibility when I'm aggregating. 

**I can use a list of grouping columns and perform more than one function on my aggregating column.**

```python
df.groupby(['list', 'of', 'grouping', 'columns']).agg_column.agg_func(['func', 'other_func'])
```

In [None]:
student_df.groupby(['Gender', 'Handed']).Ageyears.agg(['mean', 'median'])

___

**I can use a list of grouping columns and pass a dictionary to the `.agg()` method to use different aggregating functions on different columns.**

In [None]:
student_df.groupby(['Gender', 'Handed']).agg({'Ageyears': 'mean', 'Text_Messages_Sent_Yesterday': 'median'})

<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Joining

#### `.concat()`

- This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row-wise `axis=0` or column-wise `axis=1`. 

```python
# For example, concat with a list of two DataFrames
pd.concat([df1, df2], axis=0)
```


- When your list contains at least one DataFrame, a DataFrame is returned.


- When concatenating only Series objects row-wise, `axis=0`, a Series is returned.


- When concatenating Series or DataFrames column-wise, `axis=1`, a DataFrame is returned.

```python
# Default is set to row-wise concatenation using an outer join.
pd.concat(objs, axis=0, join='outer')
```

##### Row-wise Concat

>Combine two DataFrame objects with identical columns:

In [61]:
fam_df

Unnamed: 0,name,signs,age
kane_1,Milla,Virgo,15
kane_2,Steve,Gemini,44
kane_3,Faith,Aquarius,34
kane_4,Freya,Aquarius,1


In [62]:
# Create a list of dictionaries to be new rows in a DataFrame concatenated to my original fam_df.

new = [{'name': 'Penny', 'signs': 'Libra', 'age': '0'},
       {'name': 'Betty', 'signs': 'Libra', 'age': '1'},
       {'name': 'Pris', 'signs': 'Scorpio', 'age': '2'}]

In [63]:
# Create new_df using my list of dictionaries above.

new_df = pd.DataFrame(new, index=['kane_5', 'kane_6', 'kane_7'])
new_df

Unnamed: 0,name,signs,age
kane_5,Penny,Libra,0
kane_6,Betty,Libra,1
kane_7,Pris,Scorpio,2


In [64]:
# Concatenate my new_df to my original fam_df; the default, `axis=0`, will stack these dfs.

fam_df = pd.concat([fam_df, new_df])
fam_df

Unnamed: 0,name,signs,age
kane_1,Milla,Virgo,15
kane_2,Steve,Gemini,44
kane_3,Faith,Aquarius,34
kane_4,Freya,Aquarius,1
kane_5,Penny,Libra,0
kane_6,Betty,Libra,1
kane_7,Pris,Scorpio,2


##### Column-wise Concat

>Combine two DataFrame objects with identical index labels:

In [65]:
new_cols_df = pd.DataFrame({'eyes': ['brown', 'brown', 'blue', 'brown', 'amber', 'brown', 'hazel'],
                           'hair': ['brown', 'black', 'blonde', 'red', 'red', 'black', 'red']},
                           index=['kane_1', 'kane_2', 'kane_3','kane_4', 'kane_5', 'kane_6', 'kane_7']) 

new_cols_df

Unnamed: 0,eyes,hair
kane_1,brown,brown
kane_2,brown,black
kane_3,blue,blonde
kane_4,brown,red
kane_5,amber,red
kane_6,brown,black
kane_7,hazel,red


In [67]:
fam_df = pd.concat([fam_df, new_cols_df], axis=1)
fam_df

Unnamed: 0,name,signs,age,eyes,hair,eyes.1,hair.1
kane_1,Milla,Virgo,15,brown,brown,brown,brown
kane_2,Steve,Gemini,44,brown,black,brown,black
kane_3,Faith,Aquarius,34,blue,blonde,blue,blonde
kane_4,Freya,Aquarius,1,brown,red,brown,red
kane_5,Penny,Libra,0,amber,red,amber,red
kane_6,Betty,Libra,1,brown,black,brown,black
kane_7,Pris,Scorpio,2,hazel,red,hazel,red


___

#### `df.merge()`

- This method is similar to a SQL join. Here's a [cool read](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join) making a comparison between the two, if you're interested.

```python
left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes='_x', '_y', copy=True, indicator=False, validate=None)
```

- How does changing the default argument of the `how` parameter change my resulting DataFrame?

##### `how`

- Type of merge to be performed.

`how=left`: use only keys from left frame, similar to a SQL left outer join; preserve key order.

`how=right`: use only keys from right frame, similar to a SQL right outer join; preserve key order.

`how=outer`: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

`how=inner`: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

##### `on` 

- Merge on Label or List


- The default argument for the `on` parameter is `None`, so if I am not merging on the indexes of the Series or DataFrame objects I'm joining, this defaults to the intersection of the columns in both objects. Otherwise, I can pass the column(s) name or index level to join on.

In [68]:
# Read in some data from a CSV file to create my `titles` DataFrame.

titles = pd.read_csv('data/titles.csv', index_col=0)
titles.head(2)

Unnamed: 0,emp_no,title,from_date,to_date
0,10001,Senior Engineer,1986-06-26,9999-01-01
1,10002,Staff,1996-08-03,9999-01-01


In [77]:
# Peek at columns in table I want to merge with.

employees.head(2)

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21


In [93]:
# Merge employees and titles DataFrames on `emp_no` column.

all_emp_titles = employees.merge(titles, on='emp_no')
all_emp_titles.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date,title,from_date,to_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26,Senior Engineer,1986-06-26,9999-01-01
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21,Staff,1996-08-03,9999-01-01
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28,Senior Engineer,1995-12-03,9999-01-01
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01,Senior Engineer,1995-12-01,9999-01-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12,Senior Staff,1996-09-12,9999-01-01


<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Sorting

#### `.sort_values()`

- This is a very useful method of both pandas Series and DataFrames. When I call this method on a DataFrame, I have to pass an argument to the `by` parameter to specifiy which column(s) to sort my DataFrame by. 


- I can pass a string value (column name) or a list (column_names) as the argument to the `by` parameter, and I can also pass a single boolean value or a list of boolean values to the `ascending` parameter to curate my sort.

```python
# Defaults for sort_values method.

df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last', ignore_index=False)
```

In [96]:
# Here I am sorting by first name in ascending order and last name in descending order.

employees_df.sort_values(by=['first_name', 'last_name'], ascending=[True, False]).head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
35,10036,1959-08-10,Adamantios,Portugali,M,1992-01-03
34,10035,1953-02-08,Alain,Chappelet,M,1988-09-05
58,10059,1953-09-19,Alejandro,McAlpine,F,1991-06-26
38,10039,1959-10-01,Alejandro,Brender,M,1988-01-19
90,10091,1955-10-04,Amabile,Gomatam,M,1992-11-18


___

#### `.sort_index()`

- Just as I can sort by the values in my Series or DataFrame, I can also sort by the index. I also use this method all the time; keep it in mind.

```python
DataFrame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
```

In [97]:
fam_df.head(3)

Unnamed: 0,name,signs,age,eyes,hair,eyes.1,hair.1
kane_1,Milla,Virgo,15,brown,brown,brown,brown
kane_2,Steve,Gemini,44,brown,black,brown,black
kane_3,Faith,Aquarius,34,blue,blonde,blue,blonde


In [98]:
# I can reverse the order of my fam_df by the index if I want.

fam_df.sort_index(ascending=False)

Unnamed: 0,name,signs,age,eyes,hair,eyes.1,hair.1
kane_7,Pris,Scorpio,2,hazel,red,hazel,red
kane_6,Betty,Libra,1,brown,black,brown,black
kane_5,Penny,Libra,0,amber,red,amber,red
kane_4,Freya,Aquarius,1,brown,red,brown,red
kane_3,Faith,Aquarius,34,blue,blonde,blue,blonde
kane_2,Steve,Gemini,44,brown,black,brown,black
kane_1,Milla,Virgo,15,brown,brown,brown,brown


___

In [99]:
# I can also sort my DataFrame columns by setting `axis=1`; they are also an index.

fam_df.sort_index(axis=1)

Unnamed: 0,age,eyes,eyes.1,hair,hair.1,name,signs
kane_1,15,brown,brown,brown,brown,Milla,Virgo
kane_2,44,brown,brown,black,black,Steve,Gemini
kane_3,34,blue,blue,blonde,blonde,Faith,Aquarius
kane_4,1,brown,brown,red,red,Freya,Aquarius
kane_5,0,amber,amber,red,red,Penny,Libra
kane_6,1,brown,brown,black,black,Betty,Libra
kane_7,2,hazel,hazel,red,red,Pris,Scorpio


<hr style="border-top: 10px groove blueviolet; margin-top: 1px; margin-bottom: 1px"></hr>

### Reshaping Data

#### `.T`

- I can access this property of my DataFrame to transpose its indexes.

In [107]:
fam_df.T

Unnamed: 0,kane_1,kane_2,kane_3,kane_4,kane_5,kane_6,kane_7
name,Milla,Steve,Faith,Freya,Penny,Betty,Pris
signs,Virgo,Gemini,Aquarius,Aquarius,Libra,Libra,Scorpio
age,15,44,34,1,0,1,2
eyes,brown,brown,blue,brown,amber,brown,hazel
hair,brown,black,blonde,red,red,black,red
eyes,brown,brown,blue,brown,amber,brown,hazel
hair,brown,black,blonde,red,red,black,red


___

#### `.pivot_table()`

- This pandas function allows me to create a spreadsheet-style pivot table as a DataFrame. I'll demonstrate a very simple pivot here, but as we deal with more complex data, pivots can do a lot more.

```python
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’) 
```

In [110]:
# View original DataFrame.

cart_items.

Unnamed: 0,title,price,quantity
0,orange juice,3.99,1
1,rice,1.99,3
2,beans,0.99,3
3,chili sauce,2.99,1
4,chocolate,0.75,9


In [111]:
# Create a new column calculating price and quantity for use below.

cart_items['total'] = cart_items.price * cart_items.quantity
cart_items

Unnamed: 0,title,price,quantity,total
0,orange juice,3.99,1,3.99
1,rice,1.99,3,5.97
2,beans,0.99,3,2.97
3,chili sauce,2.99,1,2.99
4,chocolate,0.75,9,6.75


In [112]:
# A simple pivot table setting only `values` and `columns`.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title')

title,beans,chili sauce,chocolate,orange juice,rice
price,0.99,2.99,0.75,3.99,1.99
quantity,3.0,1.0,9.0,1.0,3.0
total,2.97,2.99,6.75,3.99,5.97


In [115]:
# I can choose different metrics for each of my values/rows.

pd.pivot_table(cart_items, values=['quantity', 'price', 'total'], columns='title', aggfunc={'price':'mean', 'quantity':'sum', 'total':'sum'} )

title,beans,chili sauce,chocolate,orange juice,rice
price,0.99,2.99,0.75,3.99,1.99
quantity,3.0,1.0,9.0,1.0,3.0
total,2.97,2.99,6.75,3.99,5.97


___

#### `.crosstab()`

- This function basically creates a Frequency Table.

```python
pd.crosstab(index_series, col_series)
```

In [116]:
# Here I'm reading in a CSV file I created to create my DataFrame.

dept_titles = pd.read_csv('data/dept_titles.csv', index_col=0)
dept_titles.head()

Unnamed: 0,emp_no,title,from_date,to_date,dept_name
0,10011,Staff,1990-01-22,1996-11-09,Customer Service
1,10038,Senior Staff,1996-09-20,9999-01-01,Customer Service
2,10038,Staff,1989-09-20,1996-09-20,Customer Service
3,10049,Senior Staff,2000-05-04,9999-01-01,Customer Service
4,10049,Staff,1992-05-04,2000-05-04,Customer Service


In [117]:
# Create a frequency table of titles by department

all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title)
all_titles_crosstab

title,Assistant Engineer,Engineer,Manager,Senior Engineer,Senior Staff,Staff,Technique Leader
dept_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer Service,298,2362,4,2027,13925,16150,309
Development,7769,58135,2,49326,1247,1424,7683
Finance,0,0,2,0,12139,13929,0
Human Resources,0,0,2,0,12274,14342,0
Marketing,0,0,2,0,13940,16196,0
Production,6445,49649,4,42205,1270,1478,6557
Quality Management,1831,13852,4,11864,0,0,1795
Research,378,2986,2,2570,11637,13495,393
Sales,0,0,2,0,36191,41808,0


In [122]:
# normalize=True gives us percentages of total

all_titles_crosstab = pd.crosstab(dept_titles.dept_name, dept_titles.title, normalize=True)
all_titles_crosstab

title,Assistant Engineer,Engineer,Manager,Senior Engineer,Senior Staff,Staff,Technique Leader
dept_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Customer Service,0.000608,0.004821,8e-06,0.004138,0.028424,0.032966,0.000631
Development,0.015858,0.118666,4e-06,0.100685,0.002545,0.002907,0.015683
Finance,0.0,0.0,4e-06,0.0,0.024778,0.028432,0.0
Human Resources,0.0,0.0,4e-06,0.0,0.025054,0.029275,0.0
Marketing,0.0,0.0,4e-06,0.0,0.028455,0.03306,0.0
Production,0.013156,0.101345,8e-06,0.08615,0.002592,0.003017,0.013384
Quality Management,0.003737,0.028275,8e-06,0.024217,0.0,0.0,0.003664
Research,0.000772,0.006095,4e-06,0.005246,0.023754,0.027546,0.000802
Sales,0.0,0.0,4e-06,0.0,0.073874,0.085339,0.0


**Want more on reshaping using the pandas `crosstab` function? [This article](https://pbpython.com/pandas-crosstab.html) is a lot of fun!**