<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 2


---

### Lesson Guide
- [Exercise #3](#exercise-3)
- [Split-Apply-Combine](#split-apply-combine)
    - [`.groupby()`](#groupby)
    - [Apply Functions to Groups and Combine](#apply-combine)
- [Exercise #4](#exercise-4)
- [Indexing](#indexing)
    - [Location Indexing With `.loc()`](#loc)
    - [Position Indexing With `.iloc()`](#iloc)
- [Other Frequently Used Features](#frequent)
    - [Using Map Functions With Replacement Dictionaries](#map-dict)
    - [Encoding Strings as Integers With `.factorize()`](#factorize)
    - [Determining Unique Values](#unique)
    - [Replacing Values With `.replace()`](#replace)
    - [Series String Methods With `.str`](#series-str)
    - [Datetime Conversion and Arithmetic](#datetime)
    - [Setting and Resetting the Index](#set-reset-index)
    - [Sorting by Index](#sort-by-index)
    - [Changing the Data Type of a Column](#change-dtype)
    - [Creating Dummy-Coded Columns](#dummy)
    - [Concatenating DataFrames](#concatenate)
    - [Detecting and Dropping Duplicate Rows](#duplicate-rows)
    - [Writing a DataFrame to a `.csv`](#write-csv)
    - [Pickling a DataFrame](#pickle)
    - [Randomly Sampling a DataFrame](#sample)
- [Infrequently Used Features](#infrequent)
    - [Creating DataFrames From Dictionaries and Lists of Lists](#toy-dataframes)
    - [Performing Cross-Tabulations](#crosstab)
    - [Query-Filtering Syntax](#query)
    - [Calculating Memory Usage](#memory-usage)
    - [Converting Column to Category Type](#category-type)
    - [Creating Columns With `.assign()`](#assign)
    - [Limiting the Number of Rows to Load in a File Read](#limit-rows-read)
    - [Manually Setting the Number of Rows and Columns to Print](#manual-print)

In [75]:
import pandas as pd
import numpy as np

<a id='exercise-3'></a>
## Exercise #3

---

**Using the UFO data provided below:**
1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state `VA`.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where `city` is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of `city` and `state`.

In [2]:
ufo_csv = '../../../../resource-datasets/ufo_sightings/ufo.csv'

In [41]:
# A:
ufo = pd.read_csv(ufo_csv)
ufo.head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [4]:
ufo.shape

(80543, 5)

In [11]:
ufo.groupby('Colors Reported').size().sort_values(ascending=False)[:4]

Colors Reported
ORANGE    5216
RED       4809
GREEN     1897
BLUE      1855
dtype: int64

In [16]:
ufo[ufo['State'] == 'VA'].groupby('City').size().sort_values(ascending=False).head()

City
Virginia Beach    110
Richmond           92
Alexandria         48
Roanoke            35
Chesapeake         33
dtype: int64

In [17]:
ufo[ufo['City'] == 'Arlington'].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
201,Arlington,,DISK,TX,7/7/1952 13:00
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
2945,Arlington,,TRIANGLE,TX,6/23/1975 21:00
2946,Arlington,,TRIANGLE,TX,6/23/1975 21:00
2947,Arlington,,TRIANGLE,TX,6/23/1975 21:00


In [20]:
ufo.isnull().sum()

City                  47
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

In [23]:
ufo[ufo['City'].isnull()].head()

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00


In [43]:
ufo_2 = ufo.dropna(how='any')
ufo_2.shape

(15510, 5)

In [42]:
ufo.rename(columns={'Colors Reported': 'Colors_Reported', 'Shape Reported': 'Shape_Reported'},inplace=True)
ufo.head()

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [47]:
ufo['city_state'] = ufo.apply(lambda x: str(x['City']) + '_' + str(x['State']),axis=1)
ufo.head()

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time,city_state
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00,Ithaca_NY
1,Willingboro,,OTHER,NJ,6/30/1930 20:00,Willingboro_NJ
2,Holyoke,,OVAL,CO,2/15/1931 14:00,Holyoke_CO
3,Abilene,,DISK,KS,6/1/1931 13:00,Abilene_KS
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00,New York Worlds Fair_NY


<a id='split-apply-combine'></a>
## Split-Apply-Combine

---

![](assets/split_apply_combine.png)

<a id='groupby'></a>
### `.groupby()`

**Q.1** Using the `drinks` DataFrame, calculate the mean `beer` servings by continent.

In [49]:
drinks = pd.read_csv('../../../../resource-datasets/alcohol_by_country/drinks.csv')
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [50]:
# A:
drinks.groupby('continent').mean()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.471698,16.339623,16.264151,3.007547
AS,37.045455,60.840909,9.068182,2.170455
EU,193.777778,132.555556,142.222222,8.617778
OC,89.6875,58.4375,35.625,3.38125
SA,175.083333,114.75,62.416667,6.308333


**Q.2** Describe the `beer` column by continent.

In [52]:
# A:
drinks.groupby('continent')[['beer_servings']].describe()

Unnamed: 0_level_0,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0


<a id='apply-combine'></a>
### Apply Functions to Groups and Combine

**Q.1** Find the `count`, `mean`, `minimum`, and `maximum `of the `beer` column by continent.

In [56]:
# A:
drinks.groupby('continent')[['beer_servings']].describe()

Unnamed: 0_level_0,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0


**Q.2** Perform the same task as in Q.1, but now sort the output by the `mean` column.

In [61]:
# A:
drinks.groupby('continent')[['beer_servings']].describe().sort_values([('beer_servings','mean')],ascending=False)

Unnamed: 0_level_0,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings,beer_servings
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0


**Q.3** Apply a custom function to all columns of the `drinks` DataFrame, grouping by continent.

In [83]:
drinks.groupby('continent').mean()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.471698,16.339623,16.264151,3.007547
AS,37.045455,60.840909,9.068182,2.170455
EU,193.777778,132.555556,142.222222,8.617778
OC,89.6875,58.4375,35.625,3.38125
SA,175.083333,114.75,62.416667,6.308333


In [95]:
# A:
def custom(x,n):
    x = np.round(np.mean(x),n)
    return x

drinks.groupby('continent').apply(custom,2)

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.47,16.34,16.26,3.01
AS,37.05,60.84,9.07,2.17
EU,193.78,132.56,142.22,8.62
OC,89.69,58.44,35.62,3.38
SA,175.08,114.75,62.42,6.31


**Q.4** **Note:** If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [101]:
# A:
drinks.groupby('continent').agg({'beer_servings': lambda x: custom(x, 3),
                                 'spirit_servings': lambda x: custom(x, 2),
                                'wine_servings': lambda x: custom(x, 1)})

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AF,61.472,16.34,16.3
AS,37.045,60.84,9.1
EU,193.778,132.56,142.2
OC,89.688,58.44,35.6
SA,175.083,114.75,62.4


<a id='exercise-4'></a>

## Exercise #4

---

**Using the `users` DataFrame**:
1. Count the number of distinct occupations in `users`.
2. Calculate the mean age by occupation.
3. Calculate the minimum and maximum age by occupation.
4. Calculate the mean age by cross-sections of `occupation` and `gender`.

> **Tip**: Multiple columns can be passed to the `.groupby()` function for more granular cross-sections.

In [105]:
# A:
local_user_file = '../../../../resource-datasets/users/users_original.txt'
users_header = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
local_users = pd.read_csv(local_user_file, delimiter='|',names=users_header)
local_users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


<a id='indexing'></a>
## Indexing

---
<a id='loc'></a>
### Location Indexing With `.loc()`

**Q.1** Select all rows and the `city` column from the UFO data set using `.loc()`.

In [104]:
# A:


Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


**Q.2** Select all rows and columns in `city` and `state`.

In [13]:
# A:

**Q.3** Select all rows and columns from `city` *through* `state`.

In [14]:
# A:

**Q.4** Select:
- All columns at row 0.
- All columns at rows 0:2.
- Columns `city` through `state` at rows 0:2.

In [15]:
# A:

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [16]:
# A:

**Q.2** Select all rows and columns in positions 0 through 4.

In [17]:
# A:

**Q.3** Select rows in positions 0:3, along with all columns.

In [18]:
# A:

<a id='frequent'></a>
## Frequently Used Features

---
<a id='map-dict'></a>
### Using Map Functions With Replacement Dictionaries

In [19]:
# A:

<a id='factorize'></a>
### Encoding Strings as Integers With `.factorize()`

In [20]:
# A:

<a id='unique'></a>
### Determining Unique Values

In [21]:
# A:

<a id='replace'></a>
### Replacing Values With `.replace()`

In [22]:
# A:

<a id='series-str'></a>
### Series String Methods With `.str`

In [23]:
# A:

<a id='datetime'></a>
### Datetime Conversion and Arithmetic

In [24]:
# A:

<a id='set-reset-index'></a>
### Setting and Resetting the Index

In [25]:
# A:

<a id='sort-by-index'></a>
### Sorting by Index

In [26]:
# A:

<a id='change-dtype'></a>
### Changing the Data Type of a Column

In [27]:
# A:

<a id='dummy'></a>
### Creating Dummy-Coded Columns

In [28]:
# A:

<a id='concatenate'></a>
### Concatenating DataFrames

In [29]:
# A:

<a id='duplicate-rows'></a>
### Detecting and Dropping Duplicate Rows

In [30]:
# A:

<a id='write-csv'></a>
### Writing a DataFrame to a `.csv`
```python
# Write a DataFrame out to a `.csv`.
drinks.to_csv('drinks_updated.csv')  # Index is used as the first column
drinks.to_csv('drinks_updated.csv', index=False) # Ignore index
```

<a id='pickle'></a>
### Pickling a DataFrame
```python
# Save a DataFrame to disk (a.k.a., "pickle") and read it from disk (a.k.a., "unpickle").
drinks.to_pickle('drinks_pickle')
pd.read_pickle('drinks_pickle')
```

<a id='sample'></a>
### Randomly Sampling a DataFrame

In [31]:
# A:

<a id='infrequent'></a>
## Infrequently Used Features

---

<a id='toy-dataframes'></a>
### Creating DataFrames From Dictionaries and Lists of Lists

In [32]:
# A:

In [33]:
# A:

<a id='crosstab'></a>
### Performing Cross-Tabulations

In [34]:
# A:

<a id='query'></a>
### Query-Filtering Syntax

In [35]:
# A:

<a id='memory-usage'></a>
### Calculating Memory Usage

In [36]:
# A:

<a id='category-type'></a>
### Converting Column to Category Type

In [37]:
# A:

<a id='assign'></a>
### Creating Columns With `.assign()`

In [38]:
# A:

<a id='limit-rows-read'></a>
### Limiting the Number of Rows to Load in a File Read

In [39]:
# A:

<a id='manual-print'></a>
### Manually Setting the Number of Rows and Columns to Print

In [40]:
# A:

In [41]:
# A: