<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 2

_Authors: Joseph Nelson (DC)_

---

**Warning: This is a resource-heavy notebook that can consume a lot of RAM, especially when it's run in Chrome. For this lesson, you may want to close idle applications and/or open this notebook with Safari.**

### Lesson Guide
- [Exercise #3](#exercise-3)
- [Split-Apply-Combine](#split-apply-combine)
    - [`.groupby()`](#groupby)
    - [Apply Functions to Groups and Combine](#apply-combine)
- [Exercise #4](#exercise-4)
- [Indexing](#indexing)
    - [Location Indexing With `.loc()`](#loc)
    - [Position Indexing With `.iloc()`](#iloc)
- [Other Frequently Used Features](#frequent)
    - [Using Map Functions With Replacement Dictionaries](#map-dict)
    - [Encoding Strings as Integers With `.factorize()`](#factorize)
    - [Determining Unique Values](#unique)
    - [Replacing Values With `.replace()`](#replace)
    - [Series String Methods With `.str`](#series-str)
    - [Datetime Conversion and Arithmetic](#datetime)
    - [Setting and Resetting the Index](#set-reset-index)
    - [Sorting by Index](#sort-by-index)
    - [Changing the Data Type of a Column](#change-dtype)
    - [Creating Dummy-Coded Columns](#dummy)
    - [Concatenating DataFrames](#concatenate)
    - [Detecting and Dropping Duplicate Rows](#duplicate-rows)
    - [Writing a DataFrame to a `.csv`](#write-csv)
    - [Pickling a DataFrame](#pickle)
    - [Randomly Sampling a DataFrame](#sample)
- [Infrequently Used Features](#infrequent)
    - [Creating DataFrames From Dictionaries and Lists of Lists](#toy-dataframes)
    - [Performing Cross-Tabulations](#crosstab)
    - [Query-Filtering Syntax](#query)
    - [Calculating Memory Usage](#memory-usage)
    - [Converting Column to Category Type](#category-type)
    - [Creating Columns With `.assign()`](#assign)
    - [Limiting the Number of Rows to Load in a File Read](#limit-rows-read)
    - [Manually Setting the Number of Rows and Columns to Print](#manual-print)

In [1]:
import pandas as pd

<a id='exercise-3'></a>
## Exercise #3

---

**Using the UFO data provided below:**
1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state `VA`.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where `city` is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of `city` and `state`.

In [2]:
ufo_csv = 'https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/ufo.csv'
ufo_csv = 'datasets/ufo.csv'

In [3]:
df = pd.read_csv(ufo_csv)

In [4]:
print df.shape
print df.info()
print df.describe(include='all')

(80543, 5)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80543 entries, 0 to 80542
Data columns (total 5 columns):
City               80496 non-null object
Colors Reported    17034 non-null object
Shape Reported     72141 non-null object
State              80543 non-null object
Time               80543 non-null object
dtypes: object(5)
memory usage: 3.1+ MB
None
           City Colors Reported Shape Reported  State            Time
count     80496           17034          72141  80543           80543
unique    13504              31             27     52           68901
top     Seattle          ORANGE          LIGHT     CA  7/4/2014 22:00
freq        646            5216          16332  10743              45


In [5]:
df.groupby('Colors Reported').size().sort_values(ascending=False).head(4)

Colors Reported
ORANGE    5216
RED       4809
GREEN     1897
BLUE      1855
dtype: int64

In [6]:
df[df['State']=='VA'].groupby('City').size().sort_values(ascending=False).head(4)

City
Virginia Beach    110
Richmond           92
Alexandria         48
Roanoke            35
dtype: int64

In [7]:
df[(df['State']=='VA') & (df['City']=='Arlington')]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
6300,Arlington,,CHEVRON,VA,5/5/1990 21:40
10278,Arlington,,DISK,VA,5/27/1997 15:30
14527,Arlington,,OTHER,VA,9/10/1999 21:41
17984,Arlington,RED,DISK,VA,11/19/2000 22:00
21201,Arlington,GREEN,FIREBALL,VA,1/7/2002 17:45
22633,Arlington,,LIGHT,VA,7/26/2002 1:15
22780,Arlington,,LIGHT,VA,8/7/2002 21:00
25066,Arlington,,CIGAR,VA,6/1/2003 22:34
27398,Arlington,,VARIOUS,VA,12/13/2003 2:00


In [8]:
df.isnull().sum()

City                  47
Colors Reported    63509
Shape Reported      8402
State                  0
Time                   0
dtype: int64

In [9]:
df[df['City'].isnull()]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
21,,,,LA,8/15/1943 0:00
22,,,LIGHT,LA,8/15/1943 0:00
204,,,DISK,CA,7/15/1952 12:30
241,,BLUE,DISK,MT,7/4/1953 14:00
613,,,DISK,NV,7/1/1960 12:00
1877,,YELLOW,CIRCLE,AZ,8/15/1969 1:00
2013,,,,NH,8/1/1970 9:30
2546,,,FIREBALL,OH,10/25/1973 23:30
3123,,RED,TRIANGLE,WV,11/25/1975 23:00
4736,,,SPHERE,CA,6/23/1982 23:00


In [10]:
df.dropna().count()

City               15510
Colors Reported    15510
Shape Reported     15510
State              15510
Time               15510
dtype: int64

In [11]:
import re
new_colnames = []
for c in df.columns:
    new_colnames.append(re.sub(' ', '_', c))

df.columns = new_colnames
df.head(3)

Unnamed: 0,City,Colors_Reported,Shape_Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


<a id='split-apply-combine'></a>
## Split-Apply-Combine

---

![](assets/split_apply_combine.png)

<a id='groupby'></a>
### `.groupby()`

**Q.1** Using the `drinks` DataFrame, calculate the mean `beer` servings by continent.

In [12]:
drinks =pd.read_csv('datasets/drinks_updated.csv')

In [13]:
drinks.groupby('continent')['beer'].mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer, dtype: float64

**Q.2** Describe the `beer` column by continent.

In [14]:
drinks.groupby('continent')['beer'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0


<a id='apply-combine'></a>
### Apply Functions to Groups and Combine

**Q.1** Find the `count`, `mean`, `minimum`, and `maximum `of the `beer` column by continent.

In [15]:
drinks.groupby('continent')['beer'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0


**Q.2** Perform the same task as in Q.1, but now sort the output by the `mean` column.

In [16]:
drinks.groupby('continent')['beer'].describe().sort_values('mean')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AS,44.0,37.045455,49.469725,0.0,4.25,17.5,60.5,247.0
AF,53.0,61.471698,80.557816,0.0,15.0,32.0,76.0,376.0
OC,16.0,89.6875,96.641412,0.0,21.0,52.5,125.75,306.0
SA,12.0,175.083333,65.242845,93.0,129.5,162.5,198.0,333.0
EU,45.0,193.777778,99.631569,0.0,127.0,219.0,270.0,361.0


**Q.3** Apply a custom function to all columns of the `drinks` DataFrame, grouping by continent.

In [17]:
drinks.groupby('continent').mean()

Unnamed: 0_level_0,beer,spirit,wine,liters
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AF,61.471698,16.339623,16.264151,3.007547
AS,37.045455,60.840909,9.068182,2.170455
EU,193.777778,132.555556,142.222222,8.617778
OC,89.6875,58.4375,35.625,3.38125
SA,175.083333,114.75,62.416667,6.308333


**Q.4** **Note:** If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [18]:
# A:

<a id='exercise-4'></a>

## Exercise #4

---

**Using the `users` DataFrame**:
1. Count the number of distinct occupations in `users`.
2. Calculate the mean age by occupation.
3. Calculate the minimum and maximum age by occupation.
4. Calculate the mean age by cross-sections of `occupation` and `gender`.

> **Tip**: Multiple columns can be passed to the `.groupby()` function for more granular cross-sections.

In [19]:
df = pd.read_table('./datasets/users.txt', sep='|')
df['occupation'].nunique()

21

In [20]:
df.groupby('occupation')['age'].mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

In [21]:
print df.groupby('occupation')['age'].min()
print df.groupby('occupation')['age'].max()

occupation
administrator    21
artist           19
doctor           28
educator         23
engineer         22
entertainment    15
executive        22
healthcare       22
homemaker        20
lawyer           21
librarian        23
marketing        24
none             11
other            13
programmer       20
retired          51
salesman         18
scientist        23
student           7
technician       21
writer           18
Name: age, dtype: int64
occupation
administrator    70
artist           48
doctor           64
educator         63
engineer         70
entertainment    50
executive        69
healthcare       62
homemaker        50
lawyer           53
librarian        69
marketing        55
none             55
other            64
programmer       63
retired          73
salesman         66
scientist        55
student          42
technician       55
writer           60
Name: age, dtype: int64


In [22]:
df.groupby(['occupation','gender'])['age'].mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

<a id='indexing'></a>
## Indexing

---
<a id='loc'></a>
### Location Indexing With `.loc()`

**Q.1** Select all rows and the `city` column from the UFO data set using `.loc()`.

In [23]:
df = pd.read_csv(ufo_csv)
df.head(3)

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


In [24]:
df.loc[:,'City']

0                      Ithaca
1                 Willingboro
2                     Holyoke
3                     Abilene
4        New York Worlds Fair
5                 Valley City
6                 Crater Lake
7                        Alma
8                     Eklutna
9                     Hubbard
10                    Fontana
11                   Waterloo
12                     Belton
13                     Keokuk
14                  Ludington
15                Forest Home
16                Los Angeles
17                  Hapeville
18                     Oneida
19                 Bering Sea
20                   Nebraska
21                        NaN
22                        NaN
23                  Owensboro
24                 Wilderness
25                  San Diego
26                 Wilderness
27                     Clovis
28                 Los Alamos
29               Ft. Duschene
                 ...         
80513              Manahawkin
80514             New Bedford
80515     

**Q.2** Select all rows and columns in `city` and `state`.

In [25]:
df.loc[:,['City','State']]

Unnamed: 0,City,State
0,Ithaca,NY
1,Willingboro,NJ
2,Holyoke,CO
3,Abilene,KS
4,New York Worlds Fair,NY
5,Valley City,ND
6,Crater Lake,CA
7,Alma,MI
8,Eklutna,AK
9,Hubbard,OR


**Q.3** Select all rows and columns from `city` *through* `state`.

In [26]:
df.loc[:,'City':'State']

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS
4,New York Worlds Fair,,LIGHT,NY
5,Valley City,,DISK,ND
6,Crater Lake,,CIRCLE,CA
7,Alma,,DISK,MI
8,Eklutna,,CIGAR,AK
9,Hubbard,,CYLINDER,OR


**Q.4** Select:
- All columns at row 0.
- All columns at rows 0:2.
- Columns `city` through `state` at rows 0:2.

In [27]:
df.loc[0,:]

City                       Ithaca
Colors Reported               NaN
Shape Reported           TRIANGLE
State                          NY
Time               6/1/1930 22:00
Name: 0, dtype: object

In [28]:
df.loc[0:2:]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [29]:
df.iloc[:3,:3]

Unnamed: 0,City,Colors Reported,Shape Reported
0,Ithaca,,TRIANGLE
1,Willingboro,,OTHER
2,Holyoke,,OVAL


**Q.2** Select all rows and columns in positions 0 through 4.

In [30]:
df.iloc[:4,:4]

Unnamed: 0,City,Colors Reported,Shape Reported,State
0,Ithaca,,TRIANGLE,NY
1,Willingboro,,OTHER,NJ
2,Holyoke,,OVAL,CO
3,Abilene,,DISK,KS


**Q.3** Select rows in positions 0:3, along with all columns.

In [31]:
df.iloc[:3,:]

Unnamed: 0,City,Colors Reported,Shape Reported,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00


<a id='frequent'></a>
## Frequently Used Features

---
<a id='map-dict'></a>
### Using Map Functions With Replacement Dictionaries

In [59]:
user = pd.read_table('./datasets/users.txt', sep='|')
user['is_male'] = user['gender'].map({'F':0, 'M':1})
user.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_male
0,1,24,M,technician,85711,1
1,2,53,F,other,94043,0
2,3,23,M,writer,32067,1
3,4,24,M,technician,43537,1
4,5,33,F,other,15213,0


<a id='factorize'></a>
### Encoding Strings as Integers With `.factorize()`

In [63]:
user['occ'] = user['occupation'].factorize()[0]
user.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_male,occ
0,1,24,M,technician,85711,1,0
1,2,53,F,other,94043,0,1
2,3,23,M,writer,32067,1,2
3,4,24,M,technician,43537,1,0
4,5,33,F,other,15213,0,1


<a id='unique'></a>
### Determining Unique Values

In [34]:
df['State'].unique()

array(['NY', 'NJ', 'CO', 'KS', 'ND', 'CA', 'MI', 'AK', 'OR', 'AL', 'SC',
       'IA', 'GA', 'TN', 'NE', 'LA', 'KY', 'WV', 'NM', 'UT', 'RI', 'FL',
       'VA', 'NC', 'TX', 'WA', 'ME', 'IL', 'AZ', 'OH', 'PA', 'MN', 'WI',
       'MD', 'SD', 'NV', 'ID', 'MO', 'OK', 'IN', 'CT', 'MS', 'AR', 'WY',
       'MA', 'MT', 'DE', 'NH', 'VT', 'HI', 'Ca', 'Fl'], dtype=object)

<a id='replace'></a>
### Replacing Values With `.replace()`

In [35]:
df['State'].replace('CA','NIO')

0         NY
1         NJ
2         CO
3         KS
4         NY
5         ND
6        NIO
7         MI
8         AK
9         OR
10       NIO
11        AL
12        SC
13        IA
14        MI
15       NIO
16       NIO
17        GA
18        TN
19        AK
20        NE
21        LA
22        LA
23        KY
24        WV
25       NIO
26        WV
27        NM
28        NM
29        UT
        ... 
80513     NJ
80514     MA
80515     VA
80516    NIO
80517     NH
80518     PA
80519     IL
80520     PA
80521     OH
80522     MA
80523     MD
80524     WA
80525     IA
80526     MA
80527     WA
80528     OH
80529     WA
80530     FL
80531     VA
80532     MA
80533     IA
80534     TX
80535     KY
80536     PA
80537     NE
80538     NE
80539     OH
80540     AZ
80541     IL
80542     FL
Name: State, Length: 80543, dtype: object

<a id='series-str'></a>
### Series String Methods With `.str`

In [36]:
# A:

<a id='datetime'></a>
### Datetime Conversion and Arithmetic

In [37]:
# A:

<a id='set-reset-index'></a>
### Setting and Resetting the Index

In [38]:
# A:

<a id='sort-by-index'></a>
### Sorting by Index

In [39]:
# A:

<a id='change-dtype'></a>
### Changing the Data Type of a Column

In [40]:
# A:

<a id='dummy'></a>
### Creating Dummy-Coded Columns

In [41]:
# A:

<a id='concatenate'></a>
### Concatenating DataFrames

In [42]:
# A:

<a id='duplicate-rows'></a>
### Detecting and Dropping Duplicate Rows

In [43]:
# A:

<a id='write-csv'></a>
### Writing a DataFrame to a `.csv`
```python
# Write a DataFrame out to a `.csv`.
drinks.to_csv('drinks_updated.csv')  # Index is used as the first column
drinks.to_csv('drinks_updated.csv', index=False) # Ignore index
```

<a id='pickle'></a>
### Pickling a DataFrame
```python
# Save a DataFrame to disk (a.k.a., "pickle") and read it from disk (a.k.a., "unpickle").
drinks.to_pickle('drinks_pickle')
pd.read_pickle('drinks_pickle')
```

<a id='sample'></a>
### Randomly Sampling a DataFrame

In [44]:
# A:

<a id='infrequent'></a>
## Infrequently Used Features

---

<a id='toy-dataframes'></a>
### Creating DataFrames From Dictionaries and Lists of Lists

In [45]:
# A:

In [46]:
# A:

<a id='crosstab'></a>
### Performing Cross-Tabulations

In [47]:
# A:

<a id='query'></a>
### Query-Filtering Syntax

In [48]:
# A:

<a id='memory-usage'></a>
### Calculating Memory Usage

In [49]:
# A:

<a id='category-type'></a>
### Converting Column to Category Type

In [50]:
# A:

<a id='assign'></a>
### Creating Columns With `.assign()`

In [51]:
# A:

<a id='limit-rows-read'></a>
### Limiting the Number of Rows to Load in a File Read

In [52]:
# A:

<a id='manual-print'></a>
### Manually Setting the Number of Rows and Columns to Print

In [53]:
# A:

In [54]:
# A: