<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 1

_Authors: Joseph Nelson (DC)_

---

**Warning: This is a resource-heavy notebook that can consume a lot of RAM, especially when it's run in Chrome. For this lesson, you may want to close idle applications and/or open this notebook with Safari.**

### Lesson Guide
- [The Basics of `pandas` DataFrames](#basics)
    - [Loading Data](#loading)
    - [A Basic Examination of DataFrames](#examine)
    - [Selecting Columns](#selecting)
    - [Describing Data](#describing)
- [Exercise #1](#exercise-1)
- [Filtering and Sorting DataFrames](#filtering-sorting)
    - [Boolean Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise #2](#exercise-2)
- [Renaming, Adding, and Removing Columns](#columns)
    - [Renaming Columns](#renaming-columns)
    - [Adding Columns](#adding-columns)
    - [Removing Columns](#removing-columns)
- [Handling Missing Values](#missing)
    - [Finding Missing Values](#find-missing)
    - [Dropping Missing Values](#drop-missing)
    - [Filling in Missing Values](#fill-missing)


<a id='basics'></a>

## The Basics of `pandas` DataFrames

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading Data

**Q.1** You can read in a file either from your local computer or directly from a URL.

```Python
# Local:
users = pd.read_table('../datasets/users.txt')

# Remote:
users = pd.read_table('https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users.txt')
```

Read in the data using the method you prefer.

In [9]:
users = pd.read_csv('datasets/users.txt')
users.sample(10)

Unnamed: 0,user_id|age|gender|occupation|zip_code
534,535|45|F|educator|80302
794,795|30|M|programmer|08610
622,623|50|F|educator|60187
318,319|38|M|programmer|22030
912,913|27|M|student|76201
804,805|27|F|other|20009
208,209|33|F|educator|85710
131,132|24|M|other|94612
198,199|30|M|writer|17604
769,770|28|M|student|14216


**Q.2** Use kwargs to set appropriate data-reading parameters.

In [10]:
users = pd.read_csv('datasets/users.txt',sep='|')
users.sample(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
575,576,48,M,executive,98281
491,492,57,M,educator,94618
707,708,26,F,homemaker,96349
600,601,19,F,artist,99687
367,368,18,M,student,92113
415,416,20,F,student,92626
881,882,35,M,engineer,40503
561,562,54,F,administrator,20879
222,223,19,F,student,47906
708,709,21,M,other,N4T1A


<a id='examine'></a>
### A Basic Examination of DataFrames

**Q.1** Print the type of `users`.

In [11]:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.2** Print the first five rows, first ten rows, and last two rows of `users`.

In [12]:
users.head(5)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [13]:
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [14]:
users.tail(2)

Unnamed: 0,user_id,age,gender,occupation,zip_code
941,942,48,F,librarian,78209
942,943,22,M,student,77841


**Q.3** Print the index and columns.

In [15]:
users.index

RangeIndex(start=0, stop=943, step=1)

In [16]:
users.columns

Index([u'user_id', u'age', u'gender', u'occupation', u'zip_code'], dtype='object')

**Q.4** Find the dtypes of the columns.

In [17]:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the DataFrame.

In [18]:
users.shape

(943, 5)

**Q.6** Extract the underlying `numpy` array as a new variable.

In [20]:
values = users.values
print(type(values))
print(values.shape)

<type 'numpy.ndarray'>
(943, 5)


<a id='selecting'></a>
### Selecting Columns

**Q.1** Assign the `gender` column to a variable.

In [21]:
gender = users['gender']

**Q.2** What is the type of `gender`?

In [24]:
gender.dtype

dtype('O')

**Q.3** Select `gender` and `occupation` as a new DataFrame.

In [26]:
users[['gender','occupation']].head()

Unnamed: 0,gender,occupation
0,M,technician
1,F,other
2,M,writer
3,M,technician
4,F,other


<a id='describing'></a>
### Describing Data

**Q.1** Calculate the descriptive statistics for the numeric columns in the DataFrame (_which is the function default_).  

In [27]:
users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


**Q.2** Describe the "object" (string) columns.

In [30]:
users.describe(include=['O'])

Unnamed: 0,gender,occupation,zip_code
count,943,943,943
unique,2,21,795
top,M,student,55414
freq,670,196,9


**Q.3** Describe all of the columns, regardless of type.

In [31]:
users.describe(include='all')

Unnamed: 0,user_id,age,gender,occupation,zip_code
count,943.0,943.0,943,943,943.0
unique,,,2,21,795.0
top,,,M,student,55414.0
freq,,,670,196,9.0
mean,472.0,34.051962,,,
std,272.364951,12.19274,,,
min,1.0,7.0,,,
25%,236.5,25.0,,,
50%,472.0,31.0,,,
75%,707.5,43.0,,,


**Q.4** Describe the `gender` Series from the `users` DataFrame.

In [32]:
users['gender'].describe()

count     943
unique      2
top         M
freq      670
Name: gender, dtype: object

**Q.5** Calculate the mean of the `age` column.

In [33]:
users['age'].mean()

34.05196182396607

**Q.6** Calculate the counts of distinct values in the `gender` and `age` columns.

In [39]:
len(users['age'].unique())

61

In [40]:
users['age'].nunique()

61

<a id='exercise-1'></a>
## Exercise #1

---

Load the `drinks.csv` data provided in the URL below.

**Perform the following:**
1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the `beer_servings` column/Series to a variable.
4. Calculate summary statistics for `beer_servings`.
5. Calculate the median of `beer_servings`.
6. Count the values of unique categories in `continent`.
7. Print the dimensions of the `drinks` DataFrame.
8. Find the first three items of the value counts of the `occupation` column.

**BONUS:**
1. Create the `users` DataFrame from the `user_file` provided (which lacks a header row).
2. Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`.


In [41]:
#  Use your preferred file location to read in the data:
remote_drinks_csv = 'https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/drinks.csv'
local_drinks_csv = 'datasets/drinks.csv'
# and
remote_user_file ='https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users_original.txt'
local_user_file = 'datasets/users_original.txt'

In [44]:
drinks = pd.read_csv(local_drinks_csv)
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [51]:
users = pd.read_csv(local_user_file, sep='|', header=None)
users.columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [54]:
users['occupation'].value_counts().head(3)

student     196
other       105
educator     95
Name: occupation, dtype: int64

<a id='filtering-sorting'></a>

## Filtering and Sorting DataFrames

---


<a id='filtering'></a>
### Boolean Filtering

**Q.1** Show users `age < 20` using a Boolean mask.

In [56]:
users[users['age']<20].head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
35,36,19,F,student,93117
51,52,18,F,student,55105
56,57,16,M,none,84010
66,67,17,M,student,60402


**Q.2** Calculate the value counts of `occupation` for users `age < 20`.

In [57]:
users[users['age']<20]['occupation'].value_counts()

student          64
other             4
none              3
writer            2
entertainment     2
salesman          1
artist            1
Name: occupation, dtype: int64

**Q.3** Print the male users `age < 20`. 

In [59]:
users[(users['age']<20) & (users['gender']=='M')].head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
56,57,16,M,none,84010
66,67,17,M,student,60402
67,68,19,M,student,22904
100,101,15,M,student,5146


**Q.4** Print the users `age < 10` or `age > 70`.

In [60]:
users[(users['age']<10) | (users['age']>70)]

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
480,481,73,M,retired,37771


<a id='sorting'></a>
### Sorting

**Q.1** Return the `age` column sorted in ascending order.

In [64]:
users['age'].sort_values(ascending=True).head()

29      7
470    10
288    11
879    13
608    13
Name: age, dtype: int64

**Q.2** Sort the `users` DataFrame by the `age` column (ascending).

In [65]:
users.sort_values(by='age',ascending=True).head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
470,471,10,M,student,77459
288,289,11,M,none,94619
879,880,13,M,student,83702
608,609,13,F,student,55106


**Q.3** Sort the `users` DataFrame by the `age` column in *descending* order.

In [66]:
users.sort_values(by='age',ascending=False).head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
480,481,73,M,retired,37771
802,803,70,M,administrator,78212
766,767,70,M,engineer,0
859,860,70,F,retired,48322
584,585,69,M,librarian,98501


<a id='exercise-2'></a>

## Exercise #2

---

**Using the `drinks` DataFrame from the previous exercise:**
1. Filter `drinks` to include only European countries.
2. Filter `drinks` to include only European countries with `wine_servings` > 300.
3. Calculate the mean `beer_servings` for all of Europe.
4. Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.



In [70]:
drinks[drinks['continent']=='EU'].head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89,132,54,4.9,EU
3,Andorra,245,138,312,12.4,EU
7,Armenia,21,179,11,3.8,EU
9,Austria,279,75,191,9.7,EU
10,Azerbaijan,21,46,5,1.3,EU


In [71]:
drinks[(drinks['continent']=='EU') & (drinks['wine_servings']>300)].head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245,138,312,12.4,EU
61,France,127,151,370,11.8,EU
136,Portugal,194,67,339,11.0,EU


In [72]:
drinks[(drinks['continent']=='EU')]['beer_servings'].mean()

193.77777777777777

In [76]:
drinks.sort_values(by='total_litres_of_pure_alcohol',ascending=False).head(10)[['country']]

15                Belarus
98              Lithuania
3                 Andorra
68                Grenada
45         Czech Republic
61                 France
141    Russian Federation
81                Ireland
155              Slovakia
99             Luxembourg
Name: country, dtype: object

**Using the `users` DataFrame:**
1. Sort `users` by occupation and then by `age` in a single command.
2. Filter `users` to only include doctors and lawyers without using a `|`.

> **Hint:** Look up `pandas.Series.isin`.

In [79]:
users[users.occupation.isin(['doctor','lawyer'])].sample(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
679,680,33,M,lawyer,90405
9,10,53,M,lawyer,90703
364,365,29,M,lawyer,20009
250,251,28,M,doctor,85032
934,935,42,M,doctor,66221
588,589,21,M,lawyer,90034
124,125,30,M,lawyer,22202
844,845,64,M,doctor,97405
137,138,46,M,doctor,53211
443,444,51,F,lawyer,53202


<a id='columns'></a>

## Renaming, Adding, and Removing Columns

---

<a id='renaming-columns'></a>
### Renaming Columns

**Q.1** Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a *new* DataFrame.

In [80]:
drinks_new = drinks.rename(columns={'beer_servings':'beer',
                                    'wine_servings':'wine'})

**Q.2** Perform the same renaming for `drinks`, but in place.

In [81]:
drinks.rename(columns={'beer_servings':'beer','wine_servings':'wine'}, inplace=True)

**Q.3** Replace the column names of `drinks` with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [82]:
drinks.columns = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']

<a id='adding-columns'></a>
### Adding Columns

**Q.1** Make a `servings` column combines `beer`, `spirit`, and `wine`.

In [84]:
drinks['serving'] = drinks['beer'] + drinks['spirit'] + drinks['wine']

**Q.2** Make an `mL` column that is the `liters` column multiplied by 1,000.

In [90]:
drinks['mL'] = drinks['liters']*1000

In [91]:
drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent,serving,mL
0,Afghanistan,0,0,0,0.0,AS,0,0.0
1,Albania,89,132,54,4.9,EU,275,4900.0
2,Algeria,25,0,14,0.7,AF,39,700.0
3,Andorra,245,138,312,12.4,EU,695,12400.0
4,Angola,217,57,45,5.9,AF,319,5900.0


<a id='removing-columns'></a>
### Removing Columns

**Q.1** Remove the `mL` column, returning a new DataFrame.

In [92]:
drinks_new = drinks.drop('mL',axis=1)

**Q.2** Remove the `mL` and `servings` columns from `drinks` in place.

In [93]:
drinks.drop('mL',axis=1,inplace=True)

<a id='missing'></a>
## Handling Missing Values

---

<a id='find-missing'></a>
### Finding Missing Values

**Q.1** Include missing values from the `continent` variable in the `drinks` DataFrame when counting unique values.

In [97]:
drinks['continent'].nunique()

5

In [98]:
drinks['continent'].fillna('NAN').nunique()

6

**Q.2** Create a Boolean Series indicating which values are missing or not missing in `continents`.

In [100]:
drinks['continent'].isnull()

0      False
1      False
2      False
3      False
4      False
5       True
6      False
7      False
8      False
9      False
10     False
11      True
12     False
13     False
14      True
15     False
16     False
17      True
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
163    False
164    False
165    False
166    False
167    False
168    False
169    False
170    False
171    False
172    False
173    False
174     True
175    False
176    False
177    False
178    False
179    False
180    False
181    False
182    False
183    False
184     True
185    False
186    False
187    False
188    False
189    False
190    False
191    False
192    False
Name: continent, Length: 193, dtype: bool

**Q.3** Subset to rows in `drinks` where `continent` is missing and where `continent` is not missing.

In [102]:
drinks_null = drinks[drinks['continent'].isnull()]
drinks_notnull = drinks[~drinks['continent'].isnull()]
len(drinks_null),len(drinks_notnull)

(23, 170)

**Q.4** Calculate the sum of `drinks`' *columns* and the sum of its *rows*.

In [None]:
# A:

In [None]:
# A:

**Side Note: Adding Booleans**
```python
pd.Series([True, False, True])  # Creates a Boolean Series
pd.Series([True, False, True]).sum()  # Converts `False` to 0 and `True` to 1
```

**Q.5** FInd the number of missing values by column in `drinks`.

In [105]:
drinks.isnull().sum(axis=0)

country       0
beer          0
spirit        0
wine          0
liters        0
continent    23
serving       0
dtype: int64

<a id='drop-missing'></a>
### Dropping Missing Values

**Q.1** Drop rows where *ANY* values are missing in `drinks` (returning a new DataFrame).  
_Make sure you know ahead of time exactly what you'll be dropping._

In [108]:
drinks.sample(5)

Unnamed: 0,country,beer,spirit,wine,liters,continent,serving
54,El Salvador,52,69,2,2.2,,123
104,Mali,5,1,1,0.6,AF,7
9,Austria,279,75,191,9.7,EU,545
115,Mozambique,47,18,5,1.3,AF,70
132,Paraguay,213,117,74,7.3,SA,404


In [110]:
drinks.dropna(how='any',axis=0)

Unnamed: 0,country,beer,spirit,wine,liters,continent,serving
0,Afghanistan,0,0,0,0.0,AS,0
1,Albania,89,132,54,4.9,EU,275
2,Algeria,25,0,14,0.7,AF,39
3,Andorra,245,138,312,12.4,EU,695
4,Angola,217,57,45,5.9,AF,319
6,Argentina,193,25,221,8.3,SA,439
7,Armenia,21,179,11,3.8,EU,211
8,Australia,261,72,212,10.4,OC,545
9,Austria,279,75,191,9.7,EU,545
10,Azerbaijan,21,46,5,1.3,EU,72


**Q.2** Drop rows only where *ALL* values are missing in `drinks`.

In [111]:
drinks.dropna(how='all',axis=0)

Unnamed: 0,country,beer,spirit,wine,liters,continent,serving
0,Afghanistan,0,0,0,0.0,AS,0
1,Albania,89,132,54,4.9,EU,275
2,Algeria,25,0,14,0.7,AF,39
3,Andorra,245,138,312,12.4,EU,695
4,Angola,217,57,45,5.9,AF,319
5,Antigua & Barbuda,102,128,45,4.9,,275
6,Argentina,193,25,221,8.3,SA,439
7,Armenia,21,179,11,3.8,EU,211
8,Australia,261,72,212,10.4,OC,545
9,Austria,279,75,191,9.7,EU,545


<a id='fill-missing'></a>
### Filling in Missing Values

What's up with these `NaN` continents?

In [113]:
drinks[drinks['continent'].isnull()]

Unnamed: 0,country,beer,spirit,wine,liters,continent,serving
5,Antigua & Barbuda,102,128,45,4.9,,275
11,Bahamas,122,176,51,6.3,,349
14,Barbados,143,173,36,6.3,,352
17,Belize,263,114,8,6.8,,385
32,Canada,240,122,100,8.2,,462
41,Costa Rica,149,87,11,4.4,,247
43,Cuba,93,137,5,4.2,,235
50,Dominica,52,286,26,6.6,,364
51,Dominican Republic,193,147,9,6.2,,349
54,El Salvador,52,69,2,2.2,,123


_You probably figured it out already, but all of these continents are in North America (`NA`), and, when read in, were misinterpreted as a `null` or `NaN` value._

**Q.1** Fill in the missing values of the `continent` column using string `NA`.

In [114]:
drinks['continent'] = drinks['continent'].fillna('NA')

**Q.2** Turn off the missing value filter when loading the `drinks` `.csv`.

In [None]:
# A:

In [119]:
pd.read_csv(local_drinks_csv, na_filter=False).head(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
5,Antigua & Barbuda,102,128,45,4.9,
6,Argentina,193,25,221,8.3,SA
7,Armenia,21,179,11,3.8,EU
8,Australia,261,72,212,10.4,OC
9,Austria,279,75,191,9.7,EU
