<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 1


---


### Lesson Guide
- [The Basics of `pandas` DataFrames](#basics)
    - [Loading Data](#loading)
    - [A Basic Examination of DataFrames](#examine)
    - [Selecting Columns](#selecting)
    - [Describing Data](#describing)
- [Exercise #1](#exercise-1)
- [Filtering and Sorting DataFrames](#filtering-sorting)
    - [Boolean Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise #2](#exercise-2)
- [Renaming, Adding, and Removing Columns](#columns)
    - [Renaming Columns](#renaming-columns)
    - [Adding Columns](#adding-columns)
    - [Removing Columns](#removing-columns)
- [Handling Missing Values](#missing)
    - [Finding Missing Values](#find-missing)
    - [Dropping Missing Values](#drop-missing)
    - [Filling in Missing Values](#fill-missing)


<a id='basics'></a>

## The Basics of `pandas` DataFrames

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading Data

**Q.1** Read in the data file. Does it look right?

```Python
users = pd.read_csv('../../../../resource-datasets/users/users.txt')
```

In [2]:
# A:
users = pd.read_csv('../../../../resource-datasets/users/users.txt')
users.head()

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043
2,3|23|M|writer|32067
3,4|24|M|technician|43537
4,5|33|F|other|15213


**Q.2** Try to load the file again but this time use the appropriate arguments to load correctly.

In [3]:
# A:
users = pd.read_csv('../../../../resource-datasets/users/users.txt',delimiter='|')
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


<a id='examine'></a>
### A Basic Examination of DataFrames

**Q.1** Print the type of `users`.

In [5]:
# A:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.2** Print the first five rows, first 10 rows, and last two rows of `users`.

In [10]:
# A:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [9]:
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [8]:
# A:
users.tail(2)

Unnamed: 0,user_id,age,gender,occupation,zip_code
941,942,48,F,librarian,78209
942,943,22,M,student,77841


**Q.3** Print the index and columns.

In [11]:
# A:
users.index

RangeIndex(start=0, stop=943, step=1)

In [12]:
users.columns

Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')

**Q.4** Find the data types of the columns.

In [70]:
# A:
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the DataFrame.

In [22]:
# A:
users.shape

(943, 5)

**Q.6** Extract the underlying `numpy` array as a new variable as given by `.values`.

In [24]:
# A:
users_values = users.values
users_values

array([[1, 24, 'M', 'technician', '85711'],
       [2, 53, 'F', 'other', '94043'],
       [3, 23, 'M', 'writer', '32067'],
       ...,
       [941, 20, 'M', 'student', '97229'],
       [942, 48, 'F', 'librarian', '78209'],
       [943, 22, 'M', 'student', '77841']], dtype=object)

<a id='selecting'></a>
### Selecting Columns

**Q.1** Assign the `gender` column to a variable.

In [29]:
# A:
gender = users[['gender']]
gender.head()

Unnamed: 0,gender
0,M
1,F
2,M
3,M
4,F


**Q.2** What is the type of `gender`?

In [27]:
# A:
gender.dtypes

gender    object
dtype: object

**Q.3** Select `gender` and `occupation` as a new DataFrame.

In [31]:
# A:
gender_occup = users[['gender','occupation']]
gender_occup.head()

Unnamed: 0,gender,occupation
0,M,technician
1,F,other
2,M,writer
3,M,technician
4,F,other


<a id='describing'></a>
### Describing Data

**Q.1** Show the data types and whether there are null values in the dataframe.

In [35]:
# A:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
user_id       943 non-null int64
age           943 non-null int64
gender        943 non-null object
occupation    943 non-null object
zip_code      943 non-null object
dtypes: int64(2), object(3)
memory usage: 36.9+ KB


**Q.2** Show the descriptive statistics for the numeric columns in the DataFrame (_which is the function default_).  

In [36]:
# A:
users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


**Q.3** Describe all of the columns, regardless of type.

In [39]:
# A:
users.describe(include='all')

Unnamed: 0,user_id,age,gender,occupation,zip_code
count,943.0,943.0,943,943,943.0
unique,,,2,21,795.0
top,,,M,student,55414.0
freq,,,670,196,9.0
mean,472.0,34.051962,,,
std,272.364951,12.19274,,,
min,1.0,7.0,,,
25%,236.5,25.0,,,
50%,472.0,31.0,,,
75%,707.5,43.0,,,


**Q.4** Describe the `gender` Series from the `users` DataFrame.

In [43]:
# A:
users['gender'].describe()

count     943
unique      2
top         M
freq      670
Name: gender, dtype: object

**Q.5** Calculate the mean of the `age` column.

In [49]:
# A:
users[['age']].mean()

age    34.051962
dtype: float64

**Q.6** Calculate the counts of distinct values in the `gender` and `age` columns.

In [65]:
# A:
users.gender.nunique()

2

In [64]:
users.age.nunique()

61

<a id='exercise-1'></a>
## Exercise #1

---

Load the `drinks.csv` data provided below.

**Perform the following:**
1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the `beer_servings` column to a variable.
4. Calculate summary statistics for `beer_servings`.
5. Calculate the median of `beer_servings`.
6. Count the values of unique categories in `continent`.
7. Print the dimensions of the `drinks` DataFrame.


**BONUS:**
1. Create the `users` DataFrame from the `user_file` provided (which lacks a header row).
2. Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`.
8. Display the three most frequent occupations.

In [145]:
local_drinks_csv = '../../../../resource-datasets/alcohol_by_country/drinks.csv'
# and
local_user_file = '../../../../resource-datasets/users/users_original.txt'

In [146]:
# A:
local_drinks = pd.read_csv(local_drinks_csv)

users_header = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
local_users = pd.read_csv(local_user_file, delimiter='|',names=users_header)

local_drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [147]:
local_users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


<a id='filtering-sorting'></a>

## Filtering and Sorting DataFrames

---


<a id='filtering'></a>
### Boolean Filtering

**Q.1** Show users `age < 20` using a Boolean mask.

In [95]:
# A:
local_users[local_users['age'] < 20].head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
35,36,19,F,student,93117
51,52,18,F,student,55105
56,57,16,M,none,84010
66,67,17,M,student,60402


**Q.2** Calculate the value counts of `occupation` for users `age < 20`.

In [94]:
# A:
local_users[local_users['age'] < 20].occupation.value_counts()

student          64
other             4
none              3
entertainment     2
writer            2
salesman          1
artist            1
Name: occupation, dtype: int64

**Q.3** Print the male users `age < 20`. 

In [96]:
# A:
local_users[(local_users['age'] < 20) & (local_users['gender'] == 'M')].head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
56,57,16,M,none,84010
66,67,17,M,student,60402
67,68,19,M,student,22904
100,101,15,M,student,5146


**Q.4** Print the users `age < 10` or `age > 70`.

In [98]:
# A:
local_users[(local_users['age'] < 10) | (local_users['age'] > 70)]

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
480,481,73,M,retired,37771


<a id='sorting'></a>
### Sorting

**Q.1** Return the `age` column sorted in ascending order.

In [101]:
# A:
local_users[['age']].sort_values('age',ascending=True).head()

Unnamed: 0,age
29,7
470,10
288,11
879,13
608,13


**Q.2** Sort the `users` DataFrame by the `age` column (ascending).

In [105]:
# A:
local_users.sort_values('age').head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
29,30,7,M,student,55436
470,471,10,M,student,77459
288,289,11,M,none,94619
879,880,13,M,student,83702
608,609,13,F,student,55106


**Q.3** Sort the `users` DataFrame by the `age` column in *descending* order.

In [104]:
# A:
local_users.sort_values('age',ascending=False).head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
480,481,73,M,retired,37771
802,803,70,M,administrator,78212
766,767,70,M,engineer,0
859,860,70,F,retired,48322
584,585,69,M,librarian,98501


<a id='exercise-2'></a>

## Exercise #2

---

**Using the `drinks` DataFrame from the previous exercise:**
1. Filter `drinks` to include only European countries (use `continent` to filter).
2. Filter `drinks` to include only European countries with `wine_servings` > 300.
3. Calculate the mean `beer_servings` for all of Europe.
4. Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.

**Using the `users` DataFrame:**
1. Sort `users` by occupation and then by `age` in a single command.
2. Filter `users` to only include doctors and lawyers without using a `|`.

> **Hint:** Look up `pandas.Series.isin`.

In [112]:
# A:
local_drinks[(local_drinks['continent'] == 'EU') & (local_drinks['wine_servings'] > 300)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245,138,312,12.4,EU
61,France,127,151,370,11.8,EU
136,Portugal,194,67,339,11.0,EU


In [138]:
local_drinks[local_drinks['continent'] == 'EU'].mean()

TypeError: 'NoneType' object is not subscriptable

In [120]:
local_drinks.sort_values('total_litres_of_pure_alcohol',ascending=False).head(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
15,Belarus,142,373,42,14.4,EU
98,Lithuania,343,244,56,12.9,EU
3,Andorra,245,138,312,12.4,EU
68,Grenada,199,438,28,11.9,
45,Czech Republic,361,170,134,11.8,EU
61,France,127,151,370,11.8,EU
141,Russian Federation,247,326,73,11.5,AS
81,Ireland,313,118,165,11.4,EU
155,Slovakia,196,293,116,11.4,EU
99,Luxembourg,236,133,271,11.4,EU


In [126]:
local_users.sort_values(['occupation','age']).head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
117,118,21,M,administrator,90210
179,180,22,F,administrator,60202
281,282,22,M,administrator,20057
316,317,22,M,administrator,13210
438,439,23,F,administrator,20817


In [128]:
local_users.occupation.unique()

array(['technician', 'other', 'writer', 'executive', 'administrator',
       'student', 'lawyer', 'educator', 'scientist', 'entertainment',
       'programmer', 'librarian', 'homemaker', 'artist', 'engineer',
       'marketing', 'none', 'healthcare', 'retired', 'salesman', 'doctor'],
      dtype=object)

In [133]:
local_users[local_users.occupation.isin(['doctor','lawyer'])].head()

Unnamed: 0,user_id,age,gender,occupation,zip_code
9,10,53,M,lawyer,90703
124,125,30,M,lawyer,22202
125,126,28,F,lawyer,20015
137,138,46,M,doctor,53211
160,161,50,M,lawyer,55104


<a id='columns'></a>

## Renaming, Adding, and Removing Columns

---

<a id='renaming-columns'></a>
### Renaming Columns

**Q.1** Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a *new* DataFrame.

In [148]:
# A:
local_drinks_2 = local_drinks.rename(columns={'beer_servings': 'beer', 'wine_servings': 'wine'})
local_drinks_2.head()

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


**Q.2** Perform the same renaming for `drinks`, but in place.

In [149]:
# A:
local_drinks.rename(columns={'beer_servings': 'beer', 'wine_servings': 'wine'},inplace=True)
local_drinks.head()

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


**Q.3** Replace the column names of `drinks` with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [151]:
# A:
local_drinks.columns = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


<a id='adding-columns'></a>
### Adding Columns

**Q.1** Make a `servings` column that combines `beer`, `spirit`, and `wine`.

In [158]:
# A:
local_drinks['servings'] = local_drinks.loc[:,['beer','spirit','wine']].apply('sum', axis = 1)
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent,servings,mL,mL2,servings2
0,Afghanistan,0,0,0,0.0,AS,0,0.0,0.0,
1,Albania,89,132,54,4.9,EU,275,4900.0,4900.0,
2,Algeria,25,0,14,0.7,AF,39,700.0,700.0,
3,Andorra,245,138,312,12.4,EU,695,12400.0,12400.0,
4,Angola,217,57,45,5.9,AF,319,5900.0,5900.0,


**Q.2** Make an `mL` column that is the `liters` column multiplied by 1,000.

In [184]:
# A:
local_drinks['mL'] = local_drinks['liters'].apply(lambda x: x*1000)
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent,mL
0,Afghanistan,0,0,0,0.0,AS,0.0
1,Albania,89,132,54,4.9,EU,4900.0
2,Algeria,25,0,14,0.7,AF,700.0
3,Andorra,245,138,312,12.4,EU,12400.0
4,Angola,217,57,45,5.9,AF,5900.0


<a id='removing-columns'></a>
### Removing Columns

**Q.1** Remove the `mL` column, returning a new DataFrame.

*Hint: Try out .drop()*

In [167]:
# A:
#local_drinks_2 = local_drinks.drop(columns = ['mL2', 'mL', 'servings2'])
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent,servings
0,Afghanistan,0,0,0,0.0,AS,0
1,Albania,89,132,54,4.9,EU,275
2,Algeria,25,0,14,0.7,AF,39
3,Andorra,245,138,312,12.4,EU,695
4,Angola,217,57,45,5.9,AF,319


**Q.2** Remove the `mL` and `servings` columns from `drinks` in place.

In [185]:
# A:
local_drinks.drop(columns='mL', inplace=True)
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


<a id='missing'></a>
## Handling Missing Values

---

<a id='find-missing'></a>
### Finding Missing Values

**Q.1** Include missing values from the `continent` variable in the `drinks` DataFrame when counting unique values.

*Hint: Try out the argument dropna with True and False.*

In [175]:
# A:
local_drinks.continent.nunique(dropna=False)

6

In [176]:
local_drinks.continent.nunique(dropna=True)

5

In [178]:
local_drinks.head()

Unnamed: 0,country,beer,spirit,wine,liters,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


**Q.2** Create a Boolean Series indicating which values are missing or not missing in `continents`.

In [188]:
# A:
local_drinks['continent_isnull'] = local_drinks.continent.isnull()

Unnamed: 0,country,beer,spirit,wine,liters,continent,continent_isnull
0,Afghanistan,0,0,0,0.0,AS,False
1,Albania,89,132,54,4.9,EU,False
2,Algeria,25,0,14,0.7,AF,False
3,Andorra,245,138,312,12.4,EU,False
4,Angola,217,57,45,5.9,AF,False


In [191]:
local_drinks.head(10)

Unnamed: 0,country,beer,spirit,wine,liters,continent,continent_isnull
0,Afghanistan,0,0,0,0.0,AS,False
1,Albania,89,132,54,4.9,EU,False
2,Algeria,25,0,14,0.7,AF,False
3,Andorra,245,138,312,12.4,EU,False
4,Angola,217,57,45,5.9,AF,False
5,Antigua & Barbuda,102,128,45,4.9,,True
6,Argentina,193,25,221,8.3,SA,False
7,Armenia,21,179,11,3.8,EU,False
8,Australia,261,72,212,10.4,OC,False
9,Austria,279,75,191,9.7,EU,False


**Q.3** Subset to rows in `drinks` where `continent` is missing and where `continent` is not missing.

In [40]:
# A:

**Q.4** Calculate the sum of `drinks`' *columns* and the sum of its *rows*.

*Hint: Try out the axis argument inside .sum()*

In [41]:
# A:

**Side Note: Adding Booleans**
```python
pd.Series([True, False, True])  # Creates a Boolean Series
pd.Series([True, False, True]).sum()  # Converts `False` to 0 and `True` to 1
```

**Q.5** FInd the number of missing values by column in `drinks`.

In [42]:
# A:

<a id='drop-missing'></a>
### Dropping Missing Values

**Q.1** Drop rows where *ANY* values are missing in `drinks` (returning a new DataFrame).  
_Make sure you know ahead of time exactly what you'll be dropping._

In [43]:
# A:

**Q.2** Drop rows only where *ALL* values are missing in `drinks`.

*Hint: Check out the `how` argument*

In [44]:
# A:

<a id='fill-missing'></a>
### Filling in Missing Values

What's up with these `NaN` continents? Have a look at them.

In [45]:
# A:

_You probably figured it out already, but all of these continents are in North America (`NA`), and, when read in, were misinterpreted as a `null` or `NaN` value._

**Q.1** Fill in the missing values of the `continent` column using string `NA`.

In [46]:
# A:

**Q.2** Turn off the missing value filter when loading the `drinks` `.csv`.

*Hint: Check out the `na_filter` argument in `read_csv()`*

In [47]:
# A: