<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 1

_Authors: Joseph Nelson (DC)_

---

**Warning: This is a resource-heavy notebook that can consume a lot of RAM, especially when it's run in Chrome. For this lesson, you may want to close idle applications and/or open this notebook with Safari.**

### Lesson Guide
- [The Basics of `pandas` DataFrames](#basics)
    - [Loading Data](#loading)
    - [A Basic Examination of DataFrames](#examine)
    - [Selecting Columns](#selecting)
    - [Describing Data](#describing)
- [Exercise #1](#exercise-1)
- [Filtering and Sorting DataFrames](#filtering-sorting)
    - [Boolean Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise #2](#exercise-2)
- [Renaming, Adding, and Removing Columns](#columns)
    - [Renaming Columns](#renaming-columns)
    - [Adding Columns](#adding-columns)
    - [Removing Columns](#removing-columns)
- [Handling Missing Values](#missing)
    - [Finding Missing Values](#find-missing)
    - [Dropping Missing Values](#drop-missing)
    - [Filling in Missing Values](#fill-missing)


<a id='basics'></a>

## The Basics of `pandas` DataFrames

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading Data

**Q.1** You can read in a file either from your local computer or directly from a URL.

```Python
# Local:
users = pd.read_table('../datasets/users.txt')

# Remote:
users = pd.read_table('https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users.txt')
```

Read in the data using the method you prefer.

In [2]:
users = pd.read_table('../datasets/users.txt')

In [3]:
users.head(2)

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043


**Q.2** Use kwargs to set appropriate data-reading parameters.

In [4]:
# Read table into `users` with formatting. 
users = pd.read_table('../datasets/users.txt', sep='|', index_col='user_id')
users.head(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043


<a id='examine'></a>
### A Basic Examination of DataFrames

**Q.1** Print the type of `users`.

In [5]:
type(users)

pandas.core.frame.DataFrame

**Q.2** Print the first five rows, first 10 rows, and last two rows of `users`.

In [6]:
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [7]:
users.head(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


In [8]:
users.tail(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
942,48,F,librarian,78209
943,22,M,student,77841


**Q.3** Print the index and columns.

In [9]:
print(users.index[0:5])
print(users.columns)

Int64Index([1, 2, 3, 4, 5], dtype='int64', name='user_id')
Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')


**Q.4** Find the dtypes of the columns.

In [10]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the DataFrame.

In [11]:
users.shape

(943, 4)

**Q.6** Extract the underlying `numpy` array as a new variable.

In [12]:
X = users.values
print(type(X), X.shape)

<class 'numpy.ndarray'> (943, 4)


<a id='selecting'></a>
### Selecting Columns

**Q.1** Assign the `gender` column to a variable.

In [13]:
gender = users['gender']
# or
gender = users.gender

_The former method is preferred, as columns can have names with special characters like periods or underscores that will create syntax issues with the latter._

**Q.2** What is the type of `gender`?

In [14]:
type(gender)

pandas.core.series.Series

**Q.3** Select `gender` and `occupation` as a new DataFrame.

In [15]:
gen_occ = users[['gender','occupation']]

<a id='describing'></a>
### Describing Data

**Q.1** Calculate the descriptive statistics for the numeric columns in the DataFrame (_which is the function default_).  

In [16]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


**Q.2** Describe the "object" (string) columns.

In [17]:
users.describe(include=['object'])

Unnamed: 0,gender,occupation,zip_code
count,943,943,943
unique,2,21,795
top,M,student,55414
freq,670,196,9


**Q.3** Describe all of the columns, regardless of type.

In [18]:
users.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


**Q.4** Describe the `gender` Series from the `users` DataFrame.

In [20]:
users['gender'].describe()

count     943
unique      2
top         M
freq      670
Name: gender, dtype: object

**Q.5** Calculate the mean of the `age` column.

In [21]:
users['age'].mean()

34.05196182396607

**Q.6** Calculate the counts of distinct values in the `gender` and `age` columns.

In [22]:
# Most useful for categorical variables:
users['gender'].value_counts()

M    670
F    273
Name: gender, dtype: int64

In [23]:
# You can also use it on numeric:
users['age'].value_counts()[0:5]

30    39
25    38
22    37
28    36
27    35
Name: age, dtype: int64

<a id='exercise-1'></a>
## Exercise #1

---

Load the `drinks.csv` data provided in the URL below.

**Perform the following:**
1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the `beer_servings` column/Series to a variable.
4. Calculate summary statistics for `beer_servings`.
5. Calculate the median of `beer_servings`.
6. Count the values of unique categories in `continent`.
7. Print the dimensions of the `drinks` DataFrame.
8. Find the first three items of the value counts of the `occupation` column.

**BONUS:**
1. Create the `users` DataFrame from the `user_file` provided (which lacks a header row).
2. Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`.


In [24]:
#  Use your preferred file location to read in the data:
remote_drinks_csv = 'https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/drinks.csv'
local_drinks_csv = '../datasets/drinks.csv'
# and
remote_user_file ='https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/users_original.txt'
local_user_file = '../datasets/users_original.txt'

In [25]:
# Read `drinks.csv` into a DataFrame called `drinks`.
drinks = pd.read_table(local_drinks_csv, sep=',')
# or
drinks = pd.read_csv(local_drinks_csv)   # assumes separator is comma


In [26]:
# Print the head and the tail.
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [27]:
drinks.tail()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF
192,Zimbabwe,64,18,4,4.7,AF


In [28]:
# Examine the default index, data types, and shape.
print(drinks.index)
print(drinks.dtypes)
print(drinks.shape)

RangeIndex(start=0, stop=193, step=1)
country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object
(193, 6)


In [29]:
# Print the 'beer_servings' Series.
drinks['beer_servings']

0        0
1       89
2       25
3      245
4      217
5      102
6      193
7       21
8      261
9      279
10      21
11     122
12      42
13       0
14     143
15     142
16     295
17     263
18      34
19      23
20     167
21      76
22     173
23     245
24      31
25     231
26      25
27      88
28      37
29     144
      ... 
163    128
164     90
165    152
166    185
167      5
168      2
169     99
170    106
171      1
172     36
173     36
174    197
175     51
176     51
177     19
178      6
179     45
180    206
181     16
182    219
183     36
184    249
185    115
186     25
187     21
188    333
189    111
190      6
191     32
192     64
Name: beer_servings, Length: 193, dtype: int64

In [30]:
# Calculate the mean `beer_servings` for the entire data set.
drinks.describe()   # Summarize all numeric columns.

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0
mean,106.160622,80.994819,49.450777,4.717098
std,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0
25%,20.0,4.0,1.0,1.3
50%,76.0,56.0,8.0,4.2
75%,188.0,128.0,59.0,7.2
max,376.0,438.0,370.0,14.4


In [31]:
drinks['beer_servings'].describe()     # Summarize only the `beer_servings` Series.

count    193.000000
mean     106.160622
std      101.143103
min        0.000000
25%       20.000000
50%       76.000000
75%      188.000000
max      376.000000
Name: beer_servings, dtype: float64

In [32]:
drinks['beer_servings'].mean()         # Only calculate the mean.

106.16062176165804

In [33]:
# Count the number of occurrences of each `continent` value and see if it looks correct.
drinks['continent'].value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

In [34]:
# BONUS: Display only the number of rows of the `users` DataFrame.
users.shape[0]

943

In [35]:
# BONUS: Display the three most frequent occupations in `users`.
users['occupation'].value_counts().head(3)
#or 
users['occupation'].value_counts()[:3]

student     196
other       105
educator     95
Name: occupation, dtype: int64

In [36]:
# BONUS: Create the `users` DataFrame from the `u.user_original` file (which lacks a header row).
# Hint: Read the `pandas.read_table` documentation.
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table(local_user_file, sep='|', header=None, names=user_cols, index_col='user_id')

<a id='filtering-sorting'></a>

## Filtering and Sorting DataFrames

---


<a id='filtering'></a>
### Boolean Filtering

**Q.1** Show users `age < 20` using a Boolean mask.

In [40]:
# Boolean filtering: Only show users age < 20.
young_bool = users['age'] < 20         # Create a Series of Booleans...
users[young_bool].head()                 # ...and use that Series to filter rows.

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
36,19,F,student,93117
52,18,F,student,55105
57,16,M,none,84010
67,17,M,student,60402


**Q.2** Calculate the value counts of `occupation` for users `age < 20`.

In [42]:
# users[users.age < 20]               # Or, combine into a single step.
# users[users.age < 20].occupation    # Select one column from the filtered results.
users[users['age'] < 20]['occupation'].value_counts()     # `value_counts` of resulting Series

student          64
other             4
none              3
writer            2
entertainment     2
artist            1
salesman          1
Name: occupation, dtype: int64

**Q.3** Print the male users `age < 20`. 

In [43]:
# Boolean filtering with multiple conditions:
users[(users['age'] < 20) & (users['gender']=='M')]       # Ampersand for AND condition

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
57,16,M,none,84010
67,17,M,student,60402
68,19,M,student,22904
101,15,M,student,5146
110,19,M,student,77840
142,13,M,other,48118
179,15,M,entertainment,20755
221,19,M,student,20685
246,19,M,student,28734


**Q.4** Print the users `age < 10` or `age > 70`.

In [44]:
# Pipe for OR condition
users[(users['age'] < 10) | (users['age'] > 70)].head()          


Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
481,73,M,retired,37771


<a id='sorting'></a>
### Sorting

**Q.1** Return the `age` column sorted in ascending order.

In [45]:
users['age'].sort_values()                   # Sort a column.

user_id
30      7
471    10
289    11
880    13
609    13
142    13
674    13
628    13
813    14
206    14
887    14
849    15
281    15
461    15
618    15
179    15
101    15
57     16
580    16
550    16
451    16
434    16
621    17
619    17
761    17
375    17
904    17
646    17
582    17
257    17
       ..
90     60
308    60
931    60
752    60
469    60
464    60
234    60
694    60
934    61
351    61
106    61
520    62
266    62
858    63
777    63
364    63
845    64
423    64
318    65
651    65
564    65
211    66
349    68
573    68
559    69
585    69
767    70
803    70
860    70
481    73
Name: age, Length: 943, dtype: int64

**Q.2** Sort the `users` DataFrame by the `age` column (ascending).

In [46]:
users.sort_values('age').head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
30,7,M,student,55436
471,10,M,student,77459
289,11,M,none,94619
880,13,M,student,83702
609,13,F,student,55106


**Q.3** Sort the `users` DataFrame by the `age` column in *descending* order.

In [47]:
users.sort_values('age', ascending=False).head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
481,73,M,retired,37771
803,70,M,administrator,78212
767,70,M,engineer,0
860,70,F,retired,48322
585,69,M,librarian,98501


<a id='exercise-2'></a>

## Exercise #2

---

**Using the `drinks` DataFrame from the previous exercise:**
1. Filter `drinks` to include only European countries.
2. Filter `drinks` to include only European countries with `wine_servings` > 300.
3. Calculate the mean `beer_servings` for all of Europe.
4. Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.

**Using the `users` DataFrame:**
1. Sort `users` by occupation and then by `age` in a single command.
2. Filter `users` to only include doctors and lawyers without using a `|`.

> **Hint:** Look up `pandas.Series.isin`.

In [49]:
# Filter `drinks` to only include European countries.
drinks[drinks['continent']=='EU']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89,132,54,4.9,EU
3,Andorra,245,138,312,12.4,EU
7,Armenia,21,179,11,3.8,EU
9,Austria,279,75,191,9.7,EU
10,Azerbaijan,21,46,5,1.3,EU
15,Belarus,142,373,42,14.4,EU
16,Belgium,295,84,212,10.5,EU
21,Bosnia-Herzegovina,76,173,8,4.6,EU
25,Bulgaria,231,252,94,10.3,EU
42,Croatia,230,87,254,10.2,EU


In [50]:
# Filter `drinks` to only include European countries with `wine_servings` > 300.
drinks[(drinks['continent']=='EU') & (drinks['wine_servings'] > 300)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245,138,312,12.4,EU
61,France,127,151,370,11.8,EU
136,Portugal,194,67,339,11.0,EU


In [51]:
# Calculate the mean `beer_servings` for all of Europe.
drinks[drinks['continent']=='EU']['beer_servings'].mean()

193.77777777777777

In [52]:
# Determine which 10 countries have the highest `total_litres_of_pure_alcohol`.
drinks.sort_values('total_litres_of_pure_alcohol').tail(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
99,Luxembourg,236,133,271,11.4,EU
155,Slovakia,196,293,116,11.4,EU
81,Ireland,313,118,165,11.4,EU
141,Russian Federation,247,326,73,11.5,AS
61,France,127,151,370,11.8,EU
45,Czech Republic,361,170,134,11.8,EU
68,Grenada,199,438,28,11.9,
3,Andorra,245,138,312,12.4,EU
98,Lithuania,343,244,56,12.9,EU
15,Belarus,142,373,42,14.4,EU


In [53]:
# BONUS: Sort `users` by `occupation` and then by `age` in a single command.
users.sort_values(['occupation', 'age'])

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
118,21,M,administrator,90210
180,22,F,administrator,60202
282,22,M,administrator,20057
317,22,M,administrator,13210
439,23,F,administrator,20817
509,23,M,administrator,10011
394,25,M,administrator,96819
665,25,M,administrator,55412
726,25,F,administrator,80538
78,26,M,administrator,61801


In [54]:
# BONUS: Filter `users` to only include doctors and lawyers without using a `|`.
# Hint: Read the `pandas.Series.isin` documentation.
users[users['occupation'].isin(['doctor', 'lawyer'])].head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10,53,M,lawyer,90703
125,30,M,lawyer,22202
126,28,F,lawyer,20015
138,46,M,doctor,53211
161,50,M,lawyer,55104


<a id='columns'></a>

## Renaming, Adding, and Removing Columns

---

<a id='renaming-columns'></a>
### Renaming Columns

**Q.1** Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a *new* DataFrame.

In [55]:
# Rename one or more columns via dictionary pairs.
renamed_drinks = drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

**Q.2** Perform the same renaming for `drinks`, but in place.

In [56]:
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)


In [57]:
drinks.head()

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


**Q.3** Replace the column names of `drinks` with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

In [58]:
# Replace all column names.
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
drinks.columns = drink_cols

# Side Note: You can replace these names when loading a `.csv` or other file:
# Drinks = pd.read_csv('drinks.csv', header=0, names=drink_cols)

<a id='adding-columns'></a>
### Adding Columns

**Q.1** Make a `servings` column combines `beer`, `spirit`, and `wine`.

In [59]:
drinks['servings'] = drinks['beer'] + drinks['spirit'] + drinks['wine']

**Q.2** Make an `mL` column that is the `liters` column multiplied by 1,000.

In [60]:
drinks['mL'] = drinks['liters'] * 1000

<a id='removing-columns'></a>
### Removing Columns

**Q.1** Remove the `mL` column, returning a new DataFrame.

In [61]:
dropped = drinks.drop('mL', axis=1) # axis=0 for rows, 1 for columns


**Q.2** Remove the `mL` and `servings` columns from `drinks` in place.

In [62]:
drinks.drop(['mL', 'servings'], axis=1, inplace=True)   # Drop multiple columns.

<a id='missing'></a>
## Handling Missing Values

---

<a id='find-missing'></a>
### Finding Missing Values

**Q.1** Include missing values from the `continent` variable in the `drinks` DataFrame when counting unique values.

In [63]:
# Missing values are usually excluded by default.
print(drinks['continent'].value_counts()  )            # Excludes missing values
print(drinks['continent'].value_counts(dropna=False))  # Includes missing values

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64
AF     53
EU     45
AS     44
NaN    23
OC     16
SA     12
Name: continent, dtype: int64


**Q.2** Create a Boolean Series indicating which values are missing or not missing in `continents`.

In [64]:
# Find missing values in a Series.
is_null = drinks['continent'].isnull() # True if missing
is_not_null = drinks['continent'].notnull() # True if not missing

**Q.3** Subset to rows in `drinks` where `continent` is missing and where `continent` is not missing.

In [65]:
# Use a Boolean Series to filter DataFrame rows.
drinks_continent_null = drinks[drinks['continent'].isnull()]   # Only show rows where `continent` is missing
drinks_continent_notnull = drinks[drinks['continent'].notnull()]  # Only show rows where `continent` is not missing

**Q.4** Calculate the sum of `drinks`' *columns* and the sum of its *rows*.

In [66]:
# Side Note: Understanding axes
print(drinks.sum())      # Sums "down" the 0 axis (rows)
print(drinks.sum(axis=0))# Equivalent (as axis=0 is the default)
print(drinks.sum(axis=1).head())      # Sums "across" the 1 axis (columns)

country    AfghanistanAlbaniaAlgeriaAndorraAngolaAntigua ...
beer                                                   20489
spirit                                                 15632
wine                                                    9544
liters                                                 910.4
dtype: object
country    AfghanistanAlbaniaAlgeriaAndorraAngolaAntigua ...
beer                                                   20489
spirit                                                 15632
wine                                                    9544
liters                                                 910.4
dtype: object
0      0.0
1    279.9
2     39.7
3    707.4
4    324.9
dtype: float64


**Side Note: Adding Booleans**
```python
pd.Series([True, False, True])  # Creates a Boolean Series
pd.Series([True, False, True]).sum()  # Converts `False` to 0 and `True` to 1
```

**Q.5** FInd the number of missing values by column in `drinks`.

In [67]:
# Find missing values in a DataFrame.
drinks.isnull()             # DataFrame of True/False Booleans
drinks.isnull().sum()       # Count the missing values in each column

country       0
beer          0
spirit        0
wine          0
liters        0
continent    23
dtype: int64

<a id='drop-missing'></a>
### Dropping Missing Values

**Q.1** Drop rows where *ANY* values are missing in `drinks` (returning a new DataFrame).  
_Make sure you know ahead of time exactly what you'll be dropping._

In [68]:
print(drinks.shape)
d = drinks.dropna()
print(d.shape)

(193, 6)
(170, 6)


**Q.2** Drop rows only where *ALL* values are missing in `drinks`.

In [69]:
print(drinks.shape)
d = drinks.dropna(how='all')
print(d.shape)

(193, 6)
(193, 6)


<a id='fill-missing'></a>
### Filling in Missing Values

What's up with these `NaN` continents?

In [70]:
# Can we identify a trend?
drinks[drinks['continent'].isnull()].head(7)

Unnamed: 0,country,beer,spirit,wine,liters,continent
5,Antigua & Barbuda,102,128,45,4.9,
11,Bahamas,122,176,51,6.3,
14,Barbados,143,173,36,6.3,
17,Belize,263,114,8,6.8,
32,Canada,240,122,100,8.2,
41,Costa Rica,149,87,11,4.4,
43,Cuba,93,137,5,4.2,


_You probably figured it out already, but all of these continents are in North America (`NA`), and, when read in, were misinterpreted as a `null` or `NaN` value._

**Q.1** Fill in the missing values of the `continent` column using string `NA`.

In [71]:
# Fill in the missing values.
drinks['continent'].fillna(value='NA', inplace=True)   # Fill in the missing values with `NA`.

**Q.2** Turn off the missing value filter when loading the `drinks` `.csv`.

In [72]:
# Turn off the missing value filter.
drinks = pd.read_csv(local_drinks_csv, header=0, names=drink_cols, na_filter=False)