<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Pandas Data Munging Full Overview

_Authors: Joseph Nelson (DC)_

---


### Lesson Guide
- [Basics of pandas dataframes](#basics)
    - [Loading data](#loading)
    - [Basic examination of data](#examination)
    - [Selecting columns](#selecting)
    - [Describing the data](#describing)
- [Exercise 1](#exercise-1)
- [Sorting and filtering dataframes](#sorting-filtering)
    - [Filtering](#filtering)
    - [Sorting](#sorting)
- [Exercise 2](#exercise-2)
- [Renaming, adding, and removing columns](#columns)
    - [Renaming columns](#renaming-columns)
    - [Adding columns](#adding-columns)
    - [Removing columns](#removing-columns)
- [Handling missing values](#missing)
    - [Find missing values](#find-missing)
    - [Drop missing values](#drop-missing)
    - [Fill in missing values](#fill-missing)
- [Exercise 3](#exercise-3)
- [Split-apply-combine](#split-apply-combine)
    - [Groupby](#groupby)
    - [Apply and combine](#apply-combine)
- [Exercise 4](#exercise-4)
- [Indexing](#indexing)
    - [Location indexing with .loc](#loc)
    - [Position indexing with .iloc](#iloc)
- [Other frequently used features](#frequent)
    - [Use map functions with replacement dictionaries](#map-dict)
    - [Encode strings as integers with .factorize](#factorize)
    - [Determine unique values](#unique)
    - [Replace values with .replace](#replace)
    - [Series string methods with .str](#series-str)
    - [Datetime conversion and arithmetic](#datetime)
    - [Setting and resetting the index](#set-reset-index)
    - [Sort by index](#sort-by-index)
    - [Change data type of a column](#change-dtype)
    - [Create dummy-coded columns](#dummy)
    - [Concatenate dataframes](#concatenate)
    - [Detect and drop duplicate rows](#duplicate-rows)
    - [Write a dataframe to a csv](#write-csv)
    - [Pickle a dataframe](#pickle)
    - [Randomly sample a dataframe](#sample)
- [Infrequently used features](#infrequent)
    - [Creating dataframes from dictionaries and lists of lists](#toy-dataframes)
    - [Doing cross-tabulations](#crosstab)
    - [Query filtering syntax](#query)
    - [Calculate memory usage](#memory-usage)
    - [Converting column to category type](#category-type)
    - [Creating columns with the assign function](#assign)
    - [Limit number of rows to load on file read](#limit-rows-read)
    - [Manually set number of rows and columns to print](#manual-print)

<a id='basics'></a>

## Reading Files, Selecting Columns, and Summarizing

---

In [1]:
import pandas as pd

<a id='loading'></a>
### Loading data

**Q.1** You can read a file from your local computer or directly from a URL.

In [10]:
# Local:
# pd.read_table('u.user')

# Remote:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user')
users

Unnamed: 0,user_id|age|gender|occupation|zip_code
0,1|24|M|technician|85711
1,2|53|F|other|94043
2,3|23|M|writer|32067
3,4|24|M|technician|43537
4,5|33|F|other|15213
...,...
938,939|26|F|student|33319
939,940|32|M|administrator|02215
940,941|20|M|student|97229
941,942|48|F|librarian|78209


**Q.2** Use kwargs to set appropriate data-reading parameters.

In [15]:
# A:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',sep = "|",index_col = 'user_id')
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


<a id='examine'></a>
### Basic examination of dataframes

**Q.1** Print the type of `users`.

In [17]:
# A:
type(users)

pandas.core.frame.DataFrame

**Q.2** Print the first 5 rows, first 10 rows, and last 2 rows of `users`.

In [18]:
# A:
users.head(5)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


In [19]:
# A:
users.head(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


In [20]:
# A:
users.tail(2)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
942,48,F,librarian,78209
943,22,M,student,77841


**Q.3** Print the index and columns.

In [25]:
# A:
print(users.index)
print(users.columns)

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)
Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')


**Q.4** Find the dtypes of the columns.

In [27]:
# A:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.5** Find the dimensions of the dataframe.

In [28]:
# A:
users.shape

(943, 4)

**Q.6** Extract the underlying numpy array as a new variable.

In [29]:
# A:
???

<a id='selecting'></a>
### Selecting columns

**Q.1** Assign the `gender` column to a variable.

In [33]:
# A:
gender_column = users.gender

**Q.2** What is the type of `gender`?

In [40]:
# A:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Q.3** Select `gender` and `occupation` as a new dataframe.

In [14]:
# A:

<a id='describing'></a>
### Describing the data

**Q.1** Calculate the descriptive statistics for the numeric columns in the dataframe.

In [15]:
# A:

**Q.2** Describe the "object" (string) columns.

In [16]:
# A:

**Q.3** Describe all the columns regardless of type.

In [17]:
# A:

**Q.4** Describe the `gender` Series from the `users` dataframe.

In [18]:
# A:

**Q.5** Calculate the mean of the `age` column.

In [19]:
# A:

**Q.6** Calculate the counts of distinct values in the gender and age column.

In [20]:
# A:

<a id='exercise=1'></a>
## Exercise 1

---

Load the `drinks.csv` data provided in the url below.

**Perform the following:**
1. Print the head and tail.
- Look at the index, columns, dtypes and shape.
- Assign the `beer_servings` column/Series to a variable.
- Calculate summary statistics for `beer_servings`.
- Calculate the median of `beer_servings`.
- Count the values of unique categories in `continent`.
- Print the dimensions of the drinks dataframe.
- Find the first 3 items of the value counts of the `occupation` column.

**BONUS:**
- Create the 'users' DataFrame from the `user_file` provided (which lacks a header row).
- Supply a header: `['user_id', 'age', 'gender', 'occupation', 'zip_code']`


In [21]:
drinks_csv = 'https://raw.githubusercontent.com/josephnelson93/GA-DSI/master/example-lessons/plotting-with-pandas/drinks.csv'
user_file = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user_original'

In [22]:
# A:

<a id='filtering-sorting'></a>

## Filtering and sorting dataframes and series

---


<a id='filtering'></a>
### Boolean filtering

**Q.1** Show users with age < 20 using a boolean mask.

In [23]:
# A:

**Q.2** Calculate value counts of occupation for users age < 20.

In [24]:
# A:

**Q.3** Print the male users age < 20. 

In [25]:
# A:

**Q.4** Print the users age < 10 or age > 70.

In [26]:
# A:

<a id='sorting'></a>
### Sorting

**Q.1** Return the age column sorted (ascending order)

In [27]:
# A:

**Q.2** Sort the users dataframe by the age column (ascending).

In [28]:
# A:

**Q.3** Sort the users dataframe by the age column *descending*.

In [29]:
# A:

<a id='exercise-2'></a>

## Exercise 2

---

**Using the drinks dataframe from the previous exercise:**
1. Filter drinks to include only European countries.
- Filter drinks to include only European countries with `wine_servings` > 300.
- Calculate the mean `beer_servings` for all of Europe.
- Which 10 countries have the highest `total_litres_of_pure_alcohol`?

**Using the users dataframe:**
1. Sort users by occupation and then by age in a single command.
- Filter users to only include doctors and lawyers without using a `|`

> *Hint:* look up `pandas.Series.isin`

In [30]:
# A:

<a id='columns'></a>

## Renaming, adding, and removing columns

---

<a id='renaming-columns'></a>
### Renaming columns

**Q.1** Rename "beer_servings" to "beer" and "wine_servings" to wine in the drinks dataframe, returning a *new* dataframe.

In [31]:
# A:

**Q.2** Do the same renaming for drinks, but inplace.

In [32]:
# A:

**Q.3** Replace the column names of drinks with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`

In [33]:
# A:

<a id='adding-columns'></a>
### Adding columns

**Q.1** Make a "servings" column that is beer + spirit + wine

In [34]:
# A:

**Q.2** Make a "mL" column that is the liters column times 1000.

In [35]:
# A:

<a id='removing-columns'></a>
### Removing columns

**Q.1** Remove the "mL" column returning a new dataframe.

In [36]:
# A:

**Q.2** Remove the "mL" and "servings" columns from drinks inplace.

In [37]:
# A:

<a id='missing'></a>
## Handling missing values

---

<a id='find-missing'></a>
### Finding missing values

**Q.1** Include missing values of the continent variable in the drinks dataframe when counting unique values.

In [38]:
# A:

**Q.2** Create a boolean Series representing which values are missing and not missing in continents.

In [39]:
# A:

**Q.3** Subset to rows in drinks where continent is missing and where continent is not missing.

In [40]:
# A:

**Q.4** [Side note] Calculate the sum of drink *columns* and the sum of *rows*.

In [41]:
# A:

In [42]:
# A:

**Q.5** FInd the number of missing values by column in drinks.

In [43]:
# A:

<a id='drop-missing'></a>
### Dropping missing values

**Q.1** Drop rows where *ANY* values are missing in drinks (return new dataframe)

In [44]:
# A:

**Q.2** Drop rows only where *ALL* values are missing in drinks.

In [45]:
# A:

<a id='fill-missing'></a>
### Fill in missing values

**Q.1** Fill in the missing values of the continent column with string "NA"

In [46]:
# A:

**Q.2** Turn off the missing value filter when loading the drinks csv.

In [47]:
# A:

<a id='exercise-3'></a>
## Exercise 3

---

**Using the ufo data provided below:**
1. Read in the data.
- Check the shape and describe the columns.
- Find the four most reported colors.
- Find the most frequent city for reports in state VA.
- Find only UFO reports from Arlington, VA
- Find the number of missing values in each column.
- Show only UFO reporst where city is missing
- Count the number of rows with no null values.
- Replace column names with spaces to have underscores.
- Make a new column that is a combination of city and state.

In [48]:
ufo_csv = 'https://raw.githubusercontent.com/josephofiowa/DAT8/master/data/ufo.csv'

In [49]:
# A:

<a id='split-apply-combine'></a>
## Split-apply-combine

---

![](http://i.imgur.com/yjNkiwL.png)

<a id='groupby'></a>
### Grouping

**Q.1** With the drinks dataframe, calculate the mean bear servings by continent.

In [50]:
# A:

**Q.2** Describe the beer column by continent.

In [51]:
# A:

<a id='apply-combine'></a>
### Apply functions to groups and combine

**Q.1** Find the count, mean, minimum, and maximum of the beer column by continent.

In [52]:
# A:

**Q.2** Do the same as in Q.1, but sort the output by the mean column.

In [53]:
# A:

**Q.3** Apply a custom function to all the columns of the drinks dataframe, grouped by continent.

In [54]:
# A:

**Q.4** [Note] If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [55]:
# A:

<a id='exercise-4'></a>

## Exercise 4

---

**Using the users dataframe**:
1. Count the number of distinct occupations in users.
- Calculate the mean age by occupation.
- Calculate the minimum and maximum age by occupation.
- Calculate the mean age by cross-sections of occupation and gender.

> *Tip: multiple columns can be passed to the groupby function for granular cross-sections.*

In [56]:
# A:

<a id='indexing'></a>
## Indexing

---

<a id='loc'></a>
### Label indexing with `.loc`

**Q.1** Select all rows and the "City" column from the ufo dataset with `.loc`.

In [57]:
# A:

**Q.2** Select all rows and columns "City" and "State"

In [58]:
# A:

**Q.3** Select all rows and columns from "City" *through* "State"

In [59]:
# A:

**Q.4** Select:
- all columns at row 0
- all columns at rows 0:2
- columns "City" through "State" at rows 0:2

In [60]:
# A:

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [61]:
# A:

**Q.2** Select all rows and columns 0 through 4.

In [62]:
# A:

**Q.3** Select rows in position 0:3 and all columns.

In [63]:
# A:

<a id='frequent'></a>
## Frequently used features

---

<a id='map-dict'></a>
### The `.map` function with replacement dictionaries

In [64]:
# A:

<a id='factorize'></a>
### Encode strings as integers with `.factorize`

In [65]:
# A:

<a id='unique'></a>
### Determine unique values

In [66]:
# A:

<a id='replace'></a>
### Replace values with `.replace`

In [67]:
# A:

<a id='series-str'></a>
### Series string methods with `.str`

In [68]:
# A:

<a id='datetime'></a>
### `datetime` conversion and arithmetic

In [69]:
# A:

<a id='set-reset-index'></a>
### Setting and resetting the index

In [70]:
# A:

<a id='sort-by-index'></a>
### Sorting by the index

In [71]:
# A:

<a id='change-dtype'></a>
### Changing the data type of a column

In [72]:
# A:

<a id='dummy'></a>
### Create dummy-coded columns from a categorical column

In [73]:
# A:

<a id='concatenate'></a>
### Concatenate dataframes together

In [74]:
# A:

<a id='duplicate-rows'></a>
### Detect and drop duplicate rows

In [75]:
# A:

<a id='write-csv'></a>
### Write a dataframe to a csv

In [76]:
# A:

<a id='pickle'></a>
### Write a dataframe to a pickle object

In [77]:
# A:

<a id='sample'></a>
### Randomly sample a dataframe

In [78]:
# A:

<a id='infrequent'></a>
## Infrequently used features

---


<a id='toy-dataframes'></a>
### Create dataframes from dictionaries and lists of lists

In [79]:
# A:

<a id='crosstab'></a>
### Do a cross-tabulation between Series

In [80]:
# A:

<a id='query'></a>
### Query syntax for filtering

In [81]:
# A:

<a id='memory-usage'></a>
### Calculate memory usage

In [82]:
# A:

<a id='category-type'></a>
### Convert a column to type 'category'

In [83]:
# A:

<a id='assign'></a>
### Define column with `.assign`

In [84]:
# A:

<a id='limit-rows-read'></a>
### Limit rows when reading a file

In [85]:
# A:

<a id='manual-print'></a>
### Manually set maximum rows and columns to print

In [86]:
# A: