<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 2

_Authors: Joseph Nelson (DC)_

---

**Warning: This is a resource-heavy notebook that can consume a lot of RAM, especially when it's run in Chrome. For this lesson, you may want to close idle applications and/or open this notebook with Safari.**

### Lesson Guide
- [Exercise #3](#exercise-3)
- [Split-Apply-Combine](#split-apply-combine)
    - [`.groupby()`](#groupby)
    - [Apply Functions to Groups and Combine](#apply-combine)
- [Exercise #4](#exercise-4)
- [Indexing](#indexing)
    - [Location Indexing With `.loc()`](#loc)
    - [Position Indexing With `.iloc()`](#iloc)
- [Other Frequently Used Features](#frequent)
    - [Using Map Functions With Replacement Dictionaries](#map-dict)
    - [Encoding Strings as Integers With `.factorize()`](#factorize)
    - [Determining Unique Values](#unique)
    - [Replacing Values With `.replace()`](#replace)
    - [Series String Methods With `.str`](#series-str)
    - [Datetime Conversion and Arithmetic](#datetime)
    - [Setting and Resetting the Index](#set-reset-index)
    - [Sorting by Index](#sort-by-index)
    - [Changing the Data Type of a Column](#change-dtype)
    - [Creating Dummy-Coded Columns](#dummy)
    - [Concatenating DataFrames](#concatenate)
    - [Detecting and Dropping Duplicate Rows](#duplicate-rows)
    - [Writing a DataFrame to a `.csv`](#write-csv)
    - [Pickling a DataFrame](#pickle)
    - [Randomly Sampling a DataFrame](#sample)
- [Infrequently Used Features](#infrequent)
    - [Creating DataFrames From Dictionaries and Lists of Lists](#toy-dataframes)
    - [Performing Cross-Tabulations](#crosstab)
    - [Query-Filtering Syntax](#query)
    - [Calculating Memory Usage](#memory-usage)
    - [Converting Column to Category Type](#category-type)
    - [Creating Columns With `.assign()`](#assign)
    - [Limiting the Number of Rows to Load in a File Read](#limit-rows-read)
    - [Manually Setting the Number of Rows and Columns to Print](#manual-print)

In [None]:
import pandas as pd

<a id='exercise-3'></a>
## Exercise #3

---

**Using the UFO data provided below:**
1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state `VA`.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where `city` is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of `city` and `state`.

In [None]:
ufo_csv = 'https://git.generalassemb.ly/dsi-unit-2/pandas-data_munging_full_overview-lesson/tree/master/datasets/ufo.csv'
ufo_csv = 'datasets/ufo.csv'

In [None]:
# A:

<a id='split-apply-combine'></a>
## Split-Apply-Combine

---

![](assets/split_apply_combine.png)

<a id='groupby'></a>
### `.groupby()`

**Q.1** Using the `drinks` DataFrame, calculate the mean `beer` servings by continent.

In [None]:
drinks =pd.read_csv('datasets/drinks_updated.csv')

In [None]:
# A:

**Q.2** Describe the `beer` column by continent.

In [None]:
# A:

<a id='apply-combine'></a>
### Apply Functions to Groups and Combine

**Q.1** Find the `count`, `mean`, `minimum`, and `maximum `of the `beer` column by continent.

In [None]:
# A:

**Q.2** Perform the same task as in Q.1, but now sort the output by the `mean` column.

In [None]:
# A:

**Q.3** Apply a custom function to all columns of the `drinks` DataFrame, grouping by continent.

In [None]:
# A:

**Q.4** **Note:** If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [None]:
# A:

<a id='exercise-4'></a>

## Exercise #4

---

**Using the `users` DataFrame**:
1. Count the number of distinct occupations in `users`.
2. Calculate the mean age by occupation.
3. Calculate the minimum and maximum age by occupation.
4. Calculate the mean age by cross-sections of `occupation` and `gender`.

> **Tip**: Multiple columns can be passed to the `.groupby()` function for more granular cross-sections.

In [None]:
# A:

<a id='indexing'></a>
## Indexing

---
<a id='loc'></a>
### Location Indexing With `.loc()`

**Q.1** Select all rows and the `city` column from the UFO data set using `.loc()`.

In [None]:
# A:

**Q.2** Select all rows and columns in `city` and `state`.

In [None]:
# A:

**Q.3** Select all rows and columns from `city` *through* `state`.

In [None]:
# A:

**Q.4** Select:
- All columns at row 0.
- All columns at rows 0:2.
- Columns `city` through `state` at rows 0:2.

In [None]:
# A:

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [None]:
# A:

**Q.2** Select all rows and columns in positions 0 through 4.

In [None]:
# A:

**Q.3** Select rows in positions 0:3, along with all columns.

In [None]:
# A:

<a id='frequent'></a>
## Frequently Used Features

---
<a id='map-dict'></a>
### Using Map Functions With Replacement Dictionaries

In [None]:
# A:

<a id='factorize'></a>
### Encoding Strings as Integers With `.factorize()`

In [None]:
# A:

<a id='unique'></a>
### Determining Unique Values

In [None]:
# A:

<a id='replace'></a>
### Replacing Values With `.replace()`

In [None]:
# A:

<a id='series-str'></a>
### Series String Methods With `.str`

In [None]:
# A:

<a id='datetime'></a>
### Datetime Conversion and Arithmetic

In [None]:
# A:

<a id='set-reset-index'></a>
### Setting and Resetting the Index

In [None]:
# A:

<a id='sort-by-index'></a>
### Sorting by Index

In [None]:
# A:

<a id='change-dtype'></a>
### Changing the Data Type of a Column

In [None]:
# A:

<a id='dummy'></a>
### Creating Dummy-Coded Columns

In [None]:
# A:

<a id='concatenate'></a>
### Concatenating DataFrames

In [None]:
# A:

<a id='duplicate-rows'></a>
### Detecting and Dropping Duplicate Rows

In [None]:
# A:

<a id='write-csv'></a>
### Writing a DataFrame to a `.csv`
```python
# Write a DataFrame out to a `.csv`.
drinks.to_csv('drinks_updated.csv')  # Index is used as the first column
drinks.to_csv('drinks_updated.csv', index=False) # Ignore index
```

<a id='pickle'></a>
### Pickling a DataFrame
```python
# Save a DataFrame to disk (a.k.a., "pickle") and read it from disk (a.k.a., "unpickle").
drinks.to_pickle('drinks_pickle')
pd.read_pickle('drinks_pickle')
```

<a id='sample'></a>
### Randomly Sampling a DataFrame

In [None]:
# A:

<a id='infrequent'></a>
## Infrequently Used Features

---

<a id='toy-dataframes'></a>
### Creating DataFrames From Dictionaries and Lists of Lists

In [None]:
# A:

In [None]:
# A:

<a id='crosstab'></a>
### Performing Cross-Tabulations

In [None]:
# A:

<a id='query'></a>
### Query-Filtering Syntax

In [None]:
# A:

<a id='memory-usage'></a>
### Calculating Memory Usage

In [None]:
# A:

<a id='category-type'></a>
### Converting Column to Category Type

In [None]:
# A:

<a id='assign'></a>
### Creating Columns With `.assign()`

In [None]:
# A:

<a id='limit-rows-read'></a>
### Limiting the Number of Rows to Load in a File Read

In [None]:
# A:

<a id='manual-print'></a>
### Manually Setting the Number of Rows and Columns to Print

In [None]:
# A:

In [None]:
# A: