# 07 Data Wrangling
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 7 - Data Cleaning and Preparation
* McKinney, Chapter 8 - Data Wrangling: Join, Combine, and Reshape

Outline:
1. Mapping
2. Sampling
3. Dummy Variables / Indicators
    * Value counts
4. Joining two datasets
5. Pivot tables
6. Groupbys

Other methods discussed in the book that we won't cover here, but are valuable resources:
* Regular Expressions
* String methods and manipulation

-----
We often have two sets of data on the same subject, and both add a good deal of information. Wouldn't it be nice to merge the datasets together? If we could do that, our options for what to do with data would increase significantly. 

Also, what if the data is not quite in the format we want? For example, what if we have a list of observations by date, but we'd like to change that to a table with dates indicating the row and the columns indicate the year?

In this section, we will look at how we can accomplish both of these tasks. It is part of a branch of data science called __data wrangling__.

## 7.1 Mapping
Sometimes, we have a dataset that could use a little more information. Take the following dataset on different kinds of meat:

In [2]:
import numpy as np
import pandas as pd

In [3]:
meat_data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                              "pastrami", "corned beef", "bacon",
                              "pastrami", "honey ham", "nova lox"],
                      "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


We have a variety of different meats. Let's add a little information to indicate what type of animal each meat type comes from. We do this with a technique called __mapping__. This takes the value from one variable of your dataset and looks up a second value based on the first from another list. For example, "bacon" in the food variable would have any entry in the other list that would return the animal "pig".

In [4]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

meat_data['Animal'] = meat_data['food'].map(meat_to_animal)
meat_data

Unnamed: 0,food,ounces,Animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


## 7.2 Sampling

In [5]:
samples = np.random.permutation(9)
samples

array([6, 8, 1, 4, 3, 0, 2, 5, 7])

In [6]:
meat_data.iloc[samples]

Unnamed: 0,food,ounces,Animal
6,pastrami,3.0,cow
8,nova lox,6.0,salmon
1,pulled pork,3.0,pig
4,corned beef,7.5,cow
3,pastrami,6.0,cow
0,bacon,4.0,pig
2,bacon,12.0,pig
5,bacon,8.0,pig
7,honey ham,5.0,pig


In [7]:
meat_data.sample(n=4)

Unnamed: 0,food,ounces,Animal
7,honey ham,5.0,pig
6,pastrami,3.0,cow
3,pastrami,6.0,cow
0,bacon,4.0,pig


## 7.3 Dummy Variables / Indicators

In [8]:
meat_data

Unnamed: 0,food,ounces,Animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [9]:
pd.get_dummies(meat_data['Animal'])

Unnamed: 0,cow,pig,salmon
0,0,1,0
1,0,1,0
2,0,1,0
3,1,0,0
4,1,0,0
5,0,1,0
6,1,0,0
7,0,1,0
8,0,0,1


If we want to see them together, we can do a join, which we discuss next. But for now, we have a way to count all of the values for each category.

In [17]:
pd.get_dummies(meat_data['Animal']).sum()

cow       3
pig       5
salmon    1
dtype: int64

#### Data Summaries
Pandas has a couple of built-in functions that will provide additionaly summary data. We will now look at:
* `.describe()`
* `.value_counts()`

The `.describe()` method takes any numerical variables and calculates the count, mean, standard deviation, and quartiles (including maximum and minimum).

In [23]:
meat_data.describe()

Unnamed: 0,ounces
count,9.0
mean,6.055556
std,2.855307
min,3.0
25%,4.0
50%,6.0
75%,7.5
max,12.0


The `.value_counts()` takes any unique observation and counts how many times that unique observation occurs.

In [25]:
meat_data.value_counts()

food         ounces  Animal
bacon        4.0     pig       1
             8.0     pig       1
             12.0    pig       1
corned beef  7.5     cow       1
honey ham    5.0     pig       1
nova lox     6.0     salmon    1
pastrami     3.0     cow       1
             6.0     cow       1
pulled pork  3.0     pig       1
dtype: int64

Each result is a unique observation. However, this may not adequately show how useful this can be. Let's look at a questionairre with 5 questions that are 'Yes' or 'No'.

In [30]:
DS_Survey = pd.DataFrame({
    'Enjoy Math2210' : ['Yes','Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','No','No'],
    'SE Major' : ['Yes','No','Yes','Yes','No','Yes','No','No','Yes','Yes','No','No']
})
DS_Survey

Unnamed: 0,Enjoy Math2210,SE Major
0,Yes,Yes
1,Yes,No
2,Yes,Yes
3,No,Yes
4,No,No
5,Yes,Yes
6,No,No
7,Yes,No
8,Yes,Yes
9,Yes,Yes


In [31]:
DS_Survey.value_counts()

Enjoy Math2210  SE Major
Yes             Yes         5
No              No          4
Yes             No          2
No              Yes         1
dtype: int64

We can also look at the value counts of an individual variable:

In [32]:
DS_Survey['Enjoy Math2210'].value_counts()

Yes    7
No     5
Name: Enjoy Math2210, dtype: int64

In [24]:
meat_data['Animal'].value_counts()

food         ounces  Animal
bacon        4.0     pig       1
             8.0     pig       1
             12.0    pig       1
corned beef  7.5     cow       1
honey ham    5.0     pig       1
nova lox     6.0     salmon    1
pastrami     3.0     cow       1
             6.0     cow       1
pulled pork  3.0     pig       1
dtype: int64

## 7.4 Joining two datasets
We often have 2 datasets that can be joined together when the two datasets have information that are useful together. We just saw two datasets on meat sales and the get_dummies breakdown of the animals the meat comes from. Let's join them together.

In [10]:
meat_data.join(pd.get_dummies(meat_data['Animal']))

Unnamed: 0,food,ounces,Animal,cow,pig,salmon
0,bacon,4.0,pig,0,1,0
1,pulled pork,3.0,pig,0,1,0
2,bacon,12.0,pig,0,1,0
3,pastrami,6.0,cow,1,0,0
4,corned beef,7.5,cow,1,0,0
5,bacon,8.0,pig,0,1,0
6,pastrami,3.0,cow,1,0,0
7,honey ham,5.0,pig,0,1,0
8,nova lox,6.0,salmon,0,0,1


The `join` method merges two datasets based on the index: Index 0 from meat_data is matched with index 0 in the get_dummies dataset. This often works if the order of observations is the same. But sometimes, the data is not ordered. Or sometimes one dataset is complete and the other is a subset of the first. For example, take two dataset about students' GPAs and GRE scores:
* `gpa_data` has the student ID and gpa of all students
* `gre_data` has the student ID and the score they earned on the gre

Not all students take the GRE, so not all students in `gpa_data` will be in `gre_data`. We would still like to merge that data if possible. In python, we do this with a more advanced method of join: `merge`.

Let's take a closer look at the different types of joins.

When we join datasets, there are 4 methods in which they can be joined:
* Left join (all data in left table is kept, any unmatched data from the right table is dropped)
* Right join (all data in right table is kept, any unmatched data from the left table is dropped)
* Inner join (only data that matches both left and right tables are kept)
* Outer join (all data are kept, whether they match or not)

![Different types of joins](https://d33wubrfki0l68.cloudfront.net/9c12ca9e12ed26a7c5d2aa08e36d2ac4fb593f1e/79980/diagrams/join-outer.png)
* image from *R for Data Science*, Hadley Wickham & Garret Grolemund, 2017.

Let's demonstrate this with two dummy datasets.

In [11]:
data_A = pd.DataFrame({'key':[1,2,3,4,5,6,7],
                       'value':[12,13,14,15,16,17,18]})
data_B = pd.DataFrame({'key':[1,3,5,7,9,11,13],
                       'value':[22,23,24,25,26,27,28]})

data_A

Unnamed: 0,key,value
0,1,12
1,2,13
2,3,14
3,4,15
4,5,16
5,6,17
6,7,18


In [12]:
data_B

Unnamed: 0,key,value
0,1,22
1,3,23
2,5,24
3,7,25
4,9,26
5,11,27
6,13,28


__Left Join__

In [13]:
data_A.merge(data_B, on='key', how='left')

Unnamed: 0,key,value_x,value_y
0,1,12,22.0
1,2,13,
2,3,14,23.0
3,4,15,
4,5,16,24.0
5,6,17,
6,7,18,25.0


__Right Join__

In [14]:
data_A.merge(data_B, on='key', how='right')

Unnamed: 0,key,value_x,value_y
0,1,12.0,22
1,3,14.0,23
2,5,16.0,24
3,7,18.0,25
4,9,,26
5,11,,27
6,13,,28


__Inner Join__

In [15]:
data_A.merge(data_B, on='key', how='inner')

Unnamed: 0,key,value_x,value_y
0,1,12,22
1,3,14,23
2,5,16,24
3,7,18,25


__Outer Join__

In [16]:
data_A.merge(data_B, on='key', how='outer')

Unnamed: 0,key,value_x,value_y
0,1,12.0,22.0
1,2,13.0,
2,3,14.0,23.0
3,4,15.0,
4,5,16.0,24.0
5,6,17.0,
6,7,18.0,25.0
7,9,,26.0
8,11,,27.0
9,13,,28.0


__Where__

Sometimes, the two datasets are the same, but both sets are incomplete, and we want to fill in missing 

## 7.5 Pivot tables


## 7.6 Groupbys