# 07 Data Wrangling
__Math 3080: Fundamentals of Data Science__

Reading:
* McKinney, Chapter 7 - Data Cleaning and Preparation
* McKinney, Chapter 8 - Data Wrangling: Join, Combine, and Reshape

Outline:
1. Mapping
2. Sampling
3. Dummy Variables / Indicators
    * Value counts
4. Joining two datasets
5. Pivot tables
6. Groupbys

Other methods discussed in the book that we won't cover here, but are valuable resources:
* Regular Expressions
* String methods and manipulation

-----
We often have two sets of data on the same subject, and both add a good deal of information. Wouldn't it be nice to merge the datasets together? If we could do that, our options for what to do with data would increase significantly. 

Also, what if the data is not quite in the format we want? For example, what if we have a list of observations by date, but we'd like to change that to a table with dates indicating the row and the columns indicate the year?

In this section, we will look at how we can accomplish both of these tasks. It is part of a branch of data science called __data wrangling__.

## 7.1 Mapping
Sometimes, we have a dataset that could use a little more information. Take the following dataset on different kinds of meat:

In [1]:
import numpy as np
import pandas as pd

In [2]:
meat_data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon",
                              "pastrami", "corned beef", "bacon",
                              "pastrami", "honey ham", "nova lox"],
                      "ounces": [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
meat_data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5
5,bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


We have a variety of different meats. Let's add a little information to indicate what type of animal each meat type comes from. We do this with a technique called __mapping__. This takes the value from one variable of your dataset and looks up a second value based on the first from another list. For example, "bacon" in the food variable would have any entry in the other list that would return the animal "pig".

In [5]:
meat_to_animal = {
  "bacon": "pig",
  "pulled pork": "pig",
  "pastrami": "cow",
  "corned beef": "cow",
  "honey ham": "pig",
  "nova lox": "salmon"
}

meat_data['Animal'] = meat_data['food'].map(meat_to_animal)
meat_data

Unnamed: 0,food,ounces,Animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


## 7.2 Sampling

In [None]:
samples = np.random.permutation(9)
samples

In [None]:
meat_data.iloc[samples]

In [None]:
meat_data.sample(n=4)

## 7.3 Dummy Variables / Indicators
Dummy variables take every unique value in a column and makes individual columns with those names. It then places a 1 in the rows with that column value and a zero if that row does not have that column value.

In [None]:
meat_data

In [None]:
pd.get_dummies(meat_data['Animal'])

If we want to see them together, we can do a join, which we discuss next. But for now, we have a way to count all of the values for each category.

In [None]:
pd.get_dummies(meat_data['Animal']).sum()

#### Data Summaries
Pandas has a couple of built-in functions that will provide additionaly summary data. We will now look at:
* `.describe()`
* `.value_counts()`

The `.describe()` method takes any numerical variables and calculates the count, mean, standard deviation, and quartiles (including maximum and minimum).

In [None]:
meat_data.describe()

In [None]:
meat_data[meat_data['Animal'] == 'pig'].describe()

In [None]:
meat_data[meat_data['Animal'] == 'cow'].describe()

The `.value_counts()` takes any unique observation and counts how many times that unique observation occurs.

In [None]:
meat_data.value_counts()

In [None]:
meat_data['food'].value_counts()

Each result is a unique observation. However, this does not adequately show how useful this can be. Let's look at a questionairre with 5 questions that are 'Yes' or 'No'.

In [None]:
DS_Survey = pd.DataFrame({
    'Enjoy Math2210' : ['Yes','Yes','Yes','No','No','Yes','No','Yes','Yes','Yes','No','No'],
    'SE Major' : ['Yes','No','Yes','Yes','No','Yes','No','No','Yes','Yes','No','No']
})
DS_Survey

In [None]:
DS_Survey.value_counts()

We can also look at the value counts of an individual variable:

In [None]:
DS_Survey['Enjoy Math2210'].value_counts()

In [None]:
meat_data['Animal'].value_counts()

## 7.4 Joining two datasets
We often have 2 datasets that can be joined together when the two datasets have information that are useful together. We just saw two datasets on meat sales and the get_dummies breakdown of the animals the meat comes from. Let's join them together.

In [7]:
meat_data

Unnamed: 0,food,ounces,Animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow
5,bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [6]:
pd.get_dummies(meat_data['Animal'])

Unnamed: 0,cow,pig,salmon
0,0,1,0
1,0,1,0
2,0,1,0
3,1,0,0
4,1,0,0
5,0,1,0
6,1,0,0
7,0,1,0
8,0,0,1


In [8]:
meat_data.join(pd.get_dummies(meat_data['Animal']))

Unnamed: 0,food,ounces,Animal,cow,pig,salmon
0,bacon,4.0,pig,0,1,0
1,pulled pork,3.0,pig,0,1,0
2,bacon,12.0,pig,0,1,0
3,pastrami,6.0,cow,1,0,0
4,corned beef,7.5,cow,1,0,0
5,bacon,8.0,pig,0,1,0
6,pastrami,3.0,cow,1,0,0
7,honey ham,5.0,pig,0,1,0
8,nova lox,6.0,salmon,0,0,1


The `join` method merges two datasets based on the index: Index 0 from meat_data is matched with index 0 in the get_dummies dataset. This often works if the order of observations is the same. But sometimes, the data is not ordered. Or sometimes one dataset is complete and the other is a subset of the first. For example, take two dataset about students' GPAs and GRE scores:
* `gpa_data` has the student ID and gpa of all students
* `gre_data` has the student ID and the score they earned on the gre

Not all students take the GRE, so not all students in `gpa_data` will be in `gre_data`. We would still like to merge that data if possible. In python, we do this with a more advanced method of join: `merge`.

Let's take a closer look at the different types of joins.

When we join datasets, there are 4 methods in which they can be joined:
* Left join (all data in left table is kept, any unmatched data from the right table is dropped)
* Right join (all data in right table is kept, any unmatched data from the left table is dropped)
* Inner join (only data that matches both left and right tables are kept)
* Outer join (all data are kept, whether they match or not)

![Different types of joins](https://d33wubrfki0l68.cloudfront.net/9c12ca9e12ed26a7c5d2aa08e36d2ac4fb593f1e/79980/diagrams/join-outer.png)
* image from *R for Data Science*, Hadley Wickham & Garret Grolemund, 2017.

Let's demonstrate this with two dummy datasets. Assume you are a manager for a grocery store and are tracking your inventory and sales.

In [9]:
import pandas as pd
inventory = pd.DataFrame({'ItemID':[1,2,3,4,5],
                          'Item':["Milk","12 Eggs","Bread","PB","Chips"],
                          'Price':[2.97,1.25,1.10,2.15,4.25]})

inventory # Use with the merge command

Unnamed: 0,ItemID,Item,Price
0,1,Milk,2.97
1,2,12 Eggs,1.25
2,3,Bread,1.1
3,4,PB,2.15
4,5,Chips,4.25


In [10]:
sales = pd.DataFrame({'Sale#':[1,1,2,2,2,2,3,3,3,3],
                      'ItemID':[1,2,1,1,3,4,1,2,3,6],
                      'Customer':[24,24,134,134,134,134,97,97,97,97]})

sales # Use with the merge command

Unnamed: 0,Sale#,ItemID,Customer
0,1,1,24
1,1,2,24
2,2,1,134
3,2,1,134
4,2,3,134
5,2,4,134
6,3,1,97
7,3,2,97
8,3,3,97
9,3,6,97


Notice how there are items in the inventory that don't appear is sales (Item 5: Chips), and there's even one item in sales that doesn't appear in the inventory (Item 6). You'll see how these are affected in the various joins.

__Left Join__: All items from the right dataset are matched with the left
* If items from the right dataset don't appear in the left, they are dropped
* If items from the left dataset don't appear in the right, they are filled with `NaN`

In [14]:
inventory.merge(sales, on='ItemID', how='left')

Unnamed: 0,ItemID,Item,Price,Sale#,Customer
0,1,Milk,2.97,1.0,24.0
1,1,Milk,2.97,2.0,134.0
2,1,Milk,2.97,2.0,134.0
3,1,Milk,2.97,3.0,97.0
4,2,12 Eggs,1.25,1.0,24.0
5,2,12 Eggs,1.25,3.0,97.0
6,3,Bread,1.1,2.0,134.0
7,3,Bread,1.1,3.0,97.0
8,4,PB,2.15,2.0,134.0
9,5,Chips,4.25,,


With the `merge` command, we told what column we wanted to use to match the two datasets. The `join` command uses the index. To show this, let's set the indices for each dataframe to be the ItemID. Then the results should be the exact same.

In [11]:
inventory_by_ID = inventory.set_index('ItemID')
inventory_by_ID # Use with the join command

Unnamed: 0_level_0,Item,Price
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Milk,2.97
2,12 Eggs,1.25
3,Bread,1.1
4,PB,2.15
5,Chips,4.25


In [12]:
sales_by_ID = sales.set_index('ItemID')
sales_by_ID # Use with the join command

Unnamed: 0_level_0,Sale#,Customer
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,24
2,1,24
1,2,134
1,2,134
3,2,134
4,2,134
1,3,97
2,3,97
3,3,97
6,3,97


In [13]:
inventory_by_ID.join(sales_by_ID, how='left')

Unnamed: 0_level_0,Item,Price,Sale#,Customer
ItemID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Milk,2.97,1.0,24.0
1,Milk,2.97,2.0,134.0
1,Milk,2.97,2.0,134.0
1,Milk,2.97,3.0,97.0
2,12 Eggs,1.25,1.0,24.0
2,12 Eggs,1.25,3.0,97.0
3,Bread,1.1,2.0,134.0
3,Bread,1.1,3.0,97.0
4,PB,2.15,2.0,134.0
5,Chips,4.25,,


So, we see how `join` uses the index to merge two datasets. That is how it worked earlier when we dealt with meat sales and the get_dummies data for the animal type. For the rest of these joins, we will just use the `merge` command, but know that the `join` will work the same way based on the index.

__Right Join__: All items from the left dataset are matched with the right
* If items from the left dataset don't appear in the right, they are dropped
* If items from the right dataset don't appear in the left, they are filled with `NaN`

In [15]:
inventory.merge(sales, on='ItemID', how='right')

Unnamed: 0,ItemID,Item,Price,Sale#,Customer
0,1,Milk,2.97,1,24
1,2,12 Eggs,1.25,1,24
2,1,Milk,2.97,2,134
3,1,Milk,2.97,2,134
4,3,Bread,1.1,2,134
5,4,PB,2.15,2,134
6,1,Milk,2.97,3,97
7,2,12 Eggs,1.25,3,97
8,3,Bread,1.1,3,97
9,6,,,3,97


__Inner Join__: All items from the two datasets are matched with each other
* If items from the right dataset don't appear in the left, they are dropped
* If items from the left dataset don't appear in the right, they are dropped

If the `how` argument is not specified, the default is `how='inner'`

In [16]:
#inventory.merge(sales, on='ItemID', how='inner')
inventory.merge(sales, on='ItemID')

Unnamed: 0,ItemID,Item,Price,Sale#,Customer
0,1,Milk,2.97,1,24
1,1,Milk,2.97,2,134
2,1,Milk,2.97,2,134
3,1,Milk,2.97,3,97
4,2,12 Eggs,1.25,1,24
5,2,12 Eggs,1.25,3,97
6,3,Bread,1.1,2,134
7,3,Bread,1.1,3,97
8,4,PB,2.15,2,134


__Outer Join / Full Join__: All items from the two datasets are matched with each other
* If items from the right dataset don't appear in the left, they are filled with `NaN`
* If items from the left dataset don't appear in the right, they are filled with `NaN`

In [17]:
inventory.merge(sales, on='ItemID', how='outer')

Unnamed: 0,ItemID,Item,Price,Sale#,Customer
0,1,Milk,2.97,1.0,24.0
1,1,Milk,2.97,2.0,134.0
2,1,Milk,2.97,2.0,134.0
3,1,Milk,2.97,3.0,97.0
4,2,12 Eggs,1.25,1.0,24.0
5,2,12 Eggs,1.25,3.0,97.0
6,3,Bread,1.1,2.0,134.0
7,3,Bread,1.1,3.0,97.0
8,4,PB,2.15,2.0,134.0
9,5,Chips,4.25,,


## 7.5 Pivot tables and Groupbys
A lot of data is just a list of observations. For example, here is a list of students and their exam scores:

In [18]:
student_scores = pd.DataFrame({
    'StudentID' : [1,2,3,4,5, 1,2,3,4,5, 1,2,3,4,5],
    'Exam #' : [1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3],
    'Score' : [91,92,97,87,83, 82,89,85,79,93, 86,78,84,97,94]
})
student_scores

Unnamed: 0,StudentID,Exam #,Score
0,1,1,91
1,2,1,92
2,3,1,97
3,4,1,87
4,5,1,83
5,1,2,82
6,2,2,89
7,3,2,85
8,4,2,79
9,5,2,93


This is called a __stacked__ dataset. We also say that the data is in __long format__ since there are more rows than columns.

On the other hand, when a dataset is more like a table with more columns and fewer rows, we call this __wide format__.

### Pivot Tables
We can use a stacked dataset to make a __pivot table__ where one variable is the row, another variable is the column, and a third variable would be the value within the table. In other words, we change from long format into wide format. For example, let's make a pivot table where the student ID is the row and the different exams make the columns.

In [19]:
scores_pivot = student_scores.pivot(index='StudentID', columns='Exam #', values='Score')
scores_pivot
# Known as a Pivot Table, or Wide Format

Exam #,1,2,3
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,91,82,86
2,92,89,78
3,97,85,84
4,87,79,97
5,83,93,94


We can also reverse this process by __melting__ data in wide format and turn it back into long format.

In [20]:
student_scores2 = pd.DataFrame({
    'StudentID' : [1,2,3,4,5],
    'Exam #1' : [91,92,97,87,83],
    'Exam #2' : [82,89,85,79,93],
    'Exam #3' : [86,78,84,97,94]
})
student_scores2

Unnamed: 0,StudentID,Exam #1,Exam #2,Exam #3
0,1,91,82,86
1,2,92,89,78
2,3,97,85,84
3,4,87,79,97
4,5,83,93,94


In [21]:
student_scores2.melt(id_vars="StudentID")
# Known as Long Format

Unnamed: 0,StudentID,variable,value
0,1,Exam #1,91
1,2,Exam #1,92
2,3,Exam #1,97
3,4,Exam #1,87
4,5,Exam #1,83
5,1,Exam #2,82
6,2,Exam #2,89
7,3,Exam #2,85
8,4,Exam #2,79
9,5,Exam #2,93


### 7.6 Groupbys
Another way that we can reorganize the data is to group the data by specific values, then calculating the minimum, maximum, median, mean, standard deviation, variance, or total (sum). of the values for each group. In our student example, we have two ways we can group the data: by student and by exam.

In [22]:
student_scores

Unnamed: 0,StudentID,Exam #,Score
0,1,1,91
1,2,1,92
2,3,1,97
3,4,1,87
4,5,1,83
5,1,2,82
6,2,2,89
7,3,2,85
8,4,2,79
9,5,2,93


Let's say you want to know the average for each exam. So, we groupy by exam #, then tell python we want the average (mean). If we just do a groupby without a calculation, then this just creates an object within python that can't display anythin.

In [23]:
student_scores.groupby('Exam #')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f52577fe610>

In [24]:
student_scores.groupby('Exam #').mean()

Unnamed: 0_level_0,StudentID,Score
Exam #,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.0,90.0
2,3.0,85.6
3,3.0,87.8


In [25]:
student_scores.groupby('Exam #').std(ddof=1)

Unnamed: 0_level_0,StudentID,Score
Exam #,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.581139,5.291503
2,1.581139,5.549775
3,1.581139,7.694154


Notice how the "StudentID" column makes no sense at all. After all, is there any meaning to the average ID? No. So, we can ignore or even drop that column.

In [26]:
student_scores.groupby('Exam #')["Score"].mean()

Exam #
1    90.0
2    85.6
3    87.8
Name: Score, dtype: float64

In [27]:
student_scores.groupby('Exam #').mean().drop('StudentID', axis=1)

Unnamed: 0_level_0,Score
Exam #,Unnamed: 1_level_1
1,90.0
2,85.6
3,87.8



We can also see how each student did. Let's find the student's total score for the three exams:

In [28]:
student_scores.groupby('StudentID').sum()

Unnamed: 0_level_0,Exam #,Score
StudentID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,259
2,6,259
3,6,266
4,6,263
5,6,270


-----
Other topics to include in the future:
* .apply()
  * lambda functions