# Data Wrangling
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)
* Chapters 8,10

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outcomes
### Course

### Data Analysis Certification
* Databases: Data joins

-----
## Outline
* Concatenating
* Mapping
* Encoding
    * One-hot encoding / Dummy Variables
    * Ordinal Encoding
* Joining
* Reshaping

What is the purpose of data wrangling?
* Combine datasets that augment our view of the data
* Look at the data from a different perspective
* ... other reasons ...

This is one of the big skills of a Data Scientiest

In [None]:
import pandas as pd
students = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Students')
assignments = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Assignments')
grades = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Grades')


In [None]:
print(students.shape)
students.head()

In [None]:
print(assignments.shape)
assignments.head()

In [None]:
print(grades.shape)
grades.head()

## Concatenate data

Let's say we have a couple new students move in. We need to add their data to the students dataframe.

In [None]:
new_students = pd.DataFrame({
    'student_id' : [1101, 1102, 1103],
    'first_name' : ['John', 'Jill', 'Jim'],
    'last_name' : ['Anderson', 'Benson', 'Kent'],
    'class_year' : ['Senior', 'Freshman', 'Sophomore'],
    'gpa' : [3.41, 2.77, 3.87],
    'major' : ['Physics', 'Biology', 'English'],
    'first_generation_student' : ['Yes', 'Yes', 'Yes'],
    'financial_aid' : ['Yes', 'Yes', 'Yes'],
    'housing_status' : ['On-Campus', 'On-Campus', 'Off-Campus'],
    'credits_completed' : [82, 7, 16]
})

new_students

In [None]:
pd.concat([students, new_students], axis=0).tail(7)

(What if we don't have data on all variables? Try commenting a couple of rows from the `new_students` dictionary.)

Now, let's say that in the assignments folder, we want to identify which segment of the course it is assigned in. Let's say the course is broken into 4 segments. We can do this in one of three ways: (1) concatenating, (2) directly adding columns, and (3) mapping.

In [None]:
segments = pd.Series([1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,1,1,2,3,3,4,2,4])
pd.concat([assignments, segments], axis=1)

In [None]:
assignments['segment'] = segments
assignments

## Mapping

Let's map the assignment type from the assignments sheet onto the grades sheet.

In [None]:
hw_dict = pd.Series(assignments['category'].values, index=assignments['assignment_id']).to_dict()
hw_dict

In [None]:
grades['category'] = grades['assignment_id'].map(hw_dict)
grades.head()

## One-hot Encoding

When sending data into a model, it will need numerical variables. So, we need to convert categorical variables into numerical variables that represent those categories. This process is known as __encoding__.

For Nominal variables, we use __one-hot encoding__ (in Pandas, we use the `pd.get_dummies()` function to do this). It creates one column for each category and then gives it a value of 1 if the observation is in that category and a 0 if it is not.

Let's do this for the students' major in the `students` dataset.

In [None]:
students.head()

In [None]:
major_dummies = pd.get_dummies(students['major']).astype(int)
major_dummies.head()

How do we add this onto the original dataset? We need to concatenate them.

In [None]:
students = pd.concat([students, major_dummies], axis=1)
students.head(7)

## Ordinal Encoding

For Ordinal variables, we can depict the category with a number scale (for example, pain level is often represented "on a scale from 1 to 10"). We can easily use this with mapping on our class level for each student.

In [None]:
classes = {
    'Freshman' : 1,
    'Sophomore' : 2,
    'Junior' : 3,
    'Senior' : 4
}

students['class_num'] = students['class_year'].map(classes)
students.head(7)

-----

## Joins

A __join__ is where you pair observations from one table with observations from another table. Let's look at these two tables:

In [None]:
import numpy as np
import pandas as pd

depts = pd.DataFrame({
    'dept_id' : [10,20,30,40],
    'dept_name' : ['Engineering','Sales','Marketing','HR']
})

display(depts)

employees = pd.DataFrame({
    'emp_id' : [1,2,3,4,5],
    'emp_name' : ['Alice','Bob','Carol','Dan','Eve'],
    'dept_id' : [10,20,20,50,np.nan]
})

display(employees)

There are 4 ways we can join tables:

### Left Join

To call a join, you need to list two tables. A __left join__ will take all the entries for the first table. If there are matches in the second table, it will include that data in another column. If not, it will mark the missing data as `NaN`.

<img src="./images/leftjoin.png" alt="Left Join" width=250, height=250>

In [None]:
pd.merge(employees, depts, on='dept_id', how='left')

### Right Join

A __right join__ will take all the entries for the second table. If there are matches in the first table, it will include that data in another column. If not, it will mark the missing data as `NaN`.

<img src="./images/rightjoin.png" alt="Right Join" width=250, height=250>

In [None]:
pd.merge(employees, depts, on='dept_id', how='right')

### Outer Join

An __outer join__ will take all the entries for both tables. Any missing data is marked as `NaN`.

<img src="./images/outerjoin.png" alt="Outer Join" width=250, height=250>

In [None]:
pd.merge(employees, depts, on='dept_id', how='outer')

### Inner Join

An __inner join__ will take only entries that are found in both tables. 

<img src="./images/innerjoin.png" alt="Inner Join" width=250, height=250>

In [None]:
pd.merge(employees, depts, on='dept_id', how='inner')

### Note on column names during joins
Sometimes, the column name used for merging is different. Let's assume that in the employees dataframe, the department was indicated differently.

In [None]:
display(depts)

employees = pd.DataFrame({
    'emp_id' : [1,2,3,4,5],
    'emp_name' : ['Alice','Bob','Carol','Dan','Eve'],
    'department' : [10,20,20,50,np.nan]
})

display(employees)

The department is still how we want to do the join. However, we have to indicate each column specifically since they are not under the same name.

In [None]:
pd.merge(employees, depts, left_on='department', right_on='dept_id', how='inner')

-----

## Reshaping

Reshaping dataframes involves reorganizing data to change how variables are arranged without altering the underlying values. This will present the data is a more suitable format for visualization and summarization.

Most data is recorded in long format. With __long-format data__, variables are stacked into a single column with corresponding value and identifier columns.

(With our gradbook dataset, the category is listed in the `assignment_id` column with the value in the `grade` column.)

In [None]:
import pandas as pd
grades = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Grades')
grades.head()

For visualization, we often change long-format data into wide format. With __wide-format data__, each variable has its own column and observations are spread across many columns.

(With our gradebook dataset, each category has its own column and each element represents the value for the particular identifier and category.)

Wide format is more useful for data analysis and preparation. However, we often prefer to store data in long format because it allows for more flexibility with how the data is used later. Let's learn how to move between long and wide formats.

### Aggregation

In order to reshape the data using a Groupby, we are going to need a method to summarize the values in a dataset. For instance, if we have a table with the student names for rows and homework, project, and exam grades as columns, the values in the table have to summarize and represent all the homework, project, and exam entries for that student. These summary values are found using __aggregate functions__.

Common aggregate functions:
```python
agg('max')
agg('min')
agg('mean') # Default
agg('std')
agg('count')
```

We can even create our own aggregate function
```python    
def range(x):
    return x.max()-x.min()

agg(range)
```

In [None]:
# Using one aggregate function
print(students['gpa'].mean())

print(students['gpa'].aggregate(['mean', 'min', 'median', 'max', 'std']))

We'll see these aggregate functions in use in the following topics.


### Pivot Tables

We start with a table in long format. 
* Choose one variable to be the row in our table
* Choose another variable to be the column in our table
* Choose a third variable to be the value in the table

If there is more than one value to go into the table, what do we do? This is where we use our aggregate function.

In [None]:
grade_table = pd.pivot_table(grades, index='student_id', columns='assignment_id', values='grade', aggfunc='max')
grade_table

### Melting

Melting is the opposite of a pivot table. It takes the row of your table and makes that one variable, the column of your table becomes another variable, and the value becomes a third variable.

In [None]:
grade_table.melt(ignore_index=False)

### Groupbys

In [None]:
students = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Students')
students.head()

In [None]:
students.groupby('class_year')['gpa'].mean()

In [None]:
students.groupby(['class_year','major'])['gpa'].mean()

In [None]:
students.groupby(['major','class_year'])['gpa'].mean()