# Lecture 11 Data Wrangling - Part 1 - Categorical Variables
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)
* Chapter 8

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outcomes
### Course

### Data Analysis Certification
* Databases: Data joins

-----
## Outline
* Mapping
* One-hot Encoding / Dummy Variables
* Ordinal Encoding

In [3]:
# Create a larger file with multiple HW and Exams and Projects
import pandas as pd
students = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Students')
assignments = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Assignments')
grades = pd.read_excel('../Datasets/Gradebook.xlsx', sheet_name='Grades')


In [4]:
print(students.shape)
students.head()

(50, 10)


Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed
0,1001,Sarah,Jensen,Sophomore,2.87,History,Yes,Yes,On-Campus,79
1,1002,Marcus,Lee,Junior,3.63,Physics,No,Yes,Off-Campus,83
2,1003,Emily,Torres,Junior,3.27,Biology,No,Yes,Off-Campus,51
3,1004,Daniel,Wright,Junior,2.85,English,Yes,No,On-Campus,58
4,1005,Olivia,Patel,Sophomore,3.21,Business,No,Yes,Off-Campus,26


In [5]:
print(assignments.shape)
assignments.head()

(23, 3)


Unnamed: 0,assignment_id,assignment_name,category
0,A1,Homework 1,Homework
1,A2,Homework 2,Homework
2,A3,Homework 3,Homework
3,A4,Homework 4,Homework
4,A5,Homework 5,Homework


In [6]:
print(grades.shape)
grades.head()

(1150, 3)


Unnamed: 0,student_id,assignment_id,grade
0,1001,A1,100
1,1035,A1,100
2,1002,A1,99
3,1030,A1,98
4,1021,A1,97


## Concatenate data

Let's say we have a couple new students move in. We need to add their data to the students dataframe.

In [17]:
new_students = pd.DataFrame({
    'student_id' : [1101, 1102, 1103],
    'first_name' : ['John', 'Jill', 'Jim'],
    'last_name' : ['Anderson', 'Benson', 'Kent'],
    'class_year' : ['Senior', 'Freshman', 'Sophomore'],
    'gpa' : [3.41, 2.77, 3.87],
    'major' : ['Physics', 'Biology', 'English'],
    'first_time_student' : ['Yes', 'Yes', 'Yes'],
    'financial_aid' : ['Yes', 'Yes', 'Yes'],
    'housing_status' : ['On-Campus', 'On-Campus', 'Off-Campus'],
    'credits_completed' : [82, 7, 16]
})

new_students

Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed
0,1101,John,Anderson,Senior,3.41,Physics,Yes,Yes,On-Campus,82
1,1102,Jill,Benson,Freshman,2.77,Biology,Yes,Yes,On-Campus,7
2,1103,Jim,Kent,Sophomore,3.87,English,Yes,Yes,Off-Campus,16


In [18]:
pd.concat([students, new_students], axis=0)

Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed
0,1001,Sarah,Jensen,Sophomore,2.87,History,Yes,Yes,On-Campus,79
1,1002,Marcus,Lee,Junior,3.63,Physics,No,Yes,Off-Campus,83
2,1003,Emily,Torres,Junior,3.27,Biology,No,Yes,Off-Campus,51
3,1004,Daniel,Wright,Junior,2.85,English,Yes,No,On-Campus,58
4,1005,Olivia,Patel,Sophomore,3.21,Business,No,Yes,Off-Campus,26
5,1006,Jason,Kim,Freshman,3.11,History,No,Yes,Off-Campus,66
6,1007,Hannah,Brooks,Senior,3.77,Engineering,Yes,No,Off-Campus,78
7,1008,Tyler,Nguyen,Senior,3.45,Chemistry,No,Yes,Off-Campus,20
8,1009,Chloe,Ramirez,Freshman,2.7,Biology,No,No,On-Campus,81
9,1010,Aiden,Foster,Senior,2.72,Biology,Yes,No,Off-Campus,54


Now, let's say that in the assignments folder, we want to identify which segment of the course it is assigned in. Let's say the course is broken into 4 segments. We can do this in two ways:

In [23]:
segments = pd.Series([1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,1,1,2,3,3,4,2,4])
pd.concat([assignments, segments], axis=1)

Unnamed: 0,assignment_id,assignment_name,category,0
0,A1,Homework 1,Homework,1
1,A2,Homework 2,Homework,1
2,A3,Homework 3,Homework,1
3,A4,Homework 4,Homework,2
4,A5,Homework 5,Homework,2
5,A6,Homework 6,Homework,2
6,A7,Homework 7,Homework,2
7,A8,Homework 8,Homework,3
8,A9,Homework 9,Homework,3
9,A10,Homework 10,Homework,3


In [24]:
assignments['segment'] = segments
assignments

Unnamed: 0,assignment_id,assignment_name,category,segment
0,A1,Homework 1,Homework,1
1,A2,Homework 2,Homework,1
2,A3,Homework 3,Homework,1
3,A4,Homework 4,Homework,2
4,A5,Homework 5,Homework,2
5,A6,Homework 6,Homework,2
6,A7,Homework 7,Homework,2
7,A8,Homework 8,Homework,3
8,A9,Homework 9,Homework,3
9,A10,Homework 10,Homework,3


## Mapping

Let's map the assignment type from the assignments sheet onto the grades sheet.

In [8]:
hw_dict = pd.Series(assignments['category'].values, index=assignments['assignment_id']).to_dict()
hw_dict

{'A1': 'Homework',
 'A2': 'Homework',
 'A3': 'Homework',
 'A4': 'Homework',
 'A5': 'Homework',
 'A6': 'Homework',
 'A7': 'Homework',
 'A8': 'Homework',
 'A9': 'Homework',
 'A10': 'Homework',
 'A11': 'Homework',
 'A12': 'Homework',
 'A13': 'Homework',
 'A14': 'Homework',
 'A15': 'Homework',
 'P1': 'Computer Project',
 'P2': 'Computer Project',
 'P3': 'Computer Project',
 'P4': 'Computer Project',
 'P5': 'Computer Project',
 'P6': 'Computer Project',
 'E1': 'Exam',
 'E2': 'Exam'}

In [9]:
grades['category'] = grades['assignment_id'].map(hw_dict)
grades.head()

Unnamed: 0,student_id,assignment_id,grade,category
0,1001,A1,100,Homework
1,1035,A1,100,Homework
2,1002,A1,99,Homework
3,1030,A1,98,Homework
4,1021,A1,97,Homework


## One-hot Encoding

When sending data into a model, it will need numerical variables. So, we need to convert categorical variables into numerical variables that represent those categories. This process is known as __encoding__.

For Nominal variables, we use __one-hot encoding__ (in Pandas, we use the `pd.get_dummies()` function to do this). It creates one column for each category and then gives it a value of 1 if the observation is in that category and a 0 if it is not.

Let's do this for the students' major in the `students` dataset.

In [27]:
students.head()

Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed,Freshman,Junior,Senior,Sophomore,Freshman.1,Junior.1,Senior.1,Sophomore.1
0,1001,Sarah,Jensen,Sophomore,2.87,History,Yes,Yes,On-Campus,79,0,0,0,1,0,0,0,1
1,1002,Marcus,Lee,Junior,3.63,Physics,No,Yes,Off-Campus,83,0,1,0,0,0,1,0,0
2,1003,Emily,Torres,Junior,3.27,Biology,No,Yes,Off-Campus,51,0,1,0,0,0,1,0,0
3,1004,Daniel,Wright,Junior,2.85,English,Yes,No,On-Campus,58,0,1,0,0,0,1,0,0
4,1005,Olivia,Patel,Sophomore,3.21,Business,No,Yes,Off-Campus,26,0,0,0,1,0,0,0,1


In [28]:
major_dummies = pd.get_dummies(students['major']).astype(int)
major_dummies.head()

Unnamed: 0,Biology,Business,Chemistry,Computer Science,Engineering,English,History,Mathematics,Physics,Psychology,Sociology,Statistics
0,0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,0,0


How do we add this onto the original dataset? We need to concatenate them.

In [29]:
students = pd.concat([students, major_dummies], axis=1)
students.head(7)

Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed,...,Chemistry,Computer Science,Engineering,English,History,Mathematics,Physics,Psychology,Sociology,Statistics
0,1001,Sarah,Jensen,Sophomore,2.87,History,Yes,Yes,On-Campus,79,...,0,0,0,0,1,0,0,0,0,0
1,1002,Marcus,Lee,Junior,3.63,Physics,No,Yes,Off-Campus,83,...,0,0,0,0,0,0,1,0,0,0
2,1003,Emily,Torres,Junior,3.27,Biology,No,Yes,Off-Campus,51,...,0,0,0,0,0,0,0,0,0,0
3,1004,Daniel,Wright,Junior,2.85,English,Yes,No,On-Campus,58,...,0,0,0,1,0,0,0,0,0,0
4,1005,Olivia,Patel,Sophomore,3.21,Business,No,Yes,Off-Campus,26,...,0,0,0,0,0,0,0,0,0,0
5,1006,Jason,Kim,Freshman,3.11,History,No,Yes,Off-Campus,66,...,0,0,0,0,1,0,0,0,0,0
6,1007,Hannah,Brooks,Senior,3.77,Engineering,Yes,No,Off-Campus,78,...,0,0,1,0,0,0,0,0,0,0


## Ordinal Encoding

For Ordinal variables, we can depict the category with a number scale (for example, pain level is often represented "on a scale from 1 to 10"). We can easily use this with mapping on our class level for each student.

In [32]:
classes = {
    'Freshman' : 1,
    'Sophomore' : 2,
    'Junior' : 3,
    'Senior' : 4
}

students['class_num'] = students['class_year'].map(classes)
students.head(7)

Unnamed: 0,student_id,first_name,last_name,class_year,gpa,major,first_time_student,financial_aid,housing_status,credits_completed,...,Computer Science,Engineering,English,History,Mathematics,Physics,Psychology,Sociology,Statistics,class_num
0,1001,Sarah,Jensen,Sophomore,2.87,History,Yes,Yes,On-Campus,79,...,0,0,0,1,0,0,0,0,0,2
1,1002,Marcus,Lee,Junior,3.63,Physics,No,Yes,Off-Campus,83,...,0,0,0,0,0,1,0,0,0,3
2,1003,Emily,Torres,Junior,3.27,Biology,No,Yes,Off-Campus,51,...,0,0,0,0,0,0,0,0,0,3
3,1004,Daniel,Wright,Junior,2.85,English,Yes,No,On-Campus,58,...,0,0,1,0,0,0,0,0,0,3
4,1005,Olivia,Patel,Sophomore,3.21,Business,No,Yes,Off-Campus,26,...,0,0,0,0,0,0,0,0,0,2
5,1006,Jason,Kim,Freshman,3.11,History,No,Yes,Off-Campus,66,...,0,0,0,1,0,0,0,0,0,1
6,1007,Hannah,Brooks,Senior,3.77,Engineering,Yes,No,Off-Campus,78,...,0,1,0,0,0,0,0,0,0,4


## Joins

## Pivot Tables

## Melting

## Groupbys