###Python Episode 22 - Machine Learning Secrets

Hi!

Let's learn how to prepare the dataset as AI professionals do ;)

Our focus: pandas.get_dummies()

Tutorial: Jungletronics

Pandas — One Hot Encoding (OHE)
Pandas Dataframe Examples: AI Secrets— #PySeries#Episode 22

https://medium.com/jungletronics/pandas-one-hot-encoding-ohe-eb7467dc92e8

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#One Hot Encoding - What is it?

Because many machine learning models need their input variables to be numeric, 

categorical variables need to be transformed in the pre-processing part. (Wikepedia) https://en.wikipedia.org/wiki/One-hot

##Discrimination in Salaries
These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

sx = Sex, coded 1 for female and 0 for male

rk = Rank, coded

1 for assistant professor,

2 for associate professor, and

3 for full professor

yr = Number of years in current rank

dg = Highest degree, coded 1 if doctorate, 0 if masters

yd = Number of years since highest degree was earned

sl = Academic year salary, in dollars.

The file is available in the usual plain text formats as salary.dat using

 character codes and salary.raw using numeric codes, and in Stata format as

 salary.dta. Here's an excerpt of the 'dat' file: [salary.dat](https://data.princeton.edu/wws509/datasets/#salary)

In [2]:
df = pd.read_table("https://data.princeton.edu/wws509/datasets/salary.dat", delim_whitespace = True)
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [3]:
# What are the options we have?
df['sx'].unique()

array(['male', 'female'], dtype=object)

In [4]:
# Turn my column into a dummy value
# By dropping the first column we did not lose any information, right?
dummy1 = pd.get_dummies(df['sx'], drop_first=True)
# Take a look
dummy1.head()

Unnamed: 0,male
0,1
1,1
2,1
3,0
4,1


In [5]:
# Now let's cancatenate everything...
df = pd.concat([df, dummy1], axis=1).drop('sx', axis=1)
df.head()

Unnamed: 0,rk,yr,dg,yd,sl,male
0,full,25,doctorate,35,36350,1
1,full,13,doctorate,22,35350,1
2,full,10,doctorate,23,28200,1
3,full,7,doctorate,27,26775,0
4,full,19,masters,30,33696,1


In [6]:
# Rearranging column order (optional)
cols = df.columns.tolist()
# Indexing & Slicing Techniques
cols = cols[-1:] + cols[:-1]
cols

['male', 'rk', 'yr', 'dg', 'yd', 'sl']

In [7]:
df = df[cols]
df.head()

Unnamed: 0,male,rk,yr,dg,yd,sl
0,1,full,25,doctorate,35,36350
1,1,full,13,doctorate,22,35350
2,1,full,10,doctorate,23,28200
3,0,full,7,doctorate,27,26775
4,1,full,19,masters,30,33696


In [8]:
# Now let's deal with rank (rk):
df['rk'].unique()

array(['full', 'associate', 'assistant'], dtype=object)

In [9]:
# Turn my column into a dummy value
dummy2 = pd.get_dummies(df['rk'])
# Take a look
dummy2.head()

Unnamed: 0,assistant,associate,full
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


In [10]:
df = pd.concat([df, dummy2], axis=1).drop('rk', axis=1)
df.head()

Unnamed: 0,male,yr,dg,yd,sl,assistant,associate,full
0,1,25,doctorate,35,36350,0,0,1
1,1,13,doctorate,22,35350,0,0,1
2,1,10,doctorate,23,28200,0,0,1
3,0,7,doctorate,27,26775,0,0,1
4,1,19,masters,30,33696,0,0,1


In [11]:
# Now the last one (dg = Highest degree, coded 1 if doctorate, 0 if masters):
df['dg'].unique()

array(['doctorate', 'masters'], dtype=object)

In [12]:
# Turn my column into a dummy value
dummy3 = pd.get_dummies(df['dg'])
# Take a look
dummy3.head()

Unnamed: 0,doctorate,masters
0,1,0
1,1,0
2,1,0
3,1,0
4,0,1


In [13]:
# let's simplify it once more…
dummy3 = dummy3.drop('masters', axis=1)
dummy3.head()

Unnamed: 0,doctorate
0,1
1,1
2,1
3,1
4,0


In [14]:
# Concatenating now...And finally let's go to learn machine learning ... 
df = pd.concat([df, dummy3], axis=1).drop('dg', axis=1)
df.head()

Unnamed: 0,male,yr,yd,sl,assistant,associate,full,doctorate
0,1,25,35,36350,0,0,1,1
1,1,13,22,35350,0,0,1,1
2,1,10,23,28200,0,0,1,1
3,0,7,27,26775,0,0,1,1
4,1,19,30,33696,0,0,1,0


In [15]:
print('There you have it! Now the Dataset is ready for AI Algorithm:)')

There you have it! Now the Dataset is ready for AI Algorithm:)


In [16]:
# Look this post too: https://medium.com/jungletronics/numpy-init-python-review-f5362abbaaf9#f32b
# Or this one: https://medium.com/jungletronics/numpy-jupyter-notebook-1182f78ab4e1
print("Thats's All for this lecture! See you in the next Python Episode! Bye!!!! ")

Thats's All for this lecture! See you in the next Python Episode! Bye!!!! 
