# **Data Science with Python : Usage of get_dummies to create Dummy Variables with Pandas**
It is a scenario where there are attributes which are highly correlated and one variable predicts the value of others. When we use this concept for handling the categorical data, then one attribute can be predicted with the help of other dummy variables.

For example, Gender having two values male and female. Either they can be 1/0 or 0/1. Including both dummy variable can cause redundancy because if a person is not male in such case the person is female, hence, we don't need to use both the variables in model.






In [2]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



---


# **Infromation About Dataset**
The dataset used is the salary data, consisting of observations on six variables for 52 tenure-track professors in a small college. 

[Download Dataset](https://data.princeton.edu/wws509/datasets/salary.dat)


The variables are:

* sx = Sex

* rk = Rank

* yr = Number of years in current rank

* dg = Highest degree

* yd = Number of years since highest degree was earned

* sl = Academic year salary


---




### **Import Dataset**

In [3]:
url = "https://data.princeton.edu/wws509/datasets/salary.dat"
df=pd.read_csv(url, delim_whitespace = True)
df

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696
5,male,full,16,doctorate,21,28516
6,female,full,0,masters,32,24900
7,male,full,16,doctorate,18,31909
8,male,full,13,masters,30,31850
9,male,full,13,masters,31,32850


---
From the dataset we can conclude that there are some columns that are cateogorical in nature such as

* sx which has two possible values: male/female

* rk which has three possible values: full/ assistant/associate

* dg which has two possible values: masters/doctorate
---

# **Exploratory Data Analysis (EDA)**

In [4]:
# To display first five rows
df.head()

Unnamed: 0,sx,rk,yr,dg,yd,sl
0,male,full,25,doctorate,35,36350
1,male,full,13,doctorate,22,35350
2,male,full,10,doctorate,23,28200
3,female,full,7,doctorate,27,26775
4,male,full,19,masters,30,33696


In [5]:
# To display last five rows
df.tail()

Unnamed: 0,sx,rk,yr,dg,yd,sl
47,female,assistant,2,doctorate,2,15350
48,male,assistant,1,doctorate,1,16244
49,female,assistant,1,doctorate,1,16686
50,female,assistant,1,doctorate,1,15000
51,female,assistant,0,doctorate,2,20300


In [6]:
# Determining number of rows and columns
df.shape

(52, 6)

In [7]:
# Checking is there any null values present
df.isnull().sum()

sx    0
rk    0
yr    0
dg    0
yd    0
sl    0
dtype: int64

In [8]:
# Getting brief information about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   sx      52 non-null     object
 1   rk      52 non-null     object
 2   yr      52 non-null     int64 
 3   dg      52 non-null     object
 4   yd      52 non-null     int64 
 5   sl      52 non-null     int64 
dtypes: int64(3), object(3)
memory usage: 2.6+ KB


In [9]:
# Checking the dependency of one variable with other
df.corr()

Unnamed: 0,yr,yd,sl
yr,1.0,0.638776,0.700669
yd,0.638776,1.0,0.674854
sl,0.700669,0.674854,1.0


In [10]:
# Data Description
#describe method gives us the info such as
#count, mean, std, min, max etc
#these details help to give brief idea about the dataset that we are processing 
df.describe()

Unnamed: 0,yr,yd,sl
count,52.0,52.0,52.0
mean,7.480769,16.115385,23797.653846
std,5.507536,10.22234,5917.289154
min,0.0,1.0,15000.0
25%,3.0,6.75,18246.75
50%,7.0,15.5,23719.0
75%,11.0,23.25,27258.5
max,25.0,35.0,38045.0


# **Splitting of dataset for training & testing**

In [11]:
# using slicing
# for x(features): taking all rows and all columns except the last column
# for y(labels): taking all rows and only last column

x=df.iloc[:,:-1]  #features
y=df['sl']        #labels

In [63]:
x.head()

Unnamed: 0,sx,rk,yr,dg,yd
0,male,full,25,doctorate,35
1,male,full,13,doctorate,22
2,male,full,10,doctorate,23
3,female,full,7,doctorate,27
4,male,full,19,masters,30


---
* In order to create a linear regression model from salary-train and predict
salary of the professor based on multiple factors on salary-test data

* sx, rk, dg are the categorical data which are independent variables that can take on one of a limited number of possible values.

* Thses data values need to be converted into numerical form so that our machine learning algorithm can take in that as input.
---

## **Creating Dummy Variables**

In [12]:
gender=pd.get_dummies(x['sx'])
gender.head()

Unnamed: 0,female,male
0,0,1
1,0,1
2,0,1
3,1,0
4,0,1


---
* Notice above, how every new dummy column has at least one "1" within it? This is because every variable is accounted for with a True (1) indicator.
Every row specifies a person which could either be male/female.

* However, what if a first male column was filled 0s?This indicates female column is filled with 1s.

* This is also a way to identify one of your values. drop_first allows you to drop your first variable and identify it through all other columns being 0.

* So we only need to use one of these two dummy-coded variables as a predictor.
---

In [13]:
# Convert the sx column into categorical columns
gender=pd.get_dummies(x['sx'],drop_first=True)
gender.head()

Unnamed: 0,male
0,1
1,1
2,1
3,0
4,1


Notice how "female" column has been removed, and where the "male" column is use to predict both female and male values 

In [14]:
# Convert the rk column into categorical columns
rank=pd.get_dummies(x['rk'], drop_first=True)
rank.head()

Unnamed: 0,associate,full
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1


In [15]:
# Convert the dg column into categorical columns
degree=pd.get_dummies(x['dg'],drop_first=True)
degree.head()

Unnamed: 0,masters
0,0
1,0
2,0
3,0
4,1


In [16]:
# Drop the sx, rk, dg coulmns 
x=x.drop(['sx', 'rk', 'dg'], axis='columns')
x.head()

Unnamed: 0,yr,yd
0,25,35
1,13,22
2,10,23
3,7,27
4,19,30


## **Concatenation Of Dummy Variables With Original Dataset**

In [17]:
# concat the dummy variables
x=pd.concat([x,gender,rank,degree],axis='columns')
x.head()

Unnamed: 0,yr,yd,male,associate,full,masters
0,25,35,1,0,1,0
1,13,22,1,0,1,0
2,10,23,1,0,1,0
3,7,27,0,0,1,0
4,19,30,1,0,1,1


In [18]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest = train_test_split(x,y, test_size=0.33, random_state=3)

In [19]:
# imported linear regression model from sklearn 
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression

model = LinearRegression()      # y=a(x1+x2+x3....) + b
model.fit(xtrain,ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [20]:
# Predicting the Test set results
y_pred=model.predict(xtest)
y_pred

array([28372.97880766, 17077.07543027, 29641.53384585, 17609.93706944,
       17613.14847174, 18841.4230301 , 23257.76269529, 17609.93706944,
       17952.08508556, 23501.89192743, 17613.14847174, 27502.66323475,
       20437.07998185, 26120.93604103, 33051.89712003, 26034.76161643,
       29935.49589748, 27688.68277544])

In [24]:
#Actual values 
tested=ytest.values
tested

array([28200, 18075, 32850, 16686, 17600, 18304, 23712, 15000, 15350,
       24900, 17095, 38045, 20690, 22906, 33696, 25400, 31114, 29342])

In [25]:
# Creating dataframe between actual and predicted values
testmodel=pd.DataFrame(tested, y_pred)
testmodel.head()

Unnamed: 0,0
28372.978808,28200
17077.07543,18075
29641.533846,32850
17609.937069,16686
17613.148472,17600


In [26]:
# Determining the accuracy of the model
from sklearn.metrics import r2_score
s=r2_score(ytest,y_pred)
s

0.8169202957187006