## Data Cleaning and Creating Dummies for Catergorical Features

We will now clean the categorical variables of the train dataset by creating dummies for the respective columns. This is so that we can fit these variables into our model by converting them to numerical values.

The strategy here is to seperately clean the **Ordinal** and **Nominal** variables.  
This is so as ordinal variables are listed in a ordered manner, assigning a dummy value to them may distort the results.  
Nominal variables on the other hand are orderless, which assigning dummies can serve its purpose of naming attributes.

[Data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt) will be used for this process

### Contents:
* [Import Libraries](#Import-Libraries)  
* [Clearning Nominal Features](#Clearning-Nominal-Features)
  * [Train Dataset: Import and read cleaned train dataset](#Training-Dataset:-Import-and-read-cleaned-train-dataset)
  * [Train dataset: Create Dummies for Nominal Features](#Training-dataset:-Create-Dummies-for-Nominal-Features)
  * [Test Dataset: Import and read cleaned  test dataset](#Test-Dataset:-Import-and-read-cleaned-test-dataset)
  * [Test dataset: Create Dummies for Nominal Features](#Test-dataset:-Create-Dummies-for-Nominal-Features)
* [Cleaning Ordinal Features](#Clearning-Nominal-Features)
* [Export train and test data](#Export-train-and-test-data)
* [Summary](#Summary)

### Import Libraries

In [1]:
#import libraries
import pandas as pd
import numpy as np

### Clearning Nominal Features

##### Train Dataset: Import and read cleaned  train dataset

In [2]:
# read cleaned train data
file = '../datasets/train_clean.csv'

train = pd.read_csv(file, index_col="Id")
#review first first few rows of dataframe
train.head()

Unnamed: 0_level_0,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,533352170,60,RL,69.0552,13517,Pave,NONE,IR1,Lvl,AllPub,...,0,0,NONE,NONE,NONE,0,3,2010,WD,130500
544,531379050,60,RL,43.0,11492,Pave,NONE,IR1,Lvl,AllPub,...,0,0,NONE,NONE,NONE,0,4,2009,WD,220000
153,535304180,20,RL,68.0,7922,Pave,NONE,Reg,Lvl,AllPub,...,0,0,NONE,NONE,NONE,0,1,2010,WD,109000
318,916386060,60,RL,73.0,9802,Pave,NONE,Reg,Lvl,AllPub,...,0,0,NONE,NONE,NONE,0,4,2010,WD,174000
255,906425045,50,RL,82.0,14235,Pave,NONE,IR1,Lvl,AllPub,...,0,0,NONE,NONE,NONE,0,3,2010,WD,138500


In [3]:
#shape of dataframe
train.shape

(2051, 80)

In [4]:
#list of columns
train.columns.to_list()

['PID',
 'MS SubClass',
 'MS Zoning',
 'Lot Frontage',
 'Lot Area',
 'Street',
 'Alley',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin SF 1',
 'BsmtFin Type 2',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'Kitchen Qual',
 'TotRms AbvGrd',
 'Functional',
 'Fireplaces',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Area',
 'G

##### Train dataset: Create Dummies for Nominal Features

In [5]:
#create a list of nominal features thare we need to create dummies for
col_to_dummies = ['MS SubClass',
                  'MS Zoning',
                  'Street',
                  'Alley',
                  'Land Contour',
                  'Lot Config',
                  'Neighborhood',
                  'Condition 1',
                  'Condition 2',
                  'Bldg Type',
                  'House Style',
                  'Roof Style',
                  'Roof Matl',
                  'Exterior 1st',
                  'Exterior 2nd',
                  'Mas Vnr Type',
                  'Foundation',
                  'Heating',
                  'Central Air',
                  'Garage Type',
                  'Misc Feature',
                  'Sale Type',
                 ]

In [6]:
#use pd.dummies to get dummies for columns
train = pd.get_dummies(train, columns=col_to_dummies,drop_first=True)
train.head()

Unnamed: 0_level_0,PID,Lot Frontage,Lot Area,Lot Shape,Utilities,Land Slope,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,533352170,69.0552,13517,IR1,AllPub,Gtl,6,8,1976,2005,...,0,0,0,0,0,0,0,0,0,1
544,531379050,43.0,11492,IR1,AllPub,Gtl,7,5,1996,1997,...,0,0,0,0,0,0,0,0,0,1
153,535304180,68.0,7922,Reg,AllPub,Gtl,5,7,1953,2007,...,0,0,0,0,0,0,0,0,0,1
318,916386060,73.0,9802,Reg,AllPub,Gtl,5,5,2006,2007,...,0,0,0,0,0,0,0,0,0,1
255,906425045,82.0,14235,IR1,AllPub,Gtl,6,8,1900,1993,...,0,0,0,0,0,0,0,0,0,1


In [7]:
#shape of train dataset afer adding dummies
train.shape

(2051, 212)

##### Test Dataset: Import and read cleaned  test dataset

In [8]:
# read cleaned train data
test_file = '../datasets/test_clean.csv'

test = pd.read_csv(test_file, index_col="Id")
#review first first few rows of dataframe
test.head()

Unnamed: 0_level_0,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,...,0,0,0,NONE,NONE,NONE,0,4,2006,WD
2718,905108090,90,RL,69.630042,9662,Pave,NONE,IR1,Lvl,AllPub,...,0,0,0,NONE,NONE,NONE,0,8,2006,WD
2414,528218130,60,RL,58.0,17104,Pave,NONE,IR1,Lvl,AllPub,...,0,0,0,NONE,NONE,NONE,0,9,2006,New
1989,902207150,30,RM,60.0,8520,Pave,NONE,Reg,Lvl,AllPub,...,0,0,0,NONE,NONE,NONE,0,7,2007,WD
625,535105100,20,RL,69.630042,9500,Pave,NONE,IR1,Lvl,AllPub,...,0,185,0,NONE,NONE,NONE,0,7,2009,WD


In [9]:
#shape of test dataset
test.shape

(879, 79)

##### Test dataset: Create Dummies for Nominal Features

In [10]:
#use pd.dummies to get dummies for columns
test = pd.get_dummies(test, columns=col_to_dummies, drop_first=True)
test.head()

Unnamed: 0_level_0,PID,Lot Frontage,Lot Area,Lot Shape,Utilities,Land Slope,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_Shed,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2658,902301120,69.0,9142,Reg,AllPub,Gtl,6,8,1910,1950,...,0,0,0,0,0,0,0,0,0,1
2718,905108090,69.630042,9662,IR1,AllPub,Gtl,5,4,1977,1977,...,0,0,0,0,0,0,0,0,0,1
2414,528218130,58.0,17104,IR1,AllPub,Gtl,7,5,2006,2006,...,0,0,0,0,0,0,1,0,0,0
1989,902207150,60.0,8520,Reg,AllPub,Gtl,5,6,1923,2006,...,0,0,0,0,0,0,0,0,0,1
625,535105100,69.630042,9500,IR1,AllPub,Gtl,6,5,1963,1963,...,0,0,0,0,0,0,0,0,0,1


In [11]:
#shape of test dataset with new dummies
test.shape

(879, 202)

### Cleaning Ordinal Features

In this process, we will manually assign scale values to each variable of a column according to scale.   
We do this by using the replace method on each column for both the **Train** and **Test** datasets, by referring to the data dictionary provided above.

##### Lot Shape  
General shape of property

       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular  
       
       We will assign as per the following: 
       Reg - 4
       IR1 - 3
       IR2 - 2
       IR3 - 1

In [12]:
#train data: replace values in column using replace method
train["Lot Shape"].replace({'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1}, inplace = True)

#test data: replace values in column using replace method
test["Lot Shape"].replace({'Reg': 4, 'IR1': 3, 'IR2': 2, 'IR3': 1}, inplace = True)

##### Utilities 
Type of utilities available

       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only 
       
       We will assign as per the following: 
       AllPub - 4
       NoSewr - 3
       NoSeWa - 2
       

In [13]:
#train data: replace values in column using replace method
train["Utilities"].replace({'AllPub': 4, 'NoSewr': 3, 'NoSeWa': 2}, inplace = True)

#test data: replace values in column using replace method
test["Utilities"].replace({'AllPub': 4, 'NoSewr': 3}, inplace = True)

##### Land Slope 
Slope of property

       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope 
       
       We will assign as per the following: 
       Sev - 3
       Mod - 2
       Gtl - 1

In [14]:
#train data: replace values in column using replace method
train["Land Slope"].replace({'Sev': 3, 'Mod': 2, 'Gtl': 1}, inplace = True)

#test data: replace values in column using replace method
test["Land Slope"].replace({'Sev': 3, 'Mod': 2, 'Gtl': 1}, inplace = True)

##### Exter Qual 
Evaluates the quality of the material on the exterior 

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor 
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1

In [15]:
#train data: replace values in column using replace method
train["Exter Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2}, inplace = True)

#test data: replace values in column using replace method
test["Exter Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2}, inplace = True)

##### Exter Cond  
Evaluates the quality of the material on the exterior 

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor 
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1

In [16]:
#train data: replace values in column using replace method
train["Exter Cond"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}, inplace = True)

#test data: replace values in column using replace method
test["Exter Cond"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}, inplace = True)

##### Bsmt Qual
Evaluates the height of the basement 

       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement 
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1
       NONE - 0

In [17]:
#train data: replace values in column using replace method
train["Bsmt Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Bsmt Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

##### Bsmt Cond 
Evaluates the general condition of the basement 

       Ex	Excellent
       Gd	Good
       TA	Typical - slight dampness allowed
       Fa	Fair - dampness or some cracking or settling
       Po	Poor - Severe cracking, settling, or wetness
       NA	No Basement 
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1
       NONE - 0

In [18]:
#train data: replace values in column using replace method
train["Bsmt Cond"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Bsmt Cond"].replace({'Gd': 4, 'TA': 3, 'Fa': 2, 'NONE': 0}, inplace = True)

##### Bsmt Exposure
Refers to walkout or garden level walls

       Gd	Good Exposure
       Av	Average Exposure (split levels or foyers typically score average or above)	
       Mn	Mimimum Exposure
       No	No Exposure
       NA	No Basement 
       
       We will assign as per the following: 
       Gd - 4
       Av - 3
       Mn - 2
       No - 1
       NONE - 0

In [19]:
#train data: replace values in column using replace method
train["Bsmt Exposure"].replace({'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Bsmt Exposure"].replace({'Gd': 4, 'Av': 3, 'Mn': 2, 'No': 1, 'NONE': 0}, inplace = True)

##### BsmtFin Type 1
Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement 
       
       We will assign as per the following: 
       GLQ - 6
       ALQ - 5
       BLQ - 4
       Rec - 3
       LwQ - 2
       Unf - 1
       NONE - 0

In [20]:
#train data: replace values in column using replace method
train["BsmtFin Type 1"].replace({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["BsmtFin Type 1"].replace({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

##### BsmtFinType 2
Rating of basement finished area

       GLQ	Good Living Quarters
       ALQ	Average Living Quarters
       BLQ	Below Average Living Quarters	
       Rec	Average Rec Room
       LwQ	Low Quality
       Unf	Unfinshed
       NA	No Basement 
       
       We will assign as per the following: 
       GLQ - 6
       ALQ - 5
       BLQ - 4
       Rec - 3
       LwQ - 2
       Unf - 1
       NONE - 0

In [21]:
#train data: replace values in column using replace method
train["BsmtFin Type 2"].replace({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["BsmtFin Type 2"].replace({'GLQ': 6, 'ALQ': 5, 'BLQ': 4, 'Rec': 3, 'LwQ': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

##### Heating QC
Heating quality and condition

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor 
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1

In [22]:
#train data: replace values in column using replace method
train["Heating QC"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}, inplace = True)

#test data: replace values in column using replace method
test["Heating QC"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2}, inplace = True)

##### Electrical
Electrical system

       SBrkr	Standard Circuit Breakers & Romex
       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
       Mix	    Mixed 
       
       We will assign as per the following: 
       SBrkr - 5
       FuseA - 4
       FuseF - 3
       FuseP - 2
       Mix - 1
       NONE - 0 (for test dataset)

In [23]:
#train data: replace values in column using replace method
train["Electrical"].replace({'SBrkr': 5, 'FuseA': 4, 'FuseF': 3, 'FuseP': 2, 'Mix': 1}, inplace = True)

#test data: replace values in column using replace method
test["Electrical"].replace({'SBrkr': 5, 'FuseA': 4, 'FuseF': 3, 'FuseP': 2, 'NONE': 0}, inplace = True)

##### Kitchen Qual
Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1

In [24]:
#train data: replace values in column using replace method
train["Kitchen Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2}, inplace = True)

#test data: replace values in column using replace method
test["Kitchen Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1}, inplace = True)

##### Functional
Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only
       
       We will assign as per the following: 
       Typ - 8
       Min1 - 7
       Min2 - 6
       Mod - 5
       Maj1 - 4
       Maj2 - 3
       Sev - 2
       Sal - 1

In [25]:
#train data: replace values in column using replace method
train["Functional"].replace({'Typ': 8, 'Min1': 7, 'Min2': 6, 'Mod': 5, 'Maj1': 4, 'Maj2': 3, 'Sev': 2, 'Sal': 1}, inplace = True)

#test data: replace values in column using replace method
test["Functional"].replace({'Typ': 8, 'Min1': 7, 'Min2': 6, 'Mod': 5, 'Maj1': 4, 'Maj2': 3}, inplace = True)

##### Fireplace Qu
Fireplace quality

       Ex	Excellent - Exceptional Masonry Fireplace
       Gd	Good - Masonry Fireplace in main level
       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
       Fa	Fair - Prefabricated Fireplace in basement
       Po	Poor - Ben Franklin Stove
       NA	No Fireplace
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1
       NONE - 0

In [26]:
#train data: replace values in column using replace method
train["Fireplace Qu"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Fireplace Qu"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

##### Garage Finish
Interior finish of the garage

       Fin	Finished
       RFn	Rough Finished	
       Unf	Unfinished
       NA	No Garage
       
       We will assign as per the following: 
       Fin - 3
       RFn - 2
       Unf - 1
       NONE - 0

In [27]:
#train data: replace values in column using replace method
train["Garage Finish"].replace({'Fin': 3, 'RFn': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Garage Finish"].replace({'Fin': 3, 'RFn': 2, 'Unf': 1, 'NONE': 0}, inplace = True)

##### Garage Qual
Garage quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1
       NONE - 0

In [28]:
#train data: replace values in column using replace method
train["Garage Qual"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Garage Qual"].replace({'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

##### Garage Cond
Garage condition

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor
       NA	No Garage
       
       We will assign as per the following: 
       Ex - 5
       Gd - 4
       TA - 3
       Fa - 2
       Po - 1
       NONE - 0

In [29]:
#train data: replace values in column using replace method
train["Garage Cond"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Garage Cond"].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'NONE': 0}, inplace = True)

##### Paved Drive
Paved driveway

       Y	Paved 
       P	Partial Pavement
       N	Dirt/Gravel
       
       We will assign as per the following: 
       Y - 3
       P - 2
       N - 1


In [30]:
#train data: replace values in column using replace method
train["Paved Drive"].replace({'Y': 3, 'P': 2, 'N': 1}, inplace = True)

#test data: replace values in column using replace method
test["Paved Drive"].replace({'Y': 3, 'P': 2, 'N': 1}, inplace = True)

##### Pool QC
Pool quality

       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       NA	No Pool
       
       We will assign as per the following: 
       Ex - 4
       Gd - 3
       TA - 2
       Fa - 1
       NONE - 0

In [31]:
#train data: replace values in column using replace method
train["Pool QC"].replace({'Ex': 4, 'Gd': 3, 'TA': 2, 'Fa': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Pool QC"].replace({'Ex': 4, 'TA': 2, 'NONE': 0}, inplace = True)

##### Fence
Fence quality

       GdPrv	Good Privacy
       MnPrv	Minimum Privacy
       GdWo	Good Wood
       MnWw	Minimum Wood/Wire
       NA	No Fence
       
       We will assign as per the following: 
       GdPrv - 4
       MnPrv - 3
       GdWo - 2
       MnWw - 1
       NONE - 0

In [32]:
#train data: replace values in column using replace method
train["Fence"].replace({'GdPrv': 4, 'MnPrv': 3, 'GdWo': 2, 'MnWw': 1, 'NONE': 0}, inplace = True)

#test data: replace values in column using replace method
test["Fence"].replace({'GdPrv': 4, 'MnPrv': 3, 'GdWo': 2, 'MnWw': 1, 'NONE': 0}, inplace = True)

##### Check both datasets

In [33]:
#train dataset
train.head()

Unnamed: 0_level_0,PID,Lot Frontage,Lot Area,Lot Shape,Utilities,Land Slope,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_Shed,Misc Feature_TenC,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_WD
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,533352170,69.0552,13517,3,4,1,6,8,1976,2005,...,0,0,0,0,0,0,0,0,0,1
544,531379050,43.0,11492,3,4,1,7,5,1996,1997,...,0,0,0,0,0,0,0,0,0,1
153,535304180,68.0,7922,4,4,1,5,7,1953,2007,...,0,0,0,0,0,0,0,0,0,1
318,916386060,73.0,9802,4,4,1,5,5,2006,2007,...,0,0,0,0,0,0,0,0,0,1
255,906425045,82.0,14235,3,4,1,6,8,1900,1993,...,0,0,0,0,0,0,0,0,0,1


In [34]:
train.shape

(2051, 212)

In [35]:
#test dataset
test.head()

Unnamed: 0_level_0,PID,Lot Frontage,Lot Area,Lot Shape,Utilities,Land Slope,Overall Qual,Overall Cond,Year Built,Year Remod/Add,...,Misc Feature_Shed,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2658,902301120,69.0,9142,4,4,1,6,8,1910,1950,...,0,0,0,0,0,0,0,0,0,1
2718,905108090,69.630042,9662,3,4,1,5,4,1977,1977,...,0,0,0,0,0,0,0,0,0,1
2414,528218130,58.0,17104,3,4,1,7,5,2006,2006,...,0,0,0,0,0,0,1,0,0,0
1989,902207150,60.0,8520,4,4,1,5,6,1923,2006,...,0,0,0,0,0,0,0,0,0,1
625,535105100,69.630042,9500,3,4,1,6,5,1963,1963,...,0,0,0,0,0,0,0,0,0,1


In [36]:
test.shape

(879, 202)

##### Export train and test data

In [37]:
# exporting train data with dummies
filepath = "../datasets/train_dummies.csv"
train.to_csv(filepath)

In [38]:
# exporting new data with dummies
filepath = "../datasets/test_dummies.csv"
test.to_csv(filepath)

### Summary

We have successfully coverted all categorical data, both nominal and ordinal, into numerical data so as to be able to analyze them, and select the best variables for our model.

Next we will try to identify the best features(that fit our assumptions of linear regression) from both the numerical and categorical data to get our model.