# Metis Introduction to Data Science Course Project
## Felipe Rios Ribeiro

### What is the question you hope to answer?

I selected a challenge from Kaggle named "House Prices - Advanced Regression Techniques", available at https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview.

#### How might we predict house prices using key data and features from houses such as area, age, location, style, condition, etc.

### What data are you planning to use to answer that question?

The available data for this challenge can be found here: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

1. Train.csv 

2. Test.csv

### Steps

* Import data and libraries
* Understand Train and Test datasets
* Cleaning
* Understand Sale Price (target variable)
* Remove missing values from Train
* Determine the most important variables for Sale Price (correlation)
* Run a multi-linear regression with the most important variables


### <span style='background :yellow' > Import data and libraries

In [2]:
#libraries for data handling/modeling
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import scipy.stats as stats
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#libraries for visualization
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

#Train data 
fullData = pd.read_csv("/Users/feliperiosribeiro/METIS_project_Felipe/data/train.csv")

#Test data 
#Test = pd.read_csv("/Users/feliperiosribeiro/METIS_project_Felipe/data/test.csv")

### <span style='background :yellow' > Understand the dataset

#### Train data

In [3]:
fullData.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
fullData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

* Train has 81 columns, including Sale Price, and LOTS of null values

In [5]:
fullData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### <span style='background :yellow' > Cleaning 

In [6]:
# fix the columns with non numerical values

# field MSSubClass doesn't have missing values but is not in the right format
# it's the type of dwelling, so that shouldn't be an integer

fullData['MSSubClass'] = fullData['MSSubClass'].astype(str)

# columns where the missing value actually means something, so I'll fill it with "None" 
# these fields are already an object so no need to change format

fullData['Alley'] = fullData['Alley'].fillna("None")
fullData['BsmtQual'] = fullData['BsmtQual'].fillna("None")
fullData['BsmtCond'] = fullData['BsmtCond'].fillna("None")
fullData['BsmtExposure'] = fullData['BsmtExposure'].fillna("None")
fullData['BsmtFinType1'] = fullData['BsmtFinType1'].fillna("None")
fullData['BsmtFinType2'] = fullData['BsmtFinType2'].fillna("None")
fullData['FireplaceQu'] = fullData['FireplaceQu'].fillna("None")
fullData['GarageType'] = fullData['GarageType'].fillna("None")
fullData['GarageFinish'] = fullData['GarageFinish'].fillna("None")
fullData['GarageQual'] = fullData['GarageQual'].fillna("None")
fullData['GarageType'] = fullData['GarageType'].fillna("None")
fullData['GarageCond'] = fullData['GarageCond'].fillna("None")
fullData['PoolQC'] = fullData['PoolQC'].fillna("None")
fullData['Fence'] = fullData['Fence'].fillna("None")
fullData['MiscFeature'] = fullData['MiscFeature'].fillna("None")

# columns that are not numerical and have missing values, so I'll fill it with the mode 
# these fields are already an object so no need to change format

fullData['MSZoning'] = fullData['MSZoning'].fillna(fullData['MSZoning'].mode()[0])
fullData['Utilities'] = fullData['Utilities'].fillna(fullData['Utilities'].mode()[0])
fullData['Exterior1st'] = fullData['Exterior1st'].fillna(fullData['Exterior1st'].mode()[0])
fullData['Exterior2nd'] = fullData['Exterior2nd'].fillna(fullData['Exterior2nd'].mode()[0])
fullData['Functional'] = fullData['Functional'].fillna(fullData['Functional'].mode()[0])
fullData['MasVnrType'] = fullData['MasVnrType'].fillna(fullData['MasVnrType'].mode()[0])
fullData['Electrical'] = fullData['Electrical'].fillna(fullData['Electrical'].mode()[0])
fullData['KitchenQual'] = fullData['KitchenQual'].fillna(fullData['KitchenQual'].mode()[0])
fullData['SaleType'] = fullData['SaleType'].fillna(fullData['SaleType'].mode()[0])


In [7]:
fullData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   object 
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          1460 non-null   object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

Now I'm missing the following:
* LotFrontage
* MasVnrArea
* BsmtFinSF1
* BsmtFinSF2
* TotalBsmtSF
* BsmtFullBath
* BsmtHalfBath
* Functional
* GarageYrBlt
* GarageCars
* GarageArea

In [8]:
# fill numeric fields with median

fullData['LotFrontage'] = fullData['LotFrontage'].fillna(fullData['LotFrontage'].median())
fullData['MasVnrArea'] = fullData['MasVnrArea'].fillna(fullData['MasVnrArea'].median())
fullData['BsmtFinSF1'] = fullData['BsmtFinSF1'].fillna(fullData['BsmtFinSF1'].median())
fullData['BsmtFinSF2'] = fullData['BsmtFinSF2'].fillna(fullData['BsmtFinSF2'].median())
fullData['BsmtFullBath'] = fullData['BsmtFullBath'].fillna(fullData['BsmtFullBath'].median())
fullData['BsmtHalfBath'] = fullData['BsmtHalfBath'].fillna(fullData['BsmtHalfBath'].median())
fullData['GarageYrBlt'] = fullData['GarageYrBlt'].fillna(fullData['GarageYrBlt'].median())
fullData['GarageCars'] = fullData['GarageCars'].fillna(fullData['GarageCars'].median())
fullData['GarageArea'] = fullData['GarageArea'].fillna(fullData['GarageArea'].median())


In [9]:
fullData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   object 
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1460 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          1460 non-null   object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### <span style='background :yellow' > Handling Categorical Features

In [10]:
# Make a copy of the dataset in case the encoding goes wrong

fullData2 = fullData.copy()

In [11]:
fullData2.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [12]:
#columns to be encoded using get_dummies
fullData2.select_dtypes(np.object).columns

Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'],
      dtype='object')

In [13]:
#will encode all variables that are not integer or float64

dummiesDF = pd.get_dummies(fullData2.select_dtypes(np.object))
dummiesDF.shape

(1460, 281)

In [14]:
# append the dummies back together

fullData2 = fullData2.merge(dummiesDF,left_index=True,right_index=True)

# drop the encoded columns and the Id
fullData2.drop(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType',
       'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'], axis = 'columns', inplace = True)

In [15]:
fullData2.head()

Unnamed: 0,Id,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,0,0,1,0,0,0,0,1,0
1,2,80.0,9600,6,8,1976,1976,0.0,978,0,...,0,0,0,1,0,0,0,0,1,0
2,3,68.0,11250,7,5,2001,2002,162.0,486,0,...,0,0,0,1,0,0,0,0,1,0
3,4,60.0,9550,7,5,1915,1970,0.0,216,0,...,0,0,0,1,1,0,0,0,0,0
4,5,84.0,14260,8,5,2000,2000,350.0,655,0,...,0,0,0,1,0,0,0,0,1,0


### <span style='background :yellow' > Understand Sale Price (target variable)

In [16]:
fullData2.SalePrice.describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

* Average sale price is almost 181k, with a considerable St. Deviation of almost 80k
* Price is positively skewed, and is not aligned to normal distribution ("bell-shaped" curve)

In [17]:
priceColumn = fullData2['SalePrice']

In [18]:
origId = fullData2['Id']

### <span style='background :yellow' > Scale

In [19]:
fullData3 = fullData2.copy()
fullData3 = fullData3.drop('Id', axis = 1)

In [20]:
fullData3.shape

(1460, 317)

In [22]:
fullData3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Columns: 317 entries, LotFrontage to SaleCondition_Partial
dtypes: float64(3), int64(33), uint8(281)
memory usage: 811.4 KB


In [25]:
#X_mult
X_rf = fullData3.drop('SalePrice', axis=1)

#target
y_rf = fullData3.SalePrice

In [26]:
#split

X_train,X_test,y_train,y_test = train_test_split(X_rf,y_rf,test_size=.4,random_state=20)


#X_train, X_test, y_train, y_test = train_test_split(X_mult, y_mult, test_size=0.2, random_state=1234)
print("training data size:",X_train.shape)
print("testing data size:",X_test.shape)

training data size: (876, 316)
testing data size: (584, 316)


In [27]:
X_train.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
560,69.0,11341,5,6,1957,1996,180.0,1302,0,90,...,0,0,0,1,0,0,0,0,1,0
1018,69.0,10784,7,5,1991,1992,76.0,0,0,384,...,0,0,0,1,0,0,0,0,1,0
1380,45.0,8212,3,3,1914,1950,0.0,203,0,661,...,0,0,0,1,0,0,0,0,1,0
524,95.0,11787,7,5,1996,1997,594.0,719,0,660,...,0,0,0,1,0,0,0,0,1,0
1155,90.0,10768,5,8,1976,2004,0.0,1157,0,280,...,0,0,0,1,0,0,0,0,1,0


In [23]:
#scale
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [24]:
# instantiate and fit 
multiple_linreg = LinearRegression()
multiple_linreg.fit(X_train, y_train)

coeffs = multiple_linreg.coef_
intercept =  multiple_linreg.intercept_


In [1]:
#generate predictions on training set and evaluate
predictions = multiple_linreg.predict(X_test)
print("Training set RMSE for Multi Linear Reg:",np.sqrt(metrics.mean_squared_error(y_test, predictions)))

NameError: name 'multiple_linreg' is not defined