# Predicting house sale prices

## Introduction

In this project we'll use a linear regression model to predict the house sale prices in the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627) and you can also read about the different columns in the data [here](https://s3.amazonaws.com/dq-content/307/data_description.txt).

Let´s start by setting up a pipeline of functions that will let us quickly iterate on different models:
1. Train
2. transform_features()
3. Select_features
4. traing_and_test()
5. rmse_values and avg_mse

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 999
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import seaborn as sns
%matplotlib inline

data = pd.read_csv("AmesHousing.tsv", delimiter= "\t")
data.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1960,1960,Hip,CompShg,BrkFace,Plywood,Stone,112.0,TA,TA,CBlock,TA,Gd,Gd,BLQ,639.0,Unf,0.0,441.0,1080.0,GasA,Fa,Y,SBrkr,1656,0,0,1656,1.0,0.0,1,0,3,1,TA,7,Typ,2,Gd,Attchd,1960.0,Fin,2.0,528.0,TA,TA,P,210,62,0,0,0,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Feedr,Norm,1Fam,1Story,5,6,1961,1961,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,Rec,468.0,LwQ,144.0,270.0,882.0,GasA,TA,Y,SBrkr,896,0,0,896,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Attchd,1961.0,Unf,1.0,730.0,TA,TA,Y,140,0,0,0,120,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,6,1958,1958,Hip,CompShg,Wd Sdng,Wd Sdng,BrkFace,108.0,TA,TA,CBlock,TA,TA,No,ALQ,923.0,Unf,0.0,406.0,1329.0,GasA,TA,Y,SBrkr,1329,0,0,1329,0.0,0.0,1,1,3,1,Gd,6,Typ,0,,Attchd,1958.0,Unf,1.0,312.0,TA,TA,Y,393,36,0,0,0,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,Gtl,NAmes,Norm,Norm,1Fam,1Story,7,5,1968,1968,Hip,CompShg,BrkFace,BrkFace,,0.0,Gd,TA,CBlock,TA,TA,No,ALQ,1065.0,Unf,0.0,1045.0,2110.0,GasA,Ex,Y,SBrkr,2110,0,0,2110,1.0,0.0,2,1,3,1,Ex,8,Typ,2,TA,Attchd,1968.0,Fin,2.0,522.0,TA,TA,Y,0,0,0,0,0,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,5,5,1997,1998,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,GLQ,791.0,Unf,0.0,137.0,928.0,GasA,Gd,Y,SBrkr,928,701,0,1629,0.0,0.0,2,1,3,1,TA,6,Typ,1,TA,Attchd,1997.0,Fin,2.0,482.0,TA,TA,Y,212,34,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal,189900


Let's create 3 functions:
1. A function named transform_features() that, for now, just returns the train data frame.
2. A function named select_features() that, for now, just returns the Gr Liv Area and SalePrice columns from the train data frame.
3. A function named train_and_test() that, for now:
    * Selects the first 1460 rows from from data and assign to train.
    * Selects the remaining rows from data and assign to test.
    * Trains a model using all numerical columns except the SalePrice column (the target column) from the data frame returned from select_features()
    * Tests the model on the test set and returns the RMSE value.

In [2]:
def transform_features(df):
    return df

In [3]:
def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

In [4]:
def train_and_test(df):
    train = df[:1460]
    test = df[1460:]
    features = train.select_dtypes(include=['float64', "int64"])
    features = features.columns.drop("SalePrice")
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    return rmse

In [5]:
transform_df = transform_features(data)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)
rmse

57088.25161263909

## Feature Engineering

Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. We'll update transform_features() so that any column from the data frame with more than 5% missing values is dropped. In general, the goal of this function is to:
* remove features that we don't want to use in the model, just based on the number of missing values or data leakage
* transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc)
* create new features by combining other features

Handle missing values:
* All columns:
    * Drop any with 5% or more missing values for now.
* Text columns:
    * Drop any with 1 or more missing values for now.
* Numerical columns:
    * For columns with missing values, fill in with the most common value in that column

In [6]:
#Copy the dataframe
df_copy = data.copy()
total_values = len(df_copy)
missing_per_column = (df_copy.isnull().sum().sort_values()/total_values) *100

#Drop the columns with more than 5% of missing values
missing_cols = missing_per_column[missing_per_column > 5].index
df_copy = df_copy.drop(missing_cols, axis = 1)

#Fill the missing values with the most common value in that numerical column
numerical_cols = df_copy.select_dtypes(include=['float64', "int64"])
df_copy[numerical_cols.columns] = df_copy[numerical_cols.columns].apply(lambda x: x.fillna(x.mode()[0]), axis = 0)

In [7]:
#Drop text columns with 1 or more missing values
text_cols = df_copy.select_dtypes(include=["object"]).columns
null_text = df_copy[text_cols].isnull().sum()
df_copy = df_copy.drop(null_text[null_text > 0].index, axis = 1)
df_copy.isnull().sum().value_counts()

0    64
dtype: int64

Let's create 2 new features that can better capture the information:
* years_sold: the years since the house was built.
* years_since_remod: the years since the house was remodelled

We'll check if some row has strange values and drop it.

In [8]:
years_sold = df_copy["Yr Sold"] - df_copy["Year Built"]
years_sold [(years_sold < 0)]

2180   -1
dtype: int64

In [9]:
years_since_remod = df_copy["Yr Sold"] - df_copy["Year Remod/Add"]
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [10]:
df_copy["years_sold"] = years_sold
df_copy["years_since_remod"] = years_since_remod

#Drop the rows with negative numbers
df_copy = df_copy.drop([1702,2180,2181], axis = 0)

#Drop the columns ["Yr Sold","Year Built","Year Remod/Add"]
df_copy = df_copy.drop(["Yr Sold","Year Built","Year Remod/Add"], axis = 1)

In [11]:
# Drop columns that aren't useful for ML
df_copy = df_copy.drop(["PID", "Order"], axis=1)

# Drop columns that leak info about the final sale
df_copy = df_copy.drop(["Mo Sold", "Sale Condition", "Sale Type"], axis=1)

Now let's update transform_features()

In [12]:
def transform_features(df):
    #Copy the dataframe
    df_copy = df.copy()
    total_values = len(df_copy)
    missing_per_column = (df_copy.isnull().sum().sort_values()/total_values) *100
    
    #Drop the columns with more than 5% of missing values
    missing_cols = missing_per_column[missing_per_column > 5].index
    df_copy = df_copy.drop(missing_cols, axis = 1)
    
    #Fill the missing values with the most common value in that numerical column
    numerical_cols = df_copy.select_dtypes(include=['float64', "int64"])
    df_copy[numerical_cols.columns] = df_copy[numerical_cols.columns].apply(lambda x: x.fillna(x.mode()[0]), axis = 0)
    
    #Drop text columns with 1 or more missing values
    text_cols = df_copy.select_dtypes(include=["object"]).columns
    null_text = df_copy[text_cols].isnull().sum()
    df_copy = df_copy.drop(null_text[null_text > 0].index, axis = 1)
    
    years_sold = df_copy["Yr Sold"] - df_copy["Year Built"]
    years_since_remod = df_copy["Yr Sold"] - df_copy["Year Remod/Add"]
    
    df_copy["years_sold"] = years_sold
    df_copy["years_since_remod"] = years_since_remod
    
    #Drop the rows with negative numbers
    df_copy = df_copy.drop([1702,2180,2181], axis = 0)
    
    #Drop the columns ["Yr Sold","Year Built","Year Remod/Add"]
    df_copy = df_copy.drop(["Yr Sold","Year Built","Year Remod/Add"], axis = 1)

    # Drop columns that aren't useful for ML
    df_copy = df_copy.drop(["PID", "Order"], axis=1)

    # Drop columns that leak info about the final sale
    df_copy = df_copy.drop(["Mo Sold", "Sale Condition", "Sale Type"], axis=1)

    return df_copy

In [13]:
transform_df = transform_features(data)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)
rmse

55275.367312413066

## Feature Selection

Now that we have cleaned and transformed a lot of the features in the data set, it's time to move on to feature selection for numerical features. We'll look for correlations between SalePrice and the features, working with the absolute value.

In [14]:
corr = transform_df.select_dtypes(include=['float64', "int64"]).corr()
corr_sales = np.abs(corr["SalePrice"]).sort_values(ascending = False)
corr_sales

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648361
Total Bsmt SF        0.644012
Garage Area          0.641425
1st Flr SF           0.635185
years_sold           0.558979
Full Bath            0.546118
years_since_remod    0.534985
Mas Vnr Area         0.506983
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.439284
Wood Deck SF         0.328183
Open Porch SF        0.316262
Half Bath            0.284871
Bsmt Full Bath       0.276258
2nd Flr SF           0.269601
Lot Area             0.267520
Bsmt Unf SF          0.182751
Bedroom AbvGr        0.143916
Enclosed Porch       0.128685
Kitchen AbvGr        0.119760
Screen Porch         0.112280
Overall Cond         0.101540
MS SubClass          0.085128
Pool Area            0.068438
Low Qual Fin SF      0.037629
Bsmt Half Bath       0.035875
3Ssn Porch           0.032268
Misc Val             0.019273
BsmtFin SF 2         0.006127
Name: Sale

Let's use only the columns with more than 0.4 correlation

In [15]:
corr_sales[corr_sales > 0.4]

SalePrice            1.000000
Overall Qual         0.801206
Gr Liv Area          0.717596
Garage Cars          0.648361
Total Bsmt SF        0.644012
Garage Area          0.641425
1st Flr SF           0.635185
years_sold           0.558979
Full Bath            0.546118
years_since_remod    0.534985
Mas Vnr Area         0.506983
TotRms AbvGrd        0.498574
Fireplaces           0.474831
BsmtFin SF 1         0.439284
Name: SalePrice, dtype: float64

In [16]:
transform_df = transform_df.drop(corr_sales[corr_sales < 0.4].index, axis = 1)

Now we will answer the following questions:
* Which categorical columns should we use?
* Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?

In [17]:
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

If a categorical column has hundreds of unique values (or categories), when we dummy code this column, hundreds of columns will need to be added back to the data frame. So we will use onlye the first 10 with lesser unique values.

In [18]:
values_count = {}
for c in nominal_features:
    if c in transform_df.columns:
        unique_values = len(transform_df[c].value_counts())
        values_count[c] = unique_values
values_count = pd.Series(values_count).sort_values()

#we will use only the first 10
print(len(values_count))
values_count

16


Street           2
Central Air      2
Land Contour     4
Lot Config       5
Bldg Type        5
Roof Style       6
Foundation       6
Heating          6
MS Zoning        7
Condition 2      8
House Style      8
Roof Matl        8
Condition 1      9
Exterior 1st    16
Exterior 2nd    17
Neighborhood    28
dtype: int64

In [19]:
transform_df = transform_df.drop(values_count.tail(6).index, axis = 1)

Now we will transform the remaining columns into categorial:

In [20]:
text_cols = transform_df.select_dtypes(include=['object'])
for c in text_cols:
    transform_df[c] = transform_df[c].astype("category")

Later we will transform the caterogial columns into dummies columns:

In [21]:
dummies = pd.get_dummies(transform_df.select_dtypes(include = ["category"]))
dummies

Unnamed: 0,MS Zoning_A (agr),MS Zoning_C (all),MS Zoning_FV,MS Zoning_I (all),MS Zoning_RH,MS Zoning_RL,MS Zoning_RM,Street_Grvl,Street_Pave,Lot Shape_IR1,Lot Shape_IR2,Lot Shape_IR3,Lot Shape_Reg,Land Contour_Bnk,Land Contour_HLS,Land Contour_Low,Land Contour_Lvl,Utilities_AllPub,Utilities_NoSeWa,Utilities_NoSewr,Lot Config_Corner,Lot Config_CulDSac,Lot Config_FR2,Lot Config_FR3,Lot Config_Inside,Land Slope_Gtl,Land Slope_Mod,Land Slope_Sev,Condition 2_Artery,Condition 2_Feedr,Condition 2_Norm,Condition 2_PosA,Condition 2_PosN,Condition 2_RRAe,Condition 2_RRAn,Condition 2_RRNn,Bldg Type_1Fam,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,Roof Style_Flat,Roof Style_Gable,Roof Style_Gambrel,Roof Style_Hip,Roof Style_Mansard,Roof Style_Shed,Exter Qual_Ex,Exter Qual_Fa,Exter Qual_Gd,Exter Qual_TA,Exter Cond_Ex,Exter Cond_Fa,Exter Cond_Gd,Exter Cond_Po,Exter Cond_TA,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,Heating QC_Ex,Heating QC_Fa,Heating QC_Gd,Heating QC_Po,Heating QC_TA,Central Air_N,Central Air_Y,Kitchen Qual_Ex,Kitchen Qual_Fa,Kitchen Qual_Gd,Kitchen Qual_Po,Kitchen Qual_TA,Functional_Maj1,Functional_Maj2,Functional_Min1,Functional_Min2,Functional_Mod,Functional_Sal,Functional_Sev,Functional_Typ,Paved Drive_N,Paved Drive_P,Paved Drive_Y
0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0
1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
2,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1
3,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
2926,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
2927,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1
2928,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1


In [22]:
transform_df = pd.concat([transform_df,dummies], axis = 1)
transform_df = transform_df.drop(text_cols, axis = 1)

Now we will update the select_features() functions:

In [23]:
def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    
    corr = df.select_dtypes(include=['float64', "int64"]).corr()
    corr_sales = np.abs(corr["SalePrice"]).sort_values(ascending = False)
    
    #Let's use only the columns with more than coeff_threshold correlation
    df = df.drop(corr_sales[corr_sales < coeff_threshold].index, axis = 1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    values_count = {}
    for c in nominal_features:
        if c in df.columns:
            unique_values = len(df[c].value_counts())
            values_count[c] = unique_values
    values_count = pd.Series(values_count).sort_values()

    #we will use only the first uniq_threshold
    df = df.drop(values_count.tail(len(values_count) - uniq_threshold).index, axis = 1)
    
    #Now we will transform the remaining columns into categorial
    text_cols = df.select_dtypes(include=['object'])
    for c in text_cols:
        df[c] = df[c].astype("category")
        
    #Transform to dummies, concatenate and drop the original columns
    dummies = pd.get_dummies(df.select_dtypes(include = ["category"]))
    df = pd.concat([df,dummies], axis = 1)
    df = df.drop(text_cols, axis = 1)
    
    return df

In [24]:
transform_df = transform_features(data)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)
rmse

36623.53562910469

## Train and test

Now for the final part of the pipeline, training and testing. When iterating on different features, using simple validation is a good idea. Let's add a parameter named k that controls the type of cross validation that occurs.

In [25]:
from sklearn.model_selection import cross_val_score, KFold

def transform_features(df):
    #Copy the dataframe
    df_copy = df.copy()
    total_values = len(df_copy)
    missing_per_column = (df_copy.isnull().sum().sort_values()/total_values) *100
    
    #Drop the columns with more than 5% of missing values
    missing_cols = missing_per_column[missing_per_column > 5].index
    df_copy = df_copy.drop(missing_cols, axis = 1)
    
    #Fill the missing values with the most common value in that numerical column
    numerical_cols = df_copy.select_dtypes(include=['float64', "int64"])
    df_copy[numerical_cols.columns] = df_copy[numerical_cols.columns].apply(lambda x: x.fillna(x.mode()[0]), axis = 0)
    
    #Drop text columns with 1 or more missing values
    text_cols = df_copy.select_dtypes(include=["object"]).columns
    null_text = df_copy[text_cols].isnull().sum()
    df_copy = df_copy.drop(null_text[null_text > 0].index, axis = 1)
    
    years_sold = df_copy["Yr Sold"] - df_copy["Year Built"]
    years_since_remod = df_copy["Yr Sold"] - df_copy["Year Remod/Add"]
    
    df_copy["years_sold"] = years_sold
    df_copy["years_since_remod"] = years_since_remod
    
    #Drop the rows with negative numbers
    df_copy = df_copy.drop([1702,2180,2181], axis = 0)
    
    #Drop the columns ["Yr Sold","Year Built","Year Remod/Add"]
    df_copy = df_copy.drop(["Yr Sold","Year Built","Year Remod/Add"], axis = 1)

    # Drop columns that aren't useful for ML
    df_copy = df_copy.drop(["PID", "Order"], axis=1)

    # Drop columns that leak info about the final sale
    df_copy = df_copy.drop(["Mo Sold", "Sale Condition", "Sale Type"], axis=1)

    return df_copy

def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    
    corr = df.select_dtypes(include=['float64', "int64"]).corr()
    corr_sales = np.abs(corr["SalePrice"]).sort_values(ascending = False)
    
    #Let's use only the columns with more than coeff_threshold correlation
    df = df.drop(corr_sales[corr_sales < coeff_threshold].index, axis = 1)
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    values_count = {}
    for c in nominal_features:
        if c in df.columns:
            unique_values = len(df[c].value_counts())
            values_count[c] = unique_values
    values_count = pd.Series(values_count).sort_values()

    #we will use only the first uniq_threshold
    df = df.drop(values_count.tail(len(values_count) - uniq_threshold).index, axis = 1)
    
    #Now we will transform the remaining columns into categorial
    text_cols = df.select_dtypes(include=['object'])
    for c in text_cols:
        df[c] = df[c].astype("category")
        
    #Transform to dummies, concatenate and drop the original columns
    dummies = pd.get_dummies(df.select_dtypes(include = ["category"]))
    df = pd.concat([df,dummies], axis = 1)
    df = df.drop(text_cols, axis = 1)
    
    return df

def train_and_test(df, k = 0): 
    features = df.columns.drop("SalePrice")
    
    if k == 0:
        train = df[:1460]
        test = df[1460:]

        lr = LinearRegression()
        lr.fit(train[features], train["SalePrice"])
        predictions = lr.predict(test[features])
        mse = mean_squared_error(test["SalePrice"], predictions)
        rmse = np.sqrt(mse)
        return rmse
    
    if k == 1:
        shuffled = df.sample(frac=1, )
        fold_one = shuffled[:1460]
        fold_two = shuffled[1460:]
        
        lr = LinearRegression()
        lr.fit(fold_one[features], fold_one["SalePrice"])
        predictions = lr.predict(fold_two[features])
        mse_1 = mean_squared_error(fold_two["SalePrice"], predictions)
        rmse_1 = np.sqrt(mse_1)
        
        lr = LinearRegression()
        lr.fit(fold_two[features], fold_two["SalePrice"])
        predictions = lr.predict(fold_one[features])
        mse_2 = mean_squared_error(fold_one["SalePrice"], predictions)
        rmse_2 = np.sqrt(mse_2)
        
        rmse_mean = np.mean([rmse_1,rmse_2])
        print("RMSE 1: ",rmse_1)
        print("RMSE 2: ",rmse_2)
        return rmse_mean
    else:
        kf = KFold(k, shuffle=True, random_state = 1)
        model = LinearRegression()
        mses = cross_val_score(model, df[features], df["SalePrice"],
                               scoring="neg_mean_squared_error",
                               cv=kf)
        rmses = np.sqrt(np.absolute(mses))
        avg_rmse = np.mean(rmses)
        print(rmses)
        return avg_rmse
        

In [26]:
df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df, k=4)
rmse

[37553.87780272 25271.4962232  25679.45748691 28976.54697124]


29370.34462101407