# Introduction



In this project, I am working on housing data for the city of Ames, Iowa, United States from 2006 to 2010. More infromation of the data can be read [here](https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). Information the different columns in the data can be read [here](https://s3.amazonaws.com/dq-content/307/data_description.txt)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import seaborn as sns

houses=pd.read_csv("AmesHousing.tsv",delimiter='\t')

In [2]:
def transform_features(houses):
    return houses

def select_features(houses):
    return houses[["Gr Liv Area","SalePrice"]]

In [3]:
# Transform column names by removing whitespaces
houses.columns = houses.columns.str.strip()
houses.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [4]:
def train_and_test(df):
    train=df[:1460]
    test=df[1460:]

    numeric_train=train.select_dtypes(include=["integer","float"])
    numeric_test=test.select_dtypes(include=['integer', 'float'])
    features = numeric_train.columns.drop("SalePrice")
    print(features)
    lr=LinearRegression()
    model=lr.fit(train[features],train["SalePrice"])
    predictions=model.predict(test[features])
    mse=mean_squared_error(test["SalePrice"],predictions)
    rmse=mse**(0.5)
    return rmse

In [5]:
transform_houses = transform_features(houses)
filtered_houses = select_features(transform_houses)
rmse = train_and_test(filtered_houses)
rmse

Index(['Gr Liv Area'], dtype='object')


57088.25161263909

## Feature Engineering
Handle missing values:
- All columns:
    - Drop any with 5% or more missing values for now.
- Text columns:
    - Drop any with 1 or more missing values for now.
- Numerical columns:
    - For columns with missing values, fill in with the most common value in that column


1: All columns: Drop any with 5% or more missing values for now.

In [6]:
# Finding percentage of missing values in columns in descending order
missing_values=houses.isnull().sum().sort_values(ascending=False)

missing_values

Pool QC            2917
Misc Feature       2824
Alley              2732
Fence              2358
Fireplace Qu       1422
Lot Frontage        490
Garage Qual         159
Garage Yr Blt       159
Garage Cond         159
Garage Finish       159
Garage Type         157
Bsmt Exposure        83
BsmtFin Type 2       81
BsmtFin Type 1       80
Bsmt Cond            80
Bsmt Qual            80
Mas Vnr Type         23
Mas Vnr Area         23
Bsmt Full Bath        2
Bsmt Half Bath        2
Garage Area           1
Garage Cars           1
Total Bsmt SF         1
Bsmt Unf SF           1
BsmtFin SF 2          1
BsmtFin SF 1          1
Electrical            1
Exterior 2nd          0
Exterior 1st          0
Roof Matl             0
                   ... 
Heating               0
Exter Cond            0
Functional            0
Sale Type             0
Yr Sold               0
Mo Sold               0
Misc Val              0
Pool Area             0
Screen Porch          0
3Ssn Porch            0
Enclosed Porch  

In [7]:
# Filter columns that has missing values >5%
drop_cols=missing_values[(missing_values>len(houses)/20)].index

drop_cols

Index(['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu',
       'Lot Frontage', 'Garage Qual', 'Garage Yr Blt', 'Garage Cond',
       'Garage Finish', 'Garage Type'],
      dtype='object')

Let's remove 'Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu',
       'Lot Frontage', 'Garage Qual', 'Garage Yr Blt', 'Garage Cond',
       'Garage Finish', 'Garage Type' columns as they have more than 5% of missing values


In [8]:
houses=houses.drop(drop_cols,axis=1)

 Text columns: Drop any with 1 or more missing values for now.

In [9]:
## Series object: column name -> number of missing values
text_missing_cols = houses.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)

## Filter Series to columns containing *any* missing values
drop_missing_cols_2 = text_missing_cols[text_missing_cols > 0]
drop_missing_cols_2
houses = houses.drop(drop_missing_cols_2.index, axis=1)

 Numerical columns: For columns with missing values, fill in with the most common value in that column

In [10]:
## Compute column-wise missing value counts
num_missing = houses.select_dtypes(include=['int', 'float']).isnull().sum()
fixable_numeric_cols = num_missing[(num_missing < len(houses)/20) & (num_missing > 0)].sort_values()
fixable_numeric_cols

BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Garage Cars        1
Garage Area        1
Bsmt Full Bath     2
Bsmt Half Bath     2
Mas Vnr Area      23
dtype: int64

In [11]:
## Compute the most common value for each column in `fixable_nmeric_missing_cols`.
replacement_values_dict = houses[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
replacement_values_dict

{'BsmtFin SF 1': 0.0,
 'BsmtFin SF 2': 0.0,
 'Bsmt Unf SF': 0.0,
 'Total Bsmt SF': 0.0,
 'Garage Cars': 2.0,
 'Garage Area': 0.0,
 'Bsmt Full Bath': 0.0,
 'Bsmt Half Bath': 0.0,
 'Mas Vnr Area': 0.0}

In [12]:
houses=houses.fillna(replacement_values_dict)

In [13]:
## Verify that every column has 0 missing values
houses.isnull().sum().value_counts()

0    64
dtype: int64

## Creating new columns
For instance,Yr Sold and Year Built are discrete values with which we can't make much inference and relevance to our data. So,converting them to creating a meaningful information would help with our model.

The two main issues with these features are:

- Year values aren't representative of how old a house is
- The Year Remod/Add column doesn't actually provide useful information for a linear regression model

The challenge with year values like 1960 and 1961 is that they don't do a good job of capturing how old a house is. 
Instead of the years certain events happened, we want the difference between those years. We should create a new column that's the difference between both of these columns.

In [14]:
# Gives information before sold
years_sold = houses['Yr Sold'] - houses['Year Built']
years_sold[years_sold < 0]

2180   -1
dtype: int64

In [15]:
# Gives information years since remodel

years_since_remod = houses['Yr Sold'] - houses['Year Remod/Add']
years_since_remod[years_since_remod < 0]

1702   -1
2180   -2
2181   -1
dtype: int64

In [16]:
## Drop rows with negative values 
houses.drop([1780,2180,2181],axis=0)

houses["year_before_sold"]=years_sold
houses["year_since_remodel"]=years_since_remod

# Drop those yr built,yr remodel cols

houses=houses.drop(['Year Remod/Add','Year Built'],axis=1)

## Remove columns that leak information

More information on features of the dataset can be found [here](https://s3.amazonaws.com/dq-content/307/data_description.txt)

- Drop the columns that doesn't provide infromation to ML model. Ex: PID, Order columns
- Drop the columns that have potential to leak information on final prediction.

#### After a quick look at the dataset, we found that:

- In numerical columns, there are columns of which values are obviously not directly related to the sale price. We should drop these columns: Order, PID.
-  There are missing values in a few numerical columns. For the 8 columns that only have one or two missing values, we can just drop the rows. For the rest numerical columns with missing values, after checking, we can inpute the columns with their mean.
- There are a few non-numerical columns contain a lot of null values like Pool QC. Considering the large amount of missing data, we should just drop those columns. We will arbitrarily set the cut-off of at 25%.
-  From the frequency table of non-numerical columns, we can tell all of them are categorical, and that's what we will transform them into.

In [17]:
houses=houses.drop(["Mo Sold","Sale Type", "Yr Sold","Sale Condition","Order","PID"],axis=1)

In [18]:
houses.columns

Index(['MS SubClass', 'MS Zoning', 'Lot Area', 'Street', 'Lot Shape',
       'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood',
       'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Area', 'Exter Qual',
       'Exter Cond', 'Foundation', 'BsmtFin SF 1', 'BsmtFin SF 2',
       'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Functional', 'Fireplaces', 'Garage Cars', 'Garage Area', 'Paved Drive',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
       'Screen Porch', 'Pool Area', 'Misc Val', 'SalePrice',
       'year_before_sold', 'year_since_remodel'],
      dtype='object')

In [19]:
def transform_features(df):
    # Transform column names by removing whitespaces
    df.columns = df.columns.str.strip()
    # Finding percentage of missing values in columns in descending order
    num_missing = df.isnull().sum()
    # drop columns with more than 5% missing values
    drop_missing_cols = num_missing[(num_missing > len(df)/20)].sort_values()
    df = df.drop(drop_missing_cols.index, axis=1)
    
    
    ## Series object: column name -> number of missing values
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)

    ## Text Columns: Filter Series to columns containing *any* missing values
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df = df.drop(drop_missing_cols_2.index, axis=1)

    ## Compute column-wise missing value counts
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    ## Compute the most common value for each column in `fixable_nmeric_missing_cols`.
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
    # Create New Columns
    # Gives information before sold
    years_sold = df['Yr Sold'] - df['Year Built']
    # Gives information years since remodel
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    ## Drop rows with negative values in new columns
    df = df.drop([1702, 2180, 2181], axis=0)
    
    # Drop those yr sold,yr built,yr remodel cols,columns that leak information

    df = df.drop(["PID", "Order",'Yr Sold', "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)
    print(df.columns)
    return df


def select_features(df):
    return df[["Gr Liv Area", "SalePrice"]]

def train_and_test(df):  
    train = df[:1460]
    test = df[1460:]
    
    ## You can use `pd.DataFrame.select_dtypes()` to specify column types
    ## and return only those columns as a data frame.
    numeric_train = train.select_dtypes(include=['integer', 'float'])
    numeric_test = test.select_dtypes(include=['integer', 'float'])
    
    ## You can use `pd.Series.drop()` to drop a value.
    features = numeric_train.columns.drop("SalePrice")
    
    lr = LinearRegression()
    lr.fit(train[features], train["SalePrice"])
    predictions = lr.predict(test[features])
    mse = mean_squared_error(test["SalePrice"], predictions)
    rmse = np.sqrt(mse)
    
    return rmse

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse = train_and_test(filtered_df)

rmse

Index(['MS SubClass', 'MS Zoning', 'Lot Area', 'Street', 'Lot Shape',
       'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood',
       'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Area', 'Exter Qual',
       'Exter Cond', 'Foundation', 'BsmtFin SF 1', 'BsmtFin SF 2',
       'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Functional', 'Fireplaces', 'Garage Cars', 'Garage Area', 'Paved Drive',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
       'Screen Porch', 'Pool Area', 'Misc Val', 'SalePrice',
       'Years Before Sale', 'Years Since Remod'],
      dtype='object')


55275.367312413066

## Correlation between features

In [20]:
corrmat=transform_df.select_dtypes(include=['integer', 'float'])
corr_wrt_saleprice=corrmat.corr()["SalePrice"].abs().sort_values()
corr_wrt_saleprice

BsmtFin SF 2         0.006127
Misc Val             0.019273
3Ssn Porch           0.032268
Bsmt Half Bath       0.035875
Low Qual Fin SF      0.037629
Pool Area            0.068438
MS SubClass          0.085128
Overall Cond         0.101540
Screen Porch         0.112280
Kitchen AbvGr        0.119760
Enclosed Porch       0.128685
Bedroom AbvGr        0.143916
Bsmt Unf SF          0.182751
Lot Area             0.267520
2nd Flr SF           0.269601
Bsmt Full Bath       0.276258
Half Bath            0.284871
Open Porch SF        0.316262
Wood Deck SF         0.328183
BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: Sale

Let's keep columns that have correlation coefficient larger than 0.4 (we can set our cutoff,its worth experimenting)


In [21]:
corr_wrt_saleprice[corr_wrt_saleprice>0.4]

BsmtFin SF 1         0.439284
Fireplaces           0.474831
TotRms AbvGrd        0.498574
Mas Vnr Area         0.506983
Years Since Remod    0.534985
Full Bath            0.546118
Years Before Sale    0.558979
1st Flr SF           0.635185
Garage Area          0.641425
Total Bsmt SF        0.644012
Garage Cars          0.648361
Gr Liv Area          0.717596
Overall Qual         0.801206
SalePrice            1.000000
Name: SalePrice, dtype: float64

In [22]:
transform_df=transform_df.drop(corr_wrt_saleprice[corr_wrt_saleprice<=0.4].index,axis=1)

### Transforming non-numeric columns into categorical datatype

To do the transformation,we need to follow the steps
- Finding non-numeric columns and find unique values in the each of those features.
- Setting a cutoff 10 unique values in each feature. Feature needs to be removed if unique count is more than 10. 
- Creating dummy features from the filtered features and should be added back to our dataframe and also we should remove the original columns related to dummy features.

Note:
categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category will low variance. This would be similar to a low variance numerical feature (no variability in the data for the model to capture). 

In [23]:
cat_nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
transform_cat_cols = []
for col in cat_nominal_features:
    if col in transform_df.columns:
        transform_cat_cols.append(col)

In [24]:
#Finding count of unique values in non-numeric cols
unique_nonnumeric_count=transform_df[transform_cat_cols].apply(lambda col:len(col.value_counts()))
unique_nonnumeric_count

MS Zoning        7
Street           2
Land Contour     4
Lot Config       5
Neighborhood    28
Condition 1      9
Condition 2      8
Bldg Type        5
House Style      8
Roof Style       6
Roof Matl        8
Exterior 1st    16
Exterior 2nd    17
Foundation       6
Heating          6
Central Air      2
dtype: int64

In [25]:
# Drop features more than 10 unique values
transform_df=transform_df.drop(unique_nonnumeric_count[unique_nonnumeric_count>10].index,axis=1)


In [26]:

text_cols=transform_df.select_dtypes(include=["object"])
# Converting nominal to categorical features
for col in text_cols:
    transform_df[col]=transform_df[col].astype("category")

# Create dummy variables for caterogrical features
transform_df=pd.concat([
    transform_df, 
    pd.get_dummies(transform_df.select_dtypes(include=['category']))
], axis=1)
transform_df=transform_df.drop(text_cols,axis=1)

In [27]:
transform_df["SalePrice"]

0       215000
1       105000
2       172000
3       244000
4       189900
5       195500
6       213500
7       191500
8       236500
9       189000
10      175900
11      185000
12      180400
13      171500
14      212000
15      538000
16      164000
17      394432
18      141000
19      210000
20      190000
21      170000
22      216000
23      149000
24      149900
25      142000
26      126000
27      115000
28      184000
29       96000
         ...  
2900    320000
2901    369900
2902    359900
2903     81500
2904    215000
2905    164000
2906    153500
2907     84500
2908    104500
2909    127000
2910    151400
2911    126500
2912    146500
2913     73000
2914     79400
2915    140000
2916     92000
2917     87550
2918     79500
2919     90500
2920     71000
2921    150900
2922    188000
2923    160000
2924    131000
2925    142500
2926    131000
2927    132000
2928    170000
2929    188000
Name: SalePrice, Length: 2927, dtype: int64

In [28]:
rmse=train_and_test(transform_df)
rmse

33367.28718340389

In [29]:
def transform_features(df):
    # Transform column names by removing whitespaces
    df.columns = df.columns.str.strip()
    # Finding percentage of missing values in columns in descending order
    num_missing = df.isnull().sum()
    # drop columns with more than 5% missing values
    drop_missing_cols = num_missing[(num_missing > len(df)/20)].sort_values()
    df = df.drop(drop_missing_cols.index, axis=1)
    
    
    ## Series object: column name -> number of missing values
    text_mv_counts = df.select_dtypes(include=['object']).isnull().sum().sort_values(ascending=False)

    ## Text Columns: Filter Series to columns containing *any* missing values
    drop_missing_cols_2 = text_mv_counts[text_mv_counts > 0]
    df = df.drop(drop_missing_cols_2.index, axis=1)

    ## Compute column-wise missing value counts
    num_missing = df.select_dtypes(include=['int', 'float']).isnull().sum()
    fixable_numeric_cols = num_missing[(num_missing < len(df)/20) & (num_missing > 0)].sort_values()
    ## Compute the most common value for each column in `fixable_nmeric_missing_cols`.
    replacement_values_dict = df[fixable_numeric_cols.index].mode().to_dict(orient='records')[0]
    df = df.fillna(replacement_values_dict)
    # Create New Columns
    # Gives information before sold
    years_sold = df['Yr Sold'] - df['Year Built']
    # Gives information years since remodel
    years_since_remod = df['Yr Sold'] - df['Year Remod/Add']
    df['Years Before Sale'] = years_sold
    df['Years Since Remod'] = years_since_remod
    ## Drop rows with negative values in new columns
    df = df.drop([1702, 2180, 2181], axis=0)
    
    # Drop those yr sold,yr built,yr remodel cols,columns that leak information

    df = df.drop(["PID", 'Yr Sold',"Order", "Mo Sold", "Sale Condition", "Sale Type", "Year Built", "Year Remod/Add"], axis=1)
    return df


def select_features(df, coeff_threshold=0.4, uniq_threshold=10):
    corrmat=df.select_dtypes(include=['integer', 'float'])
    corr_wrt_saleprice=corrmat.corr()["SalePrice"].abs().sort_values()
    df=df.drop(corr_wrt_saleprice[corr_wrt_saleprice<=0.4].index,axis=1)
    
    ## Create a list of column names from documentation that are *meant* to be categorical
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_cols = []
    for col in nominal_features:
        if col in df.columns:
            transform_cat_cols.append(col)
    #Finding count of unique values in non-numeric cols
    unique_nonnumeric_count=df[transform_cat_cols].apply(lambda col:len(col.value_counts())).sort_values()
    
    # drop nominal features with more than uniq_threshold
    drop_nonuniq_cols = unique_nonnumeric_count[unique_nonnumeric_count > uniq_threshold].index
    df = df.drop(drop_nonuniq_cols, axis=1)
    
    text_cols = df.select_dtypes(include=['object'])
    for col in text_cols:
        # Converting nominal to categorical features
        df[col] = df[col].astype('category')
    df = pd.concat([df, pd.get_dummies(df.select_dtypes(include=['category']))], axis=1).drop(text_cols,axis=1)
    
    return df

def train_and_test(df, k=0):

    numeric_df = df.select_dtypes(include=['integer', 'float'])
    features = numeric_df.columns.drop("SalePrice")
    lr = LinearRegression()
    np.random.seed(1)
    if k == 0:
        train = df[:1460]
        test = df[1460:]

        lr.fit(train[features], train["SalePrice"])
        predictions_test = lr.predict(test[features])
        # Out sample error
        mse_test = mean_squared_error(test["SalePrice"], predictions_test)
        rmse_test = np.sqrt(mse_test)
        
        #in sample error
        predictions_train = lr.predict(train[features])
        mse_train= mean_squared_error(train["SalePrice"], predictions_train)
        rmse_train = np.sqrt(mse_train)
        var_test=np.var(predictions_test)

        return [rmse_test,rmse_train,var_test]
    
    if k == 1:
        # Randomize *all* rows (frac=1) from `df` and return
        #specifying drop=True prevents .reset_index from creating a column containing the old index entries.
        shuffled_df = df.sample(frac=1, ).reset_index(drop=True)
        train = df[:1460]
        test = df[1460:]
        
        lr.fit(train[features], train["SalePrice"])
        predictions_one_test = lr.predict(test[features])        
        # out sample error
        mse_one_test = mean_squared_error(test["SalePrice"], predictions_one_test)
        rmse_one_test = np.sqrt(mse_one_test)
        var_one=np.mean(predictions_one_test)
    
        # in sample error
        predictions_one_train = lr.predict(train[features])
        mse_one_train = mean_squared_error(train["SalePrice"], predictions_one_train)
        rmse_one_train = np.sqrt(mse_one_train)
        
        lr.fit(test[features], test["SalePrice"])
        predictions_two_test = lr.predict(train[features])    
        # out sample error
        mse_two_test = mean_squared_error(train["SalePrice"], predictions_two_test)
        rmse_two_test  = np.sqrt(mse_two_test )
        
        # in sample error
        predictions_two_train = lr.predict(test[features])
        mse_two_train = mean_squared_error(test["SalePrice"], predictions_two_train)
        rmse_two_train = np.sqrt(mse_two_train)
        var_two=np.mean(predictions_two_test)
        avg_rmse_test= np.mean([rmse_one_test, rmse_two_test])
        avg_rmse_train= np.mean([rmse_one_train, rmse_two_train])
        avg_var=np.mean([var_one,var_two])
        return [avg_rmse_test,avg_rmse_train,avg_var]
    else:
        kf = KFold(n_splits=k, shuffle=True)
        rmse_values_test = []
        rmse_values_train = []
        var_values_test=[]
        for train_index, test_index, in kf.split(df):
            train = df.iloc[train_index]
            test = df.iloc[test_index]
            lr.fit(train[features], train["SalePrice"])
            # predicting for test dataset
            predictions_test = lr.predict(test[features])
            mse_test = mean_squared_error(test["SalePrice"], predictions_test)
            rmse_test = np.sqrt(mse_test)
            rmse_values_test.append(rmse_test)
            # Predicting for training dataset
            predictions_train = lr.predict(train[features])
            mse_train = mean_squared_error(train["SalePrice"], predictions_train)
            rmse_train = np.sqrt(mse_train)
            rmse_values_train.append(rmse_train)
            
            # Variance calculation for test set 
            var_test=np.var(predictions_test)
            var_values_test.append(var_test)
        # Calculating in-sample-error (training on train dataset)
        avg_rmse_test = np.mean(rmse_values_test)
                
        # Calculating out-sample-error (training on train dataset)
        avg_rmse_train = np.mean(rmse_values_train)
        
        # Average Variance
        avg_var=np.mean(var_values_test)
        return [avg_rmse_test,avg_rmse_train,avg_var]

df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
transform_df = transform_features(df)
filtered_df = select_features(transform_df)
rmse=train_and_test(filtered_df, k=5)[0]
rmse

28832.51424964921

In [30]:
kfold_avg_rmse_test={}
kfold_avg_rmse_train={}
kfold_avg_var=[]
for k in range(0,31):
    df = pd.read_csv("AmesHousing.tsv", delimiter="\t")
    transform_df = transform_features(df)
    filtered_df = select_features(transform_df)
    model_values=train_and_test(filtered_df, k=k)
    kfold_avg_rmse_test[k]=model_values[0]
    kfold_avg_rmse_train[k]=model_values[1]
    kfold_avg_var.append(model_values[2])


In [31]:
k_fold_rmse_df=pd.DataFrame(kfold_avg_rmse_test.items(), columns=['K_Folds', 'Avg_rmse_test'])

k_fold_rmse_df["Avg_rmse_train"]=k_fold_rmse_df["K_Folds"].map(kfold_avg_rmse_train)
k_fold_rmse_df["Avg_var"]=kfold_avg_var
k_fold_rmse_df["Error_diff"]=k_fold_rmse_df["Avg_rmse_test"]-k_fold_rmse_df["Avg_rmse_train"]
k_fold_rmse_df

Unnamed: 0,K_Folds,Avg_rmse_test,Avg_rmse_train,Avg_var,Error_diff
0,0,33367.287183,23178.242653,5567576000.0,10189.04453
1,1,30261.212834,23789.624698,181099.2,6471.588136
2,2,29785.012118,23912.753647,5976302000.0,5872.258471
3,3,29482.53943,24246.472566,5949229000.0,5236.066864
4,4,29111.434973,24387.656118,5947165000.0,4723.778855
5,5,28832.51425,24437.422137,5936472000.0,4395.092113
6,6,28702.689556,24473.504443,5913157000.0,4229.185113
7,7,28592.540428,24504.842607,5926215000.0,4087.697822
8,8,28555.990293,24516.445309,5910527000.0,4039.544984
9,9,28476.629925,24523.73516,5930357000.0,3952.894766


In [44]:
best_k_value=k_fold_rmse_df[(k_fold_rmse_df["Avg_rmse_test"]==np.min(k_fold_rmse_df["Avg_rmse_test"]))]
best_k_value=best_k_value["K_Folds"].values
best_k_value

array([30], dtype=int64)

In [47]:
lowest_rmse = min(k_fold_rmse_df["Avg_rmse_test"])
print("The cross validation with {} kfolds resulted lowest average rmse of {}".format(best_k_value[0],lowest_rmse))

The cross validation with 30 kfolds resulted lowest average rmse of 27360.643611856955


# Conclusion

Before feature transformation and selection the average RMSE was 57088. Later, after the feature transformation, I was able to reduce RMSE to 55275.By removing the features which had potential to leak data and correlation value lower than 0.4, the RMSE  reduced to 33367. 

- Variance of the model at kfold 1 is lowest. But, we need to understand that as the models complexity increases error decrease and variance increase. We should decide what are the trade-offs for our model. 
- So, it always best to have a lowest error model with optimal variance.

- The model developed in this project has reduce the rmse value by 52%. Finally, by using various k values for cross validation, I was able to optimize the model at k value of 30 which resulted in average rmse of 27361.

