# Predicting House Sale Prices - Linear Regression and Feature Selection
In this project we will work with housing data for the city of Ames, Iowa from 2006 to 2010. Our mission is to create a linear regression model to predict house sale prices.
To ensure maximum accuracy, we will proceed as follows:
- **Data Exploration**: We will develop an understanding of each feature, and get an idea of their relationship with our prediction target.
- **Data Cleaning/Feature Transformation**: Missing values will be treated. If the columns have more than a 25% missing values, the columns will be dropped. For columns with less missing values, the mode value will be imputed. Relevant categorical features will be transformed as required to be used in our model.
- **Feature Selection**: We will study data correlation to determine the most valuable features and multicollinearity to avoid any imprecise predictions.
- **Model Training and Testing**: We will train and test our model.

In order to prove the importance of feature transformation and selection, we will show the different RMSE's obtained with a barely treated model versus a model where our features have been treated and selected following rigorous criteria.

## Data Exploration

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

### import scikit-learn classes
%matplotlib inline

data = pd.read_csv('AmesHousing.tsv',delimiter='\t')

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         29

## First Model - Without Optimizations
Explain function pipeline and first feature selection

In [2]:
def transform_features(df): #empty for now        
    return df

In [3]:
def select_features(df):
    return df[['Gr Liv Area']]

In [4]:
def train_and_test(df):
    df_trans_features = transform_features(df)
    train = df_trans_features[:1460]
    test = df_trans_features[1460:]
    train_features = select_features(train)
    test_features = select_features(test)
    #fit model using select_features()
    lr = LinearRegression()
    lr.fit(train_features,train['SalePrice'])
    #predict
    predictions = lr.predict(test_features)
    #return rmse
    rmse = mean_squared_error(predictions,test['SalePrice'])**(1/2)
    return rmse

In [5]:
train_and_test(data)

57088.25161263909

## Optimized Model
Same pipelines

To Do:

- Transform Features: done

- Select Features: 


### Feature Transformation

REORGANIZE: ADD EXPLANATION FOR EACH PART CONTAINED IN THE FUNCTION - IF REQUIRED. Now it's just showing the end result and no analysis!

In [18]:
def transform_features(df):
    #delete columns with >25% missing values
    missing_values_pct = df.isnull().sum()*100/df.shape[0]
    drop_cols = missing_values_pct[missing_values_pct > 25].index 
    df_transformed = df.copy()
    for col in drop_cols:
        del df_transformed[col]
    #fill NA values with mode
    for col in df_transformed.columns:
        df_transformed[col] = df_transformed[col].fillna(df_transformed[col].mode()[0])
    #create 2 new columns that relate timeline of construction,remod and sale
    df_transformed['years_sold_built'] = df_transformed['Yr Sold'] - df_transformed['Year Built']
    df_transformed['years_sold_remod'] = df_transformed['Yr Sold'] - df_transformed['Year Remod/Add']
    #delete rows with negative values from the two columns created above
    del_index = df_transformed[
        (df_transformed['years_sold_built'] < 0) |
        (df_transformed['years_sold_remod'] < 0)
    ].index
    df_transformed.drop(del_index,inplace=True)
    #delete columns related to the salethey leak sale info - EXCEPT SalePrice (target column)
    #case sensitive to not delete our new columns
    drop_cols = df_transformed.columns[
        (
            df_transformed.columns.str.contains('Sale') 
            | df_transformed.columns.str.contains('Sold')
        ) & ~ df_transformed.columns.str.contains('SalePrice')

    ].tolist() + ['Year Built','Year Remod/Add']
    
    for col in drop_cols:
        del df_transformed[col]
    #delete columns not useful for our model
    df_transformed = df_transformed.drop(["PID", "Order"], axis=1)
    
    return df_transformed

In [19]:
###double check: Missing values are treated
transformed_data = transform_features(data)
missing_vals_pct = transformed_data.isnull().sum()*100/transformed_data.shape[0]
missing_vals_pct.sort_values(ascending=False).head(10)

years_sold_remod    0.0
Foundation          0.0
Exterior 1st        0.0
Exterior 2nd        0.0
Mas Vnr Type        0.0
Mas Vnr Area        0.0
Exter Qual          0.0
Exter Cond          0.0
Bsmt Qual           0.0
Roof Style          0.0
dtype: float64

In [16]:
###double_check: years_sold_built and years_sold_remod have no negative values
transformed_data[['years_sold_built','years_sold_remod']].min()

years_sold_built    0
years_sold_remod    0
dtype: int64

### Feature Selection - TODO - ADD ANALYSIS TO CHOOSE FEATURES AND TRANSFORM TO DUMMY VARIABLES
CONTINUE HERE!

- Establish criteria for correlation -higher than...lower than... (arbitrary parameter)
- Select categorical variables based on number of unique values (arbitrary parameter)

In [21]:
#Determine the number of unique values to see which columns are too complicated for
#categories -dummy variables
transformed_data_object = transformed_data.select_dtypes(include='object')
unique_counts_dict = dict()
for col in transformed_data_object.columns:
    unique_counts_dict[col] = len(transformed_data_object[col].unique())

unique_counts = pd.Series(unique_counts_dict,index=unique_counts_dict.keys())
unique_counts.sort_values(ascending=False)

Neighborhood      28
Exterior 2nd      17
Exterior 1st      16
Condition 1        9
Condition 2        8
House Style        8
Functional         8
Roof Matl          8
MS Zoning          7
Foundation         6
BsmtFin Type 2     6
Roof Style         6
BsmtFin Type 1     6
Heating            6
Garage Type        6
Bsmt Cond          5
Garage Qual        5
Exter Cond         5
Mas Vnr Type       5
Heating QC         5
Kitchen Qual       5
Bsmt Qual          5
Bldg Type          5
Electrical         5
Lot Config         5
Garage Cond        5
Land Contour       4
Bsmt Exposure      4
Lot Shape          4
Exter Qual         4
Paved Drive        3
Utilities          3
Land Slope         3
Garage Finish      3
Central Air        2
Street             2
dtype: int64

In [None]:
correlations = abs(transform_features(data).corr())
top_15_corr = correlations['SalePrice'].sort_values(ascending=False)[0:15]
top_15_corr

In [None]:
top_15_corr_df = abs(transform_features(data[top_15_corr.index]).corr())
sns.heatmap(top_15_corr_df)

In [None]:
top_15_corr.index.isnull().sum()

These are the 14 features that are most correlated with SalePrice. 

Risk of multicollinearity -see heatmap, high correlations between features:
- Garage Cars and Garage Area: Garage Cars will not be used, since Garage Area gives us more detail -continuous feature vs discrete feature.
- Total Bsmt SF and 1st Flr SF:

In [None]:
def select_features(df):
    return df[['Gr Liv Area']]

### Training and Testing - TODO
- Kfold if k greater than 2! check

In [None]:
def train_and_test(df):
    df_trans_features = transform_features(df)
    train = df_trans_features[:1460]
    test = df_trans_features[1460:]
    train_features = select_features(train)
    test_features = select_features(test)
    #fit model using select_features()
    lr = LinearRegression()
    lr.fit(train_features,train['SalePrice'])
    #predict
    predictions = lr.predict(test_features)
    #return rmse
    rmse = mean_squared_error(predictions,test['SalePrice'])**(1/2)
    return rmse

In [None]:
train_and_test(data)