Disha Jain - dj9am
Sameer Gupta - sg4vh
Jason Quinn - jtq5ba
Sindhu Ranga - sgr7va

## Project 1: Housing Price Prediction

<div class="alert alert-block alert-danger">
<b>Due: 9:29am, Tuesday, 21 January 2020</b>
</div>

Price prediction is one the key ingredients in market design and
market competition. An important feature of the price in a competitive
market is that it arises as an outcome of the market equilibrium where
supply is equal to demand. As a result, factors that may be good
predictors of the price, such as the volume of market sales, do not
have a causal relationship with it.

The distinction between factors that have good predictive power and
causal factors becomes particularly important when prediction is
needed in the changing market settings.  For instance, the change in
average disposable consumer income shifts market demand, causing the
change in both the prices and the volume of sales. Accounting for
market conditions, substitution between products and variation in
consumer demographics makes such predictions even more challenging in
complex markets, such as the market for real estate.

In your first project you are tasked with constructing a model for predicting the median price of a detached 3-bedroom single-family home in the US. We provide some starting code to give you and example of how to analyze the zillow data. Then, you need to perform your own analysis to answer the given questions. Your answers should be in the form of a clear argument that includes both well-written prose and code and its results (when the notebook is run).

<div class="alert alert-block alert-info">
You and your team members should work together on this
assignment. Both team members should fully understand everything you
submit.  If there are parts you understand quickly but are new to your
partners, it is your responsibility to explain them to your partners
until everyone understands. If there are parts that your partners
understands quickly but that are new to you, it is your responsibility
to insist that your partners explain things to you until you
understand them well.
</div>

### Data

We will use data provided by Zillow: https://www.kaggle.com/c/zillow-prize-1/

<div class="alert alert-block alert-warning">
Download the data you need for this assignment from:
Collab/Resources/Datasets
</div>

This file contains two `.csv` (comma-separated values) files, and one Excel directory file. Unzip the file to extract the CSV files into a directory of your choice.

### Libraries

You will find it useful to install several relevant libraries for this project (which will also be useful for later projects). 

We recommend using these libraries (but you are welcome to use any open source libraries you prefer):

- [pandas](https://pandas.pydata.org/) (Python Data Analysis Library):
````
conda install pandas
````

- [numpy](http://www.numpy.org/) (if you installed Anaconda, this should already be installed; if not, follow the directions there)

- [StatsModels](https://www.statsmodels.org/stable/index.html) 
````
conda install statsmodels
````

<div class="alert alert-block alert-warning">
All members on your team should set up the data and these libraries on your own machine, so you can each run things locally.  You should also decide on a way to share the `project1.ipynb` file (its up to you how to do this, but recommended options include using Dropbox, Google Drive, or a shared private github repository).
</div>

Import the libraries and give them abbreviated names:

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

### Example Analysis

We choose two consecutive months arbitrariliy. We use the first month's data to fit our regression model, and then test our model's performance on data from the next month. Here, we use July 2017 and August 2017.

In [2]:
month1 = [7] # month1 = July
month2 = [8] # month2 = August

Total taxable value of the property is typically a good proxy for its market value. We use it as an outcome variable.

In [4]:
# load the data, use the directory where you saved the data
df_properties = pd.read_csv('properties_2017.csv') 
df_train = pd.read_csv('train_2017.csv', parse_dates=["transactiondate"])

df_train['transactionmonth'] = df_train['transactiondate'].dt.month # create a new column indicating the month
df_merged = pd.merge(df_train, df_properties, on='parcelid', how='left') # merge two loaded files

  interactivity=interactivity, compiler=compiler, result=result)


Split the dataframe into `month1` and `month2`:

In [5]:
df_month1 = df_merged[df_merged['transactionmonth'].isin(month1)]  # save the data of July
df_month2 = df_merged[df_merged['transactionmonth'].isin(month2)]  # save the data of August

df_month1 = df_month1.fillna(0) # Be careful when you deal with blank observations. Here we substitute with 0 because we use only the number of the bedroom
df_month2 = df_month2.fillna(0)

For each month, create an independent variable (`x`) with the predictive variables, and a dependent variable (`y`).

In [6]:
y_month1 = df_month1['taxvaluedollarcnt']
x_month1 = df_month1['bedroomcnt']
y_month2 = df_month2['taxvaluedollarcnt']
x_month2 = df_month2['bedroomcnt']

Now fit the linear regression model on `month1`'s data:

In [7]:
model = sm.OLS(y_month1, x_month1).fit()  # fit the model
print(model.summary())      # print a summary of results

                                 OLS Regression Results                                
Dep. Variable:      taxvaluedollarcnt   R-squared (uncentered):                   0.430
Model:                            OLS   Adj. R-squared (uncentered):              0.430
Method:                 Least Squares   F-statistic:                              7164.
Date:                Mon, 20 Jan 2020   Prob (F-statistic):                        0.00
Time:                        14:15:51   Log-Likelihood:                     -1.3960e+05
No. Observations:                9490   AIC:                                  2.792e+05
Df Residuals:                    9489   BIC:                                  2.792e+05
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Predict month2's taxable assessment of a home using the model fitted above:

In [8]:
yhat = model.predict(x_month2)

Calculate mean squared error (MSE), variance and r-squared:

In [9]:
# MSE
serror = np.square(y_month2 - yhat)
mse = np.mean(serror)
print("MSE: ", mse)

# Sample Variance
ybar = y_month2.mean()
variance = np.mean((y_month2 - ybar)**2)
print("Variance: ", variance)

# Explanation of variance
# this is the r-squared 
rsq = 1 - (mse / variance)
print("R-squared: ", rsq)

MSE:  309344169217.8261
Variance:  329449869767.6554
R-squared:  0.061028102891674596


### Assignment

You should complete the assignment by inserting cells in the notebook with your answers to these questions, including both prose and code you used for your analysis.

<div class="alert alert-block alert-warning">
Construct and estimate a linear regression model to predict the taxable value of 3-bedroom homes.
</div>

Provide an argument which variables can potentially be good predictors for the value of interest and try to estimate the linear regression with all those variables included. Discuss which models you have considered estimating but decided to discard and why. Present and discuss the outcome of that estimation.  


   

In [12]:
df_merged.head()

Unnamed: 0,parcelid,logerror,transactiondate,transactionmonth,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,...,numberofstories,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock
0,14297519,0.025595,2017-01-01,1,,,,3.5,4.0,,...,,,485713.0,1023282.0,2016.0,537569.0,11013.72,,,60590630000000.0
1,17052889,0.055619,2017-01-01,1,,,,1.0,2.0,,...,1.0,,88000.0,464000.0,2016.0,376000.0,5672.48,,,61110010000000.0
2,14186244,0.005383,2017-01-01,1,,,,2.0,3.0,,...,1.0,,85289.0,564778.0,2016.0,479489.0,6488.3,,,60590220000000.0
3,12177905,-0.10341,2017-01-01,1,,,,3.0,4.0,,...,,,108918.0,145143.0,2016.0,36225.0,1777.51,,,60373000000000.0
4,10887214,0.00694,2017-01-01,1,1.0,,,3.0,3.0,,...,,,73681.0,119407.0,2016.0,45726.0,1533.89,,,60371240000000.0


In [135]:
try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20

    
try:
    from sklearn.compose import ColumnTransformer
except ImportError:
    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20
    
try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer

num_attribs = ['calculatedbathnbr', 'basementsqft', 'finishedsquarefeet12','fireplacecnt','garagetotalsqft','buildingqualitytypeid']


In [136]:
corr_matrix = df_merged.corr()
corr_matrix["taxvaluedollarcnt"].sort_values(ascending=False)

taxvaluedollarcnt               1.000000
taxamount                       0.990001
landtaxvaluedollarcnt           0.957909
structuretaxvaluedollarcnt      0.796731
finishedsquarefeet12            0.592820
calculatedfinishedsquarefeet    0.583155
finishedfloor1squarefeet        0.565043
finishedsquarefeet50            0.559625
basementsqft                    0.493777
calculatedbathnbr               0.484819
fullbathcnt                     0.472832
bathroomcnt                     0.461013
fireplacecnt                    0.435830
yardbuildingsqft17              0.401433
garagetotalsqft                 0.345202
garagecarcnt                    0.328466
buildingqualitytypeid           0.327600
finishedsquarefeet15            0.309188
poolsizesum                     0.291222
bedroomcnt                      0.239326
threequarterbathnbr             0.169397
numberofstories                 0.136264
finishedsquarefeet6             0.119707
yearbuilt                       0.119309
typeconstruction

In [137]:
df_merged_3bedrooms = df_merged[df_merged['bedroomcnt'] == 3]  # save the data of August
df_merged_3bedrooms


housing = df_merged_3bedrooms[num_attribs]
housing.head()



Unnamed: 0,calculatedbathnbr,basementsqft,finishedsquarefeet12,fireplacecnt,garagetotalsqft,buildingqualitytypeid
2,2.0,,1243.0,,440.0,
4,3.0,,1312.0,,,8.0
5,2.0,,1492.0,1.0,0.0,
12,2.5,,1337.0,,0.0,
13,2.5,,1340.0,1.0,420.0,


In [138]:
import math
housing.loc[housing['garagetotalsqft'].isnull()]
median = housing["calculatedbathnbr"].median()
housing.loc[housing['calculatedbathnbr'].isnull(), 'calculatedbathnbr'] = median
housing.loc[housing['basementsqft'].isnull(), 'basementsqft'] =0
housing.loc[housing['finishedsquarefeet12'].isnull(), 'finishedsquarefeet12'] = housing["finishedsquarefeet12"].median()
housing.loc[housing['buildingqualitytypeid'].isnull(), 'buildingqualitytypeid'] = 0
housing.loc[housing['garagetotalsqft'].isnull(), 'garagetotalsqft'] = 0
housing.loc[housing['fireplacecnt'].isnull(), 'fireplacecnt'] = 0

incomplete_rows =  housing[housing.isnull().any(axis=1)]
incomplete_rows




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in 

Unnamed: 0,calculatedbathnbr,basementsqft,finishedsquarefeet12,fireplacecnt,garagetotalsqft,buildingqualitytypeid


In [139]:
full_pipeline = ColumnTransformer([
        ("num", SimpleImputer(strategy="median"), num_attribs),
#         ("cat", OneHotEncoder(), cat_attribs),
    ])
housing = full_pipeline.fit_transform(housing)
housing

array([[2.000e+00, 0.000e+00, 1.243e+03, 0.000e+00, 4.400e+02, 0.000e+00],
       [3.000e+00, 0.000e+00, 1.312e+03, 0.000e+00, 0.000e+00, 8.000e+00],
       [2.000e+00, 0.000e+00, 1.492e+03, 1.000e+00, 0.000e+00, 0.000e+00],
       ...,
       [3.000e+00, 0.000e+00, 1.741e+03, 0.000e+00, 0.000e+00, 8.000e+00],
       [1.000e+00, 0.000e+00, 1.032e+03, 0.000e+00, 0.000e+00, 4.000e+00],
       [2.000e+00, 0.000e+00, 1.762e+03, 0.000e+00, 0.000e+00, 6.000e+00]])

In [159]:
df_merged_3bedrooms.loc[df_merged_3bedrooms['taxvaluedollarcnt'].isnull(), 'taxvaluedollarcnt'] =0
y_vals = df_merged_3bedrooms['taxvaluedollarcnt']
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lin_reg = LinearRegression()
lin_reg.fit(housing, y_vals)
housing_predictions = lin_reg.predict(housing)
mse = mean_squared_error(y_vals, housing_predictions)
rmse = np.sqrt(mse)
print("MSE: " + str(mse))
print("RMSE: " + str(rmse))

from sklearn.model_selection import cross_val_score

scores = cross_val_score(lin_reg, housing, y_vals,
                         scoring="neg_mean_squared_error", cv=10)
rmse_scores = np.sqrt(-scores)
print("Scores:", rmse_scores)
print(pd.Series(rmse_scores).describe())

# model = sm.OLS(y_vals, housing).fit()  # fit the model
# print(model.summary())      # print a summary of results

MSE: 133338509829.28528
RMSE: 365155.4598103187
Scores: [376695.82588614 329630.87089521 370704.62373041 338043.41987295
 378757.16509726 388277.52446632 372487.20470042 346258.6084735
 344495.39581737 401012.2847421 ]
count        10.000000
mean     364636.292368
std       23573.452396
min      329630.870895
25%      344936.198981
50%      371595.914215
75%      378241.830294
max      401012.284742
dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


<div class="alert alert-block alert-info">
When first choosing what variables we wanted in our linear regression we thought about it purely conceptually in what we might think would determine market price we in a three-bedroom house. We decided to incorporate variables such as the square footage, bathroom count, and city into the regression model because we feel that the primary way homes are valued is through location and features of a property. We then ran a correlation matrix of all the variables on taxvaluedollarcnt, to see which variables had the highest correlation. We chose our final variables of calculatedbathnbr, basementsqft, finishedsquarefeet12, fireplacecnt, garagetotalsqft, and buildingqualitytypeid because they were all positively correlated with taxvalluedollarcnt, and didn’t have the risk of high covariance or multicollinearity with one another. All of these variables had a correlation greater than .3 with the outcome variable. We discarded any variables that were negatively correlated, had a very low correlation, or risked the integrity of the regression output.
</div>

Compute the in-sample mean squared error of your prediction, i.e. the sum of squared deviations of the price predicted by your model and the price that was actually observed in the data. Compare the mean-squared error of your model with the empirical variance of the price of 3-bedroom houses. How much of that price variance is explained by your model in percentage terms?


<div class="alert alert-block alert-info">
RMSE:365155.4598103187 - MSE:133338509829.28528= 2.73856E-06   
1- 2.73856E-06 =0.999997261=R^2 or 99% which is the variance in price explained by our model 

The MSE of our model is much larger than our empirical variance of the model 
MSE=133338509829.28528 Variance= std^2= 23573.452396^2=555707657.9

</div>

How does your analysis relate to the idea of generalization performance discussed in the introduction to V. Vapnik's _The Nature of Statistical Learning Theory_?


<div class="alert alert-block alert-info">
Our analysis regresses the variables with the highest covariation with the outcome variable. Because of this high correlation, our algorithm should still perform accurately even with out-of-sample data; however, any data that doesn’t conform to categorical/numerical buckets outlined in the existing data set cannot be incorporated into the model. If there were another variable such as the color of a 3 bedroom property, our algorithm wouldn’t be able to generalize the model to this data set.
</div>