<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 120px">

# Project 2 - Ames Housing Data and Kaggle Challenge (Replace with Actual Title)
### Fill out this cell as the project progresses, then move to README.md as technical report.

*Deval Mehta*

## Table of Contents
1) [Overview](#Overview) 
2) [Data](#Data-Dictionary)
3) [Requirements](#Requirements)
4) [Executive Summary](#Executive-Summary)
    1) [Purpose](#Purpose)
    2) [Methods](#Methods)
    3) [Findings](#Findings)
    4) [Next Steps](#Next-Steps)

## Overview
Our objective in this project is to create a regression model that accurately predicts the price of a house at sale in Ames, IA.

## Data Dictionary

### Original Data
The original dataset contains 79 non-index parameters, each introducing a different piece of information regarding a listing.

| Variable | Data Type | Description | Notes |
|---|---|---|---|
| MS SubClass | `int64` | The classification of the building | Codified to numbers; see the original data documentation for the cipher |
| MS Zoning | `string` | General zoning classification of the sale | Codified into strings; see the original data documentation for the cipher |
| Lot Frontage | `float64` | Linear feet of street connected to the property | |
| Lot Area | `int64` | Lot size in square feet | |
| Street | `string` | Type of road access to property | Gravel or Paved |
| Alley | `string` | Type of alley access to property | Gravel or Paved |
| Lot Shape | `string` | General shape of property | Degree of irregularity |
| Land Contour | `string` | Flatness of the property | Level, Banked, Hillside, or Low Depression |
| Utilities | `string` | Type of utilities available | Electric, Gas, Water, Sewer |
| Lot Config | `string` | Lot Configuration | Where on a block or in a neighborhood the lot lands |
| Land Slope | `string` | Slope of the property | Categorized from "gentle" to "severe" |
| Neighborhood | `string` | Physical locations within Ames city limits | |
| Condition 1 | `string` | Proximity to main road or railroad | |
| Condition 2 | `string` | Proximity to main road or railroad | if a second is present |
| Bldg Type | `string` | Type of dwelling | More modern classification than MS SubClass |
| House Style | `string` | Style of dwelling | Number of stories |
| Overall Qual | `int64` | Overal material and finish quality | Scale 1 - 10 |
| Overall Cond | `int64` | Overall condition rating | Scale 1 - 10 |
| Year Built | `int64` | Original construction date | |
| Year Remod/Add | `int64` | Remodel date | Same as construction date if no remodeling or additions |
| Roof Style | `string` | Type of roof | |
| Roof Matl | `string` | Roofing material | |
| Exterior 1st | `string` | Exterior covering on house | |
| Exterior 2nd | `string` | Exterior covering on house | if more than one material |
| Mas Vnr Type | `string` | Masonry veneer type | |
| Mas Vnr Area | `int64` | Masonry veneer area | |
| Exter Qual | `string` | Exterior material quality | Scale Poor to Excellent (six grades) |
| Exter Cond | `string` | Present condition of the material on the exterior | Same grading system as Exter Qual |
| Foundation | `string` | Type of foundation | |
| Bsmt Qual | `string` | Height of basement | |
| Bsmt Cond | `string` | General condition of basement | |
| Bsmt Exposure | `string` | Walkout or garden level basement walls | |
| BsmtFin Type 1 | `string` | Quality of basement finished area | |
| BsmtFin SF 1 | `int64` | Type 1 finished square footage | |
| BsmtFin Type 2 | `string` | Quality of second finished area (if present) | |
| BsmtFin SF 2 | `string` | Type 2 finished square feet | |
| Bsmt Unf SF | `int64` | Unfinished square feet of basement area | |
| Total Bsmt SF | `int64` | Total square feet of basement area | |
| Heating | `string` | Type of heating | |
| Heating QC | `string` | Heating quality and condition | |
| Central Air | `string` | Central air conditioning | |
| Electrical | `string` | Electrical system | |
| 1st Flr SF | `int64` | First floor square footage | |
| 2nd Flr SF | `int64` | Second floor square footage | if it exists |
| Low Qual Fin SF | `int64` | Low quality finished square footage | all floors |
| Gr Liv Area | `int64` | Above grade living area square footage | |
| Bsmt Full Bath | `int64` | Number of basement full bathrooms | |
| Bsmt Half Bath | `int64` | Number of basement half bathrooms | |
| Bedroom AbvGr | `int64` | Number of bedrooms above grade | |
| Kitchen AbvGr | `int64` | Number of kitchens above grade | |
| Kitchen Qual | `string` | Kitchen quality | |
| TotRms AbvGr | `int64` | Total rooms above grade | does not include bathrooms |
| Functional | `string` | Home functionality rating | |
| Fireplaces | `int64` | Number of fireplaces | |
| Fireplace Qu| `string` | Fireplace quality | |
| GarageType | `string` | Garage type and location | |
| Garage Yr Blt | `float64` | Year garage was built | |
| Garage Finish | `string` | Interior finish of the garage | |
| Garage Cars | `int64` | Car capacity of garage | |
| Garage Area | `int64` | Square footage of garage | |
| Garage Qual | `string` | Garage quality | |
| Garage Cond | `string` | Garage condition | |
| Paved Drive | `string` | Paved driveway | |
| Wood Deck SF | `int64` | Wood deck area in square feet | |
| Open Porch SF | `int64` | Open porch area in square feet | |
| 3Ssn Porch | `int64` | Three season porch area in square feet | |
| Screen Porch | `int64` | Screen porch area in square feet | |
| Pool Area | `int64` | Pool area in square feet | |
| Pool QC | `string` | Pool quality | |
| Fence | `string` | Fence quality | |
| Misc Feature | `string` | Miscellaneous feature not covered in other categories | |
| Misc Val | `int64` | Dollar value of miscellaneous feature | |
| Mo Sold | `int64` | Month sold | |
| Yr SOld | `int64` | Year sold | |
| Sale Type | `string` | Type of sale | |
| Sale Price | `int64` | The property's sale price in dollars | This is our response variable for our models |

### Engineered Features

## Requirements
To replicate our analysis and predictive modeling, the following modules are necessary:


| Library | Module | Purpose |
|---|---|---|
| `numpy` | | Ease of basic aggregate operations on data |
| `pandas` | | Read our data into a DataFrame, clean it, engineer new features, and write it out to submission files |
| `matplotlib` | `pyplot` | Basic plotting functionality |
| `sklearn` | `compose` | Column transformation |
| `sklearn` | `impute` | Imputation methods |
| `sklearn` | `linear_model` | to write SLR and MLR models |
| `sklearn` | `metrics` | Evaluate our models |
| `sklearn` | `model_selection` | Use k-fold cross-validation |
| `sklearn` | `preprocessing` | Data preprocessing and feature engineering tasks |
| `seaborn` | | More control over plots |
| `warnings` | | Suppress many of the warnings `pandas` flags in response to things like using `inplace` arguments |

A prospective colleague or student interested in replicating our results or improving upon them would also require access to the [Ames Housing Dataset by Dean de Cock](https://www.kaggle.com/datasets/prevek18/ames-housing-dataset). In our case, we have saved this data within the `datasets` directory.

## Executive Summary

#### Purpose

#### Methods

#### Findings

#### Next Steps

## Imports
To begin, we'll import all the necessary libraries for this project. We need:
* `numpy` for the ease of basic aggregate operations on data
* `pandas` to read our data into a DataFrame, clean it, engineer new features, and write it out to submission files.
* `matplotlib.pyplot` for basic plotting functionality
* `sklearn.compose` for our column transformer
* `sklearn.impute` for imputation methods
* `sklearn.linear_model`to write SLR and MLR models
* `sklearn.metrics` to evaluate our models
* `sklearn.model_selection` to use k-fold cross-validation
    * Note that we will opt to perform cross-validation with parallelization, rather than train-test splitting to write more accurate models.
* `sklearn.preprocessing`for data preprocessing and feature engineering tasks
* `seaborn` for more control over plot
* `warnings` to suppress many of the warnings `pandas` flags in response to things like using `inplace` arguments.

In [3]:
# Basic imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sci-kit Learn module imports
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error as rmse
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

## The Data
Our goal is to predict the price of homes listed for sale in Ames, IA, given information about the properties in question. This is a much wider dataset than any with which we have previously worked. The data consists of 80 columns, ranging from zoning classification to the quality and condition of various parts of the home and exterior. As we progress with analysis and modeling, we may find that we have to engineer new features to build a more accurate predictive model.

#### Original Data
The original dataset contains 79 non-index parameters, each introducing a different piece of information regarding a listing.

| Variable | Data Type | Description | Notes |
|---|---|---|---|
| MS SubClass | `int64` | The classification of the building | Codified to numbers; see the original data documentation for the cipher |
| MS Zoning | `string` | General zoning classification of the sale | Codified into strings; see the original data documentation for the cipher |
| Lot Frontage | `float64` | Linear feet of street connected to the property | |
| Lot Area | `int64` | Lot size in square feet | |
| Street | `string` | Type of road access to property | Gravel or Paved |
| Alley | `string` | Type of alley access to property | Gravel or Paved |
| Lot Shape | `string` | General shape of property | Degree of irregularity |
| Land Contour | `string` | Flatness of the property | Level, Banked, Hillside, or Low Depression |
| Utilities | `string` | Type of utilities available | Electric, Gas, Water, Sewer |
| Lot Config | `string` | Lot Configuration | Where on a block or in a neighborhood the lot lands |
| Land Slope | `string` | Slope of the property | Categorized from "gentle" to "severe" |
| Neighborhood | `string` | Physical locations within Ames city limits | |
| Condition 1 | `string` | Proximity to main road or railroad | |
| Condition 2 | `string` | Proximity to main road or railroad | if a second is present |
| Bldg Type | `string` | Type of dwelling | More modern classification than MS SubClass |
| House Style | `string` | Style of dwelling | Number of stories |
| Overall Qual | `int64` | Overal material and finish quality | Scale 1 - 10 |
| Overall Cond | `int64` | Overall condition rating | Scale 1 - 10 |
| Year Built | `int64` | Original construction date | |
| Year Remod/Add | `int64` | Remodel date | Same as construction date if no remodeling or additions |
| Roof Style | `string` | Type of roof | |
| Roof Matl | `string` | Roofing material | |
| Exterior 1st | `string` | Exterior covering on house | |
| Exterior 2nd | `string` | Exterior covering on house | if more than one material |
| Mas Vnr Type | `string` | Masonry veneer type | |
| Mas Vnr Area | `int64` | Masonry veneer area | |
| Exter Qual | `string` | Exterior material quality | Scale Poor to Excellent (six grades) |
| Exter Cond | `string` | Present condition of the material on the exterior | Same grading system as Exter Qual |
| Foundation | `string` | Type of foundation | |
| Bsmt Qual | `string` | Height of basement | |
| Bsmt Cond | `string` | General condition of basement | |
| Bsmt Exposure | `string` | Walkout or garden level basement walls | |
| BsmtFin Type 1 | `string` | Quality of basement finished area | |
| BsmtFin SF 1 | `int64` | Type 1 finished square footage | |
| BsmtFin Type 2 | `string` | Quality of second finished area (if present) | |
| BsmtFin SF 2 | `string` | Type 2 finished square feet | |
| Bsmt Unf SF | `int64` | Unfinished square feet of basement area | |
| Total Bsmt SF | `int64` | Total square feet of basement area | |
| Heating | `string` | Type of heating | |
| Heating QC | `string` | Heating quality and condition | |
| Central Air | `string` | Central air conditioning | |
| Electrical | `string` | Electrical system | |
| 1st Flr SF | `int64` | First floor square footage | |
| 2nd Flr SF | `int64` | Second floor square footage | if it exists |
| Low Qual Fin SF | `int64` | Low quality finished square footage | all floors |
| Gr Liv Area | `int64` | Above grade living area square footage | |
| Bsmt Full Bath | `int64` | Number of basement full bathrooms | |
| Bsmt Half Bath | `int64` | Number of basement half bathrooms | |
| Bedroom AbvGr | `int64` | Number of bedrooms above grade | |
| Kitchen AbvGr | `int64` | Number of kitchens above grade | |
| Kitchen Qual | `string` | Kitchen quality | |
| TotRms AbvGr | `int64` | Total rooms above grade | does not include bathrooms |
| Functional | `string` | Home functionality rating | |
| Fireplaces | `int64` | Number of fireplaces | |
| Fireplace Qu| `string` | Fireplace quality | |
| GarageType | `string` | Garage type and location | |
| Garage Yr Blt | `float64` | Year garage was built | |
| Garage Finish | `string` | Interior finish of the garage | |
| Garage Cars | `int64` | Car capacity of garage | |
| Garage Area | `int64` | Square footage of garage | |
| Garage Qual | `string` | Garage quality | |
| Garage Cond | `string` | Garage condition | |
| Paved Drive | `string` | Paved driveway | |
| Wood Deck SF | `int64` | Wood deck area in square feet | |
| Open Porch SF | `int64` | Open porch area in square feet | |
| 3Ssn Porch | `int64` | Three season porch area in square feet | |
| Screen Porch | `int64` | Screen porch area in square feet | |
| Pool Area | `int64` | Pool area in square feet | |
| Pool QC | `string` | Pool quality | |
| Fence | `string` | Fence quality | |
| Misc Feature | `string` | Miscellaneous feature not covered in other categories | |
| Misc Val | `int64` | Dollar value of miscellaneous feature | |
| Mo Sold | `int64` | Month sold | |
| Yr SOld | `int64` | Year sold | |
| Sale Type | `string` | Type of sale | |
| Sale Price | `int64` | The property's sale price in dollars | This is our response variable for our models |

We can clearly identify some redundant information here already, which will allow us to pare the data down a bit. The "overall" numbers will all be aggregates or combinations of the individual values, so we can count them out in our analysis. We will also want to convert our "sliding scale" variables to a numeric data type, then convolve some of them. In particular, the area of something is likely to interact with its quality. There are many missing values in the data, as will be seen below. We will want to reasonably impute as many of them as possible. 

In [8]:
# Read in the training and test datasets
ames_training = pd.read_csv('../datasets/train.csv')
ames_validation = pd.read_csv('../datasets/test.csv')

In [9]:
# Check data types and number of non-null values in the training data.
ames_training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   Lot Frontage     1721 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            140 non-null    object 
 8   Lot Shape        2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  Lot Config       2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

In [11]:
ames_training.shape, ames_training.dropna().shape

((2051, 81), (0, 81))

In [12]:
ames_training.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,...,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,...,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,...,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,...,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,...,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,...,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500


In [16]:
ames_training[['Pool Area', 'Pool QC']].dropna()

Unnamed: 0,Pool Area,Pool QC
52,519,Fa
657,576,Gd
761,800,Gd
952,228,Ex
960,480,Gd
1130,648,Fa
1249,738,Gd
1635,368,TA
1875,561,TA


In [17]:
ames_training[['Alley']].dropna()

Unnamed: 0,Alley
13,Pave
16,Grvl
27,Grvl
43,Grvl
46,Grvl
...,...
1996,Grvl
1999,Grvl
2004,Pave
2030,Pave


In [25]:
ames_training.drop(columns = ['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Mas Vnr Type', 'Mas Vnr Area', 'Lot Frontage']).dropna().shape

(1029, 74)