# Data Analysis for Machine Learning
We'll work with a Kaggle dataset: __[House Sales in King County, USA](https://www.kaggle.com/harlfoxem/housesalesprediction)__

This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.

It's a great dataset for evaluating simple regression models.

19 house features plus the price and the id columns, along with 21613 observations.

These are the features of the dataset:

- id: a notation for a house
- date: Date house was sold
- price: Price is prediction target
- bedrooms: Number of Bedrooms/House
- bathrooms: Number of bathrooms/bedrooms
- sqft_living: square footage of the home
- sqft_lot: square footage of the lot
- floors: Total floors (levels) in house
- waterfront: House which has a view to a waterfront
- view: Has been viewed
- condition: How good the condition is ( Overall )
- grade: overall grade given to the housing unit, based on King County grading system
- sqft_above: square footage of house apart from basement
- sqft_basement: square footage of the basement
- yr_built: Built Year
- yr_renovated: Year when house was renovated
- zipcode: zip
- lat: Latitude coordinate
- long: Longitude coordinate
- sqft_living15: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
- sqft_lot15: lotSize area in 2015(implies-- some renovations)

In [None]:
# DON'T DO THIS!
#!pip install sklearn_pandas

In [None]:
#Importing the required libraries
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None)
%matplotlib inline

from sklearn import preprocessing
from sklearn_pandas import DataFrameMapper

## Exploratory Data Analysis
loading the dataframe

In [None]:
df = pd.read_csv('kc_house_data.csv')
df.head()

## Step 1: Cleaning data: we have loaded the data correctly and we have valid values

In [None]:
df.shape

we know that there are 21,613 rows, with 21 columns (features). Let's check for red flags on those features:

In [None]:
df.info()

info gives you a quick summary of both the type and the count for each column. In this case the data seems correct, there are no missing values and the types are correct.

## Step 2: High level Feature Selection
Our objective is to predict the price of a house based on the features that we know about the house. For example, we know that a larger surface area and more bedrooms will relate with a highest price. But what about the id of the house? It's probably just an internal ID and is not affecting the real price.

That is feature selection, understanding what features are important to the ML model.

With pandas is extremely simple to exclude columns:

In [None]:
df.drop(columns=['id']).head()

What other variables would you exclude? For this workshop, we'll exclude date, lat and long. We could have done a better analysis for lat and long, but with zipcode it's probably enough.

In [None]:
df.drop(columns=['id', 'date', 'lat', 'long'], inplace=True)

## Step 3: Correlation between variables
Some variables will have higher (positive or negative) correlation with the price. We know that the surface area of a house is positively correlated with its price: the larger the house, a higher price. But what about others? We can build a simple correlation plot to understand a little bit better the relationship between different variables:

In [None]:
df.corr()

### So, for example, we can see that sqft_living is highly correlated with the price:

In [None]:
df.corr().loc['sqft_living', 'price']

We'll use a simple visualization mechanism to have a visual clue about these variables and their correlation:

In [None]:
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

fig, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

We see some strange patterns, like for example, the apparent "negative" correlation between zipcode and price. Something that doesn't make any sense. We'll talk more about this when we explore zipcode as a categorical feature later.

Once we identify correlation between different variables, we can explore how they're correlated. For example, we saw sqft_living and price:

In [None]:
df.plot(x='sqft_living', y='price', kind='scatter', figsize=(12, 7))

### What about grade and price?

In [None]:
df.corr().loc['grade', 'price']

In [None]:
### They also seem strongly correlated, but, are they just linearly correlated?

In [None]:
df.plot(x='grade', y='price', kind='scatter', figsize=(12, 5))

Doesn't seem so, or at least it's not as clear as with sqft_living. There seems to be some sort of polynomic relationship. We can use a logarithmic y axis to test:

In [None]:
df.plot(x='grade', y='price', kind='scatter', figsize=(12, 7), logy=True)

## Step 4: More cleaning, identifying outliers
Linear regression (along with other ML models) will be really sensitive to outliers:

In [None]:
df.describe()

A house with 33 bedrooms? There's something going on here:

In [None]:
fig, ax = plt.subplots(figsize=(15, 7))
sns.boxplot(data=df[['bedrooms', 'bathrooms']], orient="h", palette="Set2")

In [None]:
df[df['bedrooms'] == 33]

In [None]:
df.drop(15870, inplace=True)

Now, what about those properties without bathrooms? That is strange, let's take a look:

In [None]:
df[df['bathrooms'] == 0]

Now that we look at it it makes a little bit more sense. Maybe those are just warehouses or other type of storage unit facilities? Without more information is now difficult to make a decision. This is an important lesson: domain expertise is fundamental when analyzing data

I'll not remove any house for now.

How are other variables doing?

## Step 5: Dummy variables
The zipcode feature imposes an issue. Machine learning models, don't understand "human" features like zipcode. For a ML algorithm, a value of 98178 in zipcode is "greater" than 98125, even though for us, knowing the area, the zipcode 98125 might have more expensive houses. These are the zipcodes in our dataset:

In [None]:
df['zipcode'].unique()

In [None]:
df['zipcode'].value_counts()

Dummy variables is the correct way to feed a ML model a categorical feature. We'll see how to combine these later.

In [None]:
pd.get_dummies(df['zipcode'])

## Step 6: Feature scaling and normalization
There's a final IMPORTANT point to discuss, and that is "scaling" and "normalizing" features. It has a mathematical explanation, but basically, what we DON'T want is to have features that are in completely different units. For example:

In [None]:
df[['bedrooms', 'sqft_living']].head()

The values here are too dissimilars, which will make some algorithms perform poorly and slower. We'll then "scale" these features to remove the unit. Read more here: __[Importance of Feature Scaling](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)__

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler().fit(df[['bedrooms', 'sqft_living']].head())
scaler.transform(df[['bedrooms', 'sqft_living']].head())

## Step 7: Putting it all together¶


In [None]:
ScalerClass = preprocessing.StandardScaler
mapper = DataFrameMapper([
    (['bedrooms'], ScalerClass()),
    (['bathrooms'], ScalerClass()),
    (['sqft_living'], ScalerClass()),
    (['sqft_lot'], ScalerClass()),
    (['floors'], ScalerClass()),
    (['condition'], ScalerClass()),
    (['grade'], ScalerClass()),
    (['sqft_above'], ScalerClass()),
    (['sqft_basement'], ScalerClass()),
    (['sqft_living15'], ScalerClass()),
    (['sqft_lot15'], ScalerClass()),

    ('zipcode', preprocessing.LabelBinarizer()),
    ('yr_built', None),
    ('yr_renovated', None),

    ('waterfront', None),
    ('view', None)    
])
X_train, X_test, y_train, y_test = train_test_split(
    mapper.fit_transform(df.drop(columns=['price'])), df['price'], test_size=0.3, random_state=10)

Let's see now how our Linear Regression is performing with these simple modifications:

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)