# Comprehensive Data Analysis

Before tackling any problem with complex mathematics it is vital to understand the data. 

- Understand the problem. We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
- Univariable study. We'll just focus on the dependent variable ('SalePrice') and try to know a little bit more about it.
- Multivariate study. We'll try to understand how the dependent variable and independent variables relate.
- Basic cleaning. We'll clean the dataset and handle the missing data, outliers and categorical variables.
- Test assumptions. We'll check if our data meets the assumptions required by most multivariate techniques.

Let's move away from using Plotly to Seaborn as it has some really useful utlities to get us started.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from scipy.stats import norm
import matplotlib.pyplot as plt
import os

In [None]:
DIR = '~/notebooks/erudition/data/kaggle/house-prices-advanced-regression-techniques'

# bring in the data
df =pd.read_csv(DIR + os.sep + 'train.csv')
df.head()

In [None]:
# check the decoration
df.columns

We ultimately care about the SalePrice so lets carry out some initial analysis

In [None]:
df.SalePrice.describe()

There looks to be a large range between the lowest and highest priced houses.
Let's plot a histogram to visualise this range

In [None]:
sns.distplot(df.SalePrice)

At first glance we can clearly see it:

- Deviates from the normal distribution.
- Has appreciable positive skewness.
- Shows peakedness.

In [None]:
# lets dig further into the shape of the curve

print('Skewness: ', df.SalePrice.skew(), ' Kurtosis: ', df.SalePrice.kurt())

In [None]:
p=df.plot.scatter(x='GrLivArea', y='SalePrice')

Looks like a nice linear relationship between these two **independent variables**

In [None]:
p=df.plot.scatter(x='TotalBsmtSF', y='SalePrice')

There now appears to be some features around TotalBsmtSF=0 and a more exponential relationship. The larger the TotalBsmtSF value the more increase we see in the SalePrice value.

What about the relationships to more caregoricl features

In [None]:
f, ax = plt.subplots(figsize=(8, 8))
p=sns.boxplot(x='OverallQual', y="SalePrice", data=df)

There is an obvious relationship between the overall quality andthe sales price. Better quality = higher sale price.

What about the year the property was built?

In [None]:
f, ax = plt.subplots(figsize=(16, 8))
p=sns.boxplot(x='YearBuilt', y="SalePrice", data=df)

# Plasma Soup

We know more about the data now and have some obvious trends as surfaced above but how do all the features relate to one another?

In [None]:
corrmap = df.corr()
f, ax = plt.subplots(figsize=(20, 20))
p=sns.heatmap(corrmap, square=True)

That's a lot of data although there are some obvious features which are highly correlated. Let's explore the top 10.

In [None]:
corrmap = df.corr()
cols = corrmap.nlargest(10, 'SalePrice')['SalePrice'].index
corrmap = df[cols].corr()
f, ax = plt.subplots(figsize=(20, 20))
p=sns.heatmap(corrmap, square=True, annot=True)

With this information we can now look at the actual scatter plot relationships between the features using Seaborns pairplot.

In [None]:
p=sns.pairplot(df[cols])

# Mising Data

Is there anything missing in our data? This can have a huge effect on any training model.


In [None]:
total = df.isna().sum().sort_values(ascending=False)
percentage = (df.isna().sum()/df.isna().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percentage], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

In [None]:
#dealing with missing data
df= df.drop((missing_data[missing_data['Total'] > 1]).index,1)
df = df.drop(df.loc[df['Electrical'].isnull()].index)
df.isnull().sum().max() #just checking that there's no missing data missing...

In [None]:
df.columns

# Out liars

Outliers can markedly affect our models and can be a valuable source of information, providing us insights about specific behaviours.

Quick analysis through the standard deviation of 'SalePrice' and a set of scatter plots.

# Four assumptions should be tested

According to Hair et al. (2013)

https://is.muni.cz/el/1423/podzim2017/PSY028/um/_Hair_-_Multivariate_data_analysis_7th_revised.pdf

- **Normality** - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach). Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.

- **Homoscedasticity** - I just hope I wrote it right. Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)' (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.

- **Linearity** - The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.

- **Absence of correlated errors** - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there's a relationship between these variables. This occurs often in time series, where some patterns are time related. We'll also not get into this. However, if you detect something, try to add a variable that can explain the effect you're getting. That's the most common solution for correlated errors.

## Start with a search for Normality

- **Histogram** - Kurtosis and skewness.
- **Normal probability plot** - Data distribution should closely follow the diagonal that represents the normal distribution.

In [None]:
#histogram and normal probability plot
from scipy import stats
sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()

# Calculate quantiles for a probability plot, and optionally show the plot.
# Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). 
# probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function.
res = stats.probplot(df['SalePrice'], plot=plt)

Hmmmm, the SalePrice does not appear normal. It shows peakedness, positive skewness and does not follow the diagonal line that would indicate a purely normaly distribution.

# POSITIVE SKEWNESS & LOG TRANSFORM

But everything's not lost. A simple data transformation can solve the problem. This is one of the awesome things you can learn in statistical books: in case of positive skewness, log transformations usually works well. When I discovered this, I felt like an Hogwarts' student discovering a new cool spell.

In [None]:
df.SalePrice = np.log(df.SalePrice)

sns.distplot(df['SalePrice'], fit=norm);
fig = plt.figure()

# Calculate quantiles for a probability plot, and optionally show the plot.
# Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). 
# probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function.
res = stats.probplot(df['SalePrice'], plot=plt)

In [None]:
sns.distplot(df['GrLivArea'], fit=norm);
fig = plt.figure()

# Calculate quantiles for a probability plot, and optionally show the plot.
# Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). 
# probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function.
res = stats.probplot(df['GrLivArea'], plot=plt)

Looks like a similar issue as before, let transform and re-draw

In [None]:
df.GrLivArea = np.log(df.GrLivArea)

sns.distplot(df['GrLivArea'], fit=norm);
fig = plt.figure()

# Calculate quantiles for a probability plot, and optionally show the plot.
# Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). 
# probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function.
res = stats.probplot(df['GrLivArea'], plot=plt)

In [None]:
sns.distplot(df['TotalBsmtSF'], fit=norm);
fig = plt.figure()

# Calculate quantiles for a probability plot, and optionally show the plot.
# Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). 
# probplot optionally calculates a best-fit line for the data and plots the results using Matplotlib or a given plot function.
res = stats.probplot(df['TotalBsmtSF'], plot=plt)