# Sprint 1

## Goal
The goal for this sprint is for your team to get to know the data.

Some questions you may want to ask:
- How are the prices distributed?
- How many missing values are there?
- What features may be interesting, intuitively?
- Can we construct new features based on the existing ones?

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set() # For aesthetic purposes
pd.set_option("display.max_columns", 101) # So we can see all columns of our DataFrame
pd.set_option('display.max_rows', 100) # So we can more rows of our DataFrame
np.random.seed(42) # So all our results will be the same

## Load data

If we want to analyse our data, we first have to load it.

In [None]:
# Training data
df_train = pd.read_csv("../data/train.csv").set_index('Id')

### Exploratory analysis

Our dataset also comes with a handy little file called `data_description.txt`, which explains exactly what each column means. How convenient! This is often surprisingly difficult in practice.

In [None]:
with open('../data/data_description.txt', 'r+') as f:
    x = f.read()
print(x)

Pandas' `.head()` function allows us to take a look at the first few rows, so we can get a sense of what the data look like.

In [None]:
df_train.head()

The `.describe()` function shows descriptive statistics for all numeric variables.

In [None]:
df_train.describe()

You may have spotted it above, but some columns have values that show up as `NaN` (Not a Number). This can indicate that a field is simply not applicable (if you don't have a pool... what's "the quality of your pool"?), or it can indicate that we're just missing some data (we don't know when this house was built!). Let's have a look at how many data points are missing for each column.

In [None]:
# Missing data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

By the looks of it, 

### Visualization

Alright, now let's get visual. What does the distribution of our target variable, `SalePrice` look like?

In [None]:
#histogram
sns.distplot(df_train['SalePrice']);

Cool, cool. And how does it correlate to some of our variables?

In [None]:
# Scatter plot GrLivArea/SalePrice
var = 'GrLivArea'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));

Interesting! These two variables are clearly very strongly correlated. This is good to keep in mind for the next sprint, when we'll build our first model!

**Experiment with this, and have a look at different variables!**

_Tip: you can use the .columns attribute of our dataframe (so df_train.columns) to find out the names of all columns._

For categorical variables, a scatter plot may not be the right solution. But seaborn makes it easy to create other plots as well. For instance, the code below draws these box plots for every category in a variable.

In [None]:
# Box plot overallqual/saleprice
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

In [None]:
df_train.dtypes

So yes, the price is obviously dependent on the 'Overall quality' of the house! Who would have guessed.

**Experiment with this, and have a look at different variables!**

_Tip: you can use the .dtypes attribute of our dataframe (so df_train.dtypes) to find out the data types of all our columns. `float64` and `int64` columns are numerical, `object` columns are (usually) categorical!_

As you may have noticed, going through all these features individually can be a bit tedious... but again, Seaborn is here to help us! The outstanding plot below (a `heatmap` plot over the `corr`elations of our DataFrame. Shows us the correlations between each of our variables.

In [None]:
# Correlation matrix
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

Isn't that beautiful?

What do you see here? Anything peculiar that stands out? Can you explain it?

For instance, we see two little squares. That's usually something to be wary of, as it may indicate multicollinearity. In this case, that actually makes sense! Of course the area of a garage is very strongly correlated to the number of cars that can fit in it, and it makes sense that the area of the basement would be about the same as the area of the first floor. In fact, perhaps some people who made this data set even considered the basement the first floor. Something to be aware of, in the very least, and it's probably smart to discard one of these.

Furthermore, we see that the bottom row (or rightmost column) contains our target variable, the SalePrice. Bright cells in this one are the ones to keep in mind for later when we start building our models!

So, now that we have a feel for which variables may be important, let's have a closer look at the best ones.

In [None]:
# SalePrice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Sweet! So we still see those two squares of possible multicollinearity, so we have to take them into account. Other than that, this gives us a good idea of which features we may want to use for our model!

Let's see how these most interesting features all correlate to one another. One way to do this is by using `seaborn`'s amazing `pairplot` function, which allows you to create scatter plots and histograms of many features in one go.

In [None]:
#scatterplot
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

So it seems all features have a positive correlation with SalePrice... perhaps bigger is better (or at least more expensive) when it comes to houses.

Note: the code for a number of the plots in this Notebook were copied from [this](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python) amazing Kaggle submission 🙏