# Exploring the housing prices dataset

This notebook is a quick exploration of the [Ames,Iowa](https://www.kaggle.com/c/house-prices-advanced-regression-techniques#description) housing price dataset. 

A full description of this dataset is available [here](https://github.com/eliiza/ml-training-data/blob/master/housing_price_data/data_description.txt)

## Pandas library

We'll use the [Pandas Data Analysis Library](https://pandas.pydata.org/) to explore the data. 

A useful cheat sheet when working with the Pandas library can be found [here](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)

## Notebooks

Notebooks are a web based interface to code interpreters. They also support inline markdown comments (like this) as well as visualisations in the output

- In Jupyter:
    - You can run a cell using __Shift-Enter__.
    - You can insert cells using __Insert__ in the menu-bar.
    - You can delete cells using __d + d__ (selecting the chunk and pressing `d` twice). 
- In Google Colab:
    - You can run a cell using the __Play__ button.
    - You can insert cells using __Insert__ in the menu-bar.
    - You can delete cells using __Right Click > Delete Cell__. 

## Dataset

In [None]:
# Import required libraries
import pandas as pd

# Set notebook to display matplotlib graphics within notebook
%matplotlib inline

# Load dataset
housing_prices = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/housing_data.csv")

In [None]:
housing_prices.shape # 1460 Rows, 81 Columns

In [None]:
housing_prices.info() # Display names and types of all columns.

### Exercise

- Display the dataset ordered by the `SalePrice` values using the `.sort_values(by='VARIABLE_NAME')` function
- This will allow you to see the cheapest and most expensive houses.

## Numerical variables

- You can pick out a single column from a DataFrame using `df['COLUMN_NAME']`
- And multiple columns using `df[['COLUMN_1', 'COLUMN_2', ...]]`
- The `df.describe()` function prints summary statistics

In [None]:
# Describe summary statistics for Bedrooms, Fireplaces, Lot Area, SalePrice
housing_prices[["BedroomAbvGr","YearBuilt","LotArea","SalePrice"]].describe()

In [None]:
# Construct a scatter plot of lot area vs sale price
housing_prices.plot.scatter(x='LotArea',y='SalePrice')

**Question:** What does this plot tell us about the relationship between lot area and house sale price?

### Exercise

Make a scatter plot (like the above) that plots `YearBuilt` vs. `SalePrice`

### Exercise

- What attributes do you think are related to a house's sale price?
- Explore a couple of other variables in the dataset that you think might be related to Sale Price. Use summary statistics and plots against Sale Prices. You can find the list of variables [here](https://github.com/eliiza/ml-training-data/blob/master/housing_price_data/data_description.txt)

### Correlations

It can be difficult to go through the variables by-hand to see which are related to sale price! For the numerical variables, we can look at the __correlations__ within the dataset. 

In [None]:
# Calculate correlation matrix for the numerical columns
housing_prices.corr()

### Exercise

- Pick out the column that contains the correlations with `SalePrice` (using `df.['VARIABLE_NAME']`)
- You can sort the correlation values using `df.sort_values(ascending=False)`.

### Exercise (optional)

- Now look at scatter plots for some of the variables that are high correlated with `SalePrice`

## Non-numerical variables

- We will use [box plots](https://en.wikipedia.org/wiki/Box_plot) to explore the relationship between some of the non-numerical variables and sale price.
- The [Seaborn](https://seaborn.pydata.org/) visualisation library provides a good boxplot function: [sns.boxplot()](https://seaborn.pydata.org/generated/seaborn.boxplot.html)


- Example: `BsmtCond` evaluates the height of the basement
       Ex	Excellent (100+ inches)	
       Gd	Good (90-99 inches)
       TA	Typical (80-89 inches)
       Fa	Fair (70-79 inches)
       Po	Poor (<70 inches
       NA	No Basement


In [None]:
import seaborn as sns
# Construct a scatter plot of basement condition vs sale price
sns.boxplot(x='BsmtCond',y='SalePrice', data=housing_prices)

### Exercise (optional)

- Have a look at the relationships between the following variables and `SalePrice` using boxplots:
    - `KitchenQual`
    - `CentralAir`
    - `Heating`
- Check the variable definitions (and possible values) [here](https://github.com/eliiza/ml-training-data/blob/master/housing_price_data/data_description.txt)