# California Housing Challenge

The notebook is intended to predict the average house value upon the provided house features.

In [None]:
# Import Standard Libraries
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Define Seaborn theme parameters
theme_parameters =  {
    'axes.spines.right': False,
    'axes.spines.top': False,
    'grid.alpha':0.3,
    'figure.figsize': (16, 6),
    'font.family': 'Andale Mono',
    'axes.titlesize': 24,
    'figure.facecolor': '#E5E8E8',
    'axes.facecolor': '#E5E8E8'
}

# Set the theme
sns.set_theme(style='whitegrid',
              palette=sns.color_palette('deep'), 
              rc=theme_parameters)

# Read Data

In [None]:
# Read training data
california_housing_train = pd.read_csv('./../../data/season_3_episode_1/california_housing_train.csv')

In [None]:
california_housing_train.head()

In [None]:
california_housing_train.info()

# Exploratory Data Analysis (EDA)

## Median Income Distribution

In [None]:
# Plot the distribution of the column 'MedInc'
ax = sns.histplot(data=california_housing_train, 
                  x='MedInc')

ax.set_title('Median Income Distribution')

plt.show()

The data have a binomial distribution.

## Median House Value over Median Income

Explore the relationship of the `MedHouseVal` with respect to the `MedInc`

In [None]:
# Plot a scatterl plot of `MedHouseVal` over the `MedInc`
ax = sns.scatterplot(data=california_housing_train,
                     x='MedInc',
                     y='MedHouseVal')

ax.set_ylabel('Median House Value', 
              fontweight='bold')

ax.set_xlabel('Median Income', 
              fontweight='bold')

ax.set_title('Median House Value over Median Income', 
             fontsize=14)

plt.xticks(rotation=45)

plt.show()

There is a positive correlation between the Median House Value and the Median Income. However the data regarding the Median House Value seems to be capped at '5'.

## House Age

In [None]:
# Plot the distribution of the column 'MedInc'
ax = sns.histplot(data=california_housing_train, 
                  x='HouseAge')

ax.set_title('House Age Distribution')

plt.show()

It seems that there are not houses older than 52 years. Since there are 3 major trends (18, 35 and 52 years old), it could be reasonable to define a categorical variable called `HouseAgeClass` with the following values:
- young
- middle
- old

In [None]:
# Define a categorical variable called `HouseAgeClass`
california_housing_train['HouseAgeClass'] 