<a href="https://colab.research.google.com/github/federicomilani/intro_python_data_analysis/blob/main/IntroductionToDataScienceFabLabAosta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Fab Lab Aosta - Introduction to Data Science**

> This notebook is adapted from the introductory project of [Hands on Machine Learning](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb)

*Welcome to Machine Learning Housing Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*



# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Get the data

In [None]:
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [None]:
fetch_housing_data()

Write a `load_housing_data` method, which takes an optional `housing_path` argument, with a default value of `HOUSING_PATH`. The method must read a `housing.csv` file located in the `housing_path` and load its content in a Pandas dataframe, using the `read_csv` function.

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    # YOUR CODE HERE
    # remember to return something

Assign the result of the default execution of the method to `housing` variable

In [None]:
housing = # YOUR CODE HERE
housing.head()

Get general information about the dataframe.
> Is there a feature with missing data? Which one is it?

In [None]:
# YOUR CODE HERE

Print the distribution of the different values for the `ocean_proximity` feature. 
> How many times is `NEAR OCEAN` represented? What about `ISLAND`?

In [None]:
# YOUR CODE HERE

Try to get statistical metrics for the `housing` dataframe. 
> What is the median latitude?

In [None]:
# YOUR CODE HERE

Next, we need to look at the values distribution for the different features. We can do that by creating histograms for each feature, using the `pandas.DataFrame.hist` function (suggested values: 50 bins and a `20,15` figure size. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
# YOUR CODE HERE
save_fig("attribute_histogram_plots")
plt.show()

In [None]:
# to make this notebook's output identical at every run
np.random.seed(42)

> What does the `split_train_test` function do? How should it be changed to include a validation set?

In [None]:
import numpy as np

# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

Use the above function to create `train_set` and `test_set` starting from the `housing` dataframe. Let's use 80% of the data for the training set and the remaining 20% for the test set.
> How many items are there in `train_set`? What about `test_set`?

In [None]:
# YOUR CODE HERE

Now let's look at the `median_income` values: create a histogram of its distribution, as done before.

In [None]:
housing["median_income"].hist()

Let's now convert the continuous `median_income` values to a categorical attribute. For that, you need to use the `pandas.cut` function. The end result is adding a new column to the dataset (`income_cat`) which indicates the income category, from 1 to 5. In order to do that, you need to define the income values which will belong to each category; you can use `0., 1.5, 3.0, 4.5, 6., np.inf` as values for the `bins` argument 

In [None]:
housing["income_cat"] = # YOUR CODE HERE

Now look at how many items are there in each income category
> How many rows are there in the dataset with a `median_income` between 1.5 and 3.0?

In [None]:
# YOUR CODE HERE

Let's represent them with a histogram

In [None]:
housing["income_cat"].hist()

The next snippet uses a SciKit Learn to perform a stratified split based on the income category: 

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

The next line calculates the fraction of the rows for the test set in each category. 

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

Now compare this values to the corresponding ratios for the whole dataset.
> How do these data compare? Is this improving the representativeness of the test set?

In [None]:
# YOUR CODE HERE

Let's make this more quantitative and compare the relative sampling errors using a completely random split and a stratified one.

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props

We can now drop the `income_cat` feature, which was only used to get a better split

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Discover and visualize the data to gain insights

Next, we'll try to visualize the dataset. First of all, let's set `housing` to a copy of the stratified training set.

In [None]:
housing = strat_train_set.copy()

We can now use a scatter view to get a (bad) map of the data.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")
save_fig("bad_visualization_plot")

The above command can be significantly improved by using semi-transparent points for the scatter plot; 90% trasnparency will get you a good visualization.

In [None]:
# YOUR CODE HERE
save_fig("better_visualization_plot")

Finally, let's try to add one dimension to the scatter chart, namely the `median_house_value`, so that the map shows different colors for different house prices, immediately highlighting where it was more expensive to live in California in 1990. For this last challenge, you can use the `jet` colormap and add a legend for what the different colors mean; please refer to the `pandas.DataFrame.plot` documentation. 
PS. If you have strange visualization issues, try to add `sharex=False` to the arguments

In [None]:
# YOUR CODE HERE
plt.legend()
save_fig("housing_prices_scatterplot")

Now let's calculate a correlation matrix
> What does `corr_matrix` look like?

In [None]:
corr_matrix = housing.corr()

Next, let's focus on the correlation between the different features and `median_house_value`
> Which feature has the highest positive correlation with it?

In [None]:
# YOUR CODE HERE

Let's now get a scatter matrix for selected features.

In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")

The next snippet creates a scatter plot relating income and house value.
> What's wrong with it? What does it tell us about the `median_house_value` data^

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")

Next, let's create the following synthetic features and add them to the dataframe:
- `rooms_per_household` (shown)
- `bedrooms_per_room`
- `population_per_household`

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = # YOUR CODE HERE
housing["population_per_household"] = # YOUR CODE HERE

Let's recalculate the correlation matrix and look at it
> Do our synthetic features provide a better correlation than the values originally present in the dataset?

In [None]:
corr_matrix = housing.corr()
# YOUR CODE HERE

Let's get a visual answer to the previous question by looking at a scatter plot.

In [None]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

Finally, let's look at a dataframe description:
> Is there something wrong here? What is it?

In [None]:
housing.describe()