# Day 17 - Housing data work

Working along with the book. In this exercise, will go along with the book and try to understand the approach of playing with the data, cleaning up, analysing and then creating a model.

The idea here is to understand the process and learn new concepts which comes along the way.

In [None]:
import os, tarfile, urllib, pandas as pd, numpy as np

In [None]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("../../data", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

Keeping this code ready which allows us to download the tar file from Github.

Later, I will refactor it to put any resource URL and then download that to a specified folder.

In [None]:
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(file_path=HOUSING_PATH):
    csv_path = os.path.join(file_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
fetch_housing_data()

# Data loading

Now that we have the necessary files, we can look at loading the data into Pandas.

In [None]:
housing = load_housing_data()
housing.head()

In [None]:
housing.info()

## Analysing data

Now that we have the data inside a DataFrame, let's understand what data we have and what we can do with it.

There are total 20640 rows.

In the entire dataset, only total_bedrooms doesn't have 20640 rows. We need to fix that problem.

All columns are numbers. However, only ocean_proximity is object. This means, it's not numeric value. We will have to in some way convert that to numbers because model will understand only numbers.

Let's understand what are the unique values we have for "ocean_proximity".

In [None]:
housing['ocean_proximity'].value_counts()

The describe method also gives us a good understand of the data that we have.

In [None]:
housing.describe()

## Represent data through visualisation

Now that we understand a bit of the data, let's try to visualise the data

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
housing.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
from zlib import crc32
def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

In [None]:
housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
    bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
    labels=[1, 2, 3, 4, 5])

housing["income_cat"].hist()

# Model training

Now will start with the process of training the model.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

In [None]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Keeping the test data aside by making the main data set a copy of only the train dataset.

This way, going forward, no operations will be performed on the original data.

In [None]:
df = strat_train_set.copy()

# Visualising Geographic data

In [None]:
df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, 
        s=df["population"]/100, label="population", figsize=(10,7),
        c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
        )
plt.legend()

In [None]:
corr_matrix = df.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

The above output showst the kind of co-relation "median_house_value" with other values.

It is clear that when the location goes to north, the relation is negative. But, the median income has a direct relation with median income. 

Now, will use the scatter matrix.

In [None]:
from pandas.plotting import scatter_matrix

In [None]:
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Let's also draw a co-relation between median house value is median income. 

So let's plot that

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)

Adding some additional attributes computed based on the data that we have.

We are adding these to have more impact.

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
housing.head()

In [None]:
housing_new = housing.drop("ocean_proximity", axis=1)
# housing_new.drop("total_bedrooms", axis=1)
housing_new.head()

# Data preparation before training

droping few data points before making the data ready for training

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing = strat_train_set["median_house_value"].copy()

We have some data missing for "total_bedrooms".

So, before training, we need to fix that. We can either drop then, make them zero or use mean.

Here we will Impute them with median

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")