# Table of Contents

* [Introduction](#Introduction)
* [Meta-data](#Meta_data)
* [Data Reading](#Data_reading)
* [ML Modeling](#ML_modeling)
    * []
    * []
    * []
    * []
    * []
    * []
    * []

## Introduction

Vous êtes développeur AI dans une startup de la Silicon Valley qui fournit des services dans le domaine de l'investissement immobilier.
Les chargés de relation client ont mentionné que la demande a augmenté récemment et qu'il devient difficile de faire des estimations personnalisées.
De ce fait, l'entreprise vous a confier d'automatiser cette tâche avec un modèle prédictif.
Pour cela, vous avez récupéré une base de données qui contient les prix médians des logements pour les districts de Californie issus du recensement de 1990

## Meta data

`longitude`
`latitude`
`housingMedianAge` | Âge médian d'une maison dans un pâté de maisons ; un chiffre plus bas correspond à un bâtiment plus récent.
`totalRooms` | Nombre total de chambres dans un bloc
`totalBedrooms` |  Nombre total de chambres de lit dans un bloc
`population` Nombre total de personnes résidant dans un bloc
`households`| Nombre total de ménages, c'est-à-dire un groupe de personnes résidant dans une unité d'habitation, pour un bloc
`medianIncome` | Revenu médian des ménages dans un bloc de maisons (mesuré en dizaines de milliers de dollars US)
`medianHouseValue` | Valeur médiane des maisons pour les ménages d'un bloc (mesurée en dollars US)
`oceanProximity` | Situation de la maison par rapport à la mer

## Importing Libraries

In [25]:
# Import needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import StandardScaler, MinMaxScaler

#from sklearn.metrics import

#from sklearn.metrics import roc_auc_score, plot_roc_curve, roc_curve
#from sklearn.metrics import average_precision_score
#from sklearn.metrics import precision_recall_curve,plot_precision_recall_curve

#from sklearn.model_selection import learning_curve

#from statsmodels.stats.outliers_influence import variance_inflation_factor as vif

#from sklearn.inspection import permutation_importance
#import warnings
#warnings.simplefilter(action="ignore")

## Data Reading

### Reading data

In [19]:
# importing cleaned dataset w/out missing values, duplicates and capped outliers to 10th and  90th percentiles
df = pd.read_csv(r'../data/data_cleaned.csv')
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,14817.0,72000.0,INLAND
1,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,69133.0,274100.0,<1H OCEAN
2,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,15536.0,58300.0,INLAND
3,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,15284.0,200000.0,<1H OCEAN
4,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,40815.0,117900.0,INLAND
...,...,...,...,...,...,...,...,...,...,...
16507,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,30625.0,93800.0,INLAND
16508,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,26133.0,103100.0,NEAR BAY
16509,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,44958.0,192600.0,INLAND
16510,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,25750.0,153500.0,<1H OCEAN


## Feature Engineering

### Encoding

We'll need to use either one-hot or dummy feature encoding methods since if we categorize 'ocean_proximity' feature with an ordinal encoding, it'll cause an increase of feature's weight, affecting model's predictivity performance, giving a 'preference' to some instance (individuals) which have been feature labeled with highest 'class' category.

We'll use then dummy feature encoding

In [20]:
# Using get_dummies() pandas method to return a dataframe with ocean_proximity instances as dummy variables.
dummy_ocn_prx = pd.get_dummies(df.ocean_proximity)
print(dummy_ocn_prx)

       <1H OCEAN  INLAND  ISLAND  NEAR BAY  NEAR OCEAN
0              0       1       0         0           0
1              1       0       0         0           0
2              0       1       0         0           0
3              1       0       0         0           0
4              0       1       0         0           0
...          ...     ...     ...       ...         ...
16507          0       1       0         0           0
16508          0       0       0         1           0
16509          0       1       0         0           0
16510          1       0       0         0           0
16511          0       0       0         0           1

[16512 rows x 5 columns]


In [21]:
# dropping unneeded ocean_proximity raw feature
df = df.drop(['ocean_proximity'], axis=1)
# merging encoded feature instances into the scaled dataframe
df = pd.merge(
    left=df,
    right=dummy_ocn_prx,
    left_index=True,
    right_index=True,
)
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-119.84,36.77,6.0,1853.0,473.0,1397.0,417.0,14817.0,72000.0,0,1,0,0,0
1,-117.80,33.68,8.0,2032.0,349.0,862.0,340.0,69133.0,274100.0,1,0,0,0,0
2,-120.19,36.60,25.0,875.0,214.0,931.0,214.0,15536.0,58300.0,0,1,0,0,0
3,-118.32,34.10,31.0,622.0,229.0,597.0,227.0,15284.0,200000.0,1,0,0,0,0
4,-121.23,37.79,21.0,1922.0,373.0,1130.0,372.0,40815.0,117900.0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16507,-121.90,39.59,20.0,1465.0,278.0,745.0,250.0,30625.0,93800.0,0,1,0,0,0
16508,-122.25,38.11,49.0,2365.0,504.0,1131.0,458.0,26133.0,103100.0,0,0,0,1,0
16509,-121.22,38.92,19.0,2531.0,461.0,1206.0,429.0,44958.0,192600.0,0,1,0,0,0
16510,-118.14,34.16,39.0,2776.0,840.0,2546.0,773.0,25750.0,153500.0,1,0,0,0,0


## Building a baseline model
### Linear regression

#### Hold-out split

In [95]:
# define X, y
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#### Instantiate Linear regression model

In [24]:
reg = LinearRegression().fit(X_train, y_train)
reg.predict(X_test)
# return model's score
reg.score(X_test, y_test)

0.6562230949358678

## First Iteration : Data Cleaning Part 2

### Scaling

*Does your data need scaling / normalizing across the feature lists?*
*What kind of scaling method would you do ? ( Standard Scaling / Min-Max Scaling?)*

- Our data needs to be scaled since the feature's seems to be uniform, but it would be necessary to be normalized as well if model's metrics shows there's a vast quantity of outliers.

- We would perform then Standard Scaling method

In [96]:
# dropping categorical ocean_proximity feature since it doesn't need to be standardized as it has been encoded by a feature engineering encoding.

X = X.iloc[:, :8]

# be aware -> the scale of the two variables is different, so we need to use a model that does not require any normalization, otherwise we might have to perform normalization and include them in the feature set as they are, w/out scaling.
X_scaled = pd.DataFrame(StandardScaler().fit_transform(X), columns = X.columns)

# Reinserting ocean_proximity encoded categories into scaled dataset
X_scaled = pd.merge(
    left=X_scaled,
    right=df.iloc[:, 9:],
    left_index=True,
    right_index=True,
)

# Splitting data set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.33, random_state=42)

#### Instantiate Linear regression model

In [97]:
reg = LinearRegression().fit(X_train, y_train)
reg.predict(X_test)
# return model's score
reg.score(X_test, y_test)

0.6562230949358718

### *Notes*

fdzefzefZ