<a href="https://colab.research.google.com/github/d-klotz/ai-training/blob/main/linear-regression-2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import Pandas library to load the dataset.


In [None]:
import pandas as pd

dataset = pd.read_csv("data/housing.csv")
dataset.head()

## Configurations
Let's make sure this notebook works fine with python 2 and 3. 
Matplotlit will be used for data visualization.

In [None]:
import numpy as np
np.random.seed(42)
import os

# used to plot the charts
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## Knowing the dataset

In [None]:
# Displayes how many rows and columns are in the dataset
dataset.shape

In [None]:
dataset.info()

total_bedrooms has 207 missing values.

ocean_proximity is of type text, let's check how many categorical values it has

In [None]:
set(dataset['ocean_proximity'])
dataset['ocean_proximity'].value_counts()

Let's see some statistics about the numerical features

In [None]:
dataset.describe()

Let's visualize the data to get a better understanding

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
dataset.hist(bins=50, figsize=(20,15))

After checking the histograms, we notice that median income is capped at 15 and median house value is capped at 500,000. If we want to make predictions for houses above 500,000, we need to train our model with higher-priced data.

### Let's separate our databases into a training set and a test set.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(dataset, test_size= 0.2, random_state=7)
print(len(df_train), "training", len(df_test), "testing")

## Creating categories of median annual income
Let's suppose that we talked to a sales specialist and he said that median income is an important attribute to add to the predictive model. When we divide the dataset into training and testing sets, we need to make sure that the distribution of incomes in both sets is similar. We can achieve this by dividing the median income into categories.

In [None]:
dataset['median_income'].hist()

Divide incomes into 5 categories by dividing the value of the column "median_income" by 1.5 and then rounding up to the nearest whole number. with np.ceil.

In [None]:
dataset['income_cat'] = np.ceil(dataset['median_income'] / 1.5)

Label those above 5 as 5 and those below 5 remains the same.

In [None]:
dataset['income_cat'].where(dataset['income_cat'] < 5, 5)

We can use the cut function from pandas to divide a set of intervaled data into bins.

In [None]:
dataset['income_cat'] = pd.cut(dataset['median_income'], bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
dataset['income_cat'].value_counts()

In [None]:
dataset['income_cat'].hist()

Now we can use stratified sampling to ensure that test and training sets have the same distribution of income categories. 

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(dataset, dataset['income_cat']):
    strat_train_set = dataset.loc[train_index]
    strat_test_set = dataset.loc[test_index]

In [None]:
#Let's check the distribution of income categories in the test set
strat_test_set['income_cat'].value_counts() / len(strat_test_set)

In [None]:
#Let's check the distribution of income categories in the training set
strat_train_set['income_cat'].value_counts() / len(strat_train_set)

In [None]:
dataset['income_cat'].value_counts() / len(dataset)

After finishing stratified sampling, we can remove the income_cat column from the datasets because it's no longer needed, we only used it as an auxiliary variable for stratified sampling.'

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop('income_cat', axis=1, inplace=True)

### Analysing geographical data

In [None]:
housing = strat_train_set.copy()

# The scatter plot will display almost a perfetc map of the California cities.
housing.plot(kind='scatter', x='longitude', y='latitude')

In [None]:
# This chart will display a density of the houses in the map.
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

Now let's plot the housing prices on the map and compare it with the density of houses.

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.4,
             s=housing['population']/100, label='population', figsize=(10,7),
             c='median_house_value', cmap=plt.get_cmap('jet'), colorbar=True,
             sharex=False)
plt.legend()

### Correlations
Now let's look at the correlations between the attributes.

In [None]:
housing = housing.drop(columns="ocean_proximity")
corr_matrix = housing.corr()

In [None]:
# We can see that median house value is strongly correlated with median income
corr_matrix['median_house_value'].sort_values(ascending=False)

In [None]:
# from pandas.tools.plotting import scatter_matrix
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1) # apagando a target para a base de treino (nosso x)
housing_labels = strat_train_set["median_house_value"].copy()

In [None]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

In [None]:
housing.isnull().sum()

It's possible that some values are null, how are we going to handle them?

In [None]:
# Option 1
# Replacing values with the median
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
sample_incomplete_rows

### Or using Sklearn's built-in SimpleImputer to replace null values by the median.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
('imputer', SimpleImputer (strategy="median")), #replace null values by median 
('std_scaler', StandardScaler()), # standardize the features
])

housing_num = housing.drop(columns="ocean_proximity")
housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
housing_num_tr

**Column transformer** is a scikit learn feature that allows us to apply different transformations on different columns (numbers/categories).

In [None]:
try:
    from sklearn.compose import ColumnTransformer
except ImportError:
    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20 

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer ([
    ("num", num_pipeline, num_attribs), # transforming numerical attributes
    ("cat", OneHotEncoder(), cat_attribs), # transforming categorical attributes -> This takes each category and adds each value a separate column and then adds 0 and 1 in this column for each category.
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared


In [None]:
housing_prepared.shape

In [None]:
column_names = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households','median_income', 
                '1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

# Transforming the numpy array back to a pandas DataFrame
housing_df = pd.DataFrame(data=housing_prepared, columns=column_names)

# Display de DataFrame outcome
print(housing_df.shape)

In [None]:
housing_df.head()

In [None]:
print(housing_df.isnull().sum())

In [57]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

### Let's test the model with some data.'

In [58]:
some_data = housing.iloc [:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform (some_data)
predictions = lin_reg.predict (housing_prepared)
print("Predictions:", lin_reg.predict (some_data_prepared) )

Predictions: [ 88983.14806384 305351.35385026 153334.71183453 184302.55162102
 246840.18988841]


### Compare with actual values from the labels.

In [60]:
print("Labels:", list(some_labels))

Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]


### Evaluate the model
MSE (Mean Square Error) is the mean squared of the differences between the predicted and actual values.
The closer the MSE, the better the model.


In [61]:
from sklearn.metrics import mean_squared_error
# Mean squared error squares the average of the absolute mean error. I'm evaluating if the errors are not too large, it penalizes values far from the mean much more.
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse) # square root of the mean squared error
lin_rmse

np.float64(69050.56219504567)

### MAE (Mean Absolute Error) is the mean of the absolute differences between the predicted and actual values. The closer to zero, the better the model.

In [62]:
from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

np.float64(49905.329442715316)

A margin of error of 69,050 dollars is not very acceptable in our model, given that the median_housing_values range between 120 thousand dollars and 265 thousand dollars. We can define here that this model is overfitting. Shall we try a more powerful model?

Using R2 the value needs to be close to 1

In [63]:
from sklearn.metrics import r2_score
r2 = r2_score(housing_labels, housing_predictions)
print ('p2', r2)

p2 0.6438078994746375


In [64]:
# Function to calculate the MAPE (Mean Absolute Percentage Error) The closer to 100% more inaccurate the model. 
def calculate_mape(labels, predictions):
    errors = np.abs (labels - predictions)
    relative_errors = errors / np.abs(labels)
    mape = np. mean (relative_errors) * 100
    return mape

In [65]:
# Calc MAPE
mape_result = calculate_mape(housing_labels, housing_predictions)
# print result
print(f"O MAPE é: {mape_result: 2f}%")

O MAPE é:  28.648798%


## Let's try other models

In [66]:
from sklearn. tree import DecisionTreeRegressor
# Create a DecisionTreeRegressor model
model_dtr = DecisionTreeRegressor (max_depth=10)
model_dtr. fit(housing_prepared, housing_labels)

In [67]:
# let's try the complete preprocessing pipeline on some training instances
some_data = housing. iloc[ :5]
some_labels = housing_labels. iloc [:5]
some_data_prepared = full_pipeline.transform (some_data)
predictions = model_dtr.predict (some_data_prepared)
print ("Predictions:", model_dtr.predict (some_data_prepared) )

Predictions: [ 90980.88235294 324661.11111111  72856.96202532 168772.60273973
 226591.38505747]


In [68]:
# Actual data
print("Labels:", list (some_labels))

Labels: [72100.0, 279600.0, 82700.0, 112500.0, 238300.0]


In [None]:
### Checking the performance of the Decision Tree Model

In [69]:
# mean_squared_error
housing_predictions = model_dtr. predict (housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np. sqrt(lin_mse)
lin_rmse

np.float64(47873.314559945495)

In [70]:
# mean_absolute_error
lin_mae = lin_mae
mean_absolute_error(housing_labels, housing_predictions)

np.float64(32067.265630356796)

In [71]:
r2 = r2_score(housing_labels, housing_predictions)
print('r2', r2)

r2 0.8287869591640339


In [72]:
# Calc MAPE
mape_result = calculate_mape(housing_labels, housing_predictions)
# print
print(f"O MAPE é: {mape_result: 2f}%")

O MAPE é:  17.941159%
