# tipToe Dive into Machine Learning I
> implementing various methods and tools particularly notes from [The Mechanics of Machine Learning](https://mlbook.explained.ai/).
- toc: true
- branch: master
- badges: true
- comments: true
- author: Victor Worlanyo
- categories: [Advanced Beginner]

## Loading data
download and unzip these files into your working directory before running the following command `!python prep-rent.py`:
- [train.json](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data?select=train.json.zip)
- [prep-rent.py](https://mlbook.explained.ai/data/prep-rent.py)

In [1]:
!python prep-rent.py

Created rent.csv
Created rent-ideal.csv


# First, a shallow analysis of our data
this is a rental listing data from RentHop, a portforlio company of [Two Sigma Ventures](https://twosigmaventures.com/). the goal of this project is to predict the number of enquiries a future listing will get based on the date the new listing was created and other features such as location, number of bedrooms, bathrooms etc. 

In [2]:
import pandas as pd
df = pd.read_csv('rent.csv')

transposing dataframe for better display

In [3]:
print(df.shape)
df.head().T

(49352, 15)


Unnamed: 0,0,1,2,3,4
bathrooms,1,1,1,1.5,1
bedrooms,1,2,2,3,0
building_id,8579a0b0d54db803821a35a4a615e97a,b8e75fc949a6cd8225b455648a951712,cd759a988b8f23924b5a2058d5ab2b49,53a5b119ba8f7b61d4e010512e0dfc85,bfb9405149bfff42a92980b594c28234
created,2016-06-16 05:55:27,2016-06-01 05:44:33,2016-06-14 15:19:59,2016-06-24 07:54:24,2016-06-28 03:50:23
description,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ...,Over-sized Studio w abundant closets. Availabl...
display_address,145 Borinquen Place,East 44th,East 56th Street,Metropolitan Avenue,East 34th Street
features,"['Dining Room', 'Pre-War', 'Laundry in Buildin...","['Doorman', 'Elevator', 'Laundry in Building',...","['Doorman', 'Elevator', 'Laundry in Building',...",[],"['Doorman', 'Elevator', 'Fitness Center', 'Lau..."
latitude,40.7108,40.7513,40.7575,40.7145,40.7439
listing_id,7170325,7092344,7158677,7211212,7225292
longitude,-73.9539,-73.9722,-73.9625,-73.9425,-73.9743


next we look at a concise summary of the rental data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49352 entries, 0 to 49351
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        49352 non-null  float64
 1   bedrooms         49352 non-null  int64  
 2   building_id      49352 non-null  object 
 3   created          49352 non-null  object 
 4   description      47906 non-null  object 
 5   display_address  49217 non-null  object 
 6   features         49352 non-null  object 
 7   latitude         49352 non-null  float64
 8   listing_id       49352 non-null  int64  
 9   longitude        49352 non-null  float64
 10  manager_id       49352 non-null  object 
 11  photos           49352 non-null  object 
 12  price            49352 non-null  int64  
 13  street_address   49342 non-null  object 
 14  interest_level   49352 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 5.6+ MB


`df.info` shows the data has a mix of numeric values `float64` `int64` and non-numeric values `object`. our focus will be on numeric data.

In [5]:
df_num = df[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']] # a list of features with numeric values
df_num.head()

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,1,-73.9539,40.7108,2400
1,1.0,2,-73.9722,40.7513,3800
2,1.0,2,-73.9625,40.7575,3495
3,1.5,3,-73.9425,40.7145,3000
4,1.0,0,-73.9743,40.7439,2795


checking for missing values in the data set. ml models cannot properly handle `NaN`

In [6]:
print(df_num.isnull().sum())

bathrooms    0
bedrooms     0
longitude    0
latitude     0
price        0
dtype: int64


# Model Training 
are there patterns that show a relationship between the features and the target? 
## procedure
### create design matrices: features (independent variables) vs target (dependent variable)

In [10]:
X_train = df_num[['bathrooms','bedrooms','longitude','latitude']]
y_train = df_num['price']

### create model

In [18]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, # number of trees in the forest = 100
                           n_jobs=-1)        # use all processors for training

### fit model to training data

In [19]:
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=-1, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

## how well does the model fit the data?
methods:
- using the $R^{2}$ error metric (coefficient of determination) to measure how well the model fits the data.
- using the mean absolute error (MAE) metric to compute difference between predicted and actual values (training error).

In [20]:
# rsquared
r2 = rf.score(X_train, y_train)
print(f'{r2:.4f}')

0.8756


a high $R^{2}$ score means the model has successfully captured some patterns in the data which indicates a relationship between the features and target. $R^{2}$ is only very useful when checking for the existence of patterns within data.

In [23]:
# mean absolute error
from sklearn.metrics import mean_absolute_error
predictions = rf.predict(X_train)
e = mean_absolute_error(y_train, predictions)
ep = e*100.0/y_train.mean()
print(f'${e:.0f} average error; {ep:.2f}% training error')

$286 average error; 7.47% training error


7.47% error is pretty great but this metric does not capture the overall validity of our model. to be more confient of our model, let's test for model generalization.

# Model Testing (Generalization)
## how well does the model perform on unseen data (generalization)?
does the model yield reasonable predictions?\
checking for validation error using:
- hold out method
- out-of-bag score

In [28]:
# hold out method
from sklearn.model_selection import train_test_split
X,y =  df_num[['bathrooms','bedrooms','longitude','latitude']], df_num['price']

# holding out 20% of training data as validation/test data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2)

rf = RandomForestRegressor(n_estimators=10)
rf.fit(X_train, y_train)

validation_e = mean_absolute_error(y_test, rf.predict(X_test))
print(f'${validation_e:.0f} average error; {validation_e*100.0/y.mean():.2f}% validation error')
print(f'a {((e/validation_e)*100):.2f}% increase in error during validation makes for poor model generalization')

$407 average error; 10.64% validation error
a 70.21% increase in error during validation makes for poor model generalization


as expected model does well on training data than validation data. this also shows that our data might be very noisy or there are no patterns in the data (we already proved with a high $R^{2}$ that there exist some form of relationship between the feature variables and the target). 

In [25]:
# out-of-bag score:  proof of noisy data
rf = RandomForestRegressor(n_estimators=100,  # number of trees in the forest = 100
                           n_jobs=-1,         # use all processors for training
                           oob_score=True)    # get error estimate

X_train, y_train = df_num[['bathrooms','bedrooms','longitude','latitude']], df_num['price']                           

rf.fit(X_train,y_train)
noisy_oob_r2 = rf.oob_score_
print(f'OOB score: {noisy_oob_r2:.4f}')

OOB score: -0.0460


OOB score is way below $R^{2}$ score, even though both are equivalent error metrics (coefficient of determination). OOB score is very terrible and our only conclusion will be that the data is pretty noisy, inconsistent or has some outliers. to prove that, we have to dig deeper into the data in the next post [tipToe Dive into Machine Learning II](https://d3la.github.io/wOL3/advanced%20beginner/2020/12/08/2SigmaP1.html).