## MLBookCamp Homework 3

- [Course page](https://datatalks.club/courses/2021-winter-ml-zoomcamp.html)

- [Homework page](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/homework.md)

### Dataset

In this homework, we will continue the New York City Airbnb Open Data. You can take it from [Kaggle](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data?select=AB_NYC_2019.csv) or download from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv) if you don't want to sign up to Kaggle.  
\
We'll keep working with the 'price' variable, and we'll transform it to a classification task.

In [None]:
# Import basic modules beforehand
import os
import numpy as np
import pandas as pd
import sklearn
import warnings
warnings.filterwarnings('ignore')


In [None]:
os.listdir()

In [None]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv -O 'airbnb.csv'

In [None]:
os.listdir()

In [None]:
df = pd.read_csv('airbnb.csv')

In [None]:
df.head()

### Features

For the rest of the homework, you'll need to use the features from the previous homework with additional two 'neighbourhood_group' and 'room_type'. So the whole feature set will be set as follows:

    'neighbourhood_group',
    'room_type',
    'latitude',
    'longitude',
    'price',
    'minimum_nights',
    'number_of_reviews',
    'reviews_per_month',
    'calculated_host_listings_count',
    'availability_365'

Select only them and fill in the missing values with 0.

In [None]:
# list of features to be used
features = [
'neighbourhood_group',
'room_type',
'latitude',
'longitude',
'minimum_nights',
'number_of_reviews',
'reviews_per_month',
'calculated_host_listings_count',
'availability_365',
'price'
]

len(features)

In [None]:
# Fetching desired df
abnb_df = df[features]
abnb_df.head()

In [None]:
abnb_df.info()

> Select only them and fill in the missing values with 0.

From df info, we can observe there is one column: `reviews_per_month` with NaN values

In [None]:
# Fill nan values with 0
abnb_df['reviews_per_month'] = abnb_df['reviews_per_month'].fillna(0)

abnb_df.info()

#### Question 1

What is the most frequent observation (mode) for the column 'neighbourhood_group'?
> Manhattan : 21661

In [None]:
# Fetch mode using value_counts
abnb_df['neighbourhood_group'].value_counts(ascending= False)

### Split the data

   -  Split your data in train/val/test sets, with 60%/20%/20% distribution.  
    
   -  Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.  
    
   -  Make sure that the target value ('price') is not in your dataframe.


In [None]:
# Slice price column 
target = abnb_df['price'].to_frame()

# Drop price column from df
abnb_df.drop(columns='price', inplace= True)

### Make price binary

   -  We need to turn the price variable from numeric into binary.
   -  Let's create a variable above_average which is 1 if the price is above (or equal to) 152.
    

In [None]:
# Binarize price
target['above_avg'] = (target['price'] >= 152).astype(int)
target.head()

In [None]:
from sklearn.model_selection import train_test_split
# Split data
x_train, x_test, y_train, y_test = train_test_split(abnb_df, target, test_size=0.2, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state=42)

In [None]:
print(
f'Train size: {len(x_train)}, {len(y_train)}\n'
f'Val size: {len(x_val)}, {len(y_val)}\n'
f'Test size: {len(x_test)}, {len(y_test)}'
)

### Question 2

- Create the correlation matrix for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?  


> `reviews_per_month` and `number_of_reviews` -> 0.549792


In [None]:
# Evaluate correlation
corr = x_train.corr()
corr.style.background_gradient(cmap='coolwarm')

In [None]:
# Fill diagonal and upper half with NaNs
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
corr[mask] = np.nan
(corr
 .style
 .background_gradient(cmap='coolwarm', axis=None, vmin=-1, vmax=1)
 .highlight_null(null_color='#f1f1f1')  # Color NaNs grey
 .set_precision(2))

### Question 3

- Calculate the mutual information score with the (binarized) price for the two categorical variables that we have. Use the training set only.
- Which of these two variables has bigger score?
- Round it to 2 decimal digits using round(score, 2)

> Room type has bigger score (0.14)

In [None]:
from sklearn.metrics import mutual_info_score

# Fetch numerical columns
num_cols = list(x_train.select_dtypes(include=['int64', 'float64']).columns)
print(num_cols)

# Fetch categorical columns
cat_cols = list(x_train.select_dtypes(include=['object']).columns)
print(cat_cols)

In [None]:
# Evaluate mutual info score
mi_ng = mutual_info_score(x_train['neighbourhood_group'], y_train['above_avg']).round(2)
mi_rt = mutual_info_score(x_train['room_type'], y_train['above_avg']).round(2)

In [None]:
print(
f'Mutual information of neighbourhood group with target: {mi_ng}\n'
f'Mutual information of room type with target: {mi_rt}\n'
)

### Question 4

- Now let's train a logistic regression
- Remember that we have two categorical variables in the data. Include them using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
- Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

> Validation accuracy is 0.79


In [None]:
from sklearn.preprocessing import OneHotEncoder

# Trying OHE for categorical columns
# Create instance of OHE
ohe = OneHotEncoder(sparse= False)

# Fitting categorical columns to encoder instances
ohe.fit(x_train[cat_cols])

# Fetching encoded column names
enc_cols = list(ohe.get_feature_names(cat_cols))
print(f'Encoded columns: {enc_cols}')

In [None]:
# Transform/Encode categorical columns
x_train[enc_cols] = ohe.transform(x_train[cat_cols])
x_val[enc_cols] = ohe.transform(x_val[cat_cols])
x_test[enc_cols] = ohe.transform(x_test[cat_cols])

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Create instance of logReg model
lr = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
lr.fit(x_train[num_cols+enc_cols], y_train['above_avg'])

In [None]:
# Predict on validation set and evaluate accuracy
y_pred = lr.predict(x_val[num_cols+enc_cols])
val_acc = accuracy_score(y_val['above_avg'], y_pred)
print(f'Validation accuracy is {round(val_acc, 2)}')

### Question 5

- We have 9 features: 7 numerical features and 2 categorical.  

- Let's find the least useful one using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
        neighbourhood_group
        room_type
        number_of_reviews
        reviews_per_month

note: the difference doesn't have to be positive


> `number_of_reviews`: -0.00072 has lowest difference  
Difference of score without column `neighbourhood_group`: 0.03538  
Difference of score without column `room_type`: 0.07117  
Difference of score without column `number_of_reviews`: -0.00072  
Difference of score without column `reviews_per_month`: 0.00123

In [None]:
# Feature importance by model coefficients
pd.DataFrame(data= {'feature': num_cols+enc_cols, 'coef': abs(lr.coef_[0])}).sort_values(by='coef',ascending=False)

In [None]:
original_acc = val_acc

In [None]:
from sklearn.feature_extraction import DictVectorizer

# Create list of features to be dropped
elim_feat = ['neighbourhood_group', 'room_type', 'number_of_reviews', 'reviews_per_month']

dv_train_df = x_train[num_cols+cat_cols].copy()
dv_val_df = x_val[num_cols+cat_cols].copy()


# For each feature in elim_feat, drop it, train model, evaluate accuracy and compare with original_acc
for i in elim_feat:
    dv = DictVectorizer(sparse= False)
    
    train_dict = dv_train_df.drop(columns=i).to_dict(orient= 'records')
    dv_train = dv.fit_transform(train_dict)
    
    val_dict = dv_val_df.drop(columns=i).to_dict(orient= 'records')
    dv_val =dv.fit_transform(val_dict)
    
    lr_dict = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
    lr_dict.fit(dv_train, y_train['above_avg'])
    
    lr_dict_pred = lr_dict.predict(dv_val)
    score = accuracy_score(y_val['above_avg'], lr_dict_pred)
    print(f'Difference of score without column {i}: {round(original_acc - score, 5)}')


### Question 6

- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model on the training data.
- This model has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

> alpha = 0.01   
    [(0, 0.22), (0.01, 0.218), (0.1, 0.218), (1, 0.218), (10, 0.218)]
  

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

# Calculate log and impute any undesired values
y_train['price'] = np.log10(y_train['price']).fillna(0).replace([np.inf, -np.inf], 0)
y_val['price'] = np.log10(y_val['price']).fillna(0).replace([np.inf, -np.inf], 0)
y_test['price'] = np.log10(y_test['price']).fillna(0).replace([np.inf, -np.inf], 0)

In [None]:
def test_params(**params):
    """
    - Train a ridge model with hyperparameter passed to function 
    - Predict  on validation set
    - Evaluate MSE
    """
    model = Ridge( **params).fit(x_train[num_cols+enc_cols], y_train['price'])
    pred = model.predict(x_val[num_cols+enc_cols])
    val_rmse = mean_squared_error(y_val['price'], pred,  squared=False)
    return val_rmse

In [None]:
def test_multiple_values(param_name, param_values):
    """
    For given param_name and range of values, train a model individually through function test_params
    and fetch-append validation RMSE
    """
    val_errors = []
    for value in param_values:
        params = {param_name: value}
        metric = test_params(**params)
        val_errors.append(round(metric, 3))
    return val_errors

In [None]:
# Compile param values and rmse values together
list(zip([0, 0.01, 0.1, 1, 10],(test_multiple_values('alpha', [0, 0.01, 0.1, 1, 10]))))

### References
- [sklearn](https://scikit-learn.org/stable/index.html)
- [pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)
- [mlbookcamp chapter 3](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/03-classification/README.md)
- [test_params function template](https://jovian.ai/adarshn-work/python-random-forests-assignment/v/10#C67)
- [Correlation matrix plotting](https://stackoverflow.com/questions/29432629/plot-correlation-matrix-using-pandas)
