## Classification - Homework

### Importing the labriaries and the dataset

In [152]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [153]:
df = pd.read_csv('../data/AB_NYC_2019.csv')
len(df)

48895

Selecting the features to use on this homework:

In [154]:
features = ['neighbourhood_group', 'room_type', 'latitude', 'longitude', 'price', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

In [155]:
df = df[features]
df.isnull().sum()

neighbourhood_group                   0
room_type                             0
latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

Replacing missing values with 0.

In [156]:
df['reviews_per_month'] = df['reviews_per_month'].fillna(0) 

#### Q1. Most frecuent observation

In [157]:
df['neighbourhood_group'].mode()

0    Manhattan
dtype: object

__NOTE:__ making the price binary is asked in the next question, but we can't split the data before making this change.

In [158]:
df['above_average'] = np.where(df['price'] >= 152, 1, 0)

In [159]:
df = df.drop('price', axis=1)

Spliting the data:

In [160]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [161]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [162]:
len(df_train), len(df_val), len(df_test)

(29337, 9779, 9779)

Setting apart the objective `price` from the rest of features:

In [163]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [164]:
y_train = df_train['above_average'].values
y_val = df_val['above_average'].values
y_test = df_test['above_average'].values

#### Q2. Correlation

Looking at the numerical variables:

In [165]:
df_full_train.dtypes

neighbourhood_group                object
room_type                          object
latitude                          float64
longitude                         float64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
above_average                       int64
dtype: object

In [166]:
numerical_values = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
                    'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

In [167]:
categorical_values = ['neighbourhood_group', 'room_type']

Looking at the correlation matrix:

In [168]:
df_train[numerical_values].corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


Looking at the correlation matrix, the highest values are for:

- `calculated_host_listing_count` vs `availability_365` (0.550)
- `number_of_reviews` vs `reviews_per_month` (0.226)

Making `price` binary:

Already done before.

#### Q3. Mutual information

Calculating the mutual information score:

In [169]:
from sklearn.metrics import mutual_info_score

In [170]:
def mutual_info_price_score(series):
    return mutual_info_score(series, df_train['above_average'])

In [171]:
mutual_info = df_train[categorical_values].apply(mutual_info_price_score)
mutual_info.round(2).sort_values(ascending=False)

room_type              0.14
neighbourhood_group    0.05
dtype: float64

#### Q4. Training a logistic regression

Pre-processing the data:

In [172]:
from sklearn.feature_extraction import DictVectorizer

In [173]:
train_dict = df_train[categorical_values + numerical_values].to_dict(orient='records')
train_dict[0]

{'neighbourhood_group': 'Brooklyn',
 'room_type': 'Entire home/apt',
 'latitude': 40.7276,
 'longitude': -73.94495,
 'minimum_nights': 3,
 'number_of_reviews': 29,
 'reviews_per_month': 0.7,
 'calculated_host_listings_count': 13,
 'availability_365': 50}

Vectorizer:

In [174]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)

Fiting the model:

In [175]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(random_state=42)

Measuring the accuracy:

In [176]:
from sklearn.metrics import accuracy_score

In [177]:
val_dict = df_val[categorical_values + numerical_values].to_dict(orient='records')
X_val = dv.fit_transform(val_dict)

In [178]:
y_pred = model.predict(X_val)

In [179]:
accuracy = accuracy_score(y_val, y_pred)
accuracy.round(2)

0.79

#### Q5. Finding the least useful feature

In [180]:
all_features = categorical_values + numerical_values
all_features

['neighbourhood_group',
 'room_type',
 'latitude',
 'longitude',
 'minimum_nights',
 'number_of_reviews',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

In [181]:
used_features = [x for i, x in enumerate(all_features) if i != 2]
used_features

['neighbourhood_group',
 'room_type',
 'longitude',
 'minimum_nights',
 'number_of_reviews',
 'reviews_per_month',
 'calculated_host_listings_count',
 'availability_365']

In [182]:
# getting X_train just for the used_features
train_dict = df_train[used_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
print(df_train[used_features].columns)

# getting X_val just for the used_features
val_dict = df_val[used_features].to_dict(orient='records')
X_val = dv.fit_transform(val_dict)
print(df_val[used_features].columns)

# fitting the model with the used_features
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train)

# predicting and measuring just for used_features
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
accuracy.round(3)

Index(['neighbourhood_group', 'room_type', 'longitude', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')
Index(['neighbourhood_group', 'room_type', 'longitude', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')


0.786

Let's put the above code into a function and create a loop to measure all the different set of features:

In [183]:
def fit_and_measure(used_features):
    # getting X_train just for the used_features
    train_dict = df_train[used_features].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    # getting X_val just for the used_features
    val_dict = df_val[used_features].to_dict(orient='records')
    X_val = dv.fit_transform(val_dict)

    # fitting the model with the used_features
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
    model.fit(X_train, y_train)

    # predicting and measuring just for used_features
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    return accuracy.round(3)

In [184]:
results = {}

for col in range(len(df_val.columns)):
    used_features = [x for i, x in enumerate(all_features) if i != col]
    score = fit_and_measure(used_features)

    results[f'Difference without {df_val.columns[col]}'] = round((accuracy - score), 6)

results

{'Difference without neighbourhood_group': 0.035379,
 'Difference without room_type': 0.070379,
 'Difference without latitude': 0.000379,
 'Difference without longitude': -0.000621,
 'Difference without minimum_nights': 0.001379,
 'Difference without number_of_reviews': -0.000621,
 'Difference without reviews_per_month': 0.000379,
 'Difference without calculated_host_listings_count': -0.000621,
 'Difference without availability_365': 0.004379,
 'Difference without above_average': 0.000379}

As we can see, the less valuable features are:

- `latitude`
- `reviews_per_month`
- `above_verage`

with a score difference of 0.000379

#### Q6. Using a linear regression and measureing the RMSE

Since we have to use the original values for `price` we have to run the $y$ values again:

In [188]:
df2 = pd.read_csv('../data/AB_NYC_2019.csv')
df2 = df2[features]

In [190]:
df2['reviews_per_month'] = df2['reviews_per_month'].fillna(0) 

Splitting the data:

In [191]:
df_full_train_lr, df_test_lr = train_test_split(df2, test_size=0.2, random_state=42)
df_train_lr, df_val_lr = train_test_split(df_full_train_lr, test_size=0.25, random_state=42)

Applying the logaritmic transformation to the column `price`:

In [193]:
y_train = np.log1p(df_train_lr['price']).values
y_val = np.log1p(df_val_lr['price']).values
y_test = np.log1p(df_test_lr['price']).values

We're going to use the Ridge Regression model:

In [84]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [194]:
train_dict_lr = df_train_lr[used_features].to_dict(orient='records')
X_train_lr = dv.fit_transform(train_dict_lr)

val_dict_lr = df_val_lr[used_features].to_dict(orient='records')
X_val_lr = dv.fit_transform(val_dict_lr)

In [195]:
ridge_results = {}

In [200]:
for alpha in [0, 0.01, 0.1, 1, 10]:
    ridge_model = Ridge(alpha=alpha, random_state=42)
    ridge_model.fit(X_train, y_train)
    
    y_pred = ridge_model.predict(X_val)
    
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))

    ridge_results[f'alpha = {alpha}'] = round(rmse, 3)

In [201]:
ridge_results

{'alpha = 0': 0.498,
 'alpha = 0.01': 0.498,
 'alpha = 0.1': 0.498,
 'alpha = 1': 0.498,
 'alpha = 10': 0.499}

Based on this results, we would choose `alpha = 0`