## Classification - Homework

### Importing the labriaries and the dataset

In [5]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [113]:
df = pd.read_csv('../data/AB_NYC_2019.csv')
len(df)

48895

Selecting the features to use on this homework:

In [10]:
features = ['neighbourhood_group', 'room_type', 'latitude', 'longitude', 'price', 'minimum_nights',
            'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

In [114]:
df = df[features]
df.isnull().sum()

neighbourhood_group                   0
room_type                             0
latitude                              0
longitude                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

Replacing missing values with 0.

In [115]:
df['reviews_per_month'] = df['reviews_per_month'].fillna(0) 

#### Q1. Most frecuent observation

In [21]:
df['neighbourhood_group'].mode()

0    Manhattan
dtype: object

__NOTE:__ making the price binary is asked in the next question, but we can't split the data before making this change.

In [117]:
df['above_average'] = np.where(df['price'] >= 152, 1, 0)

In [118]:
df = df.drop('price', axis=1)

Spliting the data:

In [119]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [120]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [121]:
len(df_train), len(df_val), len(df_test)

(29337, 9779, 9779)

Setting apart the objective `price` from the rest of features:

In [122]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [123]:
y_train = df_train['above_average'].values
y_val = df_val['above_average'].values
y_test = df_test['above_average'].values

#### Q2. Correlation

Looking at the numerical variables:

In [124]:
df_full_train.dtypes

neighbourhood_group                object
room_type                          object
latitude                          float64
longitude                         float64
minimum_nights                      int64
number_of_reviews                   int64
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
above_average                       int64
dtype: object

In [125]:
numerical_values = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
                    'reviews_per_month', 'calculated_host_listings_count', 'availability_365']

In [126]:
categorical_values = ['neighbourhood_group', 'room_type']

Looking at the correlation matrix:

In [127]:
df_train[numerical_values].corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


Looking at the correlation matrix, the highest values are for:

- `calculated_host_listing_count` vs `availability_365` (0.550)
- `number_of_reviews` vs `reviews_per_month` (0.226)

Making `price` binary:

Already done before.

#### Q3. Mutual information

Calculating the mutual information score:

In [128]:
from sklearn.metrics import mutual_info_score

In [129]:
def mutual_info_price_score(series):
    return mutual_info_score(series, df_train['above_average'])

In [130]:
mutual_info = df_train[categorical_values].apply(mutual_info_price_score)
mutual_info.round(2).sort_values(ascending=False)

room_type              0.14
neighbourhood_group    0.05
dtype: float64

#### Q4. Training a logistic regression

Pre-processing the data:

In [131]:
from sklearn.feature_extraction import DictVectorizer

In [132]:
train_dict = df_train[categorical_values + numerical_values].to_dict(orient='records')
train_dict[0]

{'neighbourhood_group': 'Brooklyn',
 'room_type': 'Entire home/apt',
 'latitude': 40.7276,
 'longitude': -73.94495,
 'minimum_nights': 3,
 'number_of_reviews': 29,
 'reviews_per_month': 0.7,
 'calculated_host_listings_count': 13,
 'availability_365': 50}

Vectorizer:

In [133]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dict)

Fiting the model:

In [135]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(random_state=42)

Measuring the accuracy:

In [141]:
from sklearn.metrics import accuracy_score

In [139]:
val_dict = df_val[categorical_values + numerical_values].to_dict(orient='records')
X_val = dv.fit_transform(val_dict)

In [140]:
y_pred = model.predict(X_val)

In [144]:
accuracy = accuracy_score(y_val, y_pred)
accuracy.round(2)

0.79

#### Q5. Finding the least useful feature