# ML Zoomcamp Office Hours -- Week 8

Plan: 

* Multiclass classification
* Feature importance for continious target (regression)
* Working with texts
* Evaluating your model when there's time information

In [1]:
import pandas as pd
import numpy as np

## Multiclass classification

Let's use the NYC Airbnb dataset to predict the neighborhood from geo coordinates

In [3]:
df = pd.read_csv('../data/AB_NYC_2019.csv', nrows=3000)

In [4]:
df.neighbourhood_group.value_counts()

neighbourhood_group
Manhattan        1373
Brooklyn         1370
Queens            199
Bronx              33
Staten Island      25
Name: count, dtype: int64

In [5]:
groups = ['Manhattan', 'Brooklyn', 'Queens']
df = df[df.neighbourhood_group.isin(groups)].reset_index(drop=True)

In [6]:
df.neighbourhood_group.value_counts()

neighbourhood_group
Manhattan    1373
Brooklyn     1370
Queens        199
Name: count, dtype: int64

In [7]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [8]:
X = df[['latitude', 'longitude']].values

In [9]:
y = df.neighbourhood_group.values

In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
lr = LogisticRegression()
lr.fit(X, y)

In [12]:
lr.intercept_

array([ 0.28193733, -0.40945164,  0.12751432])

In [13]:
(lr.predict(X) == y).mean()

0.6801495581237254

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

In [15]:
dt = DecisionTreeClassifier(max_depth=4)
dt.fit(X, y)

In [16]:
(dt.predict(X) == y).mean()

0.9826648538409245

In [17]:
print(export_text(dt, feature_names=['lat', 'long']))

|--- lat <= 40.72
|   |--- long <= -73.86
|   |   |--- long <= -73.99
|   |   |   |--- lat <= 40.70
|   |   |   |   |--- class: Brooklyn
|   |   |   |--- lat >  40.70
|   |   |   |   |--- class: Manhattan
|   |   |--- long >  -73.99
|   |   |   |--- long <= -73.92
|   |   |   |   |--- class: Brooklyn
|   |   |   |--- long >  -73.92
|   |   |   |   |--- class: Brooklyn
|   |--- long >  -73.86
|   |   |--- class: Queens
|--- lat >  40.72
|   |--- long <= -73.93
|   |   |--- long <= -73.96
|   |   |   |--- lat <= 40.72
|   |   |   |   |--- class: Manhattan
|   |   |   |--- lat >  40.72
|   |   |   |   |--- class: Manhattan
|   |   |--- long >  -73.96
|   |   |   |--- lat <= 40.76
|   |   |   |   |--- class: Brooklyn
|   |   |   |--- lat >  40.76
|   |   |   |   |--- class: Manhattan
|   |--- long >  -73.93
|   |   |--- lat <= 40.82
|   |   |   |--- class: Queens
|   |   |--- lat >  40.82
|   |   |   |--- class: Manhattan



## Feature importance for continious target

When feature is continious and target is continious, use correlation

In [18]:
df.dtypes[df.dtypes != 'object'].index

Index(['id', 'host_id', 'latitude', 'longitude', 'price', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

In [19]:
numeric = ['latitude', 'longitude', 'minimum_nights',
       'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365']

In [20]:
df[numeric].corrwith(np.log1p(df.price)).abs()

latitude                          0.018935
longitude                         0.364144
minimum_nights                    0.012581
number_of_reviews                 0.095637
reviews_per_month                 0.098535
calculated_host_listings_count    0.000751
availability_365                  0.002821
dtype: float64

When feature is categorical and target is numerical, turn target into categorical and use mutual information

In [21]:
pd.cut(np.log1p(df.price), bins=10).value_counts()

price
(4.234, 4.846]    1124
(4.846, 5.458]     991
(3.622, 4.234]     374
(5.458, 6.07]      339
(6.07, 6.682]       62
(3.01, 3.622]       24
(6.682, 7.293]      19
(7.293, 7.905]       5
(7.905, 8.517]       3
(2.392, 3.01]        1
Name: count, dtype: int64

In [22]:
qprice = pd.qcut(df.price, q=10)

In [23]:
from sklearn.metrics import mutual_info_score

In [24]:
mutual_info_score(df.neighbourhood_group, qprice)

0.037870631683242

In [25]:
mutual_info_score(df.neighbourhood, qprice)

0.2796110818606619

In [26]:
mutual_info_score(df.room_type, qprice)

0.24649049220226618

In [27]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [28]:
qprice

0        (147.6, 175.0]
1        (200.0, 269.0]
2        (147.6, 175.0]
3          (75.0, 90.0]
4          (75.0, 90.0]
             ...       
2937     (104.0, 125.0]
2938     (147.6, 175.0]
2939     (200.0, 269.0]
2940    (269.0, 5000.0]
2941       (75.0, 90.0]
Name: price, Length: 2942, dtype: category
Categories (10, interval[float64, right]): [(9.999, 60.0] < (60.0, 75.0] < (75.0, 90.0] < (90.0, 104.0] ... (147.6, 175.0] < (175.0, 200.0] < (200.0, 269.0] < (269.0, 5000.0]]

## Working with texts

Encoding for text is very similar to one-hot encoding

In [29]:
names = df.name.iloc[:3]

In [30]:
names

0     Clean & quiet apt home by the park
1                  Skylit Midtown Castle
2    THE VILLAGE OF HARLEM....NEW YORK !
Name: name, dtype: object

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
cv = CountVectorizer()
cv.fit(names)
X = cv.transform(names)

In [33]:
print(cv.get_feature_names())

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [None]:
pd.DataFrame(X.toarray(), columns=cv.get_feature_names()).round(2)

In [None]:
names