<a href="https://colab.research.google.com/github/flying-bear/kompluxternaya/blob/master/exam_var1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exam


## Theoretical part

### 5. When a decision tree for regression performes worse than linear regression? 

We can think of each datapoint as a point in a multidimensional space of dimensionality number of features. Each feature would represent an axis. Decision trees group data by features on each branching point, and thus they draw each separation line parallel to one of the axes. However, some kinds of data are exactly split along a combination of axes, like a diagonal. A linear regrassion uses a hyperplane to separate the data and thus can perfectly allign with any linear combination of axes. A decision tree on the other hand would have to create a ladder pattern to approximate a diagonal class separartion line which makes the model complex and inefficient.

(On the other hand, in a non-linear case a decision tree would most often be better)

## Practical part

Develop a model for predicting review rating.  
**Multiclass classification into 5 classes**  
Score: **F1 with macro averaging**  
You are forbidden to use test dataset for any kind of training.  
Remember proper training pipeline.  
If you are not using default params in the models, you have to use some validation scheme to justify them. 

Use `random_state` or `seed` params - your experiment must be reprodusible.


### 1 baseline = 0.51
### 2 baseline = 0.53


In [91]:
!wget https://github.com/thedenaas/hse_seminars/raw/master/2019/exam/exam_data.zip

--2020-03-24 12:15:06--  https://github.com/thedenaas/hse_seminars/raw/master/2019/exam/exam_data.zip
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2019/exam/exam_data.zip [following]
--2020-03-24 12:15:06--  https://raw.githubusercontent.com/thedenaas/hse_seminars/master/2019/exam/exam_data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7356841 (7.0M) [application/zip]
Saving to: ‘exam_data.zip.1’


2020-03-24 12:15:07 (89.5 MB/s) - ‘exam_data.zip.1’ saved [7356841/7356841]



In [92]:
!unzip exam_data.zip

Archive:  exam_data.zip
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [0]:
import gensim.downloader as api
import pandas as pd
import nltk
import numpy as np

from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC

In [0]:
SEED = 42

In [0]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [95]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train.shape

(48192, 3)

In [96]:
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5
1,Excellent service - very approachable and prof...,Excellent Service,4
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4
4,Place had too many animals and I'm allergic to...,Experience,3


In [0]:
df_train['text'] = df_train['review'] + df_train['title']
df_test['text'] = df_test['review'] + df_test['title']

In [97]:
class_distr = df_train.target.value_counts(normalize=True)
class_distr

5    0.405690
4    0.286126
3    0.153137
1    0.077648
2    0.077399
Name: target, dtype: float64

In [98]:
class_distr = df_train.target.value_counts()
class_distr

5    19551
4    13789
3     7380
1     3742
2     3730
Name: target, dtype: int64

### Tf-idf

In [0]:
corpus = [text.lower() for text in df_train.text.values]

In [0]:
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)

In [141]:
X_train.shape

(48192, 50785)

In [0]:
test_corpus = [text.lower() for text in df_test.text.values]

In [0]:
X_test = vectorizer.transform(test_corpus)

In [144]:
X_test.shape

(5355, 50785)

In [0]:
y_test = df_test.target.values
y_train = df_train.target.values

#### Model (BASELINE 1)

In [154]:
%%time
model = GridSearchCV(LogisticRegression(penalty='l2', multi_class='ovr',  max_iter=500, random_state=SEED), 
                              {'C': [3.5]},  # the parameter chosing was sped up, so I could have used LogisticRegressionCV
                              cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)
print(model.best_params_)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   41.0s finished


{'C': 3.5}
train 0.7564204246097678
test 0.5101884507737575
CPU times: user 36.3 s, sys: 26.2 s, total: 1min 2s
Wall time: 1min 12s


#### Tokenize texts

In [0]:
def tokenize(text):
  sents = nltk.tokenize.sent_tokenize(text.lower())
  token_sents = [nltk.tokenize.word_tokenize(sent) for sent in sents]
  return token_sents

In [0]:
df_train['review_tokens'] = df_train['review'].apply(tokenize)
df_train.head()

Unnamed: 0,review,title,target,review_tokens,title_tokens,review_vec,title_vec,review_vecs,title_vecs
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5,"[[the, staff, was, very, friendly, ,, the, bre...","[[Walker, Gem]]","[[-0.026950411, 0.09126197, -0.029079862, 0.08...","[[0.030090332, 0.07446289, -0.17285156, 0.1264...","[[-0.026950411, 0.09126197, -0.029079862, 0.08...","[[0.030090332, 0.07446289, -0.17285156, 0.1264..."
1,Excellent service - very approachable and prof...,Excellent Service,4,"[[excellent, service, -, very, approachable, a...","[[Excellent, Service]]","[[-0.042317707, -0.020771027, -0.04103597, 0.0...","[[-0.114990234, -0.09863281, -0.053710938, 0.0...","[[-0.042317707, -0.020771027, -0.04103597, 0.0...","[[-0.114990234, -0.09863281, -0.053710938, 0.0..."
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5,"[[really, a, top, notch, place, to, spend, a, ...","[[Good, location, ,, warm, and, friendly, staff]]","[[-0.031962078, 0.053293865, 0.011444092, 0.11...","[[-0.039355468, 0.07539215, 0.0010253906, 0.04...","[[-0.031962078, 0.053293865, 0.011444092, 0.11...","[[-0.039355468, 0.07539215, 0.0010253906, 0.04..."
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4,"[[a, little, noisy, ,, there, was, a, false, f...","[[nice, hotel, ,]]","[[0.09769058, 0.010274251, 0.06544495, 0.01977...","[[0.092163086, 0.060180664, -0.11193848, 0.190...","[[0.09769058, 0.010274251, 0.06544495, 0.01977...","[[0.092163086, 0.060180664, -0.11193848, 0.190..."
4,Place had too many animals and I'm allergic to...,Experience,3,"[[place, had, too, many, animals, and, i, 'm, ...",[[Experience]],"[[0.056062013, 0.0070678713, -0.021950684, 0.1...","[[0.10498047, -0.025146484, -0.061767578, -0.2...","[[0.056062013, 0.0070678713, -0.021950684, 0.1...","[[0.10498047, -0.025146484, -0.061767578, -0.2..."


In [0]:
df_train['title_tokens'] = df_train['title'].apply(tokenize)


In [0]:
df_test['review_tokens'] = df_test['review'].apply(tokenize)
df_test.head()

Unnamed: 0,review,title,target,review_tokens,title_tokens
0,"I am from old town, and I stayed in this hotel...",Incredible Hotel,5,"[[i, am, from, old, town, ,, and, i, stayed, i...","[[Incredible, Hotel]]"
1,We have been coming to the Ocean Park Inn for ...,We Love this beach front Inn,5,"[[we, have, been, coming, to, the, ocean, park...","[[We, Love, this, beach, front, Inn]]"
2,Perfect place for a quick get away. We had a q...,Love this place!,5,"[[perfect, place, for, a, quick, get, away, .]...","[[Love, this, place, !]]"
3,"The room was not the best however, it was good...",Good For One Night Stay...,2,"[[the, room, was, not, the, best, however, ,, ...","[[Good, For, One, Night, Stay, ...]]"
4,Sous le motif d'une priode hivernale (inaccept...,Moyen,3,"[[sous, le, motif, d'une, priode, hivernale, (...",[[Moyen]]


In [0]:
df_test['title_tokens'] = df_test['title'].apply(tokenize)
df_test.head()

Unnamed: 0,review,title,target,review_tokens,title_tokens
0,"I am from old town, and I stayed in this hotel...",Incredible Hotel,5,"[[i, am, from, old, town, ,, and, i, stayed, i...","[[incredible, hotel]]"
1,We have been coming to the Ocean Park Inn for ...,We Love this beach front Inn,5,"[[we, have, been, coming, to, the, ocean, park...","[[we, love, this, beach, front, inn]]"
2,Perfect place for a quick get away. We had a q...,Love this place!,5,"[[perfect, place, for, a, quick, get, away, .]...","[[love, this, place, !]]"
3,"The room was not the best however, it was good...",Good For One Night Stay...,2,"[[the, room, was, not, the, best, however, ,, ...","[[good, for, one, night, stay, ...]]"
4,Sous le motif d'une priode hivernale (inaccept...,Moyen,3,"[[sous, le, motif, d'une, priode, hivernale, (...",[[moyen]]


#### Vectorize

In [0]:
wv = api.load('word2vec-google-news-300')
wv['king'][:10]



  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


array([ 0.12597656,  0.02978516,  0.00860596,  0.13964844, -0.02563477,
       -0.03613281,  0.11181641, -0.19824219,  0.05126953,  0.36328125],
      dtype=float32)

In [0]:
def vectorize(sent, wv):
  sent_vec = []
  for w in sent:
    try:
      vec = wv[w]
      sent_vec.append(vec)
    except KeyError:
      continue
  if sent_vec:
    return np.mean(np.array(sent_vec), axis=0)
  else:
    return np.zeros(300)

In [0]:
df_train['review_vecs'] = df_train['review_tokens'].apply(lambda x: [vectorize(sent, wv) for sent in x])

In [0]:
df_train['title_vecs'] = df_train['title_tokens'].apply(lambda x: [vectorize(sent, wv) for sent in x])

In [0]:
df_train['review_vec'] = df_train['review_vecs'].apply(lambda x: np.mean(np.array(x), axis=0))
df_train['title_vec'] = df_train['title_vecs'].apply(lambda x: np.mean(np.array(x), axis=0))
df_train.head()

Unnamed: 0,review,title,target,review_tokens,title_tokens,review_vec,title_vec,review_vecs,title_vecs
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5,"[[the, staff, was, very, friendly, ,, the, bre...","[[walker, gem]]","[-0.027158948, 0.043372687, -0.040174697, 0.12...","[0.1439209, 0.0078125, -0.10253906, 0.1977539,...","[[0.0011528863, 0.07189348, -0.03542752, 0.105...","[[0.1439209, 0.0078125, -0.10253906, 0.1977539..."
1,Excellent service - very approachable and prof...,Excellent Service,4,"[[excellent, service, -, very, approachable, a...","[[excellent, service]]","[-0.024286542, 0.031070312, 0.013583774, 0.104...","[-0.072509766, -0.022781372, -0.040283203, 0.0...","[[-0.05834961, -0.008141835, -0.039082844, 0.0...","[[-0.072509766, -0.022781372, -0.040283203, 0...."
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5,"[[really, a, top, notch, place, to, spend, a, ...","[[good, location, ,, warm, and, friendly, staff]]","[-0.035618663, 0.0314447, -0.002783648, 0.1490...","[-0.009472656, 0.102833554, 0.006616211, 0.066...","[[-0.028625488, 0.053212482, 0.0041097007, 0.1...","[[-0.009472656, 0.102833554, 0.006616211, 0.06..."
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4,"[[a, little, noisy, ,, there, was, a, false, f...","[[nice, hotel, ,]]","[0.09769058, 0.010274251, 0.06544495, 0.019775...","[0.092163086, 0.060180664, -0.11193848, 0.1904...","[[0.09769058, 0.010274251, 0.06544495, 0.01977...","[[0.092163086, 0.060180664, -0.11193848, 0.190..."
4,Place had too many animals and I'm allergic to...,Experience,3,"[[place, had, too, many, animals, and, i, 'm, ...",[[experience]],"[-0.0010496117, 0.0051407665, 0.00068507064, 0...","[0.037841797, -0.060058594, -0.05810547, -0.15...","[[0.0318042, 0.014367675, -0.028864747, 0.1096...","[[0.037841797, -0.060058594, -0.05810547, -0.1..."
...,...,...,...,...,...,...,...,...,...
48187,"A friend of mine always books the cheapest, ba...","Comfy, cozy but oh, so grand!",4,"[[a, friend, of, mine, always, books, the, che...","[[comfy, ,, cozy, but, oh, ,, so, grand, !]]","[0.03158905, 0.015115458, 0.07060936, 0.144346...","[-0.019195557, 0.00801595, -0.04177348, 0.1196...","[[0.086649574, 0.03970337, 0.061780293, 0.1354...","[[-0.019195557, 0.00801595, -0.04177348, 0.119..."
48188,Stayed here with my family over Spring Break i...,Great location and price for family lodging.,4,"[[stayed, here, with, my, family, over, spring...","[[great, location, and, price, for, family, lo...","[0.011766577, 0.02166656, 0.023084704, 0.09776...","[0.049184162, -0.03885905, -0.06154378, 0.1077...","[[0.020019531, 0.06564941, -0.046057127, 0.113...","[[0.049184162, -0.03885905, -0.06154378, 0.107..."
48189,One word AWFUL and the pool was closed,Good hotel in a quiet part of Memphis,2,"[[one, word, awful, and, the, pool, was, closed]]","[[good, hotel, in, a, quiet, part, of, memphis]]","[0.0778983, 0.04927281, 0.15257046, 0.04976981...","[0.042892456, 0.06477865, 0.0069885254, 0.1218...","[[0.0778983, 0.04927281, 0.15257046, 0.0497698...","[[0.042892456, 0.06477865, 0.0069885254, 0.121..."
48190,Never will stay here again. Dirty towels shirt...,Filthy,1,"[[never, will, stay, here, again, .], [dirty, ...",[[filthy]],"[0.022657266, 0.039829355, 0.005852507, 0.0863...","[-0.16308594, 0.29492188, 0.34179688, -0.14843...","[[0.02454834, 0.046533205, 0.023486327, 0.1568...","[[-0.16308594, 0.29492188, 0.34179688, -0.1484..."


In [0]:
df_test['review_vecs'] = df_test['review_tokens'].apply(lambda x: [vectorize(sent, wv) for sent in x])

In [0]:
df_test['title_vecs'] = df_test['title_tokens'].apply(lambda x: [vectorize(sent, wv) for sent in x])

In [0]:
df_test['review_vec'] = df_test['review_vecs'].apply(lambda x: np.mean(np.array(x), axis=0))
df_test['title_vec'] = df_test['title_vecs'].apply(lambda x: np.mean(np.array(x), axis=0))
df_test.head()

Unnamed: 0,review,title,target,review_tokens,title_tokens,review_vecs,title_vecs,review_vec,title_vec
0,"I am from old town, and I stayed in this hotel...",Incredible Hotel,5,"[[i, am, from, old, town, ,, and, i, stayed, i...","[[incredible, hotel]]","[[0.012062073, 0.02976662, 0.033480644, 0.0908...","[[0.07043457, 0.040649414, -0.09338379, 0.0153...","[0.023384497, 0.03565883, 0.024499038, 0.07690...","[0.07043457, 0.040649414, -0.09338379, 0.01539..."
1,We have been coming to the Ocean Park Inn for ...,We Love this beach front Inn,5,"[[we, have, been, coming, to, the, ocean, park...","[[we, love, this, beach, front, inn]]","[[0.0007873535, 0.05806885, -0.002355957, 0.06...","[[0.11916097, 0.017417273, -0.027191162, 0.098...","[0.0041688452, 0.050492834, 0.0048736045, 0.14...","[0.11916097, 0.017417273, -0.027191162, 0.0980..."
2,Perfect place for a quick get away. We had a q...,Love this place!,5,"[[perfect, place, for, a, quick, get, away, .]...","[[love, this, place, !]]","[[0.03937785, -0.011301677, -0.06715902, 0.041...","[[0.0017903646, 0.063802086, 0.0120442705, 0.0...","[0.012304458, 0.0050101005, -0.016735824, 0.08...","[0.0017903646, 0.063802086, 0.0120442705, 0.07..."
3,"The room was not the best however, it was good...",Good For One Night Stay...,2,"[[the, room, was, not, the, best, however, ,, ...","[[good, for, one, night, stay, ...]]","[[0.020104302, 0.053670987, 0.049757216, 0.067...","[[0.016113281, -0.027636718, 0.00847168, 0.107...","[0.025097786, 0.03321987, 0.02929426, 0.104737...","[0.016113281, -0.027636718, 0.00847168, 0.1076..."
4,Sous le motif d'une priode hivernale (inaccept...,Moyen,3,"[[sous, le, motif, d'une, priode, hivernale, (...",[[moyen]],"[[-0.04288025, 0.05625814, 0.076776125, 0.0677...","[[0.012145996, 0.032958984, 0.028808594, 0.108...","[-0.054073587, 0.18568115, 0.10359396, 0.08268...","[0.012145996, 0.032958984, 0.028808594, 0.1083..."


#### Separate x, y

In [0]:
y_train = df_train.target.values
title_vecs = np.vstack(df_train.title_vec.values)
review_vecs =  np.vstack(df_train.review_vec.values)
X_train = np.hstack([title_vecs, review_vecs])
X_train.shape

(48192, 600)

In [0]:
y_test = df_test.target.values
test_title_vecs = np.vstack(df_test.title_vec.values)
test_review_vecs =  np.vstack(df_test.review_vec.values)
X_test = np.hstack([test_title_vecs, test_review_vecs])
X_test.shape

(5355, 600)

#### Model

In [0]:
%%time
model = GridSearchCV(LinearSVC(penalty='l2', multi_class='ovr', random_state=SEED), 
                              {'C': np.logspace(-4, 3, 6)}, 
                              cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)
print(model.best_params_)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 34.9min finished


{'C': 1.584893192461111}
train 0.5136214045537899
test 0.4635723432254101
CPU times: user 6min 6s, sys: 730 ms, total: 6min 6s
Wall time: 40min 58s




In [0]:
%%time
model = GridSearchCV(RandomForestClassifier(random_state=SEED), 
                                    {'n_estimators': [50, 100], 'max_depth': [None, 25, 50]}, 
                                    cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)
print(model.best_params_)


print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed: 13.7min finished


{'max_depth': 25, 'n_estimators': 100}
train 0.9923652824280733
test 0.41473780511074576
