<a href="https://colab.research.google.com/github/eischaire/ML_4year/blob/master/exam_var2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
# !wget https://github.com/thedenaas/hse_seminars/raw/master/2019/exam/exam_data.zip

In [0]:
# !unzip exam_data.zip
# !ls

# Exam

Develop a model for predicting review rating.  
**Binary classification:**  
**positive class: target = 5**   
**negative class: target = 1,2,3,4**  
Score: **binary F1**  
You are forbidden to use test dataset for any kind of training.  
Remember proper training pipeline.  
If you are not using default params in the models, you have to use some validation scheme to justify them. 

Use `random_state` or `seed` params - your experiment must be reprodusible.


### 1 baseline = 0.720
### 2 baseline = 0.745


## Answer for the theoretical question

### Question №10. Discrepancy in seq2seq model training and inference implementation

The problem is that during training, true previous token is taken as the input for a model, whilst on the inference phase the teacher forcing is used, i.e. a previously predicted token becomes the input on every following iteration. 

So, the model functions differently in training and inference phases. The train task is easier, as every iteration contains true previous information. During the inference, we cannot guarantee that the information in the input is 'true' or 'good', and generation here is riskier.

In [0]:
from gensim.models.word2vec import Word2Vec
from sklearn.base import TransformerMixin
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
import re
from warnings import filterwarnings
filterwarnings('ignore')
SEED = 4242

In [4]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

df_train['target'] = (df_train['target'] == 5).astype(np.int)
df_test['target'] = (df_test['target'] == 5).astype(np.int)

df_train.shape

(48192, 3)

In [5]:
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,1
1,Excellent service - very approachable and prof...,Excellent Service,0
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",1
3,"a little noisy, there was a false fire alarm a...","nice hotel,",0
4,Place had too many animals and I'm allergic to...,Experience,0


In [0]:
# code from a seminar notebook

class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5,negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ])

In [29]:
%%time
reviews_list = [re.sub('  ', ' ', re.sub('\n', ' ', x.strip())) for x in df_train.review]
# titles_list = [re.sub('  ', ' ', re.sub('\n', ' ', x.strip())) for x in df_train.title]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(reviews_list[:15000])

vectorized_reviews = [np.mean(w2v_cbow.transform(sentence.split(' ')), axis=0) for sentence in reviews_list]
df_train['vectorized_review'] = pd.Series(vectorized_reviews)

# vectorized_title = [np.mean(w2v_cbow.transform(sentence.split(' ')), axis=0) for sentence in titles_list]
# df_train['vectorized_title'] = pd.Series(vectorized_title)

test_reviews = [re.sub('  ', ' ', re.sub('\n', ' ', x.strip())) for x in df_test.review]
df_test['vectorized_review'] = pd.Series([np.mean(w2v_cbow.transform(sentence.split(' ')), axis=0) for sentence in test_reviews])


CPU times: user 14min 53s, sys: 3.33 s, total: 14min 56s
Wall time: 7min 53s


In [30]:
trial_xtrain = list(df_train['vectorized_review'])
trial_ytrain = list(df_train['target'])
trial_xtest = list(df_test['vectorized_review'])
trial_ytest = list(df_test['target'])

CPU times: user 17.4 ms, sys: 6.9 ms, total: 24.3 ms
Wall time: 24.1 ms


In [0]:
%%time
knn = KNeighborsClassifier(n_neighbors=10, metric='cosine')
knn.fit(trial_xtrain, trial_ytrain)
y_pred = knn.predict(trial_xtest)
print(f1_score(trial_ytest, y_pred))