# Question Variant 1

GPT-2 is a left-to-right language modeling, that is, it predicts the next word in a sequence. BERT can predict words on either side depending on context, this is a bidirectional model. While GPT-2 uses transformer decoder blocks, BERT makes use of encoder blocks. GPT2 is auto-regressive, so it can only produce one token at a time, and BERT is not.

# Exam

Develop a model for predicting review rating.  
**Multiclass classification into 5 classes**  
Score: **F1 with macro averaging**  
You are forbidden to use test dataset for any kind of training.  
Remember proper training pipeline.  
If you are not using default params in the models, you have to use some validation scheme to justify them. 

Use `random_state` or `seed` params - your experiment must be reprodusible.


### 1 baseline = 0.51
### 2 baseline = 0.53


In [0]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize

import gensim
import gensim.downloader as gd

SEED=1337
np.random.seed(SEED)

In [123]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

In [125]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [0]:
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

In [127]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train.shape

(48192, 3)

In [6]:
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5
1,Excellent service - very approachable and prof...,Excellent Service,4
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4
4,Place had too many animals and I'm allergic to...,Experience,3


In [0]:
df_train['tokenized'] = df_train['review'].apply(lambda sent: word_tokenize(sent))
df_test['tokenized'] = df_test['review'].apply(lambda sent: word_tokenize(sent))

In [0]:
df_train['word_count'] = df_train['tokenized'].apply(lambda sent: len(sent))
df_test['word_count'] = df_test['tokenized'].apply(lambda sent: len(sent))

In [0]:
df_train['emotpunkt_count'] = df_train['review'].apply(lambda sent: sent.count('!'))
df_test['emotpunkt_count'] = df_test['review'].apply(lambda sent: sent.count('!'))

In [0]:
lemmatizer = WordNetLemmatizer()

In [0]:
df_train['lems'] = df_train['tokenized'].apply(lambda sent: [lemmatizer.lemmatize(word.lower()) for word in sent])
df_test['lems'] = df_test['tokenized'].apply(lambda sent: [lemmatizer.lemmatize(word.lower()) for word in sent])

In [0]:
df_train['len_t'] = df_train['title'].apply(lambda sent: len(sent))
df_test['len_t'] = df_test['title'].apply(lambda sent: len(sent))

In [134]:
df_train['vec'] = df_train['lems'].apply(lambda sent: vectorize(sent, w2v))
df_test['vec'] = df_test['lems'].apply(lambda sent: vectorize(sent, w2v))

  out=out, **kwargs)


In [0]:
df_train['good'] = df_train['lems'].apply(lambda sent: goodorbad(sent))
df_test['good'] = df_test['lems'].apply(lambda sent: goodorbad(sent))

In [111]:
w2v = gd.load('word2vec-google-news-300')


  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


KeyboardInterrupt: ignored

In [0]:
def vectorize(sent, w2v):
  vecs = []
  for tk in sent:
    try:
      vector = w2v[tk]
      vecs.append(vector)
    except KeyError:
      continue
  return np.mean(np.array(vecs), axis=0)

In [0]:
good_words = ['nice', 'good', 'excellent', 'pleasant', 'tasty', 'hot', 'fast', 'favourite', 'friendly', 'top']
bad_words = ['angry', 'slow', 'cold', 'dry', 'bad', 'unpleasant', 'aggressive', 'noisy', 'smoking']

In [0]:
def goodorbad(review):
  rating = 0
  for word in review:
    if word in good_words:
      rating+=1
    elif word in bad_words:
      rating-=1
    return rating


In [0]:
df_train = df_train.fillna(0)
df_test = df_test.fillna(0)

In [0]:
no_label = df_train.drop(columns=['target', 'review', 'title'])
to_predict = df_test.drop(columns=['target', 'review', 'title'])

In [0]:
labels_train = df_train['target']
labels_test = df_test['target']

In [176]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

model =  LinearSVC()

model.fit(df_train[['word_count', 'emotpunkt_count']], df_train['target'])





LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [0]:
target_pred = model.predict(df_test[['word_count', 'emotpunkt_count']])

In [175]:
f1_score(target_pred, df_test['target'], average = 'macro')

0.10288207502298885

In [156]:
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

SVCpipe = Pipeline([('scale', StandardScaler()),
                   ('SVC',LinearSVC())])


param_grid = {'SVC__C':np.arange(0.01,100,10)}
linearSVC = GridSearchCV(SVCpipe,param_grid,cv=5,return_train_score=True)
linearSVC.fit(df_train[['word_count', 'emotpunkt_count']], df_train['target'])
print(linearSVC.best_params_)
#linearSVC.coef_
#linearSVC.intercept_

bestlinearSVC = linearSVC.best_estimator_
bestlinearSVC.fit(df_train[['word_count', 'emotpunkt_count']], df_train['target'])
bestlinearSVC.coef_ = bestlinearSVC.named_steps['SVC'].coef_
bestlinearSVC.score(df_train[['word_count', 'emotpunkt_count']], df_train['target'])



{'SVC__C': 30.01}




0.4069762616201859

In [0]:
target_pred = bestlinearSVC.predict(df_test[['word_count', 'emotpunkt_count']])

In [158]:
f1_score(target_pred, df_test['target'], average = 'macro')

0.12861386886769136