# Exam

## Part 1: Question

Compare BERT and GPT-2: high-level description 

BERT uses self-attention and encoder blocks, while GPT-2 uses masked self-attention to prevent the information leak from the future tokens and decoder blocks in auto-regresssive fashion to output one token at a time.

## Part 2: Coding

Develop a model for predicting review rating.

**Multiclass classification into 5 classes**

Score: **F1 with macro averaging**

You are forbidden to use test dataset for any kind of training.

Remember proper training pipeline.

If you are not using default params in the models, you have to use some 
validation scheme to justify them.

Use random_state or seed params - your experiment must be reprodusible.

**1 baseline = 0.51**

**2 baseline = 0.53**

In [0]:
import pandas as pd
import numpy as np
import sklearn
import spacy

SEED=1337
np.random.seed(SEED)

In [0]:
!unzip exam_data.zip

Archive:  exam_data.zip
replace test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [2]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train.shape

(48192, 3)

In [3]:
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,5
1,Excellent service - very approachable and prof...,Excellent Service,4
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",5
3,"a little noisy, there was a false fire alarm a...","nice hotel,",4
4,Place had too many animals and I'm allergic to...,Experience,3


In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
import numpy as np
from collections import Counter
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

In [0]:
vectorizer = TfidfVectorizer()
df_train['text'] = df_train.title + '. ' + df_train.review
X_tv = vectorizer.fit_transform(df_train.text)
df_test['text'] = df_test.title + '. ' + df_test.review
X_test = vectorizer.transform(df_test.text)

In [0]:
y_tv = df_train.target
y_test = df_test.target
X_train, X_val, y_train, y_val = train_test_split(X_tv, y_tv, test_size=0.2, random_state=SEED)

In [7]:
X_train.shape, X_val.shape

((38553, 43520), (9639, 43520))

In [0]:
scoring = ['f1_macro']

In [0]:
skf = StratifiedKFold(5, shuffle=True, random_state=SEED)

### Logistic regression

In [0]:
clf = LogisticRegression(random_state=SEED)

In [11]:
scores = cross_validate(clf, X_train, y_train, cv=skf, scoring=scoring)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [12]:
print(np.mean(scores['test_f1_macro']))

0.5150838198247045


In [13]:
clf = LogisticRegression(solver='saga', random_state=SEED)
param_dist = dict(C=uniform(loc=0, scale=4), penalty=['l2', 'l1'])
tuned_clf = RandomizedSearchCV(clf, param_dist, random_state=SEED, scoring=scoring, refit=scoring[0], cv=skf)
search = tuned_clf.fit(X_train, y_train)
clf = search.best_estimator_
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')



Counter({5: 4520, 4: 2662, 3: 1277, 1: 743, 2: 437})


0.5329280881531459

In [14]:
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')

Counter({5: 4520, 4: 2662, 3: 1277, 1: 743, 2: 437})


0.5329280881531459

In [15]:
pred = clf.predict(X_tv)
print(Counter(pred))
sklearn.metrics.f1_score(y_tv, pred, average='macro')

Counter({5: 22024, 4: 13426, 3: 6504, 1: 3775, 2: 2463})


0.6605345914828209

In [16]:
pred = clf.predict(X_test)
print(Counter(pred))
sklearn.metrics.f1_score(y_test, pred, average='macro')

Counter({5: 2507, 4: 1498, 3: 677, 1: 437, 2: 236})


0.5167442386551145

### RF

In [10]:
clf = DecisionTreeClassifier(random_state=SEED)
scores = cross_validate(clf, X_train, y_train, cv=skf, scoring=scoring)
print(np.mean(scores['test_f1_macro']))

0.3657346299286789


In [11]:
clf = RandomForestClassifier(n_estimators=50, random_state=SEED, oob_score=True)
scores = cross_validate(clf, X_train, y_train, cv=skf, scoring=scoring)
print(np.mean(scores['test_f1_macro']))

0.37905431622407487


In [0]:
clf = RandomForestClassifier(n_estimators=50, random_state=SEED, oob_score=True)
param_dist = dict(C=uniform(loc=0, scale=4), penalty=['l2', 'l1'])
tuned_clf = RandomizedSearchCV(clf, param_dist, random_state=SEED, scoring=scoring, refit=scoring[0], cv=skf, n_jobs=-1)
search = tuned_clf.fit(X_train, y_train)
clf = search.best_estimator_
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')

In [0]:
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')

In [0]:
# pred = clf.predict(X_tv)
# print(Counter(pred))
# sklearn.metrics.f1_score(y_tv, pred, average='macro')

In [0]:
# pred = clf.predict(X_test)
# print(Counter(pred))
# sklearn.metrics.f1_score(y_test, pred, average='macro')

### SVM

In [0]:
clf = SVC(kernel='linear', random_state=SEED)
scores = cross_validate(clf, X_train, y_train, cv=skf, scoring=scoring)
print(np.mean(scores['test_f1_macro']))

In [0]:
clf = SVC(kernel='linear', random_state=SEED)
param_dist = dict(C=uniform(loc=0, scale=4), penalty=['l2', 'l1'])
tuned_clf = RandomizedSearchCV(clf, param_dist, random_state=SEED, scoring=scoring, refit=scoring[0], cv=skf, n_jobs=-1)
search = tuned_clf.fit(X_train, y_train)
clf = search.best_estimator_
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')

In [0]:
pred = clf.predict(X_val)
print(Counter(pred))
sklearn.metrics.f1_score(y_val, pred, average='macro')

In [0]:
# pred = clf.predict(X_tv)
# print(Counter(pred))
# sklearn.metrics.f1_score(y_tv, pred, average='macro')

In [0]:
# pred = clf.predict(X_test)
# print(Counter(pred))
# sklearn.metrics.f1_score(y_test, pred, average='macro')