## 4. Compare logistic regression and SVM: decision surface and robustness

In logistic regression the decision surface is always described with a sigmoid function, while SVM tries to create a hyperplane of size n-1 in an n-dimensional space to divide the surface into subparts. The hyperplane, unlike the sigmoid function, can have various shapes. Svm is more robust than logistic regression.

# Exam

Develop a model for predicting review rating.  
**Binary classification:**  
**positive class: target = 5**   
**negative class: target = 1,2,3,4**  
Score: **binary F1**  
You are forbidden to use test dataset for any kind of training.  
Remember proper training pipeline.  
If you are not using default params in the models, you have to use some validation scheme to justify them. 

Use `random_state` or `seed` params - your experiment must be reprodusible.


### 1 baseline = 0.720
### 2 baseline = 0.745


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
current_dir = '/content/drive/My Drive/Colab Notebooks/'

In [0]:
import pandas as pd
import numpy as np

In [10]:
df_train = pd.read_csv(current_dir + 'train.csv')
df_test = pd.read_csv(current_dir + 'test.csv')

df_train['target'] = (df_train['target'] == 5).astype(np.int)
df_test['target'] = (df_test['target'] == 5).astype(np.int)

df_train.shape

(48192, 3)

In [11]:
print('Размер test dataset: ', len(df_test))
print('Размер train dataset: ', len(df_train))

Размер test dataset:  5355
Размер train dataset:  48192


In [12]:
df_train.head()

Unnamed: 0,review,title,target
0,"The staff was very friendly, the breakfast ver...",Walker Gem,1
1,Excellent service - very approachable and prof...,Excellent Service,0
2,Really a top notch place to spend a day at the...,"Good location, warm and friendly staff",1
3,"a little noisy, there was a false fire alarm a...","nice hotel,",0
4,Place had too many animals and I'm allergic to...,Experience,0


In [30]:
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 
token = RegexpTokenizer('\w+')
nltk.download('stopwords')
stops = set(stopwords.words('english'))

vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stops)
SEED=1337

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
def tokenize(text):
    return token.tokenize(text)


def lemmatize(texts):
    arr = []
    texts = [text.lower() for text in texts]
    for text in texts:
        words = [lemmatizer.lemmatize(word) for word in tokenize(text) if word not in stops]
        arr.append(' '.join(words))
    return arr

In [0]:
train_txt = list(df_train["review"])
test_txt = list(df_test["review"])

In [0]:
train_txt = lemmatize(train_txt)
test_txt = lemmatize(test_txt)

In [17]:
train_txt[:5]

['staff friendly breakfast nice extremely comfortable bed',
 'excellent service approachable professional staff stayed business dining gallery certainly convenient fitness center could improved basic cardio equipment needed would stay',
 'really top notch place spend day beginning end honduras trip staff friendly professional helpful efficient question find answer make call walking distance mall lot good restaurant though advisable take taxi night breakfast buffet',
 'little noisy false fire alarm midnight reason given',
 'place many animal allergic pet although receive pet free room pet every lobby elevator anything pet highly allergic therefore people allergy considered']

In [18]:
df_train.insert(1, "review_lem", train_txt)
df_train.head()

Unnamed: 0,review,review_lem,title,target
0,"The staff was very friendly, the breakfast ver...",staff friendly breakfast nice extremely comfor...,Walker Gem,1
1,Excellent service - very approachable and prof...,excellent service approachable professional st...,Excellent Service,0
2,Really a top notch place to spend a day at the...,really top notch place spend day beginning end...,"Good location, warm and friendly staff",1
3,"a little noisy, there was a false fire alarm a...",little noisy false fire alarm midnight reason ...,"nice hotel,",0
4,Place had too many animals and I'm allergic to...,place many animal allergic pet although receiv...,Experience,0


In [19]:
df_test.insert(1, 'review_lem', test_txt)
df_test.head()

Unnamed: 0,review,review_lem,title,target
0,"I am from old town, and I stayed in this hotel...",old town stayed hotel mom visit renovation yea...,Incredible Hotel,1
1,We have been coming to the Ocean Park Inn for ...,coming ocean park inn year usually book severa...,We Love this beach front Inn,1
2,Perfect place for a quick get away. We had a q...,perfect place quick get away queen room shared...,Love this place!,1
3,"The room was not the best however, it was good...",room best however good one night continuing tr...,Good For One Night Stay...,0
4,Sous le motif d'une priode hivernale (inaccept...,sou le motif une priode hivernale inacceptable...,Moyen,0


In [0]:
import tensorflow
from tensorflow.keras import *
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=7000)

In [0]:
def model(train_data, y_train):
    
    model = Sequential([
    Embedding(7000, 200),
    Bidirectional(LSTM(100, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(32, activation = 'relu'),
    Dropout(0.1),
    Dense(1, activation = 'sigmoid'),
    ])
    
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
    tokenizer.fit_on_texts(train_data)
    X_train = pad_sequences(tokenizer.texts_to_sequences(train_data), maxlen = 200)
    
    model.fit(X_train, y_train, batch_size=250, epochs=4, validation_split=0.25)
    
    return model

In [49]:
#до этого модель обучалась в 3 эпохи

model = model(df_train['review_lem'], df_train['target'])

Train on 36144 samples, validate on 12048 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [0]:
y_pred = [int(x[0] + 0.5) for x in model.predict(pad_sequences(tokenizer.texts_to_sequences(df_test['review_lem']), maxlen = 200))]

In [52]:
print('F1 score: ', round(f1_score(df_test['target'], y_pred), 3))

F1 score:  0.723
