# Assignment 5

Build CNN model for sentiment analysis (binary classification) of IMDB Reviews (https://www.kaggle.com/utathya/imdb-review-dataset). You can use data with label="unsup" for pretraining of embeddings. Here you are forbidden to use test dataset for pretraining of embeddings.
Your quality metric is accuracy score on test dataset. Look at "type" column for train/test split.
You can use pretrained embeddings from external sources.
You have to provide data for trials with different hyperparameter values.

You have to beat following baselines:

[3 points] acc = 0.75

[5 points] acc = 0.8

[8 points] acc = 0.9

[2 points] for using unsupervised data

### Импорты

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 
token = RegexpTokenizer('\w+')
nltk.download('stopwords')
stops = set(stopwords.words('english'))

vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stops)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ASUS\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
import tensorflow
from tensorflow.keras import *
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=7000)

### Чтение данных

In [4]:
data = pd.read_csv('imdb_master.csv', index_col=0, encoding='iso-8859-1')
data[:7]

Unnamed: 0,type,review,label,file
0,test,Once again Mr. Costner has dragged out a movie...,neg,0_2.txt
1,test,This is an example of why the majority of acti...,neg,10000_4.txt
2,test,"First of all I hate those moronic rappers, who...",neg,10001_1.txt
3,test,Not even the Beatles could write songs everyon...,neg,10002_3.txt
4,test,Brass pictures (movies is not a fitting word f...,neg,10003_3.txt
5,test,"A funny thing happened to me while watching ""M...",neg,10004_2.txt
6,test,This German horror film has to be one of the w...,neg,10005_2.txt


In [5]:
print('Размер датасета: ', len(data))
print('Кол-во неразмеченных примеров: ', len(data.loc[data["label"] == "unsup"]))

Размер датасета:  100000
Кол-во неразмеченных примеров:  50000


In [6]:
#Разделим датасет на test и train

test = data.loc[data["type"] == "test"]
train = data.loc[(data["type"] == "train") & (data["label"] != "unsup")]
print('Размер test dataset: ', len(test))
print('Размер train dataset: ', len(train))

Размер test dataset:  25000
Размер train dataset:  25000


In [7]:
def tokenize(text):
    return token.tokenize(text)


def lemmatize(texts):
    arr = []
    texts = [text.lower() for text in texts]
    for text in texts:
        words = [lemmatizer.lemmatize(word) for word in tokenize(text) if word not in stops]
        arr.append(' '.join(words))
    return arr

In [8]:
train_txt = list(train["review"])
test_txt = list(test["review"])

In [9]:
train_txt = lemmatize(train_txt)
test_txt = lemmatize(test_txt)

In [10]:
train.insert(2, "review_lem", train_txt) 

In [11]:
test['label'] = test['label'].map({'neg': 0, 'pos': 1})
train['label'] = train['label'].map({'neg': 0, 'pos': 1})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [12]:
train.head()

Unnamed: 0,type,review,review_lem,label,file
25000,train,Story of a man who has unnatural feelings for ...,story man unnatural feeling pig start opening ...,0,0_3.txt
25001,train,Airport '77 starts as a brand new luxury 747 p...,airport 77 start brand new luxury 747 plane lo...,0,10000_4.txt
25002,train,This film lacked something I couldn't put my f...,film lacked something put finger first charism...,0,10001_4.txt
25003,train,"Sorry everyone,,, I know this is supposed to b...",sorry everyone know supposed art film wow hand...,0,10002_1.txt
25004,train,When I was little my parents took me along to ...,little parent took along theater see interior ...,0,10003_1.txt


In [13]:
test.insert(2, 'review_lem', test_txt)
test.head()

Unnamed: 0,type,review,review_lem,label,file
0,test,Once again Mr. Costner has dragged out a movie...,mr costner dragged movie far longer necessary ...,0,0_2.txt
1,test,This is an example of why the majority of acti...,example majority action film generic boring re...,0,10000_4.txt
2,test,"First of all I hate those moronic rappers, who...",first hate moronic rapper could nt act gun pre...,0,10001_1.txt
3,test,Not even the Beatles could write songs everyon...,even beatles could write song everyone liked a...,0,10002_3.txt
4,test,Brass pictures (movies is not a fitting word f...,brass picture movie fitting word really somewh...,0,10003_3.txt


### Создание CNN модели

In [22]:
def model(train_data, y_train):
    
    model = Sequential([
    Embedding(7000, 200),
    Bidirectional(LSTM(100, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(32, activation = 'relu'),
    Dropout(0.1),
    Dense(1, activation = 'sigmoid'),
    ])
    
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
    tokenizer.fit_on_texts(train_data)
    X_train = pad_sequences(tokenizer.texts_to_sequences(train_data), maxlen = 200)
    
    model.fit(X_train, y_train, batch_size=250, epochs=3, validation_split=0.25)
    
    return model

In [23]:
model = model(train['review_lem'], train['label'])

Train on 18750 samples, validate on 6250 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [25]:
y_pred = [int(x[0] + 0.5) for x in model.predict(pad_sequences(tokenizer.texts_to_sequences(test['review_lem']), maxlen = 200))]

In [26]:
print('Accuracy score: ', accuracy_score(test['label'], y_pred))

Accuracy score:  0.82848


### Бинарная классификация unsup texts

In [14]:
X_test = vectorizer.fit_transform(test_txt).toarray()
X_test = tfidfconverter.fit_transform(X_test).toarray()

In [15]:
X_train = vectorizer.fit_transform(train_txt).toarray()
X_train = tfidfconverter.fit_transform(X_train).toarray()

In [16]:
y_test = list(test['label'])
y_train = list(train['label'])

In [31]:
rfc = RandomForestClassifier(n_jobs=-1, max_features= 'sqrt', n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)

In [32]:
CV_rfc.fit(X_train, y_train)
print (CV_rfc.best_params_)

{'max_features': 'log2', 'n_estimators': 700}


In [33]:
classifier = RandomForestClassifier(n_jobs=-1, max_features= 'log2', n_estimators=700, random_state=23, oob_score = True)
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=700, n_jobs=-1,
            oob_score=True, random_state=23, verbose=0, warm_start=False)

In [34]:
y_pred = classifier.predict(X_test)

In [35]:
accuracy_score(y_test, y_pred)

0.61712

Применим полученную модель к unsup texts

In [17]:
unsup = data.loc[data["label"] == "unsup"]
unsup_txt = list(unsup['review'])

In [37]:
unsup_vec = vectorizer.fit_transform(unsup_txt).toarray()
unsup_tfidf = tfidfconverter.fit_transform(unsup_vec).toarray()

In [38]:
unsup_labels = classifier.predict(unsup_tfidf)

In [39]:
unsup_labels[:5]

array([0, 1, 0, 1, 1])

In [131]:
with open('unsup_labels.txt', 'w', encoding = 'utf-8') as w:
    for u in unsup_labels:
        w.write(str(u))
        w.write('\n')

In [18]:
with open('unsup_labels.txt', 'r', encoding = 'utf-8') as r:
    unsup_labels = r.readlines()

In [21]:
unsup_labels = [int(x) for x in unsup_labels]

Обединим полученные данные

In [23]:
texts = test_txt + train_txt + unsup_txt
labels = list(test['label']) + list(train['label']) + unsup_labels

In [24]:
print(len(texts))
print(len(labels))

100000
100000


Разделим на test и train

In [27]:
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.25, random_state=0)

In [28]:
tokenizer.fit_on_texts(X_train)
X_train = pad_sequences(tokenizer.texts_to_sequences(X_train), maxlen = 200)

### CNN модель на полных данных

In [29]:
def cnn_model():
    
    model = Sequential([
    Embedding(7000, 256),
    #SpatialDropout1D(0.1),
    Bidirectional(LSTM(100, return_sequences=True)),
    GlobalMaxPool1D(),
    Dense(32, activation = 'relu'),
    Dropout(0.1),
    Dense(1, activation = 'sigmoid'),
    ])
    
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [33]:
model = cnn_model()
model.fit(X_train[:5000], y_train[:5000], batch_size=250, epochs=1, validation_split=0.25)

Train on 3750 samples, validate on 1250 samples
Epoch 1/1


<tensorflow.python.keras.callbacks.History at 0x17e3289d400>

In [34]:
y_pred = [int(x[0] + 0.5) for x in model.predict(pad_sequences(tokenizer.texts_to_sequences(X_test[:5000]), maxlen = 200))]

In [35]:
print('Accuracy score: ', accuracy_score(y_test[:5000], y_pred))

Accuracy score:  0.5252
