# 15. IMDB Reviews
To check if our algorithm works on another dataset, we used the Kaggle IMDB dataset in this case and tried to train our models with that to see if the scores were close to the best ones on Kaggle. For this dataset, the highest competition accuracy score was 0.93. Sentiment 0 is negative, 1 is positive. https://www.kaggle.com/c/imdb-review/data

## Creating the dataframe

In [11]:
import pandas as pd

df = pd.read_csv('IMDB/train_data.csv', index_col=0)
df

Unnamed: 0_level_0,SentimentText,Sentiment
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,first think another disney movie might good it...,1
1,put aside dr house repeat missed desperate hou...,0
2,big fan stephen king s work film made even gre...,1
3,watched horrid thing tv needless say one movie...,0
4,truly enjoyed film acting terrific plot jeff c...,1
...,...,...
23995,garbo s introduction sound clarence brown s an...,1
23996,means movie bad perfect stranger was not funny...,0
23997,happened basically solid plausible premise dec...,0
23998,seen romantic comedies one easiest worst attem...,0


## Vectorizing

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5, norm='l2', encoding='latin-1', ngram_range=(1, 2))
features = tfidf.fit_transform(df.SentimentText)
labels = df.Sentiment

features

<24000x94669 sparse matrix of type '<class 'numpy.float64'>'
	with 3438168 stored elements in Compressed Sparse Row format>

## Splitting and Training

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, df.index, test_size=0.33, random_state=0)
    
linearSVCModel = LinearSVC()
linearSVCModel.fit(X_train, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

## Predicting

In [18]:
from sklearn import metrics

y_predLinearSVC = linearSVCModel.predict(X_test)

print(metrics.classification_report(y_test, y_predLinearSVC))

              precision    recall  f1-score   support

           0       0.90      0.89      0.90      3958
           1       0.89      0.91      0.90      3962

    accuracy                           0.90      7920
   macro avg       0.90      0.90      0.90      7920
weighted avg       0.90      0.90      0.90      7920



## Conclusion
To test if our model was functioning properly, we wanted to test it against another dataset. To do this, we fed the new data to our best functioning model: Linear SVC with tf-idf vectorizing. Since the data seemed preprocessed already for all candidates in the Kaggle competition, we did not use our own preprocessing method (it does not seem to to much of a difference anyway on our own Kaggle dataset). The highest accuracy on Kaggle was 0.91. Our algorithm accomplished 0.90. It seems that we are on the right path according to this test. 

## Discussion
We wanted to learn from other peoples best solutions on Kaggle. However, the highest on the leaderboard have 0.93.
It seems there are not that much more better solutions. The leaders on Kaggle that get that few percent more seem to use Neural Networks. Next lecture will be on Neural Networks for NLP. This will be perfect for us to try some more NN, since our last attempt was not very succesfull. 