# Day 1 - Exercise 3 - Sentiment Analysis

## Necessary imports

In order to handle the data properly we have to import the data and the modules we need:

In [68]:
# modules
import pandas as pd
import numpy as np
import re
import nltk
from bs4 import BeautifulSoup
from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression

First of all, you need to download the data set "imdb.csv" from the GitHub repository https://github.com/assenmacher-mat/nlp_notebooks.

__If you are running this notebook on colab (https://colab.research.google.com/), you also need to run the next chunk in order to upload the data to colab.  
Choose it in the upload window and in it will be available on colab from now on.__  
(If you are running this notebook locally on your machine, you can skip the execution of this chunk)

In [None]:
from google.colab import files
uploaded = files.upload()

### Import the data set

__If you are running this notebook locally on your machine, you might need to adjust the path (depending on where you've saved the data).__  
(If you are running this notebook on colab, you can can leave the path unchanged)

In [24]:
imdb = pd.read_csv("imdb.csv", encoding = "utf-8")
imdb = imdb.sample(frac = 1).reset_index(drop = True)

### Extract a list of texts and a list of labels from the pandas data frame

In [30]:
reviews_raw = [doc for doc in list(imdb.review)]
sentiments = [sent for sent in list(imdb.sentiment)]

### Use ``BeautifulSoup`` for cleaning the html markup

In [31]:
reviews_clean = [BeautifulSoup(rev, "html.parser").text for rev in reviews_raw]

### Transform the labels in the right format  
(Hint: Use the ``LabelBinarizer``-function, which works in a similar way as the ``CountVectorizer`` you already know)

In [93]:
bin = LabelBinarizer()
labels = bin.fit_transform(sentiments)
print(labels.shape)

(50000, 1)


### Perform a split into a training set and a test set with proportion 80 to 20  
(Hint: Use the ``train_test_split``-function from ``sklearn``)

In [33]:
xtrain, xtest = train_test_split(reviews_clean, shuffle = False, train_size = .8)
ytrain, ytest = train_test_split(labels, shuffle = False, train_size = .8)

### How many samples do we have in our train and our test set?

In [40]:
print("Number of training examples:", len(xtrain))
print("Number of training labels:", len(xtest))
print("Number of test examples:", len(ytrain))
print("Number of test labels:", len(xtest))

Number of training examples: 40000
Number of training labels: 10000
Number of test examples: 40000
Number of test labels: 10000


### Transform the corpus to a Bag-of-words

#### (1) Define the ``Vectorizer``  
For now use:
    - counts
    - unigrams
    - a minimum term frequency of 10
    - a maximum of 10.000 features

In [87]:
vec = TfidfVectorizer(ngram_range = (1,3), 
                      min_df = 10, 
                      binary = False, 
                      max_features = 50000,
                      use_idf = True)

#### (2) Feed the training set to the ``Vectorizer`` and create a Document-Term matrix

In [88]:
bow_train = vec.fit_transform(xtrain)

#### (3) Transform your test set to a DTM as well

In [89]:
bow_test = vec.transform(xtest)

#### (4) Train a logistic regression model using the ``LogisticRegression``-function from ``sklearn``

In [90]:
model = LogisticRegression(penalty = "l2", max_iter = 5000, C = 1, random_state = 123, solver = "lbfgs")

model = model.fit(bow_train, ytrain)

  y = column_or_1d(y, warn=True)


#### (5) Predict the sentiments of the test set

In [91]:
ytest_pred = model.predict(bow_test)

#### (6) Check the accuracy of your model using the ``accuracy_score``-function from ``sklearn``

In [92]:
accuracy_score(ytest, ytest_pred)

0.9082

### Not bad, right? But we can do better. Note down the accuracy and try the following options:
    1. Try ngrams (uni-, bi-, tri-grams)
    2. Increase max_features to 50k
    3. Back to the basic setting, but use tf-idf
    4. Now use tf-idf + ngrams (uni-, bi-, tri-grams)
    5. Set the max_features option up to 50k again
    6. Optional: Think of other parameters to tweak in order to increase performance

- Simple Bag of words:                 acc = 0.8713
- Using uo to trigrams:                acc = 0.8798
- max_features = 50k:                  acc = 0.8839
- td-idf:                              acc = 0.8939
- tf-idf + ngrams:                     acc = 0.9018
- tf-idf + ngrams + max_features:      acc = 0.9082

#### (7) Inspect the confusion matrix

In [95]:
print(confusion_matrix(ytest, ytest_pred))

[[4479  499]
 [ 419 4603]]


### Congrats, you have finished the first day!

![](https://media.giphy.com/media/VQ77RNKX0nyaA/giphy.gif)

### __*Optional:*__ Try out a random forest and see how it reacts to changes in the Document-term matrix