***NOTE: This notebook has been adapted from https://stackabuse.com/text-classification-with-python-and-scikit-learn/ for a tutorial in NLP from the UCL ICH coding club***

# Classification of reviews

The dataset that we are going to use for this article can be downloaded from: http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

The dataset consists of a total of 2000 documents. Half of the documents contain positive reviews regarding a movie while the remaining half contains negative reviews. 

Install dependencies if needed: 

In [None]:
%%capture
## Run the following to install packages and restart Kernel:
!pip install numpy pandas nltk sklearn


In [3]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/ferran/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
movie_data = load_files(r"txt_sentoken")
X, y = movie_data.data, movie_data.target
unique_elements, counts_elements = np.unique(y, return_counts=True)
print("Unique Labels:", unique_elements)
print("Proportions:", counts_elements*100/len(y), "%")
print("Number of reviews:", len(y))
print(y[0:2])
print(X[0:2])

Unique Labels: [0 1]
Proportions: [50. 50.] %
Number of reviews: 2000
[0 1]
[b"arnold schwarzenegger has been an icon for action enthusiasts , since the late 80's , but lately his films have been very sloppy and the one-liners are getting worse . \nit's hard seeing arnold as mr . freeze in batman and robin , especially when he says tons of ice jokes , but hey he got 15 million , what's it matter to him ? \nonce again arnold has signed to do another expensive blockbuster , that can't compare with the likes of the terminator series , true lies and even eraser . \nin this so called dark thriller , the devil ( gabriel byrne ) has come upon earth , to impregnate a woman ( robin tunney ) which happens every 1000 years , and basically destroy the world , but apparently god has chosen one man , and that one man is jericho cane ( arnold himself ) . \nwith the help of a trusty sidekick ( kevin pollack ) , they will stop at nothing to let the devil take over the world ! \nparts of this are actual

# Text Preprocessing
Once the dataset has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. 




In [5]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/ferran/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
documents = []

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [lemmatizer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)


In [7]:
type(documents)

list

In [8]:
documents_df = pd.DataFrame(data=documents, columns=["text"])
documents_df.head()

Unnamed: 0,text
0,arnold schwarzenegger ha been an icon for acti...
1,good film are hard to find these day ngreat fi...
2,quaid star a man who ha taken up the proffesio...
3,we could paraphrase michelle pfieffer characte...
4,kolya is one of the richest film ve seen in so...


# Vectorization through Bag-of-Words

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

In [10]:
X[0][0:100]

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0])

In [11]:
X.shape

(2000, 1500)

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [13]:
print("Length of training set: ", X_train.shape[0])
print("Length of test set: ", X_test.shape[0])

Length of training set:  1600
Length of test set:  400


# Training Text Classification Model and Predicting Sentiment

Loading machine learning model-> random forest

In [14]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

![](randomforest.jpg)

Fit the model using the training data: learn relationship between X (feature matrix) and Y (postivie/negative reviews)

In [15]:
classifier.fit(X_train, y_train) 

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

### Make predictions on the test set

In [16]:
y_pred = classifier.predict(X_test)

# Evaluating the Model

In [17]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

[[167  41]
 [ 19 173]]
              precision    recall  f1-score   support

           0       0.90      0.80      0.85       208
           1       0.81      0.90      0.85       192

    accuracy                           0.85       400
   macro avg       0.85      0.85      0.85       400
weighted avg       0.85      0.85      0.85       400

0.85


# Try with some sentences

In [18]:
mysentences = ["This is an awful movie","This is an incredible movie. The best one I have ever seen. I would definetly recommend it to everyone"]

In [19]:
documents2 = []

for sen in range(0, len(mysentences)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(mysentences[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [lemmatizer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents2.append(document)

In [20]:
documents2

['this is an awful movie',
 'this is an incredible movie the best one have ever seen would definetly recommend it to everyone']

In [21]:
X2 = vectorizer.transform(documents2).toarray()

In [22]:
classifier.predict(X2)

array([0, 0])

In [23]:
classifier.predict_proba(X2)

array([[0.757, 0.243],
       [0.626, 0.374]])

### Wrong predictions... Why?
