# Natural Language Processing - Assignment 2
# Sentiment analysis for movie reviews

This notebook was created for you to answer question 2, 3 and 4 from assignment 2. Please read the steps and the provided code carefully and make sure you understand them. 

The (red) comments at the beginning of each function explain what they should do, which parameters you should give as input and which variables should be returned by the function. After the (green) comments "### student code here###' you should write your own code.

**Please modify the next cell specifying your group number**

 *This is the Notebook of* ***Group 10*** 




### Prerequisite - Libraries
Make sure you have the needed libraries installed on your computer: scikit-learn, Pandas, NLTK...

### Prerequisite - Load Data

In the first step, we are going to load the data in a Pandas DataFrame. Pandas DataFrames are a useful way of storing data. DataFrames are tables in which data can be accessed as columns, as rows or as individual cells. You can find more info on DataFrames here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Read the code below and make sure you understand what is happening. Run the code to load your data.

In [1]:
import os
import re
import pandas as pd
import numpy as np
import glob
### student code here: import the needed modules from sci-kit learn ###

In [2]:
def get_path(filename):
    """
    Makes a list of all the paths that fit the search requirement
    
    :param filename: A regular expression that defines the search requirement for the filenames
    :return  Returns a list of all the pathnames
    """
    # place the movies folder in the same directory as this notebook
    current_directory = os.getcwd()
    # if you are using Google Colab, you will have to change the above line
    # to load the dataset from your Google Drive

    # glob.glob() is a pattern-matching path finder, it searches for the reviews in the movies folder based on a Regular Expression
    paths = glob.glob(current_directory + '/movies/' + filename)
    
    if len(paths) == 0:
        print('Your file list is empty. The code looks for the folder '+current_directory+'/movies, but could not find it.')
    else: 
        print("Found ", len(paths), "files")
    return paths

In [3]:
def load_data(pathset):
    """
    Loads the data into a dataframe
    
    :param pathset:  A list of paths
    :return  A dataframe with three columns: Path, Review (Text) and Label
    """
    # Files are named by sentiment (P for positive, N for negative)
    pattern = re.compile('P-(train|test)[0-9]*.txt')
    reviews = []
    labels = []
    df = pd.DataFrame(columns = ['Path', 'Review', 'Label'])
    for path in pathset:
        if re.search(pattern, path):
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Pos')
        else:
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Neg')
    df['Path'] = pathset
    df['Review'] = reviews
    df['Label'] = labels
    return df

In [4]:
#Load the files in the Dataframe. This will take a while...
paths = get_path('train/[NP]-train[0-9]*.txt')
data = load_data(paths)
data.head()

Found  1200 files


Unnamed: 0,Path,Review,Label
0,C:\Users\lenovo/movies/train\N-train001.txt,Once again Mr. Costner has dragged out a movie...,Neg
1,C:\Users\lenovo/movies/train\N-train002.txt,This is a pale imitation of 'Officer and a Gen...,Neg
2,C:\Users\lenovo/movies/train\N-train003.txt,"It seems ever since 1982, about every two or t...",Neg
3,C:\Users\lenovo/movies/train\N-train004.txt,"Wow, another Kevin Costner hero movie. Postman...",Neg
4,C:\Users\lenovo/movies/train\N-train005.txt,"Alas, another Costner movie that was an hour t...",Neg


### Part 2 - Tokenization

In this step, you should write a tokenizer and compare it with an off-the-shelf one.

#### Question 2.1 Making your own tokenizer

In [5]:
def my_tokenizer(text):
    """
    The implementation of your own tokenizer
    
    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """    
    
    
    tokenized_text = re.findall(r'\w+|[^\w\s]', text)
    #Here I am using the regular expression for all the occurences, where it will split the text.
    #  \w+ are based on characters , | indicates OR expression and  [^\w\s] are for non-words such as punctuation.
    
    return tokenized_text

sample_string0 = "If you have the chance, watch it. Although, a warning, you'll cry your eyes out."
sample_string1 = "I wish life would be a bit easy" 
sample_string2 = "I wish to go to Japan every once in a year. Wishes do come true. Right?" 
sample_string3 = "Hello, world! How are you?"
print(my_tokenizer(sample_string0))
print(my_tokenizer(sample_string1))
print(my_tokenizer(sample_string2))
print(my_tokenizer(sample_string3))

['I', 'wish', 'life', 'would', 'be', 'a', 'bit', 'easy']
['I', 'wish', 'to', 'go', 'to', 'Japan', 'every', 'once', 'in', 'a', 'year', '.', 'Wishes', 'do', 'come', 'true', '.', 'Right', '?']
['Hello', ',', 'world', '!', 'How', 'are', 'you', '?']


#### Question 2.2 Using an off-the-shelf tokenizer

In [6]:
#Now we are gonna compare the tokenizer you just wrote with the one from NLTK
#if you installed NLTK but never downloaded the 'punkt' tokenizer, uncomment the following lines:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

def nltk_tokenizer(text):
    """
    This function should apply the word_tokenize (punkt) tokenizer of nltk to the input text
    
    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """     
  
    
    tokenized_text= word_tokenize(text)
    
    return tokenized_text

test_sentences = ["I like this assignment because:\n-\tit is fun;\n-\tit helps me practice my Python skills.",
        "I won a prize, but I won't be able to attend the ceremony.",
        "“The strange case of Dr. Jekyll and Mr. Hyde” is a famous book... but I haven't read it.",
        "I work for the C.I.A.. And you?",
        "OMG #Twitter is sooooo coooool <3 :-) <-- lol...why do i write like this idk right? :) 🤷😂 🤖"]

for test_string in test_sentences:
    print("my_tokenizer =", my_tokenizer(test_string))
    print("nltk_tokenizer =", nltk_tokenizer(test_string))
    
    #print(my_tokenizer(test_string))
    
    #print(nltk_tokenizer(test_string))
    print("\n")
    

my_tokenizer = ['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']
nltk_tokenizer = ['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']


my_tokenizer = ['I', 'won', 'a', 'prize', ',', 'but', 'I', 'won', "'", 't', 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']
nltk_tokenizer = ['I', 'won', 'a', 'prize', ',', 'but', 'I', 'wo', "n't", 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']


my_tokenizer = ['“', 'The', 'strange', 'case', 'of', 'Dr', '.', 'Jekyll', 'and', 'Mr', '.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '.', '.', '.', 'but', 'I', 'haven', "'", 't', 'read', 'it', '.']
nltk_tokenizer = ['“', 'The', 'strange', 'case', 'of', 'Dr.', 'Jekyll', 'and', 'Mr.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '...', 'but', 'I', 'have', "n't", 'read', 'it', '.']


my_tokenizer = ['I', 'wo

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Part 3 - Text classification with a unigram language model

#### Training phase
You now need to create the model and train it on the documents in the dataframe. Look at the scikit learn documentation to learn how to use the CountVectorizer and MultimodalNaiveBayes modules.

In [7]:
### Student answer here ###
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


vectorizer = CountVectorizer(stop_words='english', lowercase=True)
X_train = data['Review']  # Make sure X_train contains the text data
X_train_counts = vectorizer.fit_transform(X_train)

X_train_counts = vectorizer.fit_transform(X_train)

classifier = MultinomialNB()
classifier.fit(X_train_counts, data['Label'])


#### Testing phase
Now that you have a trained model, you need to test its performance.

1. Load your test data.
2. Classify your test data using the classifier you trained before.
3. Compute the accuracy of your classifier on the test data

In [8]:
# First, read all the test data from the files.  
# Then classify it using the classifier you trained before
# Finally, calculate the performance
### Student code here ###

test_paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_data(test_paths)
X_test_counts = vectorizer.transform(test_data['Review'])

predictions = classifier.predict(X_test_counts)
accuracy = accuracy_score(test_data['Label'], predictions)
print("Accuracy:", accuracy)


Found  100 files
Accuracy: 0.85


Now train two more models: one without Laplace smoothing, and one where stopwords are removed. Then test them on the same test data, and compare the performance with the results you previously obtained.

In [9]:
classifier_with_smoothing = MultinomialNB(alpha=1) #k=1
classifier_with_smoothing.fit(X_train_counts, data['Label'])

classifier_no_smoothing = MultinomialNB(alpha=0) #k=0
classifier_no_smoothing.fit(X_train_counts, data['Label'])


# Test the model with Laplace smoothing
predictions_with_smoothing = classifier_with_smoothing.predict(X_test_counts)
accuracy_with_smoothing = accuracy_score(test_data['Label'], predictions_with_smoothing)
print("Accuracy with Laplace Smoothing (k = 1):", accuracy_with_smoothing)

# Test the model without Laplace smoothing
predictions_no_smoothing = classifier_no_smoothing.predict(X_test_counts)
accuracy_no_smoothing = accuracy_score(test_data['Label'], predictions_no_smoothing)
print("Accuracy without Laplace Smoothing (k = 0):", accuracy_no_smoothing)



Accuracy with Laplace Smoothing (k = 1): 0.85
Accuracy without Laplace Smoothing (k = 0): 0.8




In [10]:
#Now we will check the perfomance with removing stop words vs without removing them
#Lets first train it
vectorizer_no_stopwords= CountVectorizer(stop_words='english',lowercase=True)
X_train_no_stopwords= data['Review']
X_train_counts_no_stopwords = vectorizer_no_stopwords.fit_transform(X_train_no_stopwords)

classifer_no_stopwords= MultinomialNB()
classifer_no_stopwords.fit(X_train_counts_no_stopwords, data['Label'])

#Now lets check the accuracy
# Test the model with stop words removed
X_test_counts_no_stopwords = vectorizer_no_stopwords.transform(test_data['Review'])
predictions_no_stopwords = classifer_no_stopwords.predict(X_test_counts_no_stopwords)
accuracy_no_stopwords = accuracy_score(test_data['Label'], predictions_no_stopwords)
print("Accuracy with Stop Words Removed:", accuracy_no_stopwords)

# Test the model without stop words removed
predictions = classifier.predict(X_test_counts)
accuracy = accuracy_score(test_data['Label'], predictions)
print("Accuracy without Stop Words Removed:", accuracy)


Accuracy with Stop Words Removed: 0.85
Accuracy without Stop Words Removed: 0.85


In [11]:
# Create a CountVectorizer without lowercase normalization and stop word removal
vectorizer_no_preprocessing = CountVectorizer(lowercase=False, stop_words=None)
X_train_no_preprocessing = data['Review']
X_train_counts_no_preprocessing = vectorizer_no_preprocessing.fit_transform(X_train_no_preprocessing)

# Train a classifier with the unprocessed text
classifier_no_preprocessing = MultinomialNB()
classifier_no_preprocessing.fit(X_train_counts_no_preprocessing, data['Label'])

#Now lets test the model
X_test_counts = vectorizer.transform(test_data['Review'])
predictions = classifier.predict(X_test_counts)
accuracy = accuracy_score(test_data['Label'], predictions)
print("Accuracy with Preprocessing:", accuracy)

X_test_counts_no_preprocessing = vectorizer_no_preprocessing.transform(test_data['Review'])
predictions_no_preprocessing = classifier_no_preprocessing.predict(X_test_counts_no_preprocessing)
accuracy_no_preprocessing = accuracy_score(test_data['Label'], predictions_no_preprocessing)
print("Accuracy without Preprocessing:", accuracy_no_preprocessing)

Accuracy with Preprocessing: 0.85
Accuracy without Preprocessing: 0.81


### Part 4 - Text classification with a bigram language model

Now we will classify the same dataset again, but this time with a bigram language model. 

#### Training phase
Build a Naïve Bayes classifier that uses bigrams instead of single words.


In [12]:
### Student code here ###



#### Testing phase
As before, calculate the performance on your test data, and notice the difference with the previous

In [13]:
### Student code here ###


### Trigrams
When I asked students how to improve the classification performance on this dataset, the first question was always "use trigrams" (or even higher-order n-grams). Let's try how much of an improvement that would be, by training a trigram model and testing it.

In [14]:
### Student code here ###
