<a href="https://colab.research.google.com/github/angelcolab97/principles_of_ai/blob/main/220601_PracticalWeek02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principles of AI
# TutorialWeek02 — NLP (Sentiment Analysis)


Sentiment Analysis will be the focus of this section (previously presented in Section 2.6)
Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative, or neutral. Text document can be used for sentiment analysis, which helps businesses monitor customer feedback and better understand their customers' needs. 
 
For this exercise, we'll look at three datasets that contain sentences with positive or negative sentiment labels. These data sets were compiled from product, movie, and restaurant reviews found on three different websites:

>- amazon.com
>- imdb.com
>- yelp.com

**Outline**
1. Introduction to the dataset
2. Importing Libraries and File Integration
3. Split data into Train Features & Train Labels
4. Training and Testing the Model

**Source**  
This exercise is based on the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015 
[[source](https://dl.acm.org/doi/10.1145/2783258.2783380)]

In [None]:
%%shell
jupyter nbconvert --to html /content/220601_PracticalWeek02.ipynb

[NbConvertApp] Converting notebook /content/220601_PracticalWeek02.ipynb to html
[NbConvertApp] Writing 304171 bytes to /content/220601_PracticalWeek02.html




In [None]:
from google.colab import drive
drive.mount('/content/drive')

### 1. Introduction to the dataset

Let's begin by importing two (2) essential libraries:
* **pandas**: Pandas is a popular open source   Python package for data science/data analysis and machine learning tasks. It   provides high-performance, user-friendly data structures and data analysis   tools for Python programming.   
The import pandas statement   instructs Python to import the pandas data analysis library into your current   environment. The as pd section of the code then instructs Python to assign   pandas the alias pd. You can use pandas functions by simply typing pd.
* **numpy**: NumPy (short for Numerical Python)   is a fast interface for storing and manipulating dense data buffers. NumPy   arrays are similar to Python's built-in list type in some ways, but NumPy   arrays provide much more efficient storage and data operations as the arrays   grow in size.  
NumPy arrays are at the heart of nearly the entire ecosystem of   Python data science tools, and we can use them to create and manipulate   vectors and matrices.

In [None]:
import pandas as pd
import numpy as np

Next import the files and then uploading them to Google Colab. 

In [None]:
from google.colab import files
uploaded = files.upload()

Saving yelp_labelled.txt to yelp_labelled.txt
Saving imdb_labelled.txt to imdb_labelled.txt
Saving amazon_cells_labelled.txt to amazon_cells_labelled.txt


* Then, we are going to read each line from both files and add it to the same list (“lines”). 
* To read a file in Python, we must open the file in reading “r” mode


In [None]:
with open("imdb_labelled.txt", "r") as text_file:
    lines = text_file.read().split("\n")
with open("amazon_cells_labelled.txt", "r") as text_file:
    lines = text_file.read().split("\n")
with open("yelp_labelled.txt", "r") as text_file:
    lines = text_file.read().split("\n")

Next we are going to split by tab and get rid of any corrupted data or lines that aren't separated by tabs.

* Tabs are represented in .txt files as the escape character "\t". 
* In the first line, splitting a string by tabs splits the string at each tab and returns a list containing the resulting substrings.
* The second to fourth line of command is a for loop used to extract only sentences that are separated by tabs. 


In [None]:
newLines = [line.split("\t")
            for line in lines 
            if len(line.split("\t")) == 2 and
            line.split("\t")[1] != ""]

* Let’s check the difference between these two list: **lines** vs. **newLines**. 
* The command below will print the first ten sentences in the list lines and newLines

In [None]:
lines[0:10]

['Wow... Loved this place.\t1',
 'Crust is not good.\t0',
 'Not tasty and the texture was just nasty.\t0',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.\t1',
 'The selection on the menu was great and so were the prices.\t1',
 'Now I am getting angry and I want my damn pho.\t0',
 "Honeslty it didn't taste THAT fresh.)\t0",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.\t0',
 'The fries were great too.\t1',
 'A great touch.\t1']

In [None]:
newLines[0:10]

[['Wow... Loved this place.', '1'],
 ['Crust is not good.', '0'],
 ['Not tasty and the texture was just nasty.', '0'],
 ['Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
  '1'],
 ['The selection on the menu was great and so were the prices.', '1'],
 ['Now I am getting angry and I want my damn pho.', '0'],
 ["Honeslty it didn't taste THAT fresh.)", '0'],
 ['The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.',
  '0'],
 ['The fries were great too.', '1'],
 ['A great touch.', '1']]

We can also examine the sentiment distribution of the datasets to ensure that we do not have an imbalanced dataset.

A dataset that is imbalanced has large differences in the distribution of its classes. This means that a dataset is biassed towards one of its classes. If a dataset is biassed towards one class, an algorithm trained on the same data will also be biassed towards that class.

Output below depicts 500 sentences labelled as “1”,  and 500 sentences labelled as “0”. That is, our dataset is not biassed towards any classes, negative or positive sentences.

In [None]:
## Check Sentiment Distribution of the Current Dataset
from collections import Counter
sentiment_distr = Counter([label for (words, label) in newLines])
print(sentiment_distr)

Counter({'1': 500, '0': 500})


### 3. Split data into Train Features & Train Labels:

Feature sets and labels are frequently separated as two units in most machine learning (ML) steps, we split our training data into 'train document'  and  'train label' as the features (X) and labels (y) in training. The newLines list will then be divided into train documents and train labels.

In [None]:
# Separate the sentences
train_documents = [line[0] for line in newLines ]
# train_documents
train_documents[0:10]

['Wow... Loved this place.',
 'Crust is not good.',
 'Not tasty and the texture was just nasty.',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.',
 'The selection on the menu was great and so were the prices.',
 'Now I am getting angry and I want my damn pho.',
 "Honeslty it didn't taste THAT fresh.)",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.',
 'The fries were great too.',
 'A great touch.']

In [None]:
# Separate the labels
train_labels = [int(line[1]) for line in newLines]
train_labels[0:10]

[1, 1, 0, 0, 1, 1, 0, 0, 1, 1]

We will use Scikit-learn's CountVectorizer to convert a set of text documents into a vector of term/token counts. CountVectorizer allows for the pre-processing of text data before generating the vector representation.

* Scikit-learn	Scikit-learn is a free Python machine learning library. It supports Python numerical and scientific libraries such as NumPy and SciPy, as well as various algorithms such as support vector machine, random forests, and k-neighbors.
* CountVectorizer	CountVectorizer generates a matrix in which each unique word is represented by a column and each text sample from the document is represented by a row. The value of each cell is simply the number of words in that text sample. This is useful when we have multiple such texts and want to convert each word into a vector (for using in further text analysis).


In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

Then, instantiate an object from CountVectorizer and convert convert the train_documents into a matrix of token counts.

* To ‘instantiate’ —	In an object-oriented programming (OOP) language, instantiating means creating an instance of an object. A named instantiated object is created in memory or on disc using the structure described in a class declaration. 
* fit_transform() —	This fit_transform() method is essentially a combination of the fit and transform methods. 
It is equivalent to fit(). transform(). This method fits and transforms the input data at the same time and converts the data points.
*fit() —	This function takes the training data as arguments
transform()	This function allows you to execute a function for each value of the DataFrame.

When we called fit_transform here, it means you want to convert the train_documents into a matrix of token counts using the function CountVectorizer 


In [None]:
# Instatiate the Countvectorizer
count_vectorizer = CountVectorizer(binary="true")

# Convert the training set to a matrix of token counts:
train_documents = count_vectorizer.fit_transform(train_documents)

### 4. Training and Testing The Model 

The Naive Bayes algorithm will be used to train our documents. The Naive Bayes classification algorithm is used to solve binary (two-class) and multiclass classification problems. Because the calculations of the probabilities for each class are simplified to make their calculations tractable, it is known as Naive Bayes or idiot Bayes (we will explore this further in Week 4).

In [None]:
# Import Navie Bayes Algorithm:
from sklearn.naive_bayes import BernoulliNB

In [None]:
# Fit the BernoulliNB Classifier:
classifier = BernoulliNB().fit(train_documents, train_labels)

The Python predict() function will be used to predicts the labels of the sentence values based on the trained model. The predict() function only accepts one argument, which is typically the data to be tested. 

Output below depicts a prediction that resulted in a positive sentiment, as indicated by the index “1”.


In [None]:
#Example 1
classifier.predict(count_vectorizer.transform(["The best movie ever"]))

array([1])

On the other hand, output below depicts a prediction that resulted in a negative sentiment, as indicated by the index “2”.

In [None]:
#Example 2
classifier.predict(count_vectorizer.transform(['worst movie ever']))

array([0])

It would be more efficient to interpret the sentiment by writing a function that:
>- converts “0” to negative sentiment → line 4 and 5
>- converts “1” to positive sentiment → line 6 and 7


In [None]:
# Create a function to output the Sentiment Analysis Label of a sentence:
def predictionOutput(sentence):
    prediction = classifier.predict(count_vectorizer.transform([sentence]))
    if(prediction[0] == 1):
        print("This is a Positive Sentiment Sentence")
    elif (prediction[0] == 0):
        print("This is a Negative Sentiment Sentence")

Let us now invoke the function predictionOutput()' that we created earlier and pass some sentences for prediction.

The predicted output is shown in the output below. In this case, we can say that the model reasonably predicts the sentiment of the sentences. However, this is merely an example of how sentiment analysis works. We will need to tweak our predictive modelling approach to avoid overfitting.

Overfitting occurs when a model learns too much detail and noise in the training data, resulting in poor performance on new data. This means that the model detects noise or random fluctuations in the training data and recognises them as concepts.

The content for weeks 3 and 4 will go over these issues and the concept of overfitting in greater depth.

In [None]:
# Testing our model with custom sentences/data:
predictionOutput("I am having a very good and great day")
predictionOutput("What a brilliant movie")
predictionOutput("He thinks that movie was a bit long")
predictionOutput("My sister said it was waste of money and time")

This is a Positive Sentiment Sentence
This is a Positive Sentiment Sentence
This is a Negative Sentiment Sentence
This is a Negative Sentiment Sentence


##End of TutorialWeek02