# Sentiment Analysis 

Sentiment analysis is a technique used to determine the emotional tone or sentiment expressed in a text. It involves analyzing the words and phrases used in the text to identify the underlying sentiment, whether it is positive, negative, or neutral. There are different ways to do sentiment analysis, we will be focusing on a Lexicon-based approach.  

### Lexicon-based Analysis 
This type of analysis involves using a set of predefined rules and heuristics to determine the sentiment of a piece of text. These rules are typically based on lexical and syntactic features of the text, such as the presence of positive or negative words and phrases. 

One of the main challenges in sentiment analysis is the inherent complexity of human language. Text data often contains sarcasm, irony, and other forms of figurative language that can be difficult to interpret using traditional methods. While lexicon-based analysis can be relatively simple to implement and interpret, it may not be as accurate as ML-based or transformed-based approaches, especially when dealing with complex or ambiguous text data. 

We will be using the NLTK Python library to do our analysis which contains the modules for the Lexicon-based approach. The analysis is broken into 3-steps 
(following [DataCamp](https://www.datacamp.com/tutorial/text-analytics-beginners-nltk)):

1. Importing the NLTK library and the modules for our analysis, as well as the data set
2. Preprocessing the data: meaning preparing the data for the sentiment analysis
3. NLTK Sentiment Analyzer

### Step 1: NLTK library and data set

#### 1.1 Importing NLTK library and modules

Let us start by importing the NLTK library and its modules, we will explain what each module does as we go along. 

In [95]:
import nltk # import the NLTK library into this notebook so we can use it 
from nltk.sentiment.vader import SentimentIntensityAnalyzer # import the sentiment analyzer (VADER) module from NLTK
from nltk.corpus import stopwords # import the stop words module from NLTK 
from nltk.tokenize import TweetTokenizer # import the tokenizer module from NLTK
from nltk.stem import WordNetLemmatizer # import the lemmatizer module from NLTK

#### 1.2 Importing the data set
For simplicity I will not dive into the details of how the data set is imported into Python. Long story short, we use the library, Pandas, to help us. 

In [96]:
import pandas as pd # Import Pandas

# Read our letterboxd CSV file from the google docs and store it in variable called 'data'
url='https://drive.google.com/file/d/1fKMObDPi3ODKn8n-vj38LY0rf1IVaILe/view?usp=sharing'
path='https://drive.google.com/uc?id=' + url.split('/')[-2]

data = pd.read_csv(path)
data

Unnamed: 0,movie_name,Release Year,Reviewer name,Clean_Review_date,Clean_Review,Clean_Comment Count,Like count,genre
0,Clue,1985,Branson Reese,1996-10-16,My dad got in so much trouble for showing me t...,6,"2,286 likes",Comedy
1,Beetlejuice,1988,Branson Reese,1999-10-21,Thank GOD Tim Burton made this movie in 1988 a...,12,"3,304 likes",Comedy
2,Being John Malkovich,1999,Than Tibbetts,2010-10-04,"Malkovich. Malkovich Malkovich Malkovich, Malk...",6,"4,300 likes",Comedy
3,The Muppets,2011,Jeff,2012-03-06,"It's fine if you don't like this movie, but it...",31,,Comedy
4,Mysterious Skin,2004,Cole,2012-03-11,"This movie is beautiful, captivating, fascinat...",4,6 23 likes,Drama
...,...,...,...,...,...,...,...,...
2832,Drive,2011,k??rsten,,"Yes, I just saw it for the first timeYes, I lo...",9,"2,160 likes",Action
2833,Fight Club,1999,hunt??r,,"if I was next to brad, I would have dropped th...",19,,Drama
2834,The Bling Ring,2013,k??rsten,,not a single good shot or outfit in this entir...,30,,Crime
2835,A Serbian Film,2010,DirkH,,"OH MY GOD, LOOK AT HOW CONTROVERSIAL I AM!!!!!!",65,,Horror


### Step 2: Preprocessing text
Text preprocessing is a crucial step in performing sentiment analysis, as it helps to clean and normalize the text data, making it easier to analyze. **The preprocessing step involves a series of techniques that help transform raw text data into a form you can use for analysis**. Some common text preprocessing techniques include tokenization, stop word removal, stemming, and lemmatization.

- To preprocess our text, we create a function called `preprocess_text` that will go through all the preprocessing steps at once when we feed it a **sentence** (i.e. a review), giving us back a form that we can use for the sentiment analysis. 

Let us go through the processes with an example movie review I found online: 

In [97]:
# the movie review
sentence = 'One of my favorite movies of all time! It is action packed with great storytelling and a beautiful love story!'

#### 2.1 Tokenization
The first step of preprocessing involves tokenizing the data. Given a sentence, we want to break it down into individual words or tokens. This allows the sentiment analyzer to analyze individual words.

**Note**: Instead of using `word_tokenize` like in DataCamp, I found that `TweetTokenizer` works better at splitting up our data into individual words. 

In [98]:
# create a reference variable for the class 
word_tokenize = TweetTokenizer() 

# lowercase the sentence and then tokenize it and store it in the variable tokens 
tokens = word_tokenize.tokenize(sentence.lower()) 

# call tokens (we see that the sentence is split up into a list of words and puncuations)
tokens

['one',
 'of',
 'my',
 'favorite',
 'movies',
 'of',
 'all',
 'time',
 '!',
 'it',
 'is',
 'action',
 'packed',
 'with',
 'great',
 'storytelling',
 'and',
 'a',
 'beautiful',
 'love',
 'story',
 '!']

#### 2.2 Removing stop words
The next step is to remove stop words which involves removing common and irrelevant words that are unlikely to convey much sentiment. Stop words are words that are very common in a language and do not carry much meaning, such as "and," "the," "of," and "it".  

By removing stop words, the remaining words in the text are more likely to indicate the sentiment being expressed. This can help to improve the accuracy of the sentiment analysis. We can actually see what the stopwords are in the `nltk.corpus` module:

In [99]:
# a list of the words that are considered stopwords, i.e. words we will remove from our sentence 
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [100]:
# the code below goes through every word in the variable tokens and check if its a word in the stopwords list above,
# if it is then it is ignored, if its not then we store that word in a new list called filtered_tokens
filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

# calling filtered_tokens we see that stopwords such as 'is', 'on', 'the' are removed
filtered_tokens

['one',
 'favorite',
 'movies',
 'time',
 '!',
 'action',
 'packed',
 'great',
 'storytelling',
 'beautiful',
 'love',
 'story',
 '!']

#### 2.3 Stemming and Lemmatization
Stemming and lemmatization involves reducing words back into their root form. Stemming involves removing the suffixes from words, such as "ing" and "ed". Lemmatization involves breaking down a word to its root meaning. Example, reducing the word "better" to "good".


In [101]:
# create reference variable for the class
lemmatizer = WordNetLemmatizer()

# the code below goes through every word we have in filtered_tokens, lemmatizes it, and then stores it in a list 
# called lemmatized_tokens
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# calling the list (in this case not many words have changed, only plurals turned singular, e.g. stunts -> stunt)
lemmatized_tokens

['one',
 'favorite',
 'movie',
 'time',
 '!',
 'action',
 'packed',
 'great',
 'storytelling',
 'beautiful',
 'love',
 'story',
 '!']

#### 2.4 Joining tokens back into a string 
Last step involves joining the remaining tokens back into one sentence or string.

In [102]:
# using the join function, we can join the list by adding a space in between each word
processed_text = ' '.join(lemmatized_tokens)

# calling it we have
processed_text

'one favorite movie time ! action packed great storytelling beautiful love story !'

#### 2.5 Preprocessing our data set
We combine all the steps above into a function called `preprocess_text` so that we can preprocess the whole sentence in one go for each movie review. 

In [103]:
# create the preprocess_text function that takes in a sentence and returns the preprocessed form of it 
def preprocess_text(text):
    
    # tokenize the text - split data (the sentence) into a list of words and symbols
    word_tokenize = TweetTokenizer() 
    tokens = word_tokenize.tokenize(text.lower())
    
    # remove stopwords - i.e. words that are commonly used in a sentence that do not help with analysis
    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]
    
    # lemmatize the tokens - i.e. reduce words to its most basic form, e.g. rocks -> rock
    lemmatizer = WordNetLemmatizer()
    
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    
    # join tokens back into a string 
    processed_text = ' '.join(lemmatized_tokens)
    
    return processed_text

In [104]:
# store the preprocessed text in a new column called 'Processed Review'
data['Processed Review'] = data['Clean_Review'].apply(preprocess_text) 

### Step 3: NLTK Analyzer

#### 3.1 Sentiment Analyzer (VADER)
The final step is to feed the sentence into the NLTK sentiment analyzer (VADER) and evaluate it. We will consider the following:
* positive reviews have a `sentiment = 1`
* neutral reviews have a `sentiment = 0`
* negative reviews have a `sentiment = -1`


We can continue with the example sentence that we preprocessed above called `processed_text`.

**Note**: VADER spits out scores for positive, neutral and negative. The sum of them is equal to 1.

In [105]:
# Creating reference variable 
analyzer = SentimentIntensityAnalyzer()

# Scoring the sentence 
scores = analyzer.polarity_scores(processed_text)

# there is also a score called 'compound' but since we are only interested in 'neg', 'neu', and 'pos', we will 
# remove 'compound'
scores.popitem()

# checking the score
scores

{'neg': 0.0, 'neu': 0.307, 'pos': 0.693}

For our analysis, we will consider the sentiment of the review based on which score is the highest (different from DataCamp). For example, in the scores above, `'pos': 0.693` is highest, therefore this review is positive and so `sentiment = 1`. 

In [106]:
# getting the highest score 
highest_score = max(scores, key=scores.get)

if highest_score == 'pos': # 
    sentiment = 1
elif highest_score == 'neu':
    sentiment = 0
else: 
    sentiment = -1
    
# call sentiment
sentiment

1

#### 3.2 Analyzing our data set
Below we combine everything together in one function. 

In [107]:
# Create get_sentiment function that takes in a sentence and scores it on negative, neutral, or positive
def get_sentiment(text):
    
    # Initialize NLTK sentiment analyzer by creating reference variable 
    analyzer = SentimentIntensityAnalyzer()
    
    # scoring the sentence 
    scores = analyzer.polarity_scores(text)
    scores.popitem() # removing the 'compound' value from the scoring since we don't need it 
    
    highest_score = max(scores, key=scores.get) # check whether positive, neutral or negative has the highest value
    
    # assign the appropriate sentiment value based on highest_score
    if highest_score == 'pos': 
        sentiment = 1
    elif highest_score == 'neu':
        sentiment = 0
    else: 
        sentiment = -1
        
    return sentiment # return the sentiment for the sentence

In [108]:
# store the sentiment in a new column called 'Sentiment'
data['Sentiment'] = data['Processed Review'].apply(get_sentiment)

# lets see what our data looks like now 
data

Unnamed: 0,movie_name,Release Year,Reviewer name,Clean_Review_date,Clean_Review,Clean_Comment Count,Like count,genre,Processed Review,Sentiment
0,Clue,1985,Branson Reese,1996-10-16,My dad got in so much trouble for showing me t...,6,"2,286 likes",Comedy,dad got much trouble showing kid started sayin...,0
1,Beetlejuice,1988,Branson Reese,1999-10-21,Thank GOD Tim Burton made this movie in 1988 a...,12,"3,304 likes",Comedy,thank god tim burton made movie 1988 2008 . im...,0
2,Being John Malkovich,1999,Than Tibbetts,2010-10-04,"Malkovich. Malkovich Malkovich Malkovich, Malk...",6,"4,300 likes",Comedy,"malkovich . malkovich malkovich malkovich , ma...",0
3,The Muppets,2011,Jeff,2012-03-06,"It's fine if you don't like this movie, but it...",31,,Comedy,"fine like movie , probably mean angry , hate-f...",-1
4,Mysterious Skin,2004,Cole,2012-03-11,"This movie is beautiful, captivating, fascinat...",4,6 23 likes,Drama,"movie beautiful , captivating , fascinating , ...",1
...,...,...,...,...,...,...,...,...,...,...
2832,Drive,2011,k??rsten,,"Yes, I just saw it for the first timeYes, I lo...",9,"2,160 likes",Action,"yes , saw first timeyes , loved everything ity...",0
2833,Fight Club,1999,hunt??r,,"if I was next to brad, I would have dropped th...",19,,Drama,"next brad , would dropped soap",0
2834,The Bling Ring,2013,k??rsten,,not a single good shot or outfit in this entir...,30,,Crime,single good shot outfit entire thing,0
2835,A Serbian Film,2010,DirkH,,"OH MY GOD, LOOK AT HOW CONTROVERSIAL I AM!!!!!!",65,,Horror,"oh god , look controversial ! ! !",1


### Dropping the Processed Review column and storing the data into a new CSV
Since we do not need Processed Review, I can drop it and then save the rest in a new CSV file called `final_sentiment_letterboxd.csv`.

In [111]:
# drop column 
if 'Processed Review' in data.head():
    data.drop(['Processed Review'], axis=1, inplace=True)

# save as new CSV
data.to_csv('final_sentiment_letterboxd_.csv', index=False)

In [112]:
# How many positive, neutral and negative reviews are there? 
pos = 0 
neu = 0
neg = 0

for i in range(len(data)):
    if data['Sentiment'][i] == 1:
        pos += 1
    elif data['Sentiment'][i] == 0:
        neu += 1
    else:
        neg += 1

print("There are", pos, "positive reviews,", neu, "neutral reviews and", neg, "negative reviews")

There are 381 positive reviews, 2200 neutral reviews and 256 negative reviews


In [114]:
data 

Unnamed: 0,movie_name,Release Year,Reviewer name,Clean_Review_date,Clean_Review,Clean_Comment Count,Like count,genre,Sentiment
0,Clue,1985,Branson Reese,1996-10-16,My dad got in so much trouble for showing me t...,6,"2,286 likes",Comedy,0
1,Beetlejuice,1988,Branson Reese,1999-10-21,Thank GOD Tim Burton made this movie in 1988 a...,12,"3,304 likes",Comedy,0
2,Being John Malkovich,1999,Than Tibbetts,2010-10-04,"Malkovich. Malkovich Malkovich Malkovich, Malk...",6,"4,300 likes",Comedy,0
3,The Muppets,2011,Jeff,2012-03-06,"It's fine if you don't like this movie, but it...",31,,Comedy,-1
4,Mysterious Skin,2004,Cole,2012-03-11,"This movie is beautiful, captivating, fascinat...",4,6 23 likes,Drama,1
...,...,...,...,...,...,...,...,...,...
2832,Drive,2011,k??rsten,,"Yes, I just saw it for the first timeYes, I lo...",9,"2,160 likes",Action,0
2833,Fight Club,1999,hunt??r,,"if I was next to brad, I would have dropped th...",19,,Drama,0
2834,The Bling Ring,2013,k??rsten,,not a single good shot or outfit in this entir...,30,,Crime,0
2835,A Serbian Film,2010,DirkH,,"OH MY GOD, LOOK AT HOW CONTROVERSIAL I AM!!!!!!",65,,Horror,1
