In this project, you will have the opportunity to delve into the realm of movie reviews by analyzing a dataset sourced from the renowned `IMDB` database. Building upon the comprehensive knowledge and tools offered by the NLTK (Natural Language Toolkit) stack, which we have extensively covered in the preceding section, you will apply these skills in a practical real-life context.
<br>
<br>
Your primary objective will be to explore the movie reviews dataset, which consists of a vast collection of textual reviews provided by users on the `IMDB` platform. By leveraging the power of the NLTK stack, you will employ a range of natural language processing techniques to extract valuable insights from the reviews.

To begin with, you will preprocess the raw textual data, implementing essential steps such as tokenization, removing stop words, and performing stemming or lemmatization. This process will help transform the reviews into a more structured and manageable format, enabling further analysis.
<br>
<br>
**By the end of the project, you will have gained invaluable experience in working with textual data, honed your natural language processing skills, and developed a solid understanding of how the NLTK stack can be effectively applied in real-life scenarios, particularly in the realm of movie reviews.**

### Project - Analyzing Movie Reviews using NLTK

![imdb](https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg)

# NLTK Adventures: Unleashing the Film Review Analyzer!

You find yourself in a whimsical world where IMDb has enlisted your extraordinary skills as a Text Movie Review Analyst! Picture yourself entering the majestic office of your quirky boss, Mr. Cinema, a quircky boss with a passion for cinema. He eagerly awaits your presentation on how NLTK will revolutionize their movie review analysis.
<br>
<br>
Your boss wants you to analyze a large chunk of movie reviews using `nltk`. He just heard about something called `text mining` and is super eager to try using Python to analyze user reviews for the first time!
<br>
<br>
Let's start!

Load the `IMDB Dataset.csv` file using the `pandas` module. 
<br>
*Hint: Check the `pandas.csv` function!*

In [1]:
import pandas as pd
imdb_data = pd.read_csv('IMDB Dataset.csv')

In [30]:
imdb_data.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


Subset the `review` column and store it in a variable called `reviews`.
<br>
*Hint: Check how to select columns in pandas!*

In [2]:
reviews = imdb_data.review

Transform the reviews object into a list of reviews. Each element in the list should contain a review. Name the new object `list_reviews`.

In [4]:
list_reviews = list(reviews)
list_reviews[:2]

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

Tokenize every review on the `list_reviews` into words. Save the new object (a new list) as `tokenized_reviews`.

In [5]:
import nltk
tokenized_reviews = [nltk.tokenize.word_tokenize(review) for review in list_reviews]

In [7]:
tokenized_reviews[0][:10]

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching']

Perform token cleaning on every review of the list review, namely:
- Remove stop words
- lower case every word
- remove punctuation.
<br>
You can save the cleaned tokens in a new object `cleaned_tokens`.

In [8]:
import string
from nltk.corpus import stopwords

punctuation = string.punctuation
stop_words = stopwords.words('english')

cleaned_tokens = []

for review in tokenized_reviews:
    cleaned_review = []
    for word in review:
        if word.lower() in punctuation or word.lower() in stop_words:
            continue
        else:
            cleaned_review.append(word.lower())
    # Append Review to the cleaned_tokens object
    cleaned_tokens.append(cleaned_review)

In [12]:
' '.join(tokenized_reviews[0])

"One of the other reviewers has mentioned that after watching just 1 Oz episode you 'll be hooked . They are right , as this is exactly what happened with me. < br / > < br / > The first thing that struck me about Oz was its brutality and unflinching scenes of violence , which set in right from the word GO . Trust me , this is not a show for the faint hearted or timid . This show pulls no punches with regards to drugs , sex or violence . Its is hardcore , in the classic use of the word. < br / > < br / > It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary . It focuses mainly on Emerald City , an experimental section of the prison where all the cells have glass fronts and face inwards , so privacy is not high on the agenda . Em City is home to many .. Aryans , Muslims , gangstas , Latinos , Christians , Italians , Irish and more .... so scuffles , death stares , dodgy dealings and shady agreements are never far away. < br / > < br / > I would s

In [13]:
' '.join(cleaned_tokens[0])

"one reviewers mentioned watching 1 oz episode 'll hooked right exactly happened me. br br first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word. br br called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many .. aryans muslims gangstas latinos christians italians irish .... scuffles death stares dodgy dealings shady agreements never far away. br br would say main appeal show due fact goes shows would n't dare forget pretty pictures painted mainstream audiences forget charm forget romance ... oz n't mess around first episode ever saw struck nasty surreal could n't say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards 'll sold nickel inmates 'll kill order get away well

Check the most common tokens of the 81th review (80th index). Which movie may it refer to?

In [14]:
nltk.FreqDist(cleaned_tokens[0]).most_common(10)

[('oz', 6),
 ('br', 6),
 ('violence', 4),
 ("'ll", 3),
 ('show', 3),
 ('prison', 3),
 ("n't", 3),
 ('forget', 3),
 ('watching', 2),
 ('episode', 2)]

Check the top 10 words of the entire corpus:

In [15]:
bag_words = []

for review in cleaned_tokens:
    for word in review:
        bag_words.append(word)

In [16]:
len(bag_words)

6524613

In [17]:
nltk.FreqDist(bag_words).most_common(10)

[('br', 201951),
 ("'s", 122131),
 ('movie', 85070),
 ('film', 76919),
 ("''", 66435),
 ("n't", 66244),
 ('``', 65695),
 ('one', 51828),
 ('like', 39183),
 ('good', 28767)]

Use a `pos_tag` to produce a version of the tokens with the respective POS_TAG. Use the off-the-shelf version of `nltk` and only tag the first 10000 reviews.
<br>
Name the new list of reviews with part-of-speech tags `tagged_reviews`.

In [18]:
tagged_reviews = [nltk.tag.pos_tag(review) for review in cleaned_tokens[0:10000]]

Based on the `tagged_reviews` object, create a new list of lists called `adjectives` where you will have a list of every adjective per review.

In [19]:
adjectives = []
for review in tagged_reviews:
    adj_review = []
    for word_tag in review:
        if word_tag[1].startswith('JJ'):
            adj_review.append(word_tag[0])
    adjectives.append(adj_review)

Based on the column `sentiment` of the dataframe `imdb_data`, split the `adjectives` list into two lists: `adjectives_positive` and `adjectives_negative`. The `adjectives_positive` should contain be a list (not a list of lists) with all adjectives that are tied to positive reviews. The adjective negative should be a similar list with all adjectives that are tied to negative reviews.

In [22]:
def get_adjectives_sentiment(adjectives, sentiment_column, sentiment):
    sentiment_list = []
    for index, adjectives_review in enumerate(adjectives):
        if sentiment_column[index] == sentiment:
            sentiment_list.extend(adjectives_review)
    return sentiment_list

In [23]:
adjectives_positive = get_adjectives_sentiment(adjectives, imdb_data.sentiment, 'positive')
adjectives_negative = get_adjectives_sentiment(adjectives, imdb_data.sentiment, 'negative')

Extract the top 50 common adjectives for negative and positive reviews. Save them in a dataframe with the number of times each adjective appears in positive or negative reviews. For example, if an adjective appear 5 times in the top 50 of negative list and it does not appear in the top 50 of the positive list, mark it as `0` in this new dataframe (call it `top_adjectives`).

In [27]:
top_positives = pd.DataFrame(
    [
        [count[0] for count in nltk.FreqDist(adjectives_positive).most_common(50)],
        [count[1] for count in nltk.FreqDist(adjectives_positive).most_common(50)]
    ],
    index = ['adjective','positive_count']
).T

top_negatives = pd.DataFrame(
    [
        [count[0] for count in nltk.FreqDist(adjectives_negative).most_common(50)],
        [count[1] for count in nltk.FreqDist(adjectives_negative).most_common(50)]
    ],
    index = ['adjective','negative_count']
).T

In [16]:
top_adjectives = top_positives.merge(top_negatives, on='adjective', how='outer').fillna(0)

Which adjective seems to be more overweighted (meaning that it seems to appear very often on negative reviews and not on positive ones) on negative reviews?

In [17]:
top_adjectives.sort_values(by='negative_count', ascending=False).head(10)

Unnamed: 0,adjective,positive_count,negative_count
0,good,2809,2866
13,bad,694,2865
2,br,1868,2049
6,much,1323,1414
5,little,1328,1176
3,many,1473,1149
1,great,2575,1019
7,real,1008,888
8,first,974,877
11,old,746,771


And on the positive reviews? Do we have more than one objective that is overweighted?

In [18]:
top_adjectives.sort_values(by='positive_count', ascending=False).head(10)

Unnamed: 0,adjective,positive_count,negative_count
0,good,2809,2866
1,great,2575,1019
2,br,1868,2049
3,many,1473,1149
4,best,1371,662
5,little,1328,1176
6,much,1323,1414
7,real,1008,888
8,first,974,877
9,new,914,678


Based on the `cleaned_tokens` object, stem all words available in our reviews and save the object in a new list of lists. Named the new object `stemmed_tokens`. Use the `SnowballStemmer`.

In [20]:
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')

In [26]:
stemmed_tokens = []
for review in cleaned_tokens:
    stemmed_review = []
    for token in review:
        stemmed_token = snowball.stem(token)
        stemmed_review.append(stemmed_token)
    stemmed_tokens.append(stemmed_review)

Add the percentage of retained data (number of characters retained for each review in the `cleaned_tokens` divided by the number of characters of the original review) to the `imdb_data`. Name the new column `perc_retain_stemming`. Which review loses more data with stemming? 
<br>
<br>
After finding it, print the review and the stemmed version of that review.

In [28]:
perc_loss = []
for index, review in enumerate(stemmed_tokens):
    perc_loss.append(len(' '.join(review))/len(reviews[index])) 

In [30]:
imdb_data['perc_retain_stemming'] = perc_loss

In [40]:
imdb_data.sort_values(by='perc_retain_stemming',ascending=True)

Unnamed: 0,review,sentiment,perc_retain_stemming
48927,Smallville episode Justice is the best episode...,positive,0.110272
39182,Smallville episode Justice is the best episode...,positive,0.110272
45723,What is the story what is it on the screen. At...,negative,0.390173
48697,Maybe it was the fact that I saw Spider-man th...,negative,0.391213
11645,No offense to anyone who saw this and liked it...,negative,0.404545
...,...,...,...
11926,I wouldn't rent this one even on dollar rental...,negative,0.811321
30527,"As so many others have written, this is a wond...",positive,0.813246
28920,Primary plot!Primary direction!Poor interpreta...,negative,0.823529
36844,OZ is the greatest show ever mad full stop.OZ ...,positive,0.836257


In [42]:
imdb_data.loc[48927,'review']

"Smallville episode Justice is the best episode of Smallville ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! It's my favorite episode of Smallville! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !"

In [43]:
' '.join(stemmed_tokens[48927])

"smallvill episod justic best episod smallvill 's favorit episod smallvill"

What conclusion do you have after reading that review?

- The stemmed review retained few original characters because of the punctuation and not because of stemming. It probably may be excluded from the analysis as it contain few meaningful text.