In this project, you will have the opportunity to delve into the realm of movie reviews by analyzing a dataset sourced from the renowned `IMDB` database. Building upon the comprehensive knowledge and tools offered by the NLTK (Natural Language Toolkit) stack, which we have extensively covered in the preceding section, you will apply these skills in a practical real-life context.
<br>
<br>
Your primary objective will be to explore the movie reviews dataset, which consists of a vast collection of textual reviews provided by users on the `IMDB` platform. By leveraging the power of the NLTK stack, you will employ a range of natural language processing techniques to extract valuable insights from the reviews.

To begin with, you will preprocess the raw textual data, implementing essential steps such as tokenization, removing stop words, and performing stemming or lemmatization. This process will help transform the reviews into a more structured and manageable format, enabling further analysis.
<br>
<br>
**By the end of the project, you will have gained invaluable experience in working with textual data, honed your natural language processing skills, and developed a solid understanding of how the NLTK stack can be effectively applied in real-life scenarios, particularly in the realm of movie reviews.**

### Project - Analyzing Movie Reviews using NLTK

![imdb](https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg)

# NLTK Adventures: Unleashing the Film Review Analyzer!

You find yourself in a whimsical world where IMDb has enlisted your extraordinary skills as a Text Movie Review Analyst! Picture yourself entering the majestic office of your quirky boss, Mr. Cinema, a quircky boss with a passion for cinema. He eagerly awaits your presentation on how NLTK will revolutionize their movie review analysis.
<br>
<br>
Your boss wants you to analyze a large chunk of movie reviews using `nltk`. He just heard about something called `text mining` and is super eager to try using Python to analyze user reviews for the first time!
<br>
<br>
Let's start!

Load the `IMDB Dataset.csv` file using the `pandas` module. Save the dataframe with the name `imdb_data`.
<br>
*Hint: Check the `pandas.csv` function!*

In [2]:
### YOUR CODE HERE

Subset the `review` column and store it in a variable called `reviews`.
<br>
*Hint: Check how to select columns in pandas!*

In [3]:
### YOUR CODE HERE

Transform the reviews object into a list of reviews. Each element in the list should contain a review. Name the new object `list_reviews`.

In [4]:
### YOUR CODE HERE

Tokenize every review on the `list_reviews` into words. Save the new object (a new list) as `tokenized_reviews`.

In [5]:
### YOUR CODE HERE

Perform token cleaning on every review of the list review, namely:
- Remove stop words
- lower case every word
- remove punctuation.
<br>
You can save the cleaned tokens in a new object `cleaned_tokens`.

In [7]:
### YOUR CODE HERE

Check the most common tokens of the 81th review (80th index). Which movie may it refer to?

In [44]:
### YOUR CODE HERE

Check the top 10 words of the entire corpus:

In [9]:
### YOUR CODE HERE

Use a `pos_tag` to produce a version of the tokens with the respective POS_TAG. Use the off-the-shelf version of `nltk` and only tag the first 10000 reviews.
<br>
Name the new list of reviews with part-of-speech tags `tagged_reviews`.

In [11]:
### YOUR CODE HERE

Based on the `tagged_reviews` object, create a new list of lists called `adjectives` where you will have a list of every adjective per review.

In [12]:
### YOUR CODE HERE

Based on the column `sentiment` of the dataframe `imdb_data`, split the `adjectives` list into two lists: `adjectives_positive` and `adjectives_negative`. The `adjectives_positive` should contain be a list (not a list of lists) with all adjectives that are tied to positive reviews. The adjective negative should be a similar list with all adjectives that are tied to negative reviews.

In [13]:
### YOUR CODE HERE

Extract the top 50 common adjectives for negative and positive reviews. Save them in a dataframe with the number of times each adjective appears in positive or negative reviews. For example, if an adjective appear 5 times in the top 50 of negative list and it does not appear in the top 50 of the positive list, mark it as `0` in this new dataframe (call it `top_adjectives`).

In [15]:
### YOUR CODE HERE

Which adjective seems to be more overweighted (meaning that it seems to appear very often on negative reviews and not on positive ones) on negative reviews?

In [2]:
### YOUR CODE HERE

And on the positive reviews? Do we have more than one objective that is overweighted?

In [1]:
### YOUR CODE HERE

Based on the `cleaned_tokens` object, stem all words available in our reviews and save the object in a new list of lists. Named the new object `stemmed_tokens`. Use the `SnowballStemmer`.

In [20]:
### YOUR CODE HERE

Add the percentage of retained data (number of characters retained for each review in the `cleaned_tokens` divided by the number of characters of the original review) to the `imdb_data`. Name the new column `perc_retain_stemming`. Which review loses more data with stemming? 
<br>
<br>
After finding it, print the review and the stemmed version of that review.

In [28]:
### YOUR CODE HERE

What conclusion do you have after reading that review?

- YOUR ANSWER HERE