<hr>

# Kickoffs - IMDB Movie Review Analysis 🍿

*Given a dataset consisting of reviews posted on IMDB for movies and series. Though the dataset may seem as a CSV file, it's record comes with a little toss of HTML code with it. We need to analyze this data using NLP techniques and get some useful insights out of it.*

##### After completing this challenge, you will be able to:
+ Understand the concepts of Data preprocessing.
+ NLP Concepts.
+ Machine learning models using SKlearn and NLTK.
+ Intense data pre-processing methods

<hr>


In [1]:
import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score, classification_report
from collections import Counter
import warnings

!mkdir .ans
!python3 -m nltk.downloader all
from test_imdb import test_imdb
warnings.filterwarnings('ignore')

A subdirectory or file .ans already exists.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.


In [2]:
import zipfile
with zipfile.ZipFile("test_imdb-0.1-py3-none-any.whl") as f:
    f.extractall()

<hr>

## Task 1 : Data Loading and Exploration,Data Preprocessing
+ Load the IMDB reviews dataset `imdb.csv` to a pandas dataframe named `data`.
- Utilize Pandas functions (`info()`, `head()`, `describe()`) to explore the structure, columns, and initial samples of the `data` dataset.
- Analyze the distribution and characteristics of text features in the dataset.


In [3]:
data = pd.read_csv('imdb.csv')

In [4]:
# Explore the dataset
data.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     2000 non-null   object
 1   sentiment  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [6]:
data.describe()

Unnamed: 0,review,sentiment
count,2000,2000
unique,2000,2
top,One of the other reviewers has mentioned that ...,positive
freq,1,1005


<hr>

## Data Preprocessing
+ Complete the following function `clean_text` to remove HTML tags, punctuations and stopwords for the `review` column.
+ Using regex - remove the HTML tags, remove all the punctuations in each record.
+ Tokenize the text data and remove all the stopwords in it. Join the tokens back into a sentence once the removal process is complete and return the cleaned text.
+ Apply the `clean_text` function on the `reviews` column of `data` dataset and store it in a new column called `clean_review`.

***Sample dataset after cleaning***:

review | sentiment | clean_review
------ | ---------- | -----------
A wonderful little production. \<br />\<br />The filming technique is very unassuming- | positive | wonderful little production filming technique unassuming
Probably my all-time favorite movie | positive | Probably alltime favorite movie

**Note:** Do not modify the function name.
<br>



In [7]:
### Do not modify function name
import re
import string
def clean_text(text):
    new=re.sub("<.*?>",'',text)
    final=''.join([word for word in new if word.lower() not in string.punctuation])
    clean_text=' '.join([word for word in word_tokenize(final) if word.lower() not in stopwords.words('english')])       
    
    return clean_text
 
# Apply cleaning function to the 'review' column
data['clean_review'] = data['review'].apply(clean_text)

In [8]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

## Sentiment Analysis Model Building
+ Prepare the data and labels; store them in `X` & `y` respectively.

In [9]:
# Prepare data and labels
X = data['clean_review']
y = data['sentiment']
 

**Run the below cell to save your answer. Do not delete the cell.**

In [10]:
try:
    test_imdb.save_ans1(data, clean_text, X, y)
except:
    pass

Test Case 1 Passed


<hr>

## Task 2

+ Create an instance of TF-IDF Vectorization with max features set to 5000 in variable `tfidf_vectorizer`.
+ Fit and transform the data extracted and store it in `X_tfidf`.
+ Split the dataset into training and testing named X_train, X_test, y_train, y_test with the newly transformed data and the labels with a test size of 20% and random state set to 42.


In [11]:
 # TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X=tfidf_vectorizer.fit_transform(X)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.2,random_state=42)


**Run the below cell to save your answer. Do not delete the cell.**

In [12]:
try:
    test_imdb.save_ans2(tfidf_vectorizer, X_train, X_test, y_train, y_test)
except:
    pass

Test Case 2 Passed


<hr>

## Task 3

+ Initialize SVM classifier with seed value of `42` stored in variable `svm`.
+ Fit the training data to the classifier and gather the predictions against the testing data and store it in `y_pred`. 


In [13]:
# Initialize SVM classifier
svm = LinearSVC()
svm.fit(X_train,y_train) 
# Train the classifier
 
# Predictions
y_pred = svm.predict(X_test)
 

In [14]:
len(y_pred)

400

**Run the below cell to save your answer. Do not delete the cell.**

In [15]:
try:
    test_imdb.save_ans3(svm, y_pred)
except:
    pass

Your answer has been saved


<hr>

## Task 5

+ Evaluate the score for predictions against the testing data and store the output in `accuracy`.
+ Get the classification report and of the predictions and the testing data as a dictionary and store it in `report`.


In [16]:
# Evaluation
accuracy = accuracy_score(y_test,y_pred)
print(f"Accuracy: {accuracy}")
 
# Classification report
report = classification_report(y_test,y_pred,output_dict=True)

Accuracy: 0.8375


**Run the below cell to save your answer. Do not delete the cell.**

In [17]:
    try:
        test_imdb.save_ans4(accuracy, report)
    except:
        pass

Your answer has been saved


<hr>

## Task 6 : Word Frequency Analysis , Average Word Length Calculation
+ Get the most common Top 10 words from the cleaned review data and store it as a dictionary.
+ Name the dictionary as `wc_dict`. Example: {word1: count1, word2: count2}
+ Use tokenization and counting methods to calculate word frequencies.

<br>

In [18]:
data['clean_review']

0       One reviewers mentioned watching 1 Oz episode ...
1       wonderful little production filming technique ...
2       thought wonderful way spend time hot summer we...
3       Basically theres family little boy Jake thinks...
4       Petter Matteis Love Time Money visually stunni...
                              ...                        
1995    Feeling Minnesota directed Steven Baigelmann s...
1996    CELL 2000 Rating 810The Cell like Antz must wa...
1997    movie despite list B C list celebs complete wa...
1998    loved movie could break tears watching really ...
1999    worst movie ever seen Billy Zane understand mo...
Name: clean_review, Length: 2000, dtype: object

In [19]:
# Analyze word frequency in the IMDb dataset
# Display the top 10 most common words
total=''.join(data['clean_review'])
lister=total.split()
word_counter=Counter(lister)
wc_dict = {}
top10=word_counter.most_common(10)
for word,count in top10:
    wc_dict.update({word:count})
wc_dict

{'movie': 3158,
 'film': 2750,
 'one': 1704,
 'like': 1397,
 'good': 1031,
 'would': 956,
 'see': 897,
 'even': 879,
 'really': 840,
 'story': 818}

In [20]:
newer=''.join(data['clean_review'])
import nltk
freq=nltk.FreqDist(word_tokenize(newer))
f=freq.most_common(10)
wc_dict=dict(f)
wc_dict

{'movie': 3158,
 'film': 2750,
 'one': 1704,
 'like': 1397,
 'good': 1031,
 'would': 956,
 'see': 897,
 'even': 879,
 'really': 840,
 'story': 818}

In [21]:
ict = {'movie': 3321,
    'film': 2851,
    'one': 1762,
    'like': 1433,
    'good': 1051,
    'would': 965,
    'see': 919,
    'even': 879,
    'really': 857,
    'story': 840}
if sorted(wc_dict)==sorted(ict):
    print('true')
else:
    print('false')

true



## Average Word Length Calculation
- Compute the average word length in characters across the text dataset and store the result in `avg_word_length`.
- Tokenize the text and calculate the average length of tokens.



In [22]:
# Calculate average word length
old=''.join(data['clean_review'])
full=old.split()
total_length=sum([len(word) for word in full])
full_len=len(full)
avg_word_length = total_length/full_len


In [23]:
avg_word_length

6.044327550669873

**Run the below cell to save your answer. Do not delete the cell**

In [24]:
try:
    test_imdb.save_ans5(wc_dict, avg_word_length)
except:
    pass

Your answer has been saved


<hr>

### <span style="color:red"> ! Note : After you finish solving the problem, please run the below cell to save your answers for testing.

In [25]:
### Do not modify this block
from test_imdb import test_imdb
try:
    test_imdb.save_answer(data, clean_text,X, y, tfidf_vectorizer,X_train, X_test, y_train, y_test,svm, y_pred, accuracy, report, wc_dict, avg_word_length)
except:
    print("Assign the answers to all the variables properly")
    test_imdb.remove_pickle()
    try:
        test_imdb.save_ans1(data, clean_text, X, y)
    except:
        pass
    try:
        test_imdb.save_ans2(tfidf_vectorizer, X_train, X_test, y_train, y_test)
    except:
        pass
    try:
        test_imdb.save_ans3(svm, y_pred)
    except:
        pass
    try:
        test_imdb.save_ans4(accuracy, report)
    except:
        pass
    try:
        test_imdb.save_ans5(wc_dict, avg_word_length)
    except:
        pass
####

Test Case 1 Failed
Test Case 2 Passed
Your answer has been saved
Your answer has been saved
Your answer has been saved
