<a href="https://colab.research.google.com/github/chipojaya1/myNEBDHub/blob/main/DSP_Movie_Reviews_Analysis_Chipo_Jaya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    Project: Sentiment Analysis of Movie Reviews
</h2>

<h3 align="center">
    Name: Chipo Jaya
</h3>


### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

Certain parts of this project will be completed individually, while other parts are encouraged to be completed with the rest of your team. In order to work within the Google Colab Notebook, **please start by clicking on "File" and then "Save a copy in Drive."** This will save a copy of the notebook in your personal Google Drive. Each member of your team should work on their personal copy.

Please rename the file to "DSP - Movie Reviews Analysis - Your Full Name." Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders.

You can now start working on the project. :)

**Project Description:**

This project will introduce students to an array of skills as they strive to create a sentiment analysis model to classify a given review as positive or negative. Sentiment Analysis leverages both Natural Language Processing and Machine Learning skills - how to represent text in a machine-understandable format so as to classify the text and extract sentiment. We will also cover visualizations and how to deploy models in the real world.

[Use this link to join the NSDC DSP Slack Channel!](https://bit.ly/nsdc-dsp-movie-reviews)


---
---



<h3 align = "center">
    Milestone #1
</h3>

NOTE: These steps are to be completed **individually**, not as a team. You are encouraged to discuss steps with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to set up your environment, install the required packages, learn how to acces data and do some basic exploratory data analysis.

**Step 1:**

Setting up libraries and installing packages

To install a library:
```python
 import <library> as <shortname>
```
We use a *short name* since it is easier to refer to the package to access functions and also to refer to subpackages within the library.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize,sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


These are the libraries that will help us throughout this project. It is not necessary that you know what each library does, but you can always look it up.

We encourage you to read more about the important and most commonly used packages like Pandas and Natural Language Toolkit (NLTK) and write a few lines in your own words about what they do. [You may use the Data Science Resource Repository (DSRR) to find resources to get started!](https://nebigdatahub.org/nsdc/data-science-resource-repository/)



<h4 style="color:orange">
    TO-DO
</h4>

Write a few lines about what each library does.

- **Pandas:** A library for data manipulation which is used to load data from excel files, clean it, explore it, and prepare it for the next steps. It's the foundation for organizing and managing textual data efficiently.


- **NLTK:** a library called Natural Language Toolkit (NLTK) and it is a comprehensive library for working with human language data (text). It comes with built-in resources like sentiment lexicons (e.g., VADER) and can do and so much more:

   1. Tokenization: Splitting sentences and paragraphs into individual words or tokens.

    2. Stopword Removal: Filtering out common but low-meaning words like "the," "is," and "and."

    3. Stemming/Lemmatization: Reducing words to their base or root form (e.g., "running" becomes "run").

    4. Part-of-Speech Tagging: Identifying nouns, verbs, adjectives, etc., which can be useful for analysis.


---

**Step 2:**

Let’s access our data. We will be using the Internet Movie Database (IMDb) as our dataset. The dataset contains 50,000 movie reviews from the Internet Movie Database. Reviews have been pre-labeled with sentiment polarity (positive/negative).  


[The IMDb Movie Reviews dataset is available at this link](https://raw.githubusercontent.com/meghjoshii/NSDC_DataScienceProjects_SentimentAnalysis/main/IMDB%20Dataset.csv). It is better to use the link provided directly within the read_csv function.



We will use pandas to read the data from the csv file using the `read_csv` function. This function returns a pandas dataframe. We will store this dataframe in a variable called `df`.

In [4]:
# TODO: Read the data using pandas read_csv function

# Load the dataset from the provided URL
df = pd.read_csv('https://raw.githubusercontent.com/meghjoshii/NSDC_DataScienceProjects_SentimentAnalysis/main/IMDB%20Dataset.csv')

print("Dataset loaded from url!")
print(f"Dataset shape: {df.shape}")  # (number_of_reviews, number_of_columns)
print(f"Number of reviews: {df.shape[0]:,}")
print(f"Number of columns: {df.shape[1]}")


Dataset loaded from url!
Dataset shape: (50000, 2)
Number of reviews: 50,000
Number of columns: 2


---

**Step 3:**

Let's see what the data looks like. We can use the `head` function which returns the first 5 rows of the dataframe.

In [5]:
# TODO: Print the first 5 rows of the data using head function of pandas
print("First 5 reviews")
df.head()

First 5 reviews


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


There are 2 columns in the dataframe - `review` and `sentiment`. The `review` column contains the text of the review and the `sentiment` column contains the sentiment of the review.

The `describe()` function gives us a summary of the data.

In [6]:
# TODO: Describe the data using describe function of pandas
print("Dataset summary")
df.describe()

Dataset summary


Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


We can see that we have 50,000 reviews in our dataset. The `sentiment` column has 2 unique values - `positive` and `negative`.

Individual columns can be accessed using the `[]` operator. For example, `df['review']` returns the `review` column of the dataframe.

In [7]:
print(df['review'])

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object


Let's see how many positive and negative reviews we have in our dataset. We can use the `value_counts()` function to get the count of each unique value in the `sentiment` column.

In [8]:
# TODO: Use the value_counts function to count the number of positive and negative reviews on the sentiment column using the [] operator
print("Number of positive and negative reviews")
print(df['sentiment'].value_counts())

print("\nSentiment distribution (percentages)")
print(df['sentiment'].value_counts(normalize=True) * 100)

Number of positive and negative reviews
sentiment
positive    25000
negative    25000
Name: count, dtype: int64

Sentiment distribution (percentages)
sentiment
positive    50.0
negative    50.0
Name: proportion, dtype: float64


We can see that we have 25,000 positive reviews and 25,000 negative reviews in our dataset. They are evenly distributed and we do not have to worry about class imbalance.

[Follow this link to learn more about class imbalance](https://machinelearningmastery.com/what-is-imbalanced-classification/).

In [9]:
print("\n--- Sample Reviews ---")
print("Sample positive review:")
print(df[df['sentiment'] == 'positive']['review'].iloc[0][:500] + "...")  # First 500 chars

print("\nSample negative review:")
print(df[df['sentiment'] == 'negative']['review'].iloc[0][:500] + "...")  # First 500 chars

print("\n--- Review length statistics ---")
df['review_length'] = df['review'].str.len()
print(f"Average review length: {df['review_length'].mean():.0f} characters")
print(f"Shortest review: {df['review_length'].min()} characters")
print(f"Longest review: {df['review_length'].max()} characters")


--- Sample Reviews ---
Sample positive review:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ...

Sample negative review:
Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing 


---

**Step 4:**
Let's take a short break from coding and do some reading that is imperative to understand the concepts of this project.


The **objective** of our machine learning model will be to predict the sentiment of a review given the text of the review. So, the model needs to learn the relationship between the text of the review and the sentiment of the review. Hence, this is a supervised learning problem where the input is text and the output is a label.

[Click here to watch introductory videos and learn more about supervised machine learning](https://www.youtube.com/playlist?list=PLNs9ZO9jGtUCiGTo3iP0qmI9_qi8oYaRN).



Since we are going to be using text as input, we cannot directly use the text because computers do not understand text. We need to convert the text into a format that is useful for our classification model.

Count vectorization is a method to convert text into a format that is useful for classification models. It converts the text into a matrix of token counts meaning that each row in the matrix represents a review and each column represents a word. The value in each cell is the number of times that word occurs in that review. So, by learning the frequency of each word in each review, the model can learn the relationship between the text and the sentiment of the review. The intuition behind this is that positive reviews will have more positive words and negative reviews will have more negative words.

Now that we have established the intuition behind count vectorization, let's look at features of the count vectorizer. The features of the count vectorizer are the words that we want to consider. We would only want to use words that are relevant to the sentiment of the review. For example, if we are classifying reviews of movies, we would not want to consider words like `the`, `a`, `an` etc. because they are not relevant to the sentiment of the review. Also, we would want to consider words that occur frequently in the reviews. For example, if a word occurs only once in the entire dataset, it is not very useful for our model.

To remove words that are not relevant to the sentiment of the review, first we need to tokenize the text.

Tokenization is the process of splitting a string into a list of tokens. This helps us to break down the text into smaller chunks which are easier to work with. What we essentially want to do  is remove all the punctuation and special characters from the text because they do not add any value to the text. We also want to convert all the text to lowercase so that the model does not treat the same word with different cases as different words.

---
---



<h3 align = "center">
    Milestone #2
</h3>

NOTE: These steps are to be completed **individually**, not as a team. You are encouraged to discuss steps with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to learn natural langauge processing and how to use the NLTK library to preprocess text. We will also learn how to use the CountVectorizer class to convert text into a format that is useful for classification models.

**Step 1:**

We will use the `nltk` library to perform these preprocessing steps. First, we will use the `word_tokenize` function to tokenize the text.

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download the punkt tokenizer
nltk.download('punkt')
nltk.download('punkt_tab') # Download the missing 'punkt_tab' resource

# Apply word_tokenize to the 'review' column and create a new column for tokens
df['review_tokens'] = df['review'].apply(word_tokenize)

print("Tokenization completed!")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
# We can see that the `review` column now contains a list of tokens for each review. Let's see what the first review looks like.
df['review_tokens'][1]

We see that the text has been tokenized into a list of words. Also, the list contains punctuation and special characters which we do not want.

---

**Step 2:**


Let's clean the text by removing punctuations, special characters and converting the text to lowercase. We will use the `isalpha` function to check if a word is an alphabet. If it is not an alphabet, we will remove it from the list. We will also convert the text to lowercase using the `lower` function. Next, we will remove the stopwords from the list. Stopwords are words that do not add any value to the text. For example, `the`, `a`, `an` etc. are stopwords. We will use the `stopwords` function from the `nltk.corpus` package to get a list of stopwords. We will then use the `remove` function to remove the stopwords from the list.

In [None]:
# isalpha() function returns True if all the characters in the string are alphabets. If not, it returns False.

# We can use the isalpha() function to remove all the punctuations and numbers from the reviews.

# Remove non-alphabetic tokens using isalpha()
df['review_tokens_clean'] = df['review_tokens'].apply(
    lambda tokens: [token for token in tokens if token.isalpha()]
)

print("Non-alphabetic tokens removed!")


In [None]:
print(" ".join(df['review_tokens_clean'][1]))

In [None]:
#TODO: convert to lowercase
#complete the code below
#df[''] = df[''].apply(lambda x: [item. for item in x])

In [None]:
df['review_tokens_lower'] = df['review_tokens_clean'].apply(lambda x: [item.lower() for item in x])

In [None]:
# remove stopwords
import nltk
from nltk.corpus import stopwords

# Download stopwords if you haven't already
nltk.download('stopwords')

# Get the English stopwords list
stop_words = set(stopwords.words('english'))

print(f"Number of stopwords: {len(stop_words)}")
print(f"Sample stopwords: {list(stop_words)[:10]}")

# Remove stopwords from the lowercase tokens
df['review_tokens_no_stopwords'] = df['review_tokens_lower'].apply(
    lambda tokens: [token for token in tokens if token not in stop_words]
)

print("Stopwords removed!")

In [None]:
df

---

**Step 3:**

Now that we have cleaned the text, we need to use a stemmer to stem the words. Stemming is the process of reducing a word to its root form. For example, the root form of the word `running` is `run`. Stemming helps us to reduce the number of unique words in the text. We will use the `PorterStemmer` function from the `nltk.stem` package to stem the words.

In [None]:
# stemming user PorterStemmer
from nltk.stem import PorterStemmer

# Initialize the Porter Stemmer
ps = PorterStemmer()

print("Porter Stemmer initialized!")

# Apply stemming to the tokens (using the stopword-removed version)
df['review_tokens_stemmed'] = df['review_tokens_no_stopwords'].apply(
    lambda tokens: [ps.stem(token) for token in tokens]
)

print("Stemming completed!")


In [None]:
df

In [None]:
#join list of words to form sentences
df['review_cleaned'] = df['review_tokens_stemmed'].apply(
    lambda tokens: ' '.join(tokens)
)

print("Tokens joined back into sentences!")

In [None]:
df

In [None]:
# Compare original vs cleaned reviews
print("--- Original Review Sample ---")
print(df['review'].iloc[0][:300] + "...")

print("\n--- Cleaned Review Sample ---")
print(df['review_cleaned'].iloc[0][:300] + "...")

print(f"\nOriginal review length: {len(df['review'].iloc[0])} characters")
print(f"Cleaned review length: {len(df['review_cleaned'].iloc[0])} characters")

# Show the complete preprocessing pipeline for one review
print("\n--- Complete Preprocessing Pipeline for One Review ---")
print("Original:", df['review'].iloc[0][:150] + "...")
print("Tokenized:", df['review_tokens'].iloc[0][:10])
print("Cleaned (alpha only):", df['review_tokens_clean'].iloc[0][:10])
print("Lowercase:", df['review_tokens_lower'].iloc[0][:10])
print("No stopwords:", df['review_tokens_no_stopwords'].iloc[0][:10])
print("Stemmed:", df['review_tokens_stemmed'].iloc[0][:10])
print("Final joined:", df['review_cleaned'].iloc[0][:150] + "...")

In [None]:
# Create cleaned versions from different preprocessing stages
df['review_no_stopwords'] = df['review_tokens_no_stopwords'].apply(lambda x: ' '.join(x))
df['review_lower_clean'] = df['review_tokens_lower'].apply(lambda x: ' '.join(x))

print("Multiple cleaned versions created!")

In [None]:
print("--- Final Dataset Overview ---")
print(f"Dataset shape: {df.shape}")
print("\nColumns available:")
print(df.columns.tolist())

print("\n--- Sample of Final Cleaned Reviews ---")
for i in range(2):
    print(f"\nReview {i+1}:")
    print(f"Sentiment: {df['sentiment'].iloc[i]}")
    print(f"Cleaned text: {df['review_cleaned'].iloc[i][:200]}...")

---
---



<h3 align = "center">
    Milestone #3
</h3>

NOTE: These steps are to be completed **individually**, not as a team. You are encouraged to discuss steps with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to split the dataset into training and testing sets. We will also learn how to use the CountVectorizer class to convert text into a format that is useful for classification models. We will also learn how to use the MultinomialNB class to train a Naive Bayes classifier.

Training and Testing Data:

Machine learning uses algorithms to learn from data in datasets. They find patterns, develop understanding, make decisions, and evaluate those decisions.

In machine learning, datasets are split into two subsets:

The first subset is known as the **training data** - it’s a portion of our actual dataset that is fed into the machine learning model to discover and learn patterns. In this way, it trains our model.

The other subset is known as the **testing data**.

Once your machine learning model is built (with your training data), you need unseen data to test your model. This data is called testing data, and you can use it to evaluate the performance and progress of your algorithms’ training and adjust or optimize it for improved results.

Testing data has two main criteria. It should:

1. Represent the actual dataset
2. Be large enough to generate meaningful predictions


**Step 1:**

Now, the data is tokenized, cleaned and reduced to its root form.
The next step is to split the data for training and testing. We split the data because we need to train our model on some data and test it on some data. We have a total of 50,000 reviews, so let's split it into 40,000 reviews for training and 10,000 reviews for testing.
To do this, we can use the slice operator `:`. For example, `df[:30000]` returns the first 30,000 rows of the dataframe. Similarly, `df[30000:]` returns the last 20,000 rows of the dataframe.

Name the training data as `train_reviews` and testing data as `test_reviews`. Remember, we are only splitting the reviews column and will do the same for sentiment in the next step.

In [None]:
# Split the reviews into training and testing sets using slicing

#train reviews
train_reviews = df['review_cleaned'][:40000]  # First 40,000 reviews for training
print(f"Training reviews shape: {train_reviews.shape}")

In [None]:
# Split the reviews into training and testing sets using slicing

# test reviews
test_reviews = df['review_cleaned'][40000:]   # Last 10,000 reviews for testing
print(f"Testing reviews shape: {test_reviews.shape}")

Now let us do the same for the sentiment column. Name the training data as `train_sentiments` and testing data as `test_sentiments`.

In [None]:
#TODO: train sentiments
train_sentiments = df['sentiment'][:40000]  # First 40,000 sentiments for training
print(f"Training sentiments shape: {train_sentiments.shape}")

In [None]:
#TODO: test sentiments
test_sentiments = df['sentiment'][40000:]   # Last 10,000 sentiments for testing
print(f"Testing sentiments shape: {test_sentiments.shape}")

In [None]:
# Verification of the distribution in both sets
print("\n--- Training Set Sentiment Distribution ---")
print(train_sentiments.value_counts())

print("\n--- Testing Set Sentiment Distribution ---")
print(test_sentiments.value_counts())

---


**Step 2:**

We need to make a few changes to the data before we can use it to train our model. First, we need to convert the data into a format that is useful for our model. We will use the `CountVectorizer` function from the `sklearn.feature_extraction.text` package to convert the text into a matrix of token counts.

For the sentiment column, we need to convert the labels into numbers. We will use the `LabelEncoder` function from the `sklearn.preprocessing` package to convert the labels into numbers.

[To read more about Count Vectorizer, follow this link](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). [You can also use this link to read more about Label Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html). Please go through the parameters of both these functions to better understand the code below.

In [None]:
#Count vectorizer for bag of words
cv = CountVectorizer(min_df=1, max_df=1.0, binary=False, ngram_range=(1,3))

To transform the data, we will use the `fit_transform` function. The `fit_transform` function fits the model to the data and then transforms the data into a matrix of token counts. We will use the `fit_transform` function on the training data and the `transform` function on the testing data. This is because we only want to fit the model to the training data and not the testing data.

In [None]:
#transformed train reviews
cv_train_reviews = cv.fit_transform(train_reviews)

In [None]:
#transformed test reviews
cv_test_reviews = cv.transform(test_reviews)

In [None]:
print("CountVectorizer transformation completed!")
print(f"Training features shape: {cv_train_reviews.shape}")
print(f"Testing features shape: {cv_test_reviews.shape}")

Again, for the sentiment column, we will use the `fit_transform` function on the training data and the `transform` function on the testing data.

In [None]:
#labeling the sentient data
lb = LabelBinarizer()

In [None]:
# transformed sentiment data
lb_train_sentiments = lb.fit_transform(train_sentiments)

In [None]:
#TODO: transformed test sentiment data (similar to count vectorizer, transform test reviews, name it lb_test_sentiments)
lb_test_sentiments = lb.transform(test_sentiments)

In [None]:
print("Label binarization completed!")
print(f"Training labels shape: {lb_train_sentiments.shape}")
print(f"Testing labels shape: {lb_test_sentiments.shape}")
print(f"Label classes: {lb.classes_}")

---

**Step 3:**

Model Building: In this step, we will build our model. We will use the `MultinomialNB` function from the `sklearn.naive_bayes` package to build our model. The Multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. Bag-of-Word counts are an example of integer-valued discrete features.

[Please read about the Multinomial Naive Bayes classifier here](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) and write about it in the comments below.

<h4 style="color:orange">
    TO-DO
</h4>

Write a few lines about the following:

- **Machine Learning Classifiers:** Algorithms that automatically learn to categorize data into predefined classes or labels based on patterns and features in the training data. They work by finding relationships between input features and output labels, then using these learned patterns to predict labels for new, unseen data. Common classifiers include Naive Bayes, Logistic Regression, Support Vector Machines, and Decision Trees.

- **Naive Bayes Classifier:** A probabilistic classification algorithm based on Bayes' theorem with a "naive" assumption of feature independence. It calculates the probability of each class given the input features and selects the class with the highest probability. Despite its simplifying assumption (that all features are independent), it also works for text classification tasks like sentiment analysis and spam detection because it's fast, efficient, and performs well with high-dimensional data like text features.




The first part is to train and fit the Multinomial Naive Bayes classifier to the training data. We will use the `fit` function to train the model on the training data.

In [None]:
# training the model
mnb = MultinomialNB()

In [None]:
# fitting the model
mnb_bow = mnb.fit(cv_train_reviews, lb_train_sentiments.ravel())

In [None]:
#Predicting the model for bag of words
mnb_bow_predict = mnb.predict(cv_test_reviews)


In [None]:
#Accuracy score for bag of words
mnb_bow_score = accuracy_score(lb_test_sentiments, mnb_bow_predict)
print("Accuracy :", mnb_bow_score)

This means the model correctly classified 88.48% of the 10,000 test reviews.

---
---

<h3 align = "center">
    Milestone #4
</h3>

NOTE: This milestone is to be completed as a **group**.  Each group member should try a different classifier and you must discuss the results with your teammates. Please attend Office Hours or ask your questions on Slack.

GOAL: The main goal of this milestone is to understand how to use different classifiers to train a model and how to evaluate the performance of the model.

**Step 1:**

[Please read about the different classifiers here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning).
Each team member should try at least one **different** classifier. You can try more than one classifier if you want. Please write about the classifier you have chosen in the comments below.

<h4 style="color:orange">
    TO-DO
</h4>

**Classifiers chosen**:
1. **Logistic Regression:**  Logistic Regression is a fundamental linear classifier that works well for text classification. It's fast, interpretable, and provides probability estimates. For sentiment analysis, it can learn which words contribute most to positive/negative sentiment.

    Expected results: Typically performs very well on text data, often comparable to or better than Naive Bayes.

2. **Linear Support Vector Machine (SVM):** SVMs are powerful for high-dimensional data like text. They find the optimal decision boundary that maximizes the margin between classes. LinearSVC is efficient for large-scale text classification.

    Expected results: Often achieves state-of-the-art performance on text classification tasks, though can be slower to train.

3. **Random Forest:** As an ensemble method, Random Forest combines multiple decision trees. It's robust to overfitting and can capture complex non-linear relationships. It provides feature importance scores.

    Expected results: May perform well but could be slower and use more memory than linear models for text data.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Dictionary to store results
classifier_results = {}

In [None]:
# Classifier 1: Logistic Regression
print("=" * 60)
print("CLASSIFIER 1: LOGISTIC REGRESSION")
print("=" * 60)

from sklearn.linear_model import LogisticRegression

# training the model
lr = LogisticRegression(random_state=42, max_iter=1000)

# fitting the model
lr.fit(cv_train_reviews, lb_train_sentiments.ravel())

# Predict the model
lr_predict = lr.predict(cv_test_reviews)

# Accuracy score
lr_acc = accuracy_score(lb_test_sentiments, lr_predict)

print(f"Accuracy: {lr_acc:.4f} ({lr_acc*100:.2f}%)")
classifier_results['Logistic Regression'] = lr_acc

In [None]:
# Classifier 2: Linear Support Vector Machine (SVM)
print("\n" + "=" * 60)
print("CLASSIFIER 2: LINEAR SUPPORT VECTOR MACHINE")
print("=" * 60)

from sklearn.svm import LinearSVC

# training the model
svm = LinearSVC(random_state=42, max_iter=2000)

# fitting the model
svm.fit(cv_train_reviews, lb_train_sentiments.ravel())

# Predict the model
svm_predict = svm.predict(cv_test_reviews)

# Accuracy score
svm_acc = accuracy_score(lb_test_sentiments, svm_predict)

print(f"Accuracy: {svm_acc:.4f} ({svm_acc*100:.2f}%)")
classifier_results['Linear SVM'] = svm_acc


In [None]:
# Classifier 3: Random Forest
print("\n" + "=" * 60)
print("CLASSIFIER 3: RANDOM FOREST")
print("=" * 60)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# training the model
rf = RandomForestClassifier(random_state=42, n_estimators=100)

# fitting the model
rf.fit(cv_train_reviews, lb_train_sentiments.ravel())

# Predict the model
rf_predict = rf.predict(cv_test_reviews)

# Accuracy score
rf_acc = accuracy_score(lb_test_sentiments, rf_predict)

print(f"Accuracy: {rf_acc:.4f} ({rf_acc*100:.2f}%)")
classifier_results['Random Forest'] = rf_acc

In [None]:
# Compare all classifiers to Naive Bayes
classifier_results['Naive Bayes'] = mnb_bow_score

print("\n" + "=" * 60)
print("CLASSIFIER COMPARISON")
print("=" * 60)
for classifier, accuracy in sorted(classifier_results.items(), key=lambda x: x[1], reverse=True):
    print(f"{classifier:20}: {accuracy:.4f} ({accuracy*100:.2f}%)")

In [None]:
# Detailed evaluation for each classifier
classifiers = {
    'Logistic Regression': (lr, lr_predict),
    'Linear SVM': (svm, svm_predict),
    'Random Forest': (rf, rf_predict),
    'Naive Bayes': (mnb_bow, mnb_bow_predict)
}

for name, (classifier, predictions) in classifiers.items():
    print(f"\n{name} - Detailed Classification Report:")
    print("=" * 50)
    print(classification_report(lb_test_sentiments, predictions, target_names=lb.classes_))

---

**Step 2:**

Data Visualizations are a great way to understand the data. [If you are interested in finding additional resources on Data Visualization, we recommend leveraging the DSRR resources, linked here.](https://nebigdatahub.org/nsdc/data-science-resource-repository/) We will be using word clouds to visualize the data. Word clouds are a great way to visualize the most frequent words in a text. We will use the `WordCloud` function from the `wordcloud` package to visualize the most frequent words in the reviews.

[Please read about the WordCloud function here](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html).



In [None]:
# word cloud for positive review words in the entire dataset
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
%matplotlib inline

#join all the positive reviews
positive_words = ' '.join(list(df[df['sentiment'] == 'positive']['review_cleaned']))

#word cloud for positive words
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(positive_words)

plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()



In [None]:
# TODO: Word cloud for negative reviews in the dataset
# Word cloud for negative review words
negative_words = ' '.join(list(df[df['sentiment'] == 'negative']['review_cleaned']))

# Word cloud for negative words
wordcloud_negative = WordCloud(
    width=800,
    height=500,
    random_state=21,
    max_font_size=110,
    background_color='white',
    colormap='Reds'
).generate(negative_words)

plt.figure(figsize=(12, 8))
plt.imshow(wordcloud_negative, interpolation="bilinear")
plt.axis('off')
plt.title('Most Frequent Words in Negative Reviews', fontsize=16, pad=20)
plt.show()

In [None]:
# Visualization of classifier performance
plt.figure(figsize=(10, 6))
classifiers_names = list(classifier_results.keys())
accuracies = list(classifier_results.values())

bars = plt.bar(classifiers_names, accuracies, color=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
plt.ylim(0.8, 0.9)  # Focus on the range where our accuracies lie
plt.title('Classifier Performance Comparison on Sentiment Analysis')
plt.ylabel('Accuracy')
plt.xlabel('Classifiers')

# Add value labels on bars
for bar, accuracy in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.002,
             f'{accuracy:.4f}', ha='center', va='bottom')

plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Display cleaned word clouds
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

ax1.imshow(wordcloud_positive, interpolation="bilinear")
ax1.set_title('Positive Reviews - Cleaned Text', fontsize=16)
ax1.axis('off')

ax2.imshow(wordcloud_negative, interpolation="bilinear")
ax2.set_title('Negative Reviews - Cleaned Text', fontsize=16)
ax2.axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Detailed analysis of the best model (SVM)
print("=" * 60)
print("DETAILED ANALYSIS - LINEAR SVM (BEST PERFORMER)")
print("=" * 60)

print(classification_report(lb_test_sentiments, svm_predict_fixed, target_names=lb.classes_))

# Confusion matrix for SVM
cm_svm = confusion_matrix(lb_test_sentiments, svm_predict_fixed)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_svm, annot=True, fmt='d', cmap='Blues',
            xticklabels=lb.classes_, yticklabels=lb.classes_)
plt.title('Confusion Matrix - Linear SVM (Best Model)')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# Calculate key metrics
tn, fp, fn, tp = cm_svm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"\nKey Metrics for SVM:")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")

---

**Step 3:** (Extra Points: Optional!!)
If you are able to complete this step and the remaining ones correctly, we will endorse your data science skills on LinkedIn.

Make a visualization of your choice! This is your chance to show your creativity and read about different visualization techniques. You can use the `matplotlib` package to make a visualization of your choice. You can also use the `seaborn` package to make a visualization of your choice.

---
---

<h3 align = 'center' >
Thank you for completing the project!
</h3>

Please do reach out to us if you have any questions or concerns. We are here to help you learn and grow.

If you have any queries, please contact the NSDC HQ Team at nsdc@nebigdatahub.org.
