# Sentiment Analysis of IMDB Reviews

## Problem Definition
Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text.

### Objective:
The primary goal of sentiment analysis is to understand the attitude or emotional state conveyed in a text. This could range from positive, negative, or neutral sentiments.

The task is to analyze the sentiment of IMDB movie reviews and classify them as either positive or negative.

### Input and Output
**Input:** The input for the algorithm is a dataset containing movie reviews from IMDB. 

**Features:** The primary feature is the text of the review itself.

**Output:** The output is the sentiment classification of each review as either:
- positive
- negative
  
For each review, the model will output a sentiment label:
- 1 for positive sentiment
- 0 for negative sentiment

### Machine Learning Category
This problem falls under the category of classification in machine learning. Specifically, it is a binary classification problem because the output has two distinct classes: positive and negative sentiments.


## Data Analysis and Processing

### Load the libraries

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import matplotlib
import nltk
import re
import pandas as pd
import matplotlib.pyplot as plt
import warnings


matplotlib.use('TkAgg')

warnings.filterwarnings('ignore')

### Visualizing the data

In [3]:
imdb_data = pd.read_csv('dataset/IMDB Dataset.csv')
print("Shape of the dataset:", imdb_data.shape, "\n")
print("Sample of the data:\n", imdb_data.head(10), "\n")

# Summary of the dataset
print("Summary of the dataset:\n", imdb_data.describe(), "\n")

# Sentiment count
imdb_data['sentiment'].value_counts()

FileNotFoundError: [Errno 2] No such file or directory: 'dataset/IMDB Dataset.csv'

### Data statistics

In [None]:
plt.figure(figsize=(8, 6))
imdb_data['sentiment'].value_counts().plot(kind='bar', color=['magenta', 'orange'])
plt.title('Number of Reviews by Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Number of Reviews')
plt.xticks(rotation=0)
plt.show()

### Text normalisation

In [None]:
# Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()


# Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)


# Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text


# Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(denoise_text)
print("Data after reducing noise:\n", imdb_data.head(10), "\n")


# Define function for removing special characters
def remove_special_characters(text):
    pattern = r'[^a-zA-z0-9\s]'
    text = re.sub(pattern, '', text)
    return text


# Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(remove_special_characters)
print("Data after removing special characters:\n", imdb_data.head(10), "\n")


# Stemming the text
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text


# Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(simple_stemmer)
print("Data after stemming:\n", imdb_data.head(10), "\n")


# Setting English stopwords
stop = set(nltk.corpus.stopwords.words('english'))
print("English stopwords:\n", stop, "\n")


# Removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = word_tokenize(text)
    tokens = [token.strip() for token in tokens]

    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stop]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stop]

    filtered_text = ' '.join(filtered_tokens)
    return filtered_text


# Apply function on review column
imdb_data['review'] = imdb_data['review'].apply(remove_stopwords)
print("Data after removing the stopwords:\n", imdb_data.head(10), "\n")

### Correlation analysis

In [None]:
imdb_data['sentiment'] = imdb_data['sentiment'].map({'positive': 1, 'negative': 0})

# Instantiate the vectorizer
vectorizer = TfidfVectorizer(max_features=1000)

# Fit and transform the review text
tfidf_matrix = vectorizer.fit_transform(imdb_data['review'])

# Convert to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Add the sentiment column to the tfidf dataframe
tfidf_df['sentiment'] = imdb_data['sentiment']

# Compute the correlation matrix
correlation_matrix = tfidf_df.corr()

# Get the correlation of each feature with the sentiment
sentiment_correlation = correlation_matrix['sentiment'].sort_values(ascending=False)

print("Correlation analysis:\n", sentiment_correlation, "\n")