## Sentiment Analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from textblob import TextBlob
import nltk
import re

In [2]:
movies = pd.read_csv('../data/clean/movie_df.csv')

Explode columns

In [None]:
movies['events'] = movies['events'].str.split(',')  # Split by comma
movies['genres'] = movies['genres'].str.split(',')  # Split by comma

# Exploding both columns (this creates multiple rows for each combination of events and genres)
df = movies.explode('events').explode('genres').reset_index(drop=True)

# Show exploded DataFrame
display(df)

## Sentiment Analysis on Events
Sentiment polarity ranges from -1 (negative sentiment) to +1 (positive sentiment). We can calculate the average sentiment per movie by aggregating the sentiments of its individual events.

In [None]:
# calculate sentiment of each event
def get_event_sentiment(event):
    # a TextBlob object
    blob = TextBlob(event)
    # polarity score of the event (range is -1 to 1)
    return blob.sentiment.polarity

# Apply to the 'events' column
df['event_sentiment'] = df['events'].apply(get_event_sentiment)

# Check the DataFrame to confirm the sentiment analysis results
print(df[['title', 'events', 'event_sentiment']])

## Overall Sentiment for Each Movie
Group by the movie's title and average the sentiment of the events related to that movie. Since the dataset has exploded rows, you can group the DataFrame by the title column.

In [None]:
# Group by 'title' and calculate the average sentiment for each movie
movie_sentiments = df.groupby('title')['event_sentiment'].mean().reset_index()

# Check the sentiment scores for each movie
print(movie_sentiments)


One-Hot Encoding
We will one-hot encode the genres, languages, and events columns to convert them into a format suitable for correlation analysis.

In [None]:
# One-hot encoding categorical columns
movies_encoded = pd.get_dummies(df, columns=['genres', 'language', 'events'])

# Now, let's inspect the columns after encoding
print(movies_encoded.head())

Correlation Matrix
Once the categorical data is encoded, you can compute the correlation matrix, which will show how the different variables are related to each other. The correlation will range from -1 to 1, where:

1 means a perfect positive correlation
-1 means a perfect negative correlation
0 means no correlation

In [None]:
movies_numeric = movies_encoded.select_dtypes(include=['number'])

# the correlation matrix for numeric columns
correlation_matrix = movies_numeric.corr()

print(correlation_matrix)

# heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix of Numeric Variables")
plt.show()


In [None]:
# Group by 'events' and calculate the average popularity and ratings for each event
events_popularity_ratings = df.groupby('events').agg({
    'popularity': 'mean',
    'imdb_rating': 'mean',
    'tmdb_rating': 'mean'
}).reset_index()

# Sort the result to check which events have higher popularity or ratings
events_popularity_ratings_sorted = events_popularity_ratings.sort_values(by='popularity', ascending=False)

# Display the top events with the highest popularity
print(events_popularity_ratings_sorted.head())

# Optional: Sort by IMDb and TMDB ratings as well
events_sorted_by_imdb = events_popularity_ratings.sort_values(by='imdb_rating', ascending=False)
events_sorted_by_tmdb = events_popularity_ratings.sort_values(by='tmdb_rating', ascending=False)

# Display the top events by IMDb rating and TMDB rating
print('Top events by IMDb rating:')
print(events_sorted_by_imdb.head())

print('Top events by TMDB rating:')
print(events_sorted_by_tmdb.head())

In [None]:
# Group by 'language' and 'events' and count occurrences of each event per language
events_per_language = df.groupby(['language', 'events']).size().reset_index(name='event_count')

# Sort the events by event_count in descending order for each language
events_per_language_sorted = events_per_language.sort_values(by=['language', 'event_count'], ascending=[True, False])

# Display the most common events per language
print(events_per_language_sorted.head(20))  # Display top 20 events for brevity

In [None]:
# Calculate the average IMDb rating per language
avg_imdb_ratings = df.groupby('language')['imdb_rating'].mean().reset_index()

# Sort the languages by average IMDb rating in descending order
avg_imdb_ratings_sorted = avg_imdb_ratings.sort_values(by='imdb_rating', ascending=False)

# Display the languages with higher average IMDb ratings
print(avg_imdb_ratings_sorted.head(10))  # Display top 10 languages with higher ratings

Sentiment Distribution by Language
Goal: Understand how sentiment varies across different languages.
Approach: Group the data by language and calculate the average sentiment score for each language. You can then analyze the languages with the most positive or negative sentiments.

In [None]:
avg_sentiment_by_language = df.groupby('language')['event_sentiment'].mean().reset_index()
avg_sentiment_by_language_sorted = avg_sentiment_by_language.sort_values(by='event_sentiment', ascending=False)
print(avg_sentiment_by_language_sorted.head())


 Sentiment by Genre
Goal: Determine if certain genres have more positive or negative sentiments.
Approach: Group the data by genres and calculate the average sentiment for each genre. You could also check how sentiment changes over time for each genre.

In [None]:
avg_sentiment_by_genre = df.groupby('genres')['event_sentiment'].mean().reset_index()
avg_sentiment_by_genre_sorted = avg_sentiment_by_genre.sort_values(by='event_sentiment', ascending=False)
print(avg_sentiment_by_genre_sorted.head())


Sentiment Over Time
Goal: Track sentiment trends over the years and see if it is improving or declining.
Approach: Group the data by release year and calculate the average sentiment for each year.

In [None]:
avg_sentiment_by_year = df.groupby('release_year')['event_sentiment'].mean().reset_index()
avg_sentiment_by_year_sorted = avg_sentiment_by_year.sort_values(by='release_year')
print(avg_sentiment_by_year_sorted.head())

Correlation Between Sentiment and Popularity/Ratings
Goal: Investigate if there is a correlation between event sentiment and movie popularity/ratings.
Approach: Calculate the correlation between event_sentiment, popularity, and ratings (imdb_rating, tmdb_rating).

In [None]:
correlation = df[['event_sentiment', 'popularity', 'imdb_rating', 'tmdb_rating']].corr()
print(correlation)

 Popularity vs. Genre Sentiment
Goal: See if more popular genres tend to have higher or lower sentiment scores.
Approach: Group by genre and calculate the average popularity and average sentiment for each genre.

In [None]:
genre_popularity_sentiment = df.groupby('genres').agg({
    'popularity': 'mean',
    'event_sentiment': 'mean'
}).reset_index()
genre_popularity_sentiment_sorted = genre_popularity_sentiment.sort_values(by='popularity', ascending=False)
print(genre_popularity_sentiment_sorted.head())

Sentiment for Events
Goal: See if certain types of events have a strong positive or negative sentiment.
Approach: Group by event type and calculate the average sentiment score for each event type.

In [None]:
avg_sentiment_by_event = df.groupby('events')['event_sentiment'].mean().reset_index()
avg_sentiment_by_event_sorted = avg_sentiment_by_event.sort_values(by='event_sentiment', ascending=False)
print(avg_sentiment_by_event_sorted.head())

Sentiment Across Different Rating Platforms (IMDB vs. TMDB)
Goal: Compare how sentiment correlates with ratings from different platforms (IMDB and TMDB).
Approach: Calculate the correlation between event_sentiment, imdb_rating, and tmdb_rating.

In [None]:
sentiment_vs_ratings = df[['event_sentiment', 'imdb_rating', 'tmdb_rating']].corr()
print(sentiment_vs_ratings)

Profitability vs. Sentiment
Goal: Explore how event sentiment correlates with a movie’s profitability (profit = revenue - budget).
Approach: Calculate the correlation between event_sentiment and the profit column.
python



In [None]:
df['profit'] = df['revenue'] - df['budget']
sentiment_vs_profit = df[['event_sentiment', 'profit']].corr()
print(sentiment_vs_profit)


Sentiment by Director
Goal: Explore if certain directors tend to create movies with more positive or negative sentiment.
Approach: Group the data by director and calculate the average sentiment for each director.

In [None]:
avg_sentiment_by_director = df.groupby('director')['event_sentiment'].mean().reset_index()
avg_sentiment_by_director_sorted = avg_sentiment_by_director.sort_values(by='event_sentiment', ascending=False)
print(avg_sentiment_by_director_sorted.head())


Sentiment Distribution
Goal: Understand the overall distribution of sentiment scores (e.g., do most events have neutral sentiment, or are they generally positive or negative?).
Approach: Plot a histogram of event_sentiment to visualize its distribution.

In [None]:
import matplotlib.pyplot as plt

# Plot histogram of event sentiment
plt.hist(df['event_sentiment'], bins=30, edgecolor='black')
plt.title('Event Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Frequency')
plt.show()


Comparing Sentiment Across Popular and Less Popular Movies
Goal: See if more popular movies (based on popularity) tend to have higher or lower sentiment.
Approach: Create two groups: high popularity (e.g., popularity > 50) and low popularity (e.g., popularity <= 50). Compare their average sentiments.

In [None]:
high_popularity = df[df['popularity'] > 50]
low_popularity = df[df['popularity'] <= 50]

avg_sentiment_high_pop = high_popularity['event_sentiment'].mean()
avg_sentiment_low_pop = low_popularity['event_sentiment'].mean()

print(f"Average sentiment for high popularity movies: {avg_sentiment_high_pop}")
print(f"Average sentiment for low popularity movies: {avg_sentiment_low_pop}")


Summary of Useful Calculations:
Sentiment by language: Understand which languages tend to have more positive or negative sentiments.
Sentiment by genre: See if certain genres are associated with specific sentiment scores.
Sentiment trends over time: Track changes in sentiment over the years.
Sentiment vs. ratings: Explore how sentiment correlates with IMDB, TMDB ratings, and popularity.
Sentiment by director: Find out if certain directors consistently make movies with a specific sentiment.
Profitability vs. sentiment: Investigate the relationship between sentiment and a movie’s profitability.
These calculations will help you identify patterns in the data and might also provide valuable insights to refine your sentiment analysis. You can further analyze the correlations to spot trends that could drive recommendations or predictions based on sentiment.

Check which languages have the higher amount of events
We can group the data by language and count the number of occurrences of events for each language. This will allow us to see which languages are associated with more events.

In [None]:
# Count the number of events per language
event_count_by_language = df.groupby('language')['events'].count().reset_index()
event_count_by_language_sorted = event_count_by_language.sort_values(by='events', ascending=False)

# Display the results
print(event_count_by_language_sorted.head())


Check if there's any correlation between high ratings and certain types of events
We'll calculate the correlation between event sentiment (or other metrics like imdb_rating, tmdb_rating) and different event types. This could help us identify if certain types of events (e.g., "flashing lights" or "blood or gore") are linked to higher ratings or popularity.

We can look at the correlation between event types and ratings by using the following steps:

a. Encode the events into a numeric format, so we can calculate correlations.
We need to create dummy variables for the events column, turning each event type into a separate binary column (i.e., 1 if the event occurred, 0 if it did not).

b. Then calculate the correlation between the encoded event types and ratings.

In [None]:
# Create dummy variables for the events column (one-hot encoding)
events_encoded = pd.get_dummies(df['events'])

# Add the ratings (IMDB and TMDB) to the encoded DataFrame
events_with_ratings = pd.concat([events_encoded, df['imdb_rating'], df['tmdb_rating']], axis=1)

# Calculate the correlation matrix
correlation_matrix = events_with_ratings.corr()

# Display the correlation matrix
print(correlation_matrix)


This will give us a correlation matrix showing how different event types are correlated with the IMDB and TMDB ratings. Higher positive correlations would indicate that certain types of events are more likely to appear in higher-rated movies.

Analysis:
If you see strong correlations between event types (e.g., "car crashes" or "blood or gore") and high ratings, you can investigate further whether those types of events are popular in blockbuster movies.
If certain event types have negative correlations with ratings, it might suggest that those events are not typically associated with highly rated movies.