# Basic Statistics from movies

**Metadata**

| **#** | **Column**               | **Description**                                                                                    |
|-------|--------------------------|----------------------------------------------------------------------------------------------------|
| 1     | **title**                | The official title of the movie.                                                                   |
| 2     | **release_year**         | The year when the movie was officially released.                                                   |
| 3     | **genres**               | Categories of genres the movie belongs to (can have multiple values, separated by commas).         |
| 4     | **director**             | Director(s) of the movie.                                                                          |
| 5     | **runtime**              | Duration of the movie in minutes.                                                                  |
| 6     | **language**             | The language in which the movie was originally produced.                                           |
| 7     | **original_title**       | The title of the movie in its original language.                                                   |
| 8     | **popularity**           | TMDB's score based on votes, views, searches, social engagement, and others.                       |
| 9     | **events**               | Describes events or triggers within the movie (e.g., violence, strong language).                   |
| 10    | **imdb_rating**          | Average rating of the movie on IMDb, on a scale from 0 to 10.                                      |
| 11    | **imdb_votes**           | Number of votes that contributed to the IMDb rating.                                               |
| 12    | **tmdb_rating**          | Rating of the movie on TMDB, on a scale from 0 to 10.                                              |
| 13    | **tmdb_votes**           | Number of votes that contributed to the TMDB rating.                                               |
| 14    | **budget**               | Financial budget allocated for the movie production.                                               |
| 15    | **revenue**              | Box office earnings of the movie.                                                                  |
| 16    | **profit**               | Profit of the movie, calculated as `revenue - budget`.                                             |


In [22]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

In [23]:
import sys
sys.path.append('../utils')
import functions

In [None]:
movies = pd.read_csv('../data/clean/movie_df.csv')
display(movies)

In [None]:
print('Basic statistics for numerical columns\n')
movies.describe()

### Univariate Analysis
- Visualize the Distribution of Numerical Columns
- Check for outliers 

In [None]:
movies[['runtime', 'popularity', 'budget', 'revenue', 'profit']].hist(bins=20, figsize=(10, 8), color='gold')
plt.tight_layout()
plt.show()

In [None]:
# detect outliers
plt.figure(figsize=(10, 6))
sns.boxplot(data=movies[['budget', 'revenue', 'profit']])
plt.show()

### Bivariate Analysis
Correlation Between Numerical Variables

In [None]:
# correlation matrix
correlation_matrix = movies[['runtime', 'popularity', 'budget', 'revenue', 'profit', 'imdb_rating', 'tmdb_rating']].corr()

# Create the heatmap with updated labels
plt.figure(figsize=(12, 10))

# Customizing axis labels for better readability
sns.heatmap(correlation_matrix, annot=True, cmap='Purples', fmt='.2f', linewidths=0.5, 
            xticklabels=['Runtime', 'Popularity', 'Budget', 'Revenue', 'Profit', 'IMDB Rating', 'TMDB Rating'],
            yticklabels=['Runtime', 'Popularity', 'Budget', 'Revenue', 'Profit', 'IMDB Rating', 'TMDB Rating'])

plt.xticks(fontsize=10)
plt.yticks(fontsize=10)

# Show the plot
plt.title('Correlation Matrix of Movie Features', fontsize=12)
plt.show()

### Categorical Data Analysis
- Most common genres
- Most frequent languages and directors

In [None]:
movies['genres'] = movies['genres'].str.split(',')
genres_flat = [genre.strip() for sublist in movies['genres'] for genre in sublist]
genre_counts = pd.Series(genres_flat).value_counts()

print('Most common genres:\n')
print(genre_counts.head(10))

In [None]:
print('Most frequent languages:\n')
print(movies['language'].value_counts().head(10))

print('\nMost frequent directiors:\n')
print(movies['director'].value_counts().head(10))

### Time-Based Analysis
- Release year trends
- Revenue and profit trends over time 

In [None]:
# distribution of movies over years
plt.figure(figsize=(10, 6))
sns.countplot(x='release_year', data=movies, hue='release_year', palette='viridis')
plt.xticks(rotation=45)
plt.title('Movies Released per Year')
plt.ylabel('Release Count')
plt.xlabel('Year')
plt.legend().set_visible(False) # hide legend

plt.show()

In [None]:
# revenue and profit trends over time
movies.groupby('release_year')[['revenue', 'profit']].sum().plot(kind='line', figsize=(10, 6))
plt.title('Revenue and Profit Trends Over Time', fontsize=16)
plt.xticks(rotation=45)
plt.xlabel('Release Year')

plt.tight_layout()
plt.show()

### Multicollinearity
- For predictive modeling, check multicollinearity between numerical variables
- VIF (Variance Inflation Factor) for multicollinearity

In [None]:
# constant to the features after removing one of the highly correlated variables (e.g., 'profit')
X = add_constant(movies[['runtime', 'popularity', 'budget', 'imdb_rating', 'tmdb_rating']])

# VIF for each feature
vif_data = pd.DataFrame()
vif_data['feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)


## VIF Interpretation:

- **`const (107.81)`**: This is the constant term (intercept) in the regression model. High VIF values for the constant are typically not concerning, as this does not represent a feature.
- **`runtime (1.45)`**: A VIF of 1.45 indicates a low degree of multicollinearity. It's a safe value.
- **`popularity (1.02)`**: A VIF close to 1 means there's minimal collinearity with other variables, which is ideal.
- **`budget (1.26)`**: A VIF of 1.26 indicates mild collinearity but is still acceptable and generally not a problem for regression.
- **`imdb_rating (2.86)`**: A VIF value of 2.86 is higher than 1, but still within a reasonable range. Generally, a VIF under 5 is considered acceptable.
- **`tmdb_rating (2.62)`**: Similar to `imdb_rating`, this VIF value is acceptable.

## Conclusion:

The VIFs for all the variables are well below the threshold of **5** (a common rule of thumb), indicating no severe multicollinearity.

You can confidently proceed with building regression models or other analysis techniques without worrying much about multicollinearity for these features.

If you plan on building models (e.g., linear regression), these results suggest that your independent variables are not highly collinear and will not cause issues like instability in the model coefficients.

## Event Analysis

In [34]:
# Step 1: Explode the 'events' column into individual events
events_exploded = movies['events'].str.split(',').explode().str.strip()

# Create a new DataFrame with individual events for each movie
events_df = pd.DataFrame({
    'event': events_exploded,
    'popularity': movies.loc[events_exploded.index, 'popularity'],
    'runtime': movies.loc[events_exploded.index, 'runtime'],
    'budget': movies.loc[events_exploded.index, 'budget'],
    'revenue': movies.loc[events_exploded.index, 'revenue'],
    'profit': movies.loc[events_exploded.index, 'profit'],
    'imdb_rating': movies.loc[events_exploded.index, 'imdb_rating'],
    'tmdb_rating': movies.loc[events_exploded.index, 'tmdb_rating']
})

In [None]:
# Step 2: Frequency of Events
event_frequency = events_df['event'].value_counts().reset_index()
event_frequency.columns = ['event', 'frequency']

event_frequency['event'] = event_frequency['event'].str.capitalize()
plt.figure(figsize=(10, 6))
sns.barplot(x='frequency', y='event', data=event_frequency.head(10), hue='frequency')
plt.title('Top 10 Most Frequent Events', fontsize=14)
plt.xlabel('Frequency', fontsize=12)
plt.ylabel('Events', fontsize=12)
plt.show()

In [None]:


# Step 1: Prepare data
# Exploding the 'genres' and 'events' columns
exploded_genres = movies['genres'].astype(str).str.split(',').explode().str.strip().reset_index(drop=True)
exploded_events = movies['events'].astype(str).str.split(',').explode().str.strip().reset_index(drop=True)

# Step 2: Create a DataFrame combining exploded genres and events
# We will create a DataFrame without worrying about the exact lengths matching
# Filling the shorter column with NaN values so that they can still be joined
max_len = max(len(exploded_genres), len(exploded_events))

# Adjust lengths to match by padding the shorter list with NaN values
exploded_genres = exploded_genres.reindex(range(max_len))
exploded_events = exploded_events.reindex(range(max_len))

# Create the DataFrame
events_per_genre = pd.DataFrame({'genre': exploded_genres, 'event': exploded_events})

# Step 3: Calculate the frequency of each event
event_frequency = events_per_genre.groupby('event').size().reset_index(name='frequency')

# Step 4: Convert the event frequencies to a dictionary for the word cloud
event_frequency_dict = dict(zip(event_frequency['event'], event_frequency['frequency']))

# Step 5: Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis').generate_from_frequencies(event_frequency_dict)

# Step 6: Plot the word cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Event Frequencies', fontsize=14)
plt.tight_layout()
plt.show()