# Data analysis on a TMDB dataset of 15000 movies

Questions:

1. What is the relationship between a movie's popularity and its average vote?
   Do more popular movies receive higher average votes?

## Data loading

In [8]:
import numpy as np
import pandas as pd

DATA_CSV_FILE = 'datasets/tmdb-15000-movies.csv'

df = pd.read_csv(DATA_CSV_FILE, lineterminator='\n')


## Data preparation

### Data cleaning

- remove non-English movies
- remove movies with less than 100 votes
- remove unused columns
- remove duplicates 
- remove empty values 

In [9]:

# from tmdb15k.average_vote_popularity import AverageVotePopularityRelationship

# q1a1 = AverageVotePopularityRelationship(df)

# Remove non-English movies.
df = df[df['original_language'] == 'en']
# Remove movies with less than 100 votes.
df = df[df['vote_count'] >= 100]
# df = df[df['popularity'] <= 50]

# Remove unused columns.
df = df.drop(
    [
        'Unnamed: 0',
        'adult',
        'backdrop_path',
        'cast',
        'crew',
        'genres',
        'keywords',
        'original_language',
        'poster_path',
        'release_date',
        'video',
        'vote_count',
    ],
    axis='columns',
)

# Remove rows with null values.
df = df.dropna()
# Fill null values with empty string.
df = df.fillna('')
# Remove duplicate rows.
df = df.drop_duplicates()

# # Convert release_date to datetime.
# df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

### Data transformation

No specific transformation needed - `vote_average` and `popularity` are 
numerical values.

In [10]:
print("and now that you don't have to be perfect, you can be good")
print("but you're already perfect")

and now that you don't have to be perfect, you can be good
but you're already perfect


### Data standardization

We need to standardize `vote_average` and `popularity`.

Let's find out their `min`, `max`, `mean`, `median` and `stddev`.

In [11]:
print('----- vote_average -----')
print(df['vote_average'].describe())

print('----- popularity -----')
print(df['popularity'].describe())

----- vote_average -----
count    6933.000000
mean        6.619458
std         0.621230
min         5.500000
25%         6.100000
50%         6.600000
75%         7.100000
max         8.500000
Name: vote_average, dtype: float64
----- popularity -----
count    6933.000000
mean       17.803592
std         9.309400
min         0.600000
25%        11.139000
50%        14.891000
75%        21.764000
max        49.947000
Name: popularity, dtype: float64


`vote_average` seems to fit in it's advertised range of __0__ - __10__, whereas
`popularity` seems to have quite some outliers. We will do Min-Max Normalization
for the former and Z-Score Standardization for the latter. 

In [12]:
df['vote_average_normalized'] = (df['vote_average'] - df['vote_average'].min()) / (df['vote_average'].max() - df['vote_average'].min())
df['popularity_standardized'] = (df['popularity'] - df['popularity'].mean()) / df['popularity'].std()

## Data analysis

**1. Descriptive analysis**

Let's re-run the descriptions on processed columns.

I've run the two processes and here are the results:

In [13]:
print('----- vote_average_normalized -----')
print(df['vote_average_normalized'].describe())

print('----- popularity_standardized -----')
print(df['popularity_standardized'].describe())

----- vote_average_normalized -----
count    6933.000000
mean        0.373153
std         0.207077
min         0.000000
25%         0.200000
50%         0.366667
75%         0.533333
max         1.000000
Name: vote_average_normalized, dtype: float64
----- popularity_standardized -----
count    6.933000e+03
mean    -2.459689e-17
std      1.000000e+00
min     -1.847981e+00
25%     -7.158992e-01
50%     -3.128657e-01
75%      4.254204e-01
max      3.452791e+00
Name: popularity_standardized, dtype: float64


We might want to do what GPT-4 says, we'll see.

> If you're concerned about the effect of these outliers on your subsequent 
> analysis, you might consider some additional preprocessing steps. You could, 
> for example, apply a logarithmic transformation to popularity before 
> standardizing, to reduce the impact of extreme values. Alternatively, you 
> might decide to remove movies that have a popularity above a certain 
> threshold, if you think these are likely to be anomalies or errors. The best 
> approach depends on your specific research question and analysis plan.

**2. Correlation analysis**

In [14]:
correlation_coefficient = df['popularity_standardized'].corr(df['vote_average_normalized'])
print(correlation_coefficient)

0.13334058333143056
