# Data analysis on a TMDB datased of 15000 movies

What is the relationship between a movie's popularity and its average vote? Do more popular movies receive higher average votes?

## Data loading

In [69]:
import numpy as np
import pandas as pd

# DATA_CSV_FILE = 'data/input.csv'
DATA_CSV_FILE = 'data/original.csv'

df = pd.read_csv(DATA_CSV_FILE, lineterminator='\n')

## Data preparation

**1. Data cleaning**

- remove non-English movies
- remove movies with less than 100 votes
- remove unused columns
- remove duplicates 
- remove empty values 

In [70]:
# Remove non-English movies.
df = df[df['original_language'] == 'en']
# Remove movies with less than 100 votes.
df = df[df['vote_count'] >= 100]
# df = df[df['popularity'] <= 100]

# Remove unused columns.
df = df.drop(
  [
    'Unnamed: 0',
    'adult',
    'backdrop_path',
    'cast',
    'crew',
    'genres',
    'keywords',
    'original_language',
    'poster_path',
    'release_date',
    'video',
    'vote_count',
  ],
  axis=1,
)

# Remove rows with null values.
df = df.dropna()
# Fill null values with empty string.
df = df.fillna('')
# Remove duplicate rows.
df = df.drop_duplicates()

# # Convert release_date to datetime.
# df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')

**2. Data cleaning**

No specific transformation needed - `vote_average` and `popularity` are 
numerical values.

In [71]:
print("and now that you don't have to be perfect, you can be good")
print("but you're already perfect")

and now that you don't have to be perfect, you can be good
but you're already perfect


**3. Data standardization**

We need to standardize `vote_average` and `popularity`.

Let's find out their `min`, `max`, `mean`, `median` and `stddev`.

In [72]:
print('----- vote_average -----')
print(df['vote_average'].describe())

print('----- popularity -----')
print(df['popularity'].describe())

----- vote_average -----
count    3252.000000
mean        6.824908
std         0.661431
min         5.500000
25%         6.300000
50%         6.800000
75%         7.300000
max         8.700000
Name: vote_average, dtype: float64
----- popularity -----
count    3252.000000
mean       40.140065
std       102.874918
min         0.617000
25%        17.177250
50%        24.440500
75%        39.750750
max      4810.649000
Name: popularity, dtype: float64


`vote_average` seems to fit in it's advertised range of __0__ - __10__, whereas
`popularity` seems to have quite some outliers. We will do Min-Max Normalization
for the former and Z-Score Standardization for the latter. 

In [73]:
df['vote_average_normalized'] = (df['vote_average'] - df['vote_average'].min()) / (df['vote_average'].max() - df['vote_average'].min())
df['popularity_standardized'] = (df['popularity'] - df['popularity'].mean()) / df['popularity'].std()

## Data analysis

**1. Descriptive analysis**

Let's re-run the descriptions on processed columns.

I've run the two processes and here are the results:

In [74]:
print('----- vote_average_normalized -----')
print(df['vote_average_normalized'].describe())

print('----- popularity_standardized -----')
print(df['popularity_standardized'].describe())

----- vote_average_normalized -----
count    3252.000000
mean        0.414034
std         0.206697
min         0.000000
25%         0.250000
50%         0.406250
75%         0.562500
max         1.000000
Name: vote_average_normalized, dtype: float64
----- popularity_standardized -----
count    3252.000000
mean        0.000000
std         1.000000
min        -0.384186
25%        -0.223211
50%        -0.152608
75%        -0.003784
max        46.371934
Name: popularity_standardized, dtype: float64


We might want to do what GPT-4 says, we'll see.

> If you're concerned about the effect of these outliers on your subsequent 
> analysis, you might consider some additional preprocessing steps. You could, 
> for example, apply a logarithmic transformation to popularity before 
> standardizing, to reduce the impact of extreme values. Alternatively, you 
> might decide to remove movies that have a popularity above a certain 
> threshold, if you think these are likely to be anomalies or errors. The best 
> approach depends on your specific research question and analysis plan.

**2. Correlation analysis**

In [75]:
correlation_coefficient = df['popularity_standardized'].corr(df['vote_average_normalized'])
print(correlation_coefficient)

0.11661303487008848
