# Milestone 2

This notebook is designed to explore and clarify key aspects of our dataset, including both preprocessing and preliminary analysis. We begin by loading the data and implementing general preprocessing steps that will be uniform throughout the notebook. The subsequent sections are organized according to the X research questions highlighted in the project's README. Each section includes specialized data preparation steps, along with vital statistics and visual representations. Our initial exploratory work is intended to offer insights and confirm the suitability of the methodologies we have selected.

---

**Contents of notebook**:
1. [Section 1](#section1)
2. [Section 2](#section2)
    1. [Subsection 2.1](#section2_1)
    2. [Subsection 2.2](#section2_2)
3. [Section 3](#section3)

---

## Data Processing

### Loading data

In [None]:
#imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from helpers import * #DEFINE HELPERS
from datetime import datetime as dt

In [None]:
DATA_FOLDER = './data/'
MOVIES_METADATA_PATH = DATA_FOLDER + 'movie.metadata.tsv'
PLOT_SUMMARIES_PATH = DATA_FOLDER + 'plot_summaries.txt'
TITLE_RATINGS_PATH = DATA_FOLDER + 'title.ratings.tsv'
TITLE_BASICS_PATH = DATA_FOLDER + 'title.basics.tsv'

In [None]:
# Read the Movie Summary Dataset
df_plots = pd.read_csv(PLOT_SUMMARIES_PATH, sep='\t', header=None, names=['id', 'plot'])

# Display the first 5 elements of the dataframe
df_plots.head(5)

In [None]:
# Read the movie meta-data dataset
metadata_df = pd.read_csv(MOVIES_METADATA_PATH, sep='\t',header=None, names=['id', 'freebase_id', 'title', 'release_date', 'boxOffice_revenue', 'runtime', 'language', 'country', 'genres'])
metadata_df.head(5)

In [None]:
# Get some insights
metadata_df.info()

In [None]:
nan_percentage = metadata_df['boxOffice_revenue'].isna().mean() * 100
nan_percentage

In [None]:
metadata_df['release_date']

In [None]:
# Function to convert release date to datetime and handle different formats
def convert_release_date(date_str):
    try:
        # Try converting with full date format
        return pd.to_datetime(date_str, format='%Y-%m-%d', errors='coerce')
    except ValueError:
        try:
            # Try converting with year-month format
            return pd.to_datetime(date_str, format='%Y-%m', errors='coerce')
        except ValueError:
            # Fallback to year-only format
            return pd.to_datetime(date_str, format='%Y', errors='coerce')

# Apply the conversion function
metadata_df['release_date'] = metadata_df['release_date'].apply(convert_release_date)

# Extract year, month, and day
metadata_df['year'] = metadata_df['release_date'].dt.year
metadata_df['month'] = metadata_df['release_date'].dt.month
metadata_df['day'] = metadata_df['release_date'].dt.day

# Create flags for month and day
metadata_df['release_month_given'] = metadata_df['release_date'].dt.strftime('%Y-%m') !=metadata_df['release_date'].dt.strftime('%Y-01')
metadata_df['release_month_day_given'] = metadata_df['release_date'].dt.strftime('%Y-%m-%d') != metadata_df['release_date'].dt.strftime('%Y-01-01')

# Convert the release date to ordinal format
metadata_df['release_date_ord'] = metadata_df['release_date'].apply(lambda x: x.toordinal() if pd.notnull(x) else None)

metadata_df.head()

In [None]:
# IMDb ratings dataset
imdb_ratings_df = pd.read_csv(TITLE_RATINGS_PATH, sep='	')
imdb_ratings_df.head()

In [None]:
# IMDb videos' metadata dataset
imdb_basics_df = pd.read_csv(TITLE_BASICS_PATH, sep='	', low_memory=False)
imdb_basics_df.head()

In [None]:
imdb_basics_df['titleType'].unique()

In [None]:
# Only keep movies
imdb_basics_df = imdb_basics_df[imdb_basics_df['titleType'] == 'movie']

In [None]:
# Merge IMDb movies' rating and metadata
imdb_df = imdb_ratings_df.merge(imdb_basics_df, on='tconst', how='inner')
imdb_df.head()

In [None]:
imdb_df.loc[imdb_df['startYear']=='\\N', 'startYear'] = np.nan
imdb_df.loc[imdb_df['runtimeMinutes']=='\\N', 'runtimeMinutes'] = np.nan
imdb_df.info()

In [None]:
imdb_df['startYear'] = pd.to_numeric(imdb_df['startYear'], errors='coerce')
imdb_df['runtimeMinutes'] = pd.to_numeric(imdb_df['runtimeMinutes'], errors='coerce')
# Merge CMU and IMDb datasets
ratings_df = metadata_df.merge(
    imdb_df,
    left_on=['title'],
    right_on=['primaryTitle'],
    how='inner'
)
ratings_df['startYear'].fillna(-1, inplace=True)
ratings_df['releaseDiff'] = (ratings_df['year'] - ratings_df['startYear']).abs()
ratings_df = ratings_df[ratings_df['releaseDiff'] <= 1]
ratings_df

In [None]:
!pip install kaggle


In [None]:
!kaggle datasets download -d akshaypawar7/millions-of-movies -p data/zip_files

In [None]:
import zipfile
import os

# Path to the downloaded zip file
zip_file_path = 'data/zip_files/millions-of-movies.zip'

# Path where you want to extract the files
extract_path = 'data/'

# Create the directory if it doesn't exist
if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# Extract the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extraction completed!")

In [None]:
dataframe = pd.read_csv('data/movies.csv')

# Now you can work with the dataframe
dataframe.head()

In [None]:
dataframe['release_date'] = dataframe['release_date'].apply(convert_release_date)

In [None]:
# Extract year, month, and day
dataframe['year'] = dataframe['release_date'].dt.year
dataframe['month'] = dataframe['release_date'].dt.month
dataframe['day'] = dataframe['release_date'].dt.day

In [None]:
# Merge CMU and IMDb datasets with NEW KAGGLE
revenues_ratings_df = ratings_df.merge(
    dataframe,
    on='title',
    how='inner'
)

revenues_ratings_df['releaseDiff2'] = (revenues_ratings_df['year_x'] - revenues_ratings_df['year_y']).abs()
revenues_ratings_df = revenues_ratings_df[revenues_ratings_df['releaseDiff2'] <= 1]

In [None]:
nan_percentage = (revenues_ratings_df['revenue'] == 0).mean() * 100
nan_percentage

In [None]:
revenues_ratings_df.loc[revenues_ratings_df['revenue'] == 0, 'revenue'] = revenues_ratings_df['boxOffice_revenue']

In [None]:
nan_percentage = (revenues_ratings_df['revenue'].isna()).mean() * 100
nan_percentage

In [None]:
# Calculate the absolute difference
revenues_ratings_df['revenue_diff'] = (revenues_ratings_df['revenue'] - revenues_ratings_df['boxOffice_revenue']).abs()

# Descriptive statistics of the differences
difference_stats = revenues_ratings_df['revenue_diff'].describe()
print(difference_stats)

# Define a threshold for "too much difference"
# For example, consider differences greater than $1 million as significant
threshold = 1_000_000

# Count the number of cases where the difference exceeds the threshold
significant_diff_count = (revenues_ratings_df['revenue_diff'] > threshold).sum()
print(f"Number of cases with significant difference: {significant_diff_count}")


In [None]:
revenues_ratings_df['revenue'].mean()