# IMDB Dataset Initial Data Check

This notebook performs an initial check on the IMDb datasets to:
1. Understand the structure of each dataset.
2. Identify missing values, duplicates, and potential issues.

## Step 1: Import Libraries

In this step, we import the required libraries and configure display options to ensure we can comprehensively view the dataset outputs in the notebook.


In [28]:
# Import Libraries
import pandas as pd

# Set display options for better visibility
pd.set_option('display.max_rows', None)  # Show all rows in a DataFrame
pd.set_option('display.max_columns', None)  # Show all columns in a DataFrame
pd.set_option('display.width', None)  # Expand the width of the output


## Step 2: Define the Inspection Function

The `load_and_inspect` function:
1. Loads a dataset from a file.
2. Prints its shape, columns, and a sample of rows.
3. Checks for missing values and highlights critical issues (e.g., columns with more than 50% missing values).
4. Identifies duplicate values in each column.


In [29]:
# Function to load and inspect datasets
def load_and_inspect(file_path, name, sep='\t'):
    print(f"\n--- Loading {name} ({file_path}) ---")
    df = pd.read_csv(file_path, sep=sep, dtype=str, na_values=['\\N'])
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")
    print("Sample Rows:")
    print(df.head())
    
    # Missing values
    missing_values = df.isnull().sum()
    print("\nMissing Values per Column:")
    print(missing_values)
    
    # Highlight critical missing value issues
    print("\nCritical Missing Value Issues:")
    for column, count in missing_values.items():
        if count / len(df) > 0.5:
            print(f"{column} has more than 50% missing values.")
    
    # Check for duplicate values in each column
    print("\nDuplicate Values per Column:")
    duplicate_results = {}
    for column in df.columns:
        duplicates = df.duplicated(subset=[column]).sum()
        duplicate_results[column] = duplicates
        print(f"{column}: {duplicates} duplicate values")
    
    return df


## Step 3: Define Dataset Paths

Here, we define the paths to the IMDb datasets. Ensure that the file paths match the location of the downloaded `.tsv` files on your system.


In [30]:
# Paths to datasets
data_files = {
    'Title Basics': 'data/title.basics.tsv',
    'Title AKAs': 'data/title.akas.tsv',
    'Title Ratings': 'data/title.ratings.tsv',
    'Title Crew': 'data/title.crew.tsv',
    'Title Episode': 'data/title.episode.tsv',
    'Title Principals': 'data/title.principals.tsv',
    'Name Basics': 'data/name.basics.tsv'
}

## Step 4: Load and Inspect Datasets

We iterate over each dataset using the `load_and_inspect` function to:
1. Load the dataset.
2. Perform the initial checks (missing values, duplicates, and foreign key validation).
3. Save the findings in a log file for each dataset.


In [31]:
# Load and inspect datasets
datasets = {}
for name, path in data_files.items():
    datasets[name] = load_and_inspect(path, name)


--- Loading Title Basics (data/title.basics.tsv) ---
Shape: (11326033, 9)
Columns: ['tconst', 'titleType', 'primaryTitle', 'originalTitle', 'isAdult', 'startYear', 'endYear', 'runtimeMinutes', 'genres']
Sample Rows:
      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short            Poor Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

  isAdult startYear endYear runtimeMinutes                    genres  
0       0      1894     NaN              1         Documentary,Short  
1       0      1892     NaN              5           Animation,Short  
2       0      1892     NaN              5  Animation,Comedy,Romance  
3       0      1892     NaN             12           A