In [9]:
import os
import pandas as pd

script_dir = os.getcwd() 

print(f"Current working directory: {script_dir}")

Current working directory: c:\Users\willi\Downloads\Dataset


Originally, the title.basics.tsv.gz file contained the following fields: tconst, titleType, primaryTitle, originalTitle, isAdult, startYear, endYear, runtimeMinutes, and genres.

Since the original title, start year, end year, and runtime are not relevant features, I have removed them. Additionally, I have filtered out any content that is not classified as a movie (excluding "short" films).

In [None]:
# File paths
basics_input_file = os.path.join(script_dir, "IMDB Datasets", "title.basics.tsv")
basics_output_file = os.path.join(script_dir, "Cleaned Datasets", "title_basics_filtered.tsv")

df = pd.read_csv(basics_input_file, sep="\t", dtype=str, na_values="\\N")
df_filtered = df.loc[df['titleType'] == 'movie', ["tconst", "titleType", "primaryTitle", "genres", "isAdult"]]
df_filtered.to_csv(basics_output_file, sep="\t", index=False)

print(f"Filtered dataset saved to: {basics_output_file}")

Filtered dataset saved to: c:\Users\willi\Downloads\Dataset\Cleaned Datasets\title_basics_filtered.tsv


To ensure data quality, I have also removed movies with fewer than 1,000 ratings, eliminating lesser-known films that could introduce noise into the recommendations.

In [None]:
# File paths
ratings_input_file = os.path.join(script_dir, "IMDB Datasets", "title.ratings.tsv") 
ratings_output_file = os.path.join(script_dir, "Cleaned Datasets", "title_ratings_filtered.tsv")

df = pd.read_csv(ratings_input_file, sep="\t", dtype=str, na_values="\\N")
df["numVotes"] = df["numVotes"].astype(int)

# Calculate Mean, Median, Mode
mean_votes = df["numVotes"].mean()
median_votes = df["numVotes"].median()
mode_votes = df["numVotes"].mode()[0]

# Print results
print(f"Mean: {mean_votes}")
print(f"Median: {median_votes}")
print(f"Mode: {mode_votes}")

df_filtered = df.loc[df["numVotes"] >= 1000]
df_filtered.to_csv(ratings_output_file, sep="\t", index=False)

print(f"Filtered dataset saved to: {ratings_output_file}")

Mean: 1025.952498432897
Median: 26.0
Mode: 7
Filtered dataset saved to: c:\Users\willi\Downloads\Dataset\Cleaned Datasets\title_ratings_filtered.tsv


To create a comprehensive dataset, I merged the filtered movie data with ratings and crew information using the common identifier tconst. Additionally, I removed the titleType and numVotes columns, as they are no longer needed.

In [None]:
# File paths
files = {
    "basics": os.path.join(script_dir, "Cleaned Datasets", "title_basics_filtered.tsv"),
    "ratings": os.path.join(script_dir, "Cleaned Datasets", "title_ratings_filtered.tsv"),
    "crew": os.path.join(script_dir, "IMDB Datasets", "title.crew.tsv"),  # Raw file
}

df = {name: pd.read_csv(path, sep="\t", dtype=str, na_values="\\N") for name, path in files.items()}


df_merged = pd.merge(df["basics"], df["ratings"], on="tconst", how="inner")
df_merged = pd.merge(df_merged, df["crew"], on="tconst", how="left")

df_merged = df_merged.drop(columns=["titleType", "numVotes"])


output_file = os.path.join(script_dir, "Cleaned Datasets", "merged_movie_data.tsv")
df_merged.to_csv(output_file, sep="\t", index=False)

print(f"Merged dataset saved to: {output_file}")

Merged dataset saved to: c:\Users\willi\Downloads\Dataset\Cleaned Datasets\merged_movie_data.tsv


### Expected Final Dataset

| tconst  | primaryTitle   | genres | isAdult | averageRating | directors | writers  |
|---------|----------------|--------|---------|--------------|-----------|----------|
| tt0001  | Example Movie  | Drama  | 0       | 7.5          | nm12345   | nm56789  |
| tt0002  | Another Movie  | Comedy | 0       | 8.2          | nm67890   | nm54321  |
