In [21]:
import os
import pandas as pd

script_dir = os.getcwd() 

print(f"Current working directory: {script_dir}")

Current working directory: c:\Users\willi\OneDrive\Documents\GitHub\3\Movie-Recommendations


Originally, the title.basics.tsv.gz file contained the following fields: tconst, titleType, primaryTitle, originalTitle, isAdult, startYear, endYear, runtimeMinutes, and genres.

Since the original title, start year, end year, and runtime are not relevant features, I have removed them. Additionally, I have filtered out any content that is not classified as a movie (excluding "short" films).

In [None]:
basics_input_file = os.path.join(script_dir, "IMDB Datasets", "title.basics.tsv")

df = pd.read_csv(basics_input_file, sep="\t", dtype=str, na_values="\\N")
title_basics_df_filtered = df.loc[df['titleType'] == 'movie', ["tconst",  "isAdult", "startYear", "runtimeMinutes", "genres"]]

# print(f"Title basics filtered:\n{title_basics_df_filtered.head(10)}")
display(title_basics_df_filtered.head(10))

Unnamed: 0,tconst,isAdult,startYear,runtimeMinutes,genres
8,tt0000009,0,1894,45.0,Romance
144,tt0000147,0,1897,100.0,"Documentary,News,Sport"
498,tt0000502,0,1905,100.0,
570,tt0000574,0,1906,70.0,"Action,Adventure,Biography"
587,tt0000591,0,1907,90.0,Drama
610,tt0000615,0,1907,,Drama
625,tt0000630,0,1908,,Drama
668,tt0000675,0,1908,,Drama
672,tt0000679,0,1908,120.0,"Adventure,Fantasy"
828,tt0000838,0,1909,,


To ensure data quality, I have also removed movies with fewer than 1,000 ratings, eliminating lesser-known films that could introduce noise into the recommendations.

In [None]:
# File paths
ratings_input_file = os.path.join(script_dir, "IMDB Datasets", "title.ratings.tsv") 

df = pd.read_csv(ratings_input_file, sep="\t", dtype=str, na_values="\\N")
df["numVotes"] = df["numVotes"].astype(int)

# Calculate Mean, Median, Mode
mean_votes = df["numVotes"].mean()
median_votes = df["numVotes"].median()
mode_votes = df["numVotes"].mode()[0]

# Print results
print(f"Mean: {mean_votes}")
print(f"Median: {median_votes}")
print(f"Mode: {mode_votes}")

ratings_df_filtered = df.loc[df["numVotes"] >= 1000]

# print(f"\n Ratings filtered:\n{ratings_df_filtered.head(10)}")
display(ratings_df_filtered.head(10))

Mean: 1025.952498432897
Median: 26.0
Mode: 7

 Filtered rating dataset saved to: c:\Users\willi\OneDrive\Documents\GitHub\3\Movie-Recommendations\Cleaned Datasets\title_ratings_filtered.tsv


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2137
2,tt0000003,6.4,2170
4,tt0000005,6.2,2902
7,tt0000008,5.4,2281
9,tt0000010,6.8,7882
11,tt0000012,7.4,13382
12,tt0000013,5.7,2051
13,tt0000014,7.1,6109
14,tt0000015,6.1,1262
15,tt0000016,5.9,1646


To create a comprehensive dataset, I merged the filtered movie data with ratings and crew information using the common identifier tconst. Additionally, I removed the titleType and numVotes columns, as they are no longer needed.

In [None]:
crew_input_file = os.path.join(script_dir, "IMDB Datasets", "title.crew.tsv")

df_merged = pd.merge(title_basics_df_filtered, ratings_df_filtered, on="tconst", how="inner")
df_merged = pd.merge(df_merged, crew_input_file, on="tconst", how="left")

df_merged = df_merged.drop(columns=["titleType", "numVotes"])


output_file = os.path.join(script_dir, "Cleaned Datasets", "merged_movie_data.tsv")
df_merged.to_csv(output_file, sep="\t", index=False)

print(f"Merged dataset saved to: {output_file}")

print(f"\n Merged dataset:\n{df_merged.head(10)}")

TypeError: Can only merge Series or DataFrame objects, a <class 'str'> was passed

### Merged Dataset

| tconst     | primaryTitle                                      | genres                 | isAdult | averageRating | directors                         | writers                                 |
|------------|--------------------------------------------------|------------------------|---------|--------------|----------------------------------|----------------------------------------|
| tt0002130  | Dante's Inferno                                 | Adventure, Drama, Fantasy | 0       | 7.0          | nm0078205, nm0655824, nm0209738  | nm0019604                              |
| tt0002423  | Passion                                        | Biography, Drama, Romance | 0       | 6.6          | nm0523932                        | nm0266183, nm0473134                   |
| tt0002844  | Fant√¥mas: In the Shadow of the Guillotine      | Crime, Drama           | 0       | 6.9          | nm0275421                        | nm0019855, nm0275421, nm0816232        |
| tt0003014  | Ingeborg Holm                                  | Drama                  | 0       | 7.0          | nm0803705                        | nm0472236, nm0803705                   |
| tt0003037  | Fantomas: The Man in Black                    | Crime, Drama           | 0       | 6.9          | nm0275421                        | nm0019855, nm0275421, nm0816232        |

