## Data Preprocessing

Before building a recommendation model, we must clean and prepare the dataset:
- Normalise ratings from a **0-5 scale** to a **0-1 scale**.
- Filter out movies without an IMDb ID to align with MovieLens data.
- Ensure all required data is correctly formatted.

In [61]:
import os
import pandas as pd

script_dir = os.getcwd()

print(f"Current working directory: {script_dir}")

Current working directory: c:\Users\willi\OneDrive\Documents\GitHub\Test\Movie-Recommendation


# Matching IMDb Movies with MovieLens Dataset

In [69]:
# File paths
links_file = os.path.join(script_dir, "MovieLens Datasets Original", "links.csv")
merged_file = os.path.join(script_dir, "Cleaned Datasets", "merged_movie_data.tsv")

# Load datasets
df_links = pd.read_csv(links_file)
df_merged = pd.read_csv(merged_file, sep="\t", dtype=str, na_values="\\N")

# Drop tmdbId column
df_links = df_links.drop(columns=["tmdbId"])

# Number of data rows 
print(f"Total data rows in links.csv: {len(df_links)}")
print(f"Total data rows in merged_movie_data.tsv: {len(df_merged)}")

# Links datafile 
print(f"\n df_links:\n{df_links.head(10)}")

Total data rows in links.csv: 9742
Total data rows in merged_movie_data.tsv: 45039

 df_links:
   movieId  imdbId
0        1  114709
1        2  113497
2        3  113228
3        4  114885
4        5  113041
5        6  113277
6        7  114319
7        8  112302
8        9  114576
9       10  113189


From the output above, we can observe that:
- The `links.csv` dataset contains a total of 9,742 rows.
- The `merged_movie_data.tsv` dataset contains 45,039 rows.

This indicates that the MovieLens dataset (`links.csv`) has fewer entries compared to the IMDB dataset (`merged_movie_data.tsv`). Therefore, we will need to remove the rows in the IMDB dataset where the IMDB ID does not have a corresponding MovieLens ID, based on the links.csv dataset.

In [None]:
'''

# Check the max movieId value
max_movie_id = df_links['movieId'].max()
print(f"Maximum movieId value: {max_movie_id}")

# Count unique movieIds
unique_movie_ids = df_links['movieId'].nunique()
print(f"Number of unique movieIds: {unique_movie_ids}")'


'''

Maximum movieId value: 193609
Number of unique movieIds: 9742


The MovieLens dataset uses non-sequential IDs for movies. Below is a sample from the dataset showing a significant gap in the movieId values:

| movieId | imdbId    | tmdbId  |
|---------|-----------|---------|
| 9018    | 0391024   | 18889   |
| 25746   | 0014142   | 18987   |

This explains why the dataset has 9,742 rows but the maximum movieId value is 193,609.

In [None]:
'''

# MovieLens imdbId doesn't include the "tt" prefix that IMDb uses

def map_movie_ids():
    # Dictionary mapping IMDb IDs to merged dataset rows
    imdb_to_movie = {row['tconst']: row for _, row in df_merged.iterrows()}
    
    # New dataframe with mapped data
    mapped_data = []
    for _, row in df_links.iterrows():
        # Format to match IMDb's tt0000000 format
        imdb_id = f"tt{row['imdbId']:07d}"  
        
        if imdb_id in imdb_to_movie:
            movie_data = imdb_to_movie[imdb_id]
            # Create a row with data from both sources
            mapped_data.append({
                'movieId': row['movieId'], 
                'imdbId': imdb_id,        
                'title': movie_data['primaryTitle'],
            })
    
    return pd.DataFrame(mapped_data)

'''

This function filters the IMDb dataset (`df_merged`) to retain only the movies that exist in the MovieLens dataset (`df_links`). 

It does this by:
- Extracting and formatting IMDb IDs from the links.csv file.
- Matching these IDs with the tconst column in the merged IMDb dataset.
- Saving the filtered dataset, which contains only movies that exist in both datasets.

This ensures that only relevant movies are included for analysis or recommendations.

In [None]:
def filter_imdb_to_movielens_matches():
    
    movielens_imdb_ids = set()
    
    for _, row in df_links.iterrows():
        # Convert float to int to string, then format
        imdb_id = int(row['imdbId'])
        formatted_id = f"tt{imdb_id:07d}"
        movielens_imdb_ids.add(formatted_id)
    
    # Filter the merged IMDb dataset to only include movies with matching IDs
    filtered_imdb = df_merged[df_merged['tconst'].isin(movielens_imdb_ids)]
    
    # Save the filtered dataset
    output_file = os.path.join(script_dir, "Cleaned Datasets", "Final_Movie_Data.tsv")
    filtered_imdb.to_csv(output_file, sep="\t", index=False)
    
    print(f"Total IMDb movies matched with MovieLens: {len(filtered_imdb)}")
    print(f"Saved filtered dataset to: {output_file}")
    
    return filtered_imdb

filtered_imdb_movies = filter_imdb_to_movielens_matches()

Total IMDb movies matched with MovieLens: 8902
Saved filtered dataset to: c:\Users\willi\OneDrive\Documents\GitHub\Test\Movie-Recommendation\Cleaned Datasets\imdb_movielens_matched.tsv


In [None]:
'''

# Filtered IMDB Movies matched with MovieLens dataset - corresponds to file: Cleaned Datasets\imdb_movielens_matched.tsv
print(f"Filtered IMDB Movies:\n{filtered_imdb_movies.head(4)}")'

'''

Filtered IMDB Movies:
       tconst                  primaryTitle                   genres isAdult  \
17  tt0004972         The Birth of a Nation                Drama,War       0   
24  tt0006333  20,000 Leagues Under the Sea  Action,Adventure,Sci-Fi       0   
26  tt0006864                   Intolerance            Drama,History       0   
40  tt0010040               Daddy-Long-Legs             Comedy,Drama       0   

   averageRating  directors                                            writers  
17           6.1  nm0000428  nm0228746,nm0000428,nm0940488,nm0934306,nm1628...  
24           6.1  nm0665737                                nm0894523,nm0665737  
26           7.7  nm0000428  nm0048512,nm0115218,nm0000428,nm0002616,nm0640...  
40           6.6  nm0624714                                nm0916914,nm0426515  


### IMDB-MovieLens Matched File:

| tconst     | primaryTitle                | genres                       | isAdult | averageRating | directors     | writers                                                                                      |
|------------|-----------------------------|------------------------------|---------|---------------|---------------|----------------------------------------------------------------------------------------------|
| tt0004972  | The Birth of a Nation        | Drama,War                    | 0       | 6.1           | nm0000428      | nm0228746,nm0000428,nm0940488,nm0934306,nm16280870,nm16280871                                |
| tt0006333  | 20,000 Leagues Under the Sea | Action,Adventure,Sci-Fi      | 0       | 6.1           | nm0665737      | nm0894523,nm0665737                                                                            |
| tt0006864  | Intolerance                  | Drama,History                | 0       | 7.7           | nm0000428      | nm0048512,nm0115218,nm0000428,nm0002616,nm0640437,nm1578667,nm0940488                        |
| tt0010040  | Daddy-Long-Legs              | Comedy,Drama                 | 0       | 6.6           | nm0624714      | nm0916914,nm0426515                                                                            |


# Processing and Filtering Ratings Data with IMDb IDs

In this section, I will have to:
- remove the `timestamp` column as it is..
- normalise the `rating` from a 0-5 scale to a 0-1 scale, so that
- remove the rows of data where the movie isnt apart of the 

In [68]:
ratings_file = os.path.join(script_dir, "MovieLens Datasets Original", "ratings.csv")

df_ratings_raw = pd.read_csv(ratings_file)

# Remove the timestamp column
df_ratings = df_ratings_raw.drop(columns=["timestamp"])

# Normalise from 0-5 ratings to 0-1 
df_ratings["rating"] = df_ratings["rating"] / 5.0

print(f"{df_ratings.head(5)}")

   userId  movieId  rating
0       1        1     0.8
1       1        3     0.8
2       1        6     0.8
3       1       47     1.0
4       1       50     1.0


Merging Ratings with IMDb Links

- The ratings dataset (df_ratings) is merged with the df_links dataset on movieId, to link ratings to IMDb IDs.
- Only the relevant columns (userId, imdbId, and rating) are kept.

The filtered dataset is then saved as ratings_imdb_matched.csv in the "Cleaned Datasets" folder, and the total number of ratings after filtering is displayed.

In [None]:
# Merge ratings with links to get IMDb IDs
df_ratings_filtered = df_ratings.merge(df_links, on="movieId", how="inner")

# Keep only relevant columns 
df_ratings_filtered = df_ratings_filtered[["userId", "imdbId", "rating"]]

# Save the filtered ratings dataset
filtered_ratings_file = os.path.join(script_dir, "Cleaned Datasets", "Audience_Ratings.csv")
df_ratings_filtered.to_csv(filtered_ratings_file, index=False)

# Display saved file locations 
print(f"Filtered ratings dataset saved to: {filtered_ratings_file}")
print(f"Total ratings after filtering: {len(df_ratings_filtered)}")

Filtered ratings dataset saved to: c:\Users\willi\OneDrive\Documents\GitHub\Test\Movie-Recommendation\Cleaned Datasets\ratings_imdb_matched.csv
Total ratings after filtering: 100836


In [None]:
'''

# Original ratings file
print(f"\nratings.csv raw:\n{df_ratings_raw.head(5)}")

# Cleaned and Normalised rating file 
print(f"\n df_ratings - timestamp column removed, ratings normalised:\n{df_ratings.head(5)}")

# Final file, corresponds to Datasets\Audience_Ratings.csv
print(f"\n df_ratings - movieID changed to imdbID:\n{df_ratings_filtered.head(5)}")

'''


ratings.csv raw:
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931

 df_ratings - timestamp column removed, ratings normalised:
   userId  movieId  rating
0       1        1     0.8
1       1        3     0.8
2       1        6     0.8
3       1       47     1.0
4       1       50     1.0

 df_ratings - movieID changed to imdbID:
   userId  imdbId  rating
0       1  114709     0.8
1       1  113228     0.8
2       1  113277     0.8
3       1  114369     1.0
4       1  114814     1.0


### Final Ratings Dataset


| userId | imdbId | rating |
|--------|--------|--------|
| 1      | 114709 | 0.8    |
| 1      | 113228 | 0.8    |
| 1      | 113277 | 0.8    |
| 1      | 114369 | 1.0    |
