<a href="https://colab.research.google.com/github/alexanderkhoo-21006325/bda_in_cloud/blob/main/BD_RecSystem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a movie/show recommendation system using MapReduce on the "titles.csv" and "credits.csv" datasets. The map task should generate pairs of movies/shows watched by the same user, and the reduce task should count the occurrences of each pair.

## Load and explore data



In [None]:
from google.colab import files
import pandas as pd

# User input file
uploaded = files.upload()

# Display the file that user has input
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))


Saving credits.csv to credits.csv
Saving titles.csv to titles.csv
User uploaded file "credits.csv" with length 3816592 bytes
User uploaded file "titles.csv" with length 2028486 bytes


**Reasoning**:
Load the datasets into pandas DataFrames and display their head.



In [None]:
import pandas as pd

titles_df = pd.read_csv('titles.csv')
credits_df = pd.read_csv('credits.csv')

display(titles_df.head())
display(credits_df.head())

Unnamed: 0,id,title,type,description,release_year,age_certification,runtime,genres,production_countries,seasons,imdb_id,imdb_score,imdb_votes,tmdb_popularity,tmdb_score
0,ts300399,Five Came Back: The Reference Films,SHOW,This collection includes 12 World War II-era p...,1945,TV-MA,51,['documentation'],['US'],1.0,,,,0.6,
1,tm84618,Taxi Driver,MOVIE,A mentally unstable Vietnam War veteran works ...,1976,R,114,"['drama', 'crime']",['US'],,tt0075314,8.2,808582.0,40.965,8.179
2,tm154986,Deliverance,MOVIE,Intent on seeing the Cahulawassee River before...,1972,R,109,"['drama', 'action', 'thriller', 'european']",['US'],,tt0068473,7.7,107673.0,10.01,7.3
3,tm127384,Monty Python and the Holy Grail,MOVIE,"King Arthur, accompanied by his squire, recrui...",1975,PG,91,"['fantasy', 'action', 'comedy']",['GB'],,tt0071853,8.2,534486.0,15.461,7.811
4,tm120801,The Dirty Dozen,MOVIE,12 American military prisoners in World War II...,1967,,150,"['war', 'action']","['GB', 'US']",,tt0061578,7.7,72662.0,20.398,7.6


Unnamed: 0,person_id,id,name,character,role
0,3748,tm84618,Robert De Niro,Travis Bickle,ACTOR
1,14658,tm84618,Jodie Foster,Iris Steensma,ACTOR
2,7064,tm84618,Albert Brooks,Tom,ACTOR
3,3739,tm84618,Harvey Keitel,Matthew 'Sport' Higgins,ACTOR
4,48933,tm84618,Cybill Shepherd,Betsy,ACTOR


## Data preprocessing

### Subtask:
Clean and preprocess the data as needed, which involves handling missing values, merging the two datasets, and selecting relevant columns for the recommendation task.


In [None]:
# Handle missing values in relevant columns
titles_df.dropna(subset=['id'], inplace=True)

# Merge the two dataframes
merged_df = pd.merge(titles_df, credits_df, on='id', how='inner')

# Select relevant columns
cleaned_df = merged_df[['id', 'title', 'type', 'person_id']]

# Sort by person_id
cleaned_df = cleaned_df.sort_values(by='person_id')

display(cleaned_df.head())

Unnamed: 0,id,title,type,person_id
45239,tm239927,Official Secrets,MOVIE,7
25520,tm244758,Breaking the Bank,MOVIE,7
14393,tm174683,Arthur Christmas,MOVIE,7
28528,tm287684,The Ritual,MOVIE,8
66655,tm992844,The Dig,MOVIE,8


## Implement mapreduce for co-occurrence

### Subtask:
Implement the MapReduce steps for calculating movie/show co-occurrence. The map task will generate pairs of movies/shows watched by the same user, and the reduce task will count the occurrences of each pair.


In [None]:
from collections import defaultdict
import itertools

# Map step: Group by person_id and generate pairs of titles watched by each person
title_pairs = []
for person_id, group in cleaned_df.groupby('person_id'):
    titles_watched = group['title'].unique().tolist()
    # Generate all unique pairs of titles for the current user
    for pair in itertools.combinations(titles_watched, 2):
        # Ensure the pair is always in the same order to avoid duplicates
        title_pairs.append(tuple(sorted(pair)))

# Reduce step: Count the occurrences of each pair
pair_counts = defaultdict(int)
for pair in title_pairs:
    pair_counts[pair] += 1

# Convert the dictionary to a pandas Series for easier viewing and potential sorting
pair_counts_series = pd.Series(pair_counts).sort_values(ascending=False)

display(pair_counts_series.head(50))

## Generate recommendations

### Subtask:
Based on the co-occurrence counts, develop a method to generate recommendations for a given user by finding movies/shows most frequently watched with those already liked by the user.


In [None]:
def recommend_titles(watched_titles, pair_counts, n=3):

# Generates recommendations based on titles watched by a user and co-occurrence counts.
    recommendations = defaultdict(float)

    for watched_title in watched_titles:
        # Find pairs involving the watched_title
        relevant_pairs = pair_counts.filter(like=watched_title, axis=0)

        for pair, count in relevant_pairs.items():
            # Identify the other title in the pair
            other_title = pair[0] if pair[1] == watched_title else pair[1]

            # Add to recommendations and increment score
            recommendations[other_title] += count

    # Sort recommendations by score
    sorted_recommendations = sorted(recommendations.items(), key=lambda item: item[1], reverse=True)

    # Exclude titles already watched by the user
    filtered_recommendations = [(title, score) for title, score in sorted_recommendations if title not in watched_titles]

    # Return top 3 recommendations
    return [title for title, score in filtered_recommendations[:3]]

# Example usage (assuming 'pair_counts_series' and 'cleaned_df' are available)
# Get a sample user's watched titles (e.g., a random user)
import random
sample_user_id = random.choice(cleaned_df['person_id'].unique().tolist())
sample_watched_titles = cleaned_df[cleaned_df['person_id'] == sample_user_id]['title'].unique().tolist()

print(f"User watched titles: {sample_watched_titles}")

# Generate recommendations for the sample user
recommendations = recommend_titles(sample_watched_titles, pair_counts_series, n=10)

print(f"Recommendations for the user: {recommendations}")

User watched titles: ['My Führer']
Recommendations for the user: ['Criminal: Germany', 'Blood Red Sky', 'Babylon Berlin']
