## Creating a Merged Dataframe for the Movies Dataframe

Goal of Project: User-Based collaborative filtering.

In this notebook, we deploy the following steps to make it easier when we create the model:
1. Inspect structure of movies and ratings csv files
2. Inspect links.csv and webscrape image_url and extract url
3. Create merged dataframe between movies and links dataframe

### 1. Inspect movies and ratings csv files

In [None]:
import numpy as np
import pandas as pd
import pickle

Load movies dataframe `movies.csv`

In [3]:
movies_df = pd.read_csv('dataset/movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Load user's ratings dataframe `ratings.csv`

In [4]:
ratings_df = pd.read_csv('dataset/ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### 2. Inspect links.csv and webscrape image_url and extract url

In [11]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

##### Load links dataframe `links.csv`

In [13]:
# load 
link_df = pd.read_csv('dataset/links.csv')
link_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


Check information of the dataframes

In [15]:
link_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  9742 non-null   int64  
 1   imdbId   9742 non-null   int64  
 2   tmdbId   9734 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 228.5 KB


In [16]:
import numpy as np
import time
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
from requests.exceptions import RequestException


In [None]:
# Create empty columns with null values
link_df['url'] = np.nan
link_df['img_url'] = np.nan


for idx, tmdbId in tqdm(enumerate(link_df['tmdbId']), total = len(link_df['tmdbId'])):
    try:
        # Get url
        url = 'https://www.themoviedb.org/movie/' + str(tmdbId)
        link_df['url'][idx] = url

        # assign the response to a object
        response = requests.get(url)

        # Use BeautifulSoup() to create a BeautifulSoup object from a response text content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find image container
        obj = soup.find('div', 'image_content backdrop').img

        # Get image url
        image_url = 'https://www.themoviedb.org' + obj.get('data-src')

        # Link image url
        link_df['img_url'][idx] = image_url
    except AttributeError:
        image_url = 'https://www.firstcolonyfoundation.org/wp-content/uploads/2022/01/no-photo-available.jpeg'
        link_df['img_url'][idx] = image_url

100%|██████████| 9742/9742 [5:31:56<00:00,  2.04s/it]  


##### Save link_df

In [22]:
# Save link_df
link_df.to_csv('output_data/link_df.csv', index=False)

#### Load links dataframe ratings.csv

In [23]:
links_df = pd.read_csv('output_data/link_df.csv')
links_df.head(3)

Unnamed: 0,movieId,imdbId,tmdbId,url,img_url
0,1,114709,862.0,https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
1,2,113497,8844.0,https://www.themoviedb.org/movie/8844.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
2,3,113228,15602.0,https://www.themoviedb.org/movie/15602.0,https://www.themoviedb.org/t/p/w300_and_h450_b...


### 3. Merge movies and links csv on the 'movieId' column

In [24]:
merged_df = pd.merge(movies_df, links_df, on='movieId')
merged_df.head(3)

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,url,img_url
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,https://www.themoviedb.org/movie/862.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0,https://www.themoviedb.org/movie/8844.0,https://www.themoviedb.org/t/p/w300_and_h450_b...
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0,https://www.themoviedb.org/movie/15602.0,https://www.themoviedb.org/t/p/w300_and_h450_b...


In [None]:
# Save merged_df to merged_movies.csv
merged_df.to_csv('output_data/merged_movies.csv', index=False)

In [1]:

merged_df.isnull().sum()

movieId    0
title      0
genres     0
imdbId     0
tmdbId     8
url        0
img_url    0
dtype: int64