# Web Scraping: Searching for Actors in Movies/TV Shows and My Recommendations of Top Works with Shared Actors

This activity aims to find the works of actors in a certain movie only under a specific category: "Acting." Some actors may also hold other positions on other teams, such as a stunt double on "Crew" or an assistant director on "Production," but this activity is only interested in the "Acting" positions of the "actors" within a specific movie.

In the following tutorial, I will be demonstrating the webscraping process using the TMDB sites of two movies: first from the example movie, *Harry Potter and the Philosopher's Stone*, and later from my favorite movie, *Kill Bill: Vol. 1*.

## Part 1: Initial Steps 

### TMDB Page: *Harry Potter and the Philosopher's Stone* 

My scraper will look at all of the actors in this movie as well as all of the other movies/TV shows that they worked on. Writing my scraper then requires me to inspect the individual HTML elements using the Developer Tools of my browser that contain the names and works that I am looking for.

Upon landing on the TMDB page for this movie, I first save the url.

https://www.themoviedb.org/movie/671-harry-potter-and-the-philosopher-s-stone/

I then scroll down to the *Full Cast & Crew* link, click on it, and then scroll to the *Cast* section. When I click on the portrait of one of the actors, I am taken to that specific actor's TMDB page. Each actor's page contains the information that I am interested in, which is the movies/TV shows that that actor worked on specifically under the "Acting" category.

### Terminal Commands

In the terminal, run the commands:

conda activate PIC16B-24W
scrapy startproject TMDB_scraper
cd TMDB_scraper

#### Tweaking the Settings 

In the settings.py file, add the following line:

CLOSESPIDER_PAGECOUNT = 20

This line prevents the scraper from downloading an excessive amount of data during my testing phase. Later, this line will be removed.

Additionally:

scrapy shell -s USER_AGENT='Scrapy/2.8.0 (+https://scrapy.org)' https://www.themoviedb.org/...

This changes the user agent on scrapy shell so that the website does not block my scraper.

## Part 2: TmdbSpider Scraper (for Harry Potter Example)

A file called tmdb_spider.py will be created inside the spiders directory that includes the following lines.

In [1]:
import scrapy

In [2]:
class TmdbSpider(scrapy.Spider):
    name = 'tmdb_spider'
    def __init__(self, subdir=None, *args, **kwargs):
        self.start_urls = [f"https://www.themoviedb.org/movie/{subdir}/"]

By later providing a subdirectory (subdir) on the TMDB website that is specific to Kill Bill: Vol. 1, the spider will be able to run through that movie.

There are 3 parsing functions that belong to the TmdbSpider class.

### Implementation of parse()

This function starts on the movie page and then navigates to the *Full Cast & Crew* page through cast_crew_url. Once on that page, the parse_full_credits() function is called by a callback argument to a yielded Scrapy request.

In [3]:
def parse(self, response):
            """
            Parses the starting url page for the favorite movie by
            first navigating to the Full Cast and Crew Page from
            the movie page.

            Does not return any data.
            """
            # starting url plus cast at end
            cast_crew_url = response.url + "/cast"

            # ask scrapy to visit the cast_crew link and once it gets there,
            # call self.parse_full_credits() recursively
            yield scrapy.Request(cast_crew_url, callback=self.parse_full_credits)

### Implementation of parse_full_credits()

This function starts on the *Full Cast & Crew* page and yields a Scrapy request for each actor's personal page (actors only, no crew members). Once on that page, the parse_actor_page() function is called by a callback argument to a yielded Scrapy request.

In [4]:
def parse_full_credits(self, response):
        """
        Starts on the Full Cast & Crew page
        Yields a scrapy Request for the page of each actor listed on the 
        page (not including the crew members).

        Does not return any data.
        """
        # list of links for the actors' pages as displayed in the Full Cast and Crew
        main_url = "https://www.themoviedb.org"

        # [0]: category being "Cast" 
        for link in response.css("ol.people.credits")[0].css("li"):
            # gets the link of that actor's page
            actor_page = link.css("a::attr(href)").get()
            actor_url = f'{main_url}{actor_page}'
            yield scrapy.Request(actor_url, callback=self.parse_actor_page)

### Implementation of parse_actor_page()

This function starts on the individual actor's page and yields a dictionary containing the name of the actor (keys) and the movie/TV show name the actor was in (values). Specifically, we are only interested in the works under the "Acting" category of that actor. As a reuslt, one actor's name could appear many times if that actor had been in many movies/TV shows. 

In [5]:
    def parse_actor_page(self, response):
        """
        On each individual's page
        Yields a dictionary of the actor name (key) and that actor's works
        under only the "Acting" category 
        """
        actor_name_CSS = response.css("h2.title a::text").get()
        # checks for the h3 category of "Acting" regardless of index number because 
        # some of the individuals such as David Holmes does not have "Acting" as 
        # their first category unlike most others
        if response.css('h3.zero:contains("Acting")'):
            index = 0;
        elif response.css('h3.one:contains("Acting")'):
            index = 1;
        elif response.css('h3.two:contains("Acting")'):
            index = 2;
        # gets the works of only the "Acting" category
        movie_or_TV_CSS = response.css("table.card.credits")[index].css("a.tooltip bdi::text")
        for each_work in movie_or_TV_CSS:
            movie_or_TV_name = each_work.get()
            yield {"actor" : actor_name_CSS, 
                   "movie_or_TV_name" : movie_or_TV_name}

### Creating results.csv:

Once all functions are correctly implemented, the following command creates a results.csv that contains all actors in *Harry Potter and the Philosopher's Stone* and their "Acting" works in key-value pairs.

scrapy crawl tmdb_spider -o results.csv -a subdir=671-harry-potter-and-the-philosopher-s-stone

## Part 3: My Recommendations 

## Sorted List with Top Movies/TV Shows that Share Actors with My Favorite Movie/TV Show 

This is a sorted list containing the top movies and TV shoes that share actors with my favorite movie: Kill Bill Volume 1. It has two columns: the names of the movies and the number of shared actors.

Using the functions used to scrape the Harry Potter movie, I will generate a new file for the Kill Bill actors and their works. 

#### In Terminal: 

scrapy crawl tmdb_spider -o resultsFavMovie.csv -a subdir=24-kill-bill-vol-1

### Pandas Dataframe 

In [6]:
import pandas as pd
import numpy as np

In [7]:
# read in csv file for Kill Bill actors + their movies/shows
filename = "resultsFavMovie.csv"
killBill_actorsMovies = pd.read_csv(filename)

In [8]:
# rename the dictionary
rename_dict = {"actor" : "Actor",
               "movie_or_TV_name" : "Movie or TV Name"}
killBill_actorsMovies.rename(columns = rename_dict, inplace = True)

# displaying the dataframe with renamed columns
killBill_actorsMovies

Unnamed: 0,Actor,Movie or TV Name
0,Uma Thurman,The Old Guard 2
1,Uma Thurman,"Oh, Canada"
2,Uma Thurman,Tau Ceti Foxtrot
3,Uma Thurman,Anita
4,Uma Thurman,The Kill Room
...,...,...
3767,Sō Yamanaka,Not Forgotten
3768,Sō Yamanaka,The Exam
3769,Sō Yamanaka,Ping Pong Bath Station
3770,Sō Yamanaka,Yonimo Kimyou na Monogatari Tokubetsuhen


Now, I will create a dictionary containing each movie as the key and the number of times that movie appears in the killBill_actorsMovies dataframe. The frequency represents the number of shared actors in the movie because the same movie name appearing again means another actor is in that movie as well.

In [9]:
# initializing the dictionary containing the movie name and its
# frequency of occurence
movie_freq_dict = {}

# loop through the movie names and increase the count when the movie
# name appears again
for each_movie in killBill_actorsMovies["Movie or TV Name"]:
    # if the movie is NOT already in the dictionary, add it
    if movie_freq_dict.get(each_movie) == None:
        movie_freq_dict[each_movie] = 1
    # else if the movie is already in the dictionary, increase the 
    # count when it reappears
    else:
        movie_freq_dict[each_movie] += 1

In [10]:
# create a new dataframe that takes in the movie names as keys and the
# frequency of occurence for each movie as their values
shared_actors = pd.DataFrame({"Movie or TV Name" : movie_freq_dict.keys(),
                             "Number of Shared Actors" : movie_freq_dict.values()})

# find the row with "Kill Bill: Vol. 1" and move it to the top row
# killBill_frontOfDF = shared_actors[shared_actors["Movie or TV Name"].str.contains("Kill Bill: Vol. 1")]
# killBill_frontOfDF

shared_actors

Unnamed: 0,Movie or TV Name,Number of Shared Actors
0,The Old Guard 2,1
1,"Oh, Canada",1
2,Tau Ceti Foxtrot,1
3,Anita,1
4,The Kill Room,1
...,...,...
3344,Hush!,1
3345,Gips,1
3346,Not Forgotten,1
3347,The Exam,1


To make the dataframe more detailed, it would be nice to see which actors from Kill Bill are in the same movies together. This would mean merging the two dataframes and desginating an additional column for the names of the Kill Bill actors that are also in these other movies together. These are the "shared actors" of each movie.

In [11]:
merged_shared = pd.merge(killBill_actorsMovies, shared_actors,
                        on = "Movie or TV Name")

# sort by movies with the most to least number of shared actors
merged_shared.sort_values(by = "Number of Shared Actors", 
                          ascending = False)

Unnamed: 0,Actor,Movie or TV Name,Number of Shared Actors
176,Vivica A. Fox,Kill Bill: Vol. 1,38
54,Chiaki Kuriyama,Kill Bill: The Whole Bloody Affair,38
61,Issey Takahashi,Kill Bill: The Whole Bloody Affair,38
60,Yoshiko Yamaguchi,Kill Bill: The Whole Bloody Affair,38
59,Ronnie Yoshiko Fujiyama,Kill Bill: The Whole Bloody Affair,38
...,...,...,...
1524,Michael Madsen,Welcome to Acapulco,1
1525,Michael Madsen,Trading Paint,1
1526,Michael Madsen,Hangover in Death Valley,1
1527,Michael Madsen,Dead On Time,1


## Data Visualization 

### Bar Graph 1 

This bar graph represents the amount of shared actors for each movie.

In [12]:
from plotly import express as px
import plotly.io as pio
pio.renderers.default = "iframe"

actorWorksCount_fig = px.bar(merged_shared,
                            x = "Movie or TV Name",
                             y = "Number of Shared Actors",
                             opacity = 1,
                            title = "Number of Shared Actors For Each Movie")

actorWorksCount_fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
actorWorksCount_fig.show()

### Bar Graph 2 

This bar graph represents the amount of movies that each actor in Kill Bill has appeared in throughout their entire acting careers.  

In [13]:
from plotly import express as px
import plotly.io as pio
pio.renderers.default = "iframe"

actorWorksCount_fig = px.bar(merged_shared,
                            x = "Actor",
                            title = "Number of Movies/TV Shows For Each Actor")

actorWorksCount_fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
actorWorksCount_fig.show()