# Blog Post 2
In this blog post, we are going to create a spider object, which is useful for scraping the web! 

For starters, here are the packages we are going to import.

In [1]:
import scrapy
import pandas as pd
import numpy as np

## The Big Picture
Let's take a look at our finished product first, before we break down how it works! 

In [2]:
class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    
    start_urls = ['https://www.imdb.com/title/tt1160419/']
    
    def parse(self,response):
        """
        assumes you start on a movie page, navigate to Cast & Crew page
        once at <movie_url>fullcredits, call parse_full_Credits method in the callback argument 
        to a yielded scrapy.Request. returns nothing, around 5 lines of code
        """
        #gets link to cast and crew from start page 
        cast_crew = response.css("a[href^= 'fullcredits']").attrib["href"] 
        if cast_crew:
            #joins link to existing link
            cast_crew = response.urljoin(cast_crew)

            #yields request: if this page exists, navigate to it and perform whatever function is described in the callback argument
            yield scrapy.Request(cast_crew, callback = self.parse_full_credits)


    def parse_full_credits(self,response):
        """
        assumes you start on cast and crew page 
        yields a scraoy.Request for the page of each actor listed on parse_actor_page
        No crew, just actors 
        returns nothing, around 5 lines of code
        """

        #below code gathers a list of urls for actors from cast page
        cast_list = [a.attrib["href"] for a in response.css("td.primary_photo a")] 
        for actor in cast_list:
        #for each actors url in the list, join it to the existing response url
            actor_page = response.urljoin(actor)
            #for each actor, yield a request: go to new response url, and perform parse_actor_page function
            yield scrapy.Request(actor_page, callback = self.parse_actor_page)


    def parse_actor_page(self,response):
        """
        start on the page of an actor, yields a dictionary of form {"actor": actor_name, "movie_or_TV_name": movie_or_TV_name}
        should yield one such dict for each of the movies or tv shows on which the actor has worked.Name has to be determined of the actor 
        and each movie or tv show 
        no more than 15 lines of code 
        """
        #retrieve actor name from header
        actor = response.css("h1.header span.itemprop::text").get()

        #get list of movies actor was involved with
        movies = response.css("div.filmo-row")
        for movie in movies:
            #see how the actor contributed to the movie, we are interested only in acting credits
            role = movie.css("::attr(id)").get()[0:3] 
            if(role == 'act'): #to account for 'ACT'or and 'ACT'ress 
                yield{
                    "actor":actor, #specified above 
                    "movie_or_TV_name": movie.css("a::text").get() #get movie title
                }

        

The 'name' and 'start_urls' variables tell the program what to call the spider when you want to run it and what the first link to look at is! 

For the purposes of this spider, our MOVIE will be the FAITHFUL ADAPTATION OF FRANK HERBERT's DUNE, directed by Denis Villeneuve

## The parse() Method 
The parse method is designed to retrieve the "Cast and Crew" link from our starting IMDB page, and then yield a scrapy.Request to tell the spider what to when we get to that page, in our case, we want to use the parse_full_credits() method. 

In [3]:
def parse(self,response):
        """
        assumes you start on a movie page, navigate to Cast & Crew page
        once at <movie_url>fullcredits, call parse_full_Credits method in the callback argument 
        to a yielded scrapy.Request. returns nothing, around 5 lines of code
        """
        #gets link to cast and crew from start page 
        cast_crew = response.css("a[href^= 'fullcredits']").attrib["href"] 
        if cast_crew:
            #joins link to existing link
            cast_crew = response.urljoin(cast_crew)

            #yields request: if this page exists, navigate to it and perform whatever function is described in the callback argument
            yield scrapy.Request(cast_crew, callback = self.parse_full_credits)

The response.urljoin() function joins the cast_crew url that we scraped from the IMDB page and combines it with the start_url! Then we navigate to the said link, and using the callback argument of the scrapy.Request() method, we instruct the spider what to do once we get there!

## The parse_full_credits() Method
This method is designed to get the list of actors/actresses from the Cast and Crew page!

In [4]:
def parse_full_credits(self,response):
        """
        assumes you start on cast and crew page 
        yields a scraoy.Request for the page of each actor listed on parse_actor_page
        No crew, just actors 
        returns nothing, around 5 lines of code
        """

        #below code gathers a list of urls for actors from cast page
        cast_list = [a.attrib["href"] for a in response.css("td.primary_photo a")] 
        for actor in cast_list:
            #for each actors url in the list, join it to the existing response url
            actor_page = response.urljoin(actor)
            #for each actor, yield a request: go to new response url, and perform parse_actor_page function
            yield scrapy.Request(actor_page, callback = self.parse_actor_page)

We set cast_list equal to a list comprehension which returns a list of all the links that clicking the photos of the actors/actresses would return. For example, in our case, the first item in this list would be the link that is returned if we clicked the headshot of Timothee Chalamet.

The for loop in this method instructs the spider (for each person on the actor list) to once again merge the link with the response.url and follow that link, only this time the callback argument specifies our final and most complex parsing function, the parse_actor_page()

## The parse_actor_page() Method 

This method is responsible for returning the actor/actress' name along with every acting credit they have received, according to IMDB. 


First, let's start by accessing the simplest element, the actor's name. To do so, we are going to take advantage of the fact that their name is the only \<h1> element of this page.

actor = 
response.css("h1.header span.itemprop::text").get()


Next, we want to use the css selector to select all the movies that the actor/actress was involved in. BE WARNED: This includes all the credits for producing, writing, etc., but we just want ACTING credits. 

movies = response.css("div.filmo-row")

The above code will return a list of all movies


In order to solve the above-mentioned problem, we need to filter out using the ::attr(id) css selector. Let's use a for-loop for each movie and see if the first three letters of the id attribute match "act", which is what we are looking for. 

Finally, the last thing to do (if the role matches what we are looking for), is to return the actor's/actress' name and the movie title. To do so we yield a dict object with two items: "actor" (the name) and "movie_or_TV_name"(name of project acted on) 

In [5]:
def parse_actor_page(self,response):
        """
        start on the page of an actor, yields a dictionary of form {"actor": actor_name, "movie_or_TV_name": movie_or_TV_name}
        should yield one such dict for each of the movies or tv shows on which the actor has worked.Name has to be determined of the actor 
        and each movie or tv show 
        no more than 15 lines of code 
        """
        #retrieve actor name from header
        actor = response.css("h1.header span.itemprop::text").get()

        #get list of movies actor was involved with
        movies = response.css("div.filmo-row")
        for movie in movies:
            #see how the actor contributed to the movie, we are interested only in acting credits
            role = movie.css("::attr(id)").get()[0:3] 
            if(role == 'act'): #to account for 'ACT'or and 'ACT'ress 
                yield{
                    "actor":actor, #specified above 
                    "movie_or_TV_name": movie.css("a::text").get() #get movie title
                }

Great! Now in order to run our spider, we need to go to the command line and to the directory where our spider is located. 

the command: scrapy crawl imdb_spider -o movies.csv 

will run our spider and output the results into a csv file in the same location 



## Movie Suggestions

Now let's figure out a way to suggest movies or tv shows with similar actors!

First, we can look at the data that we have so far

In [6]:
movies = pd.read_csv("movies.csv")

In [7]:
movies.head()

Unnamed: 0,actor,movie_or_TV_name
0,Oliver Ryan,Dune
1,Oliver Ryan,Bravely Default II
2,Oliver Ryan,The Pembrokeshire Murders
3,Oliver Ryan,The Accident
4,Oliver Ryan,The Last Vermeer


Now, let's see if we can apply value counts to each movie! This will give each movie/tv show title a value corresponding with how frequent it shows up in our data. The higher value count means that it shows up multiple times in our data, meaning that there have been more than one actor/actress in DUNE that has acted on it!

In [8]:
movie_suggestions = movies.apply(pd.value_counts)

In [9]:
movie_suggestions = movie_suggestions.reset_index()

Now, let's sort the values by movie_or_TV_name in DESCENDING ORDER, meaning that the most common movie will be on top. The most common **SHOULD** be DUNE because every actor in our list has definitely been a part of that project

In [10]:
movie_suggestions = movie_suggestions.sort_values(by = "movie_or_TV_name", ascending = False)


In [11]:
movie_suggestions.rename(columns = {"index": "Movie", "movie_or_TV_name": "Number Of Actors"})

Unnamed: 0,Movie,actor,Number Of Actors
401,Dune,,49.0
376,Doctors,,8.0
604,Holby City,,7.0
260,Casualty,,7.0
1336,The Bill,,7.0
...,...,...,...
1253,Stephen Collins,14.0,
1254,Stephen McKinley Henderson,49.0,
1295,Tachia Newall,16.0,
1588,Timothée Chalamet,31.0,


In [12]:
movie_suggestions = movie_suggestions.drop(["actor"],axis = 1)

movie_suggestions[0:10]

Unnamed: 0,index,movie_or_TV_name
401,Dune,49.0
376,Doctors,8.0
604,Holby City,7.0
260,Casualty,7.0
1336,The Bill,7.0
1187,Silent Witness,5.0
411,EastEnders,5.0
309,Coronation Street,4.0
431,Endeavour,3.0
564,Hamlet,3.0


There we have it! We have succesfully scraped the IMDB cast and crew page for DUNE, got data concerning which actors/actresses played a role in it, what other movies each actor/actress starred in, and gave recommendations for which movies to watch next based on how s