Hello dear reader and welcome back! For today's blog post you will be learning how to use the Scrapy Python package to extract valuable data from websites. Our goal today is to build a simple recommendation system that will recommend movies or TV shows based on the number of actors a title shares with your favorite movie or TV show. We will extract this data from the IMDB website. Let's get started!

## Getting Started with Scrapy

Once you have properly installed Scrapy to your device's PIC16B environment, the first step is to created a GitHub repository and initialize your project. Run the following commands in your terminal.

In [None]:
conda activate PIC16B
scrapy startproject IMDB_scraper
cd IMDB_scraper

Next up we will create a new file inside the `spiders` directory and title it `imdb_spider.py`. We will implement our spider in the `ImdbSpider` class of this file. The scraper works by calling various parsing methods in the `ImdbSpider` class to extract data from the web.  Add the following code to your new file. My favorte TV show is "Grey's Anatomy", so the URL that I added below links to the Grey's Anatomy page on IMDB. If you have a different favorite TV show or movie, feel free to change the url down below.

In [None]:
import scrapy

class ImdbSpider(scrapy.Spider):
    name = 'imdb_spider'
    
    start_urls = ['https://www.imdb.com/title/tt0413573/']

## Our Parsing Methods
Our parsing methods work by essentially clicking around on the IMDB website as directed and extracting the requested data. Scrapy is able to do so by making use of two of its objects: `request` and `response`. You will see the `request` object often called at the end of each parse function below so that the next corresponding parse function is called and so that the spider can continue to scrape data. The `response` object is called so that the spider can access the data on the web page accordingly.

### parse(self, response)
Our first method starts on a title's home page and navigates to its Cast and Crew page. When we perform this action manually on the IMDB website, we see that the only difference in URL's is that the Cast and Crew page has `fullcredits/` appended to the end of our initial URL. The following function does exactly this and appends `fullcredits/` to the end of our initial URL found in the `response` object. At the end of our function we then yield a `request` object containing `next_page`, our new url we are "clicking" on, and `self.parse_full_credits`, the next parse method we will call.

In [None]:
def parse(self, response):
    '''
    A parsing method that navigates from a title's home page to its Cast and Crew page.
    '''
    # string to append to initial url
    next_page = "fullcredits/"

    # append string and call next parsing method
    if next_page:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback = self.parse_full_credits)

### parse_full_credits(self, response)
Our next parsing method navigates to each actor's IMDB page. We do this using the list comprehension below. The URL for each page is stored in the `a` class with attribute `href`

In [None]:
def parse_full_credits(self, response):
    '''
    A parsing method that navigates to each actor's profile in a title's IMDB
    Cast and Crew page
    '''
    
    # a list of all links to each actor's IMDB page
    next_page_list = [a.attrib["href"] for a in response.css("td.primary_photo a")]

    # navigate to each actor's IMDB page
    for next_page in next_page_list:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback = self.parse_actor_page)

### parse_actor_page

In [None]:
def parse_actor_page(self, response):

    actor_name = response.css("span.itemprop::text").get()
        
    for element in response.css("div.filmo-row"):
        element = response.css("b")
        movie_or_tv_name = element.css("a::text").getall()

        this_title = "Grey's Anatomy"
        movie_or_tv_name = [a for a in movie_or_tv_name if this_title not in a]

    yield {
        "actor" : actor_name,
        "movie_or_tv_name": movie_or_tv_name,
    }

## Deploying our Spider

We are almost done! Our last step is to run the following command in our terminal. A csv file of the results will then be generated in our `IMDB_scraper` directory. 

In [None]:
scrapy crawl imdb_spider -o results.csv

## Displaying our Data

In [1]:
import pandas as pd
import numpy as np

In [11]:
df = pd.read_csv("results.csv")
df = df.dropna()

In [12]:
# get all unique movie or tv show names
all_names = df['movie_or_tv_name']
unique_list = []
for names in all_names:
    shorter_names = names.split(",")
    unique_list += shorter_names
    
unique_list = list(set(unique_list))

In [13]:
mydict = {}
for name in unique_list:
    this = df['movie_or_tv_name'].str.contains(name)
    count = sum(this)
    mydict[name] = count
    
mydf = pd.DataFrame(mydict.items(), columns=['movie_or_tv_name', 'count'])
mydf = mydf.sort_values(by=['count'], ascending=False)

  this = df['movie_or_tv_name'].str.contains(name)


# Results!

In [21]:
mydf[0:20]

Unnamed: 0,movie_or_tv_name,count
655,You,15
522,Hollywood,9
795,NCIS,8
871,Special,7
531,Grace,6
579,Dog,6
263,Entertainment Tonight,6
984,Bones,6
190,Stars,6
459,House,6
