# SICSS 2023 (Basics of Web Scraping)


## Scrapy - Extracting movie data from imdb


###### Credits for https://app.datacamp.com/learn/courses/web-scraping-with-python

In [None]:
#pip install scrapy

In [None]:
# Import a scrapy Selector
from scrapy import Selector

# Import requests
import requests

url = 'https://m.imdb.com/chart/top'

# Create the string html containing the HTML source
html = requests.get(url).content

# Create the Selector object sel from html
sel = Selector(text = html)

# Print out the number of elements in the HTML document
print( "You have found: ", len(sel.xpath('//*')))

In [None]:
type(html)

In [None]:
type(sel)

In [None]:
xpath_for_movienames ='//h4'

In [None]:
sel.xpath(xpath_for_movienames).extract()

In [None]:
len(sel.xpath(xpath_for_movienames).extract())

In [None]:
xpath_for_movienames ='//h4/text()'

In [None]:
movies = sel.xpath(xpath_for_movienames).extract()

In [None]:
movies

#### Clean your text with strip

In [None]:
movie_list=[]
for string in movies:
    cleaned_string = string.strip()
    if cleaned_string != '':
        movie_list.append(cleaned_string)
    
print(movie_list)

#### A brief break: How does strip function work?

In [None]:
string = '  xoxo love xoxo   '

# Leading and trailing whitespaces are removed
print(string.strip())

# All <whitespace>,x,o,e characters in the left
# and right of string are removed
print(string.strip(' xoe'))

# Argument doesn't contain space
# No characters are removed.
print(string.strip('stx'))

string = 'android is awesome'
print(string.strip('an'))

In [None]:
len(movie_list)

#### Extracting the order and year of movies

In [None]:
xpath_for_movieorder ='//h4/span[1]/text()'
movie_order = sel.xpath(xpath_for_movieorder).extract()

In [None]:
xpath_for_moviedates ='//h4/span[2]/text()'
movie_date = sel.xpath(xpath_for_moviedates).extract()

##### Copied Xpaths for the first four movies:

//*[@id="chart-content"]/div[1]/div[1]/div/a

//*[@id="chart-content"]/div[1]/div[2]/div/a

//*[@id="chart-content"]/div[2]/div[1]/div/a

//*[@id="chart-content"]/div[2]/div[2]/div/a

…


#### Movie links

In [None]:
xpath_for_movielink = '//*[@id="chart-content"]/div/div/div/a/@href'
movie_link = sel.xpath(xpath_for_movielink).extract()

In [None]:
len(movie_link)

In [None]:
movie_link

Shared objects are partial, let's figure this our

In [None]:
first_part_url = 'https://m.imdb.com'
last_part_url = '?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=df09bbba-7a44-41c0-bc85-426ba05a5574&pf_rd_r=R9N3Q0473JZET8YH4S2A&pf_rd_s=top-1&pf_rd_t=15506&pf_rd_i=top&ref_=m_chttp_tt_1'

In [None]:
movie_link_merged = []

for i in movie_link:
    link = first_part_url + i + last_part_url
    movie_link_merged.append(link)

In [None]:
len(movie_link_merged)

#### Let's store what we downloaded

In [None]:
dictimdb = {'movie_order':movie_order, 'movie_date':movie_date, 'movie_list':movie_list, 'movie_link_merged': movie_link_merged}

In [None]:
import pandas as pd

In [None]:
data_imdb = pd.DataFrame(dictimdb)

In [None]:
data_imdb

In [None]:
data_imdb['movie_link_merged'][0]

#### Can we go further?

Let's find out players of these famous movies:


In [None]:
url = data_imdb['movie_link_merged'][1]

In [None]:
html = requests.get(url).content

sel = Selector(text = html)

print( "You have found: ", len(sel.xpath('//*')))

##### Copied Xpaths for the first four players:

//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div[1]/div[2]/a

//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div[2]/div[2]/a

//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div[3]/div[2]/a

//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div[4]/div[2]/a


In [None]:
xpath_for_movieplayers ='//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div/div[2]/a/text()'
movie_players_first_movie = sel.xpath(xpath_for_movieplayers).extract()

In [None]:
movie_players_first_movie

In [None]:
players_list=[]

In [None]:
for i in range(3):
    url = data_imdb['movie_link_merged'][i]
    html = requests.get(url).content
    sel = Selector(text = html)
    movie_players = sel.xpath(xpath_for_movieplayers).extract()
    players_list.append(movie_players)

In [None]:
players_list[1]

## Spider

Spider works faster when we want to download a large number of pages. Let's take a brief look at the objects we call classes. Being an object-oriented programming language, nearly everything in Python is designed as a class. The class is a very important tool for programming in that it can create instances from itself and run nested functions. Various properties and methods can be assigned to classes.

#### How does it work?

In the code below you can define a class named IMDB_Spider. This class runs the Spider method of the scrapy library. The ``start_requests`` function in the content serves to start the process first. At this stage, you can define the first page from which the download will start. Then, make the necessary intervention on the ``self`` parameter that will be modified throughout the class and introduced from the function to the function, and pass to the next function (```parse_front````).

With the ```parse_front``` function, you can now define, open and store your Xpaths to be used in certain lists. At this stage, create your links that you want to progress. You can move these links to the next function using the command below.

```yield response.follow(url = url, callback = self.parse_pages)```

Now you have come to the ``parse_pages`` function. This function will perform the action you specified for each URL that you created and forwarded in the previous step. So you can define an Xpath again and mark the data you want to pull. Then leave the class by keeping the downloaded data.

In order to run the class you have defined, you need to call it:


```process = CrawlerProcess()```

```process.crawl(IMDB_Spider)```

```process.start()```

Don't forget to define empty lists before doing this because you will be writing your downloaded data into these empty lists.

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class IMDB_Spider(scrapy.Spider):
    name = "IMDB_Spider"

    def start_requests(self):
        yield scrapy.Request(url = 'https://m.imdb.com/chart/top', callback = self.parse_front)

    def parse_front(self, response):
        
        movie_names = response.xpath('//h4/text()').extract()
        for item in movie_names:
            cleaned_string = item.strip()
            if cleaned_string != '':
                movie_list.append(cleaned_string)
        
        movie_years = response.xpath('//h4/span[2]/text()').extract()
        for item in movie_years:
                years.append(item)

        movie_links = response.xpath('//*[@id="chart-content"]/div/div/div/a/@href').extract()
        for item in movie_links:
            first_part_url = 'https://m.imdb.com'
            last_part_url = '?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=df09bbba-7a44-41c0-bc85-426ba05a5574&pf_rd_r=R9N3Q0473JZET8YH4S2A&pf_rd_s=top-1&pf_rd_t=15506&pf_rd_i=top&ref_=m_chttp_tt_1'
            url = first_part_url + item + last_part_url
            links.append(url)
            yield response.follow(url = url,  callback = self.parse_pages)
            
    def parse_pages(self, response):
        movie_players = response.xpath('//*[@id="__next"]/main/div/section[1]/div/section/div/div[1]/section[4]/div[2]/div[2]/div/div[2]/a/text()').extract()
        players_list.append(movie_players)


movie_list = []
years = []
links = []
players_list = []

process = CrawlerProcess()
process.crawl(IMDB_Spider)
process.start()

### Display first five lines

In [None]:
links[0:5]

In [None]:
years[0:2]

In [None]:
movie_list[0:2]

In [None]:
players_list[0:2]

### First dataframe, then csv or json

Convert your data into a dictionary, then into a pandas DataFrame.

In [None]:
import pandas as pd

dictimdb = {'movie name':movie_list, 'year':years, 'link':links, 'player_list': players_list}
data_imdb = pd.DataFrame(dictimdb)

In [None]:
data_imdb

In [None]:
data_imdb['player_list'][56]

Now you can save the relevant data to your computer as a csv or json file.

In [None]:
data_imdb.to_csv("IMDB_Filmlerim.csv")
data_imdb.to_json("IMDB_Filmlerim.json")