# Parse Information

In [1]:
import requests
URL = "https://www.imdb.com/chart/top"
r = requests.get(URL)
html = r.text
#print(r.text)

## BeautifulSoup
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work.

In [2]:
from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
breakfast_soup = BeautifulSoup(html_doc, 'html.parser')
#print(breakfast_soup.prettify()) # Simplified tree structure of the html document
print("title of html: ", breakfast_soup.title) # The title, not the customized class "title", of the html
print("tag \'p\' of html: ", breakfast_soup.p) # The tag 'p' (paragraph) of the html
print("tag \'b\' of html: ", breakfast_soup.b) # The tag 'b' (boldtext) of the html
print("the class of tag \'p\' of html: ", breakfast_soup.p['class']) # The class of tag 'p' in the html
print("the first tag \'a\' of html: ", breakfast_soup.a) # The first tag 'a' (defines a hyperlink) of html
print("all \'a\' tags of html: ")  
a_s = breakfast_soup.find_all('a') # All tags 'a' of html
for a in a_s:
    print(a)
    print(a.get('href'))
print("context of id \'link3\': ", breakfast_soup.find(id='link3')) # Find the context by id "link3"
# Find the context by class 'title'
print("context of class \'title\': ", breakfast_soup.find('p', {"class": "title"}).get_text()) 

title of html:  <title>The Dormouse's story</title>
tag 'p' of html:  <p class="title"><b>The Dormouse's story</b></p>
tag 'b' of html:  <b>The Dormouse's story</b>
the class of tag 'p' of html:  ['title']
the first tag 'a' of html:  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
all 'a' tags of html: 
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
http://example.com/elsie
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
http://example.com/lacie
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
http://example.com/tillie
context of id 'link3':  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
context of class 'title':  The Dormouse's story


In [3]:
soup = BeautifulSoup(html, features='lxml')
print("title of the page: ", soup.title.get_text())
print("head 1 of the body: ", soup.h1.get_text())
#print(dir(soup))
all_movies = soup.find_all('td', {"class": "titleColumn"}) # td: defines a cell in a table
print("Top 5 Movies in the history: ")
for movie in all_movies[:5]:
    print(movie.get_text())
    #print('\t', movie.a.get_text())

title of the page:  IMDb Top 250 - IMDb
head 1 of the body:  Top Rated Movies
Top 5 Movies in the history: 

      1.
      The Shawshank Redemption
(1994)


      2.
      The Godfather
(1972)


      3.
      The Godfather: Part II
(1974)


      4.
      The Dark Knight
(2008)


      5.
      12 Angry Men
(1957)



### Save all posters of the top 250 movies in history
- Open the url: https://www.imdb.com/chart/top
- Find the one poster in the webpage, right click on it ans inspect
- Understand the structure, make sure what tag, class or id it is under
- Use the tag, class or id to locate all posters
- Extract the information from them and save

In [4]:
import re # change names of images

all_posters = soup.find_all('td', {"class": "posterColumn"})

for poster in all_posters:
    url = poster.img['src']
    img_r = requests.get(url, stream=True)
    image_name = re.sub(r'\W', '_', poster.img['alt']) + '.' + url.split('.')[-1]
    with open('../img/IMDb/%s' % image_name, 'wb') as f:
        for chunk in img_r.iter_content(chunk_size=128):
            f.write(chunk)
    print('Saved %s' % image_name)

print("done!")


Saved The_Shawshank_Redemption.jpg
Saved The_Godfather.jpg
Saved The_Godfather__Part_II.jpg
Saved The_Dark_Knight.jpg
Saved 12_Angry_Men.jpg
Saved Schindler_s_List.jpg
Saved The_Lord_of_the_Rings__The_Return_of_the_King.jpg
Saved Pulp_Fiction.jpg
Saved The_Good__the_Bad_and_the_Ugly.jpg
Saved Fight_Club.jpg
Saved The_Lord_of_the_Rings__The_Fellowship_of_the_Ring.jpg
Saved Forrest_Gump.jpg
Saved Star_Wars__Episode_V___The_Empire_Strikes_Back.jpg
Saved Inception.jpg
Saved The_Lord_of_the_Rings__The_Two_Towers.jpg
Saved One_Flew_Over_the_Cuckoo_s_Nest.jpg
Saved Goodfellas.jpg
Saved The_Matrix.jpg
Saved Seven_Samurai.jpg
Saved Se7en.jpg
Saved City_of_God.jpg
Saved Star_Wars__Episode_IV___A_New_Hope.jpg
Saved The_Silence_of_the_Lambs.jpg
Saved It_s_a_Wonderful_Life.jpg
Saved Life_Is_Beautiful.jpg
Saved Spider_Man__Into_the_Spider_Verse.jpg
Saved The_Usual_Suspects.jpg
Saved Spirited_Away.jpg
Saved Saving_Private_Ryan.jpg
Saved Léon__The_Professional.jpg
Saved The_Green_Mile.jpg
Saved Inters

## lxml
The [lxml](https://lxml.de/index.html) XML toolkit is a Pythonic binding for the C libraries __libxml2__ and __libxslt__. It is unique in that it combines the speed and XML feature completeness of the libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known __ElementTree__ API.

It is convinient to combine with the Xpath.

In [5]:
from lxml import etree
s = etree.HTML(html)
# The url of the poster of "The Shawshank Redemption"
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[1]/td[1]/a/img/@src'))
# tr (defines a row in a table)
# div (defines a section in a document)
# From the copied Xpath, we can tell
## tr[1] indicates the rank of the movie, i.e. tr[2] has the information of the 2nd movie, so on...
## td[1] contains the information of poster. So we can inspect more and guess td[2] has the information of the title,
## and td[3] has the information of the rating
## We can prove this by copying the xpath of the title and rating, then compare.
## Xpath of title: 
## //*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[1]/td[2]/a
## Xpath of the rating:
## //*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[1]/td[3]/strong
# The top 1 movie's name and rating
print('\n'+"Number 1:")
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[1]/td[2]/a/text()'))
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[1]/td[3]/strong/text()'))
print('\n'+"Number 2:")
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[2]/td[2]/a/text()'))
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[2]/td[3]/strong/text()'))
print('\n'+"Number 250")
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[last()]/td[2]/a/text()'))
print(s.xpath('//*[@id="main"]/div/span/div/div/div[3]/table/tbody/tr[last()]/td[3]/strong/text()'))

['https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg']

Number 1:
['The Shawshank Redemption']
['9.2']

Number 2:
['The Godfather']
['9.2']

Number 250
['8½']
['8.0']


### Save the titles and ratings of the top 250 movies
- Open the url: https://www.imdb.com/chart/top
- Find the title and rating of one movie, and inspect
- Study the Xpath of the title and rating
- Write the Xpath
- Extract the information

In [6]:
titles_list = s.xpath('//td[@class="titleColumn"]/a/text()')
ratings_list = s.xpath('//td[@class="ratingColumn imdbRating"]/strong/text()')
#print(titles_list[:10])
#print(ratings_list[:10])
imdb_list = list(zip(titles_list, ratings_list))
#print(imdb_list[:10])

In [7]:
import pandas as pd
df = pd.DataFrame(imdb_list, columns=["Title", "Rating"])
#print(list(df['Title']))
df.to_csv("./imdb.csv")