# Intro to web scraping

In [1]:
## installing required libraries. You can simply install packages using conda directly from Jupyter notebook. However, 
## it is recommended to use Terminal as a standard way to install python packages.

import sys
!conda install --yes --prefix {sys.prefix} -c anaconda beautifulsoup4
!conda install --yes --prefix {sys.prefix} -c anaconda requests

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 23.5.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.5.0



## Package Plan ##

  environment location: /Users/DeLaLuna/anaconda3

  added / updated specs:
    - beautifulsoup4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    beautifulsoup4-4.11.1      |  py310hecd8cb5_0         186 KB  anaconda
    ca-certificates-2022.4.26  |       hecd8cb5_0         132 KB  anaconda
    certifi-2022.6.15          |  py310hecd8cb5_0         157 KB  anaconda
    openssl-1.1.1u             |       hca72f7f_0         3.4 MB
    ------------------------------------------------------------
                                           Total:         3.

The first step of web scraping is to identify a website and download the html code from it. 

Real html from websites tends to be long and a bit too chaotic for a total beginner. Here we will start with a dummy html document and learn the basics of extracting info with beautifulsoup.

- You can learn about Html here https://www.w3schools.com/html/
- You can use codebeautify to make your html more readable and clean https://codebeautify.org/htmlviewer

In [4]:
html_doc = """ <!DOCTYPE html><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></html>"""

In [34]:
print(soup.prettify)

<bound method Tag.prettify of  <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>>


In [40]:
daughters = soup.select("p > a")
print(daughters)
for i in soup.select("p > a"):
    print(i.get_text())

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
Elsie
Lacie
Tillie


In [43]:
soup.select("a#link3")[0].get_text()


'Tillie'

In [45]:
soup.select("p.title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [10]:
soup.select(".title")[0].get_text()


"The Dormouse's story"

In [None]:
html_doc

In [7]:
from bs4 import BeautifulSoup

#### "creating the soup"

In [8]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') 

In [9]:
soup

 <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [11]:
import pprint

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


#### accessing single elements

We can access to the html tags appending to the correspoding `soup` a dot `.` and the name of the corresponding tag. In case of having multiple instances of the tag, only the first one will be retrieved.  

In [14]:
soup.title

<title>The Dormouse's story</title>

In [15]:
soup.title.get_text()

"The Dormouse's story"

In [16]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [17]:
soup.html.body.p

<p class="title"><b>The Dormouse's story</b></p>

In [23]:
soup.find("p")

<p class="title"><b>The Dormouse's story</b></p>

<b> searching using find() function

In [18]:
soup.find("p").get_text()

"The Dormouse's story"

In [28]:
soup.find_all("a",{"class": "story"}) 

[]

In [19]:
# this method only retrieves the first element of the specified tag
soup.p

<p class="title"><b>The Dormouse's story</b></p>

#### finding all elements of a tag with the powerful find_all()

In [24]:
p_tags = soup.find_all("p")
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [27]:
soup.find_all("p")[-1]

<p class="story">...</p>

In [26]:
a_tags = soup.find_all("a")
a_tags

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [25]:
soup.body

<body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body>

To get the `text`from the corresponding html code, we can use the function: get_text()

In [21]:
for p in p_tags:
    print(p.get_text())

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


#### Using css selectors
Another way to find contents using select(). 

Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

Everyone should reach level 6!

https://www.w3schools.com/css/css_howto.asp

In [172]:
soup.select("a")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [29]:
soup

 <!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>

In [174]:
for a in soup.select('a'):
    print(a.get_text())

Elsie
Lacie
Tillie


In [None]:
# for ID and . for class

In [30]:
soup.select("p > b")

[<b>The Dormouse's story</b>]

In [33]:
soup.select("p > b")[0].get_text()

"The Dormouse's story"

In [None]:
soup.select("p")[0]

using css selector, you can search directly using Css classes!

In [None]:
soup.select(".title")

<b> comparing to find_all() ..

In [None]:
print(soup.prettify())

In [None]:
soup.find_all("a", class_="sister")

In [None]:
soup.select(".sister")

<b>  You can search using directly using id attributes

In [None]:
soup.select("#link2")

We can combine the `select()` method with other bs4 methods, such as `get_text()`.

`get_text()`, however, can only be applied to single elements, while `select()` might return multiple elements. It's common to iterate through the output of `select()`

In [None]:
print(soup.select(".story"))

In [46]:
for p in soup.select("p.story"):
    print(p.get_text())

Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...


In [66]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [70]:
# parse the element
soup = BeautifulSoup(geography, 'html.parser') 

In [78]:
soup.select("p")

[<p>London is the most popular tourist destination in the world.</p>,
 <p>Paris was originally a Roman City called Lutetia.</p>,
 <p>Spain produces 43,8% of all the world's Olive Oil.</p>]

In [72]:
for p in soup.select("p"):
    print(p.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [None]:
for p in soup.select("city"):
    print(p.get_text())

In [82]:
soup.select(".city")
for c in soup.select(".city"):
    print(c.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [84]:
soup.select(".city")
for c in soup.select(".city"):
    print(c.get_text())


London
London is the most popular tourist destination in the world.


Paris
Paris was originally a Roman City called Lutetia.



In [85]:
soup.select(".city > h2")
for c in soup.select(".city > h2"):
    print(c.get_text())

London
Paris




Write code to print the following contents (not including the html tags, only human-readable text): 

1. All the "fun facts". 

2. The names of all the places. 

3. The content (name and fact) of all the cities (only cities, not countries!) 

4. The names (not facts!) of all the cities (not countries!)

In [73]:
soup.find_all('h2')

[<h2>London</h2>, <h2>Paris</h2>, <h2>Spain</h2>]

In [74]:
for i in soup.find_all('p'):
    print(i.get_text())

London is the most popular tourist destination in the world.
Paris was originally a Roman City called Lutetia.
Spain produces 43,8% of all the world's Olive Oil.


In [183]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

In [185]:
soup = BeautifulSoup(geography, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  Geography
 </head>
 <body>
  <div class="city">
   <h2>
    London
   </h2>
   <p>
    London is the most popular tourist destination in the world.
   </p>
  </div>
  <div class="city">
   <h2>
    Paris
   </h2>
   <p>
    Paris was originally a Roman City called Lutetia.
   </p>
  </div>
  <div class="country">
   <h2>
    Spain
   </h2>
   <p>
    Spain produces 43,8% of all the world's Olive Oil.
   </p>
  </div>
 </body>
</html>



In [None]:
# 1. All the "fun facts"
for i in soup.find_all("p"):
    print(i.get_text())

In [None]:
# 2. The names of all the places.
for i in soup.find_all("h2"):
    print(i.get_text())

or using select()

In [None]:
for item in soup.select("h2"):
    print(item.get_text())

If we want to the tags which has an `id` or `class`, we can provide a dictionary to find_all 

In [None]:
soup.find_all("div", {"class":"city"})

In [None]:
# 3. All the content (name and fact) of all the cities (only cities, not countries!)
for i in soup.find_all("div", {"class":"city"}):
    print(i.get_text())

or using select()

In [None]:
soup.select(".city")

In [None]:
for i in soup.select(".city"):
    print(i.get_text())

In [None]:
# 4. The names (not facts!) of all the cities (not countries!)
for item in soup.select(".city h2"):
    print(item.get_text())

## Use case: imdb top charts

Let's go to https://www.imdb.com/chart/top, where we'll see the top 250 movies according to IMDb ratings.

Notice how each movie has the following elements:

- Title

- Release Year

- IMDb rating

- Director & main stars (they appear when you hover over the title)

Our objective is going to be to scrape this information and store it in a pandas dataframe.



In [86]:
# 1. import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [90]:
# 2. find url and store it in a variable
url = "https://www.imdb.com/search/title/?title_type=feature&sort=user_rating,desc"

# <b> using request package

In [93]:
# 3. download html with a get request 
response = requests.get(url)

In [94]:
response.status_code # 200 status code means OK!

200

### HTTP Response status codes 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [95]:
# 4.1. parse html (create the 'soup')
soup = BeautifulSoup(response.content, "html.parser")
# 4.2. check that the html code looks like it should
soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Feature Film
(Sorted by IMDb Rating Descending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/search/title/?title_type=feature" rel="canonical"/>
<meta content="http://www.imdb.com/search/title/?title_type=feature" proper

<b> you can copy the CSS selector using Chrome inspector. Just select the element you want to extract  and right click and select copy selector 

In [None]:
#main > div > span > div > div > div.lister > table > tbody > tr:nth-child(1) > td.titleColumn > a

In [103]:
soup.select("h3 > a")

[<a href="/title/tt27789115/">Erumbu</a>,
 <a href="/title/tt15513206/">Nee Jathaga</a>,
 <a href="/title/tt27858080/">Heel'D</a>,
 <a href="/title/tt27798947/">What is Art</a>,
 <a href="/title/tt17014810/">Paper Line</a>,
 <a href="/title/tt12304768/">Neon Bleed</a>,
 <a href="/title/tt4971262/">Road King</a>,
 <a href="/title/tt12654846/">Rudy</a>,
 <a href="/title/tt23139004/">RI$E</a>,
 <a href="/title/tt13580840/">Simón</a>,
 <a href="/title/tt19397086/">Kaya Palat</a>,
 <a href="/title/tt26768535/">Beega</a>,
 <a href="/title/tt5230406/">Famous</a>,
 <a href="/title/tt27908623/">Melody Drama</a>,
 <a href="/title/tt26895530/">Abort</a>,
 <a href="/title/tt15677268/">Potentially Dangerous</a>,
 <a href="/title/tt28022731/">Stolen Dough</a>,
 <a href="/title/tt27971768/">Important in the Life</a>,
 <a href="/title/tt15325916/">Caralique</a>,
 <a href="/title/tt21384786/">Unveiled</a>,
 <a href="/title/tt28243828/">Juliet</a>,
 <a href="/title/tt11100290/">Unbeatable Fighter</a>,
 

In [107]:
titles = []

In [108]:
for t in soup.select("h3 > a"):
    titles.append(t.get_text())
print(titles)  

['Erumbu', 'Nee Jathaga', "Heel'D", 'What is Art', 'Paper Line', 'Neon Bleed', 'Road King', 'Rudy', 'RI$E', 'Simón', 'Kaya Palat', 'Beega', 'Famous', 'Melody Drama', 'Abort', 'Potentially Dangerous', 'Stolen Dough', 'Important in the Life', 'Caralique', 'Unveiled', 'Juliet', 'Unbeatable Fighter', 'The Fragile King', 'Apple Cinema', 'Breaking the Dwarf Wall', 'Song of the Fly', 'Efunsetan Aniwura', 'Fight Back!', 'A1 Quality Media Presents Innocence', 'Decent Reflection', 'Am Rande der Zeiten', 'Jeta', 'Sisters and the Shrink 2', 'Shift-e Nimeh Shab', 'Mr. Local Man', 'Matriarch', 'Sospeso', 'Bandu Boxer', 'Sadguru', 'Praveena', 'Iron Rule', 'Omr-e Setare', 'Flames of Wrath', 'Tolou Dar Shab', 'Turvo', 'Prince Oak Oakleyski Starring Supremacy', 'Poets Are the Destroyers', 'Zhuchok', 'El Pirata', 'Elmar']


In [117]:
years = []
for y in soup.select("h3 > span.lister-item-year"):
    years.append(y.get_text())
print(years)  

['(2023)', '(2021)', '(2023)', '(2023)', '(2022)', '(2023)', '(2023)', '(II) (2023)', '(2022)', '(2023)', '(2022)', '(2023)', '(II) (2023)', '(2023)', '(2023)', '(2021)', '(2023)', '(2023)', '(2022)', '(III) (2022)', '(2023)', '(2019)', '(2022)', '(2021)', '(2015)', '(2022)', '(2005)', '(2014)', '(2021)', '(2023)', '(2020)', '(2022)', '(2021)', '(2005)', '(2019)', '(2021)', '(2022)', '(2006)', '(2023)', '(2023)', '(2022)', '(2014)', '(1923)', '(2014)', '(2021)', '(2023)', '(2021)', '(2015)', '(2015)', '(2022)']


In [124]:
ratings = []
for i in soup.select("div > strong"):
    ratings.append(i.get_text())
print(ratings)

['10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0', '10.0']


In [120]:
soup.select("div > strong")

[<strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</strong>,
 <strong>10.0</s

In [113]:
soup.select("h3 > span")

[<span class="lister-item-index unbold text-primary">1.</span>,
 <span class="lister-item-year text-muted unbold">(2023)</span>,
 <span class="lister-item-index unbold text-primary">2.</span>,
 <span class="lister-item-year text-muted unbold">(2021)</span>,
 <span class="lister-item-index unbold text-primary">3.</span>,
 <span class="lister-item-year text-muted unbold">(2023)</span>,
 <span class="lister-item-index unbold text-primary">4.</span>,
 <span class="lister-item-year text-muted unbold">(2023)</span>,
 <span class="lister-item-index unbold text-primary">5.</span>,
 <span class="lister-item-year text-muted unbold">(2022)</span>,
 <span class="lister-item-index unbold text-primary">6.</span>,
 <span class="lister-item-year text-muted unbold">(2023)</span>,
 <span class="lister-item-index unbold text-primary">7.</span>,
 <span class="lister-item-year text-muted unbold">(2023)</span>,
 <span class="lister-item-index unbold text-primary">8.</span>,
 <span class="lister-item-year te

In [125]:
directors = []


In [126]:
soup.select("p > a:nth-child(1)")

[<a href="/name/nm10743272/">G. Suresh</a>,
 <a href="/name/nm2132667/">M.S. Bhaskar</a>,
 <a href="/name/nm1706602/">Charlie</a>,
 <a href="/name/nm6001916/">Suzane George</a>,
 <a href="/name/nm3388887/">Jagan</a>,
 <a href="/name/nm12977508/">Bamidipati Veera</a>,
 <a href="/name/nm7342732/">Bharath Bandaru</a>,
 <a href="/name/nm12991796/">Mehbaoob Basha</a>,
 <a href="/name/nm12991795/">Shourya Chandra</a>,
 <a href="/name/nm12991794/">Raghuveera Chary</a>,
 <a href="/name/nm8179180/">Demetris 'Illski' Jones</a>,
 <a href="/name/nm14174770/">Tonya Love</a>,
 <a href="/name/nm14893112/">Nassir Bell</a>,
 <a href="/name/nm14895990/">Victoria Chanel</a>,
 <a href="/name/nm14890513/">Bonita Choice</a>,
 <a href="/name/nm14561164/">Ciera Sharde Cohen</a>,
 <a href="/name/nm9138255/">Danny Abbott</a>,
 <a href="/name/nm4123455/">Daniel Fissmer</a>,
 <a href="/name/nm5548587/">Jarad Kopciak</a>,
 <a href="/name/nm7697469/">Arlo Sanders</a>,
 <a href="/name/nm5548587/">Jarad Kopciak</a>,


In [128]:
actors = []

In [127]:
soup.select("p > a:nth-child(n+2)")

[<a href="/name/nm2132667/">M.S. Bhaskar</a>,
 <a href="/name/nm1706602/">Charlie</a>,
 <a href="/name/nm6001916/">Suzane George</a>,
 <a href="/name/nm3388887/">Jagan</a>,
 <a href="/name/nm7342732/">Bharath Bandaru</a>,
 <a href="/name/nm12991796/">Mehbaoob Basha</a>,
 <a href="/name/nm12991795/">Shourya Chandra</a>,
 <a href="/name/nm12991794/">Raghuveera Chary</a>,
 <a href="/name/nm14174770/">Tonya Love</a>,
 <a href="/name/nm14893112/">Nassir Bell</a>,
 <a href="/name/nm14895990/">Victoria Chanel</a>,
 <a href="/name/nm14890513/">Bonita Choice</a>,
 <a href="/name/nm14561164/">Ciera Sharde Cohen</a>,
 <a href="/name/nm4123455/">Daniel Fissmer</a>,
 <a href="/name/nm5548587/">Jarad Kopciak</a>,
 <a href="/name/nm7697469/">Arlo Sanders</a>,
 <a href="/name/nm5548587/">Jarad Kopciak</a>,
 <a href="/name/nm4123455/">Daniel Fissmer</a>,
 <a href="/name/nm7697469/">Arlo Sanders</a>,
 <a href="/name/nm9138255/">Danny Abbott</a>,
 <a href="/name/nm5692730/">Walter Ashaad</a>,
 <a href="/

This long selector we copied is kind of long and ugly, isn't it? And it only selects one single movie, while we will want to collect data from all of them. Going from that particular selector to one that's more "general" and "elegant" is the actual work the web scraper needs to do.

In this case, we can play around a bit with different tags and classes, until we notice that all the information about the movies is under the tag <td class="titleColumn">. We're lucky that under this tag there's not much "trash", just the info we need.

In [130]:
len(soup.select(".titleColumn"))# all the info about all the movies

0

In [131]:
soup.select(".titleColumn")[0]# all the info about all the movies

IndexError: list index out of range

In [None]:
# we can use .get_text() to extract the content of the tags we selected
# we'll need to do it to each tag with a for loop: here we do it to the first one
soup.select("td.titleColumn  a")[0]
soup.select("td.titleColumn  a")[0].get_text()

In [None]:
ratings = soup.select(".imdbRating strong")
for rating in ratings:
    print(rating.get_text())

<b>the director and main stars are in the same tag, but as a value of the attribute "title"
we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:

In [129]:
# the director and main stars are in the same tag, but as a value of the attribute "title"
# we can access attributes as key-value pairs of dictionaries: using ["key"] to get the value:
soup.select("td.titleColumn a")[0]['title'].split(',')[1:]

# instead of ["title"] we could use .get("title"): choose whatever you prefer

IndexError: list index out of range

In [None]:
# the years are inside a 'span' tag with the 'secondaryInfo' class
# we also specify the parent tag and its class, which is the same we used before
# the years are inside parentheses, but we'll take care of that later
soup.select(".secondaryInfo")[0].get_text().strip('()')

In [None]:
year=soup.select("td.titleColumn span.secondaryInfo")[0].get_text()
year

#### Building the dataframe

In [None]:
#initialize empty lists
title = []
dir_stars = []
year = []
ratings = []

In [None]:
# define the number of iterations of our for loop 
# by checking how many elements are in the retrieved result set
# (this is equivalent but more robust than just explicitly defining 250 iterations)
num_iter = len(soup.select("td.titleColumn a"))

In [None]:
num_iter

In [None]:
# iterate through the result set and retrive all the data
for i in range(num_iter):
    title.append(soup.select("td.titleColumn a")[i].get_text()) ## getting movies titles
    dir_stars.append(soup.select("td.titleColumn a")[i]["title"]) ## getting dir and actors names
    year.append(soup.select("td.titleColumn span.secondaryInfo")[i].get_text()) ## getting the year
    ratings.append(soup.select("strong")[i].get_text())

In [None]:
title = []

In [None]:
dir_stars = []
year = []
ratings = []

In [None]:
for i in soup.select("td.titleColumn a"):
    title.append(i.get_text())

In [None]:
for i in soup.select("td.titleColumn a"):
    dir_stars.append(i['title'])

In [None]:
for i in soup.select("td.titleColumn span.secondaryInfo"):
    year.append(i.get_text())

In [None]:
for i in soup.select("strong"):
    ratings.append(i.get_text())

In [None]:
print(title)

In [None]:
len(dir_stars)

In [None]:
print(year)

In [None]:
# each list becomes a column
movies = pd.DataFrame({"title":title,
                       "dir_stars":dir_stars,
                       "year":year,
                       "ratings":ratings
                      })

movies.head(250)

#### Cleaning the data

An inherent part of web scraping is data cleaning. We managed to get the information we needed, but for it to be useful, we still need some extra steps:

- Take the year out of the parentheses: we know we can totally do that with regex, but string methods such as str.replace() might be simpler to use.

- Change the data type of the year column to integer.

- Split dir_stars into 3 columns, one for each person: "director", "star_1", "star_2". This could have been done by filtering when extracting the data from the html document, but it looks easier afterwards:

    - The "(dir.)" pattern can be totally removed
    - We can split the string at each comma

In [None]:
director = []
star_1 = []
star_2 = []

for movie in dir_stars:
    crew = movie.split(",")
    director.append(crew[0].replace(" (dir.)", ""))
    star_1.append(crew[1])
    star_2.append(crew[2])

# each list becomes a column
movies = pd.DataFrame({"title":title,
                       "director":director,
                       "star_1":star_1,
                       "star_2":star_2,
                       "year":year,
                       "ratings":ratings
                      })

movies.head()

## cleaning the year column

In [None]:
movies.year= movies.year.str.replace("(","").replace(")","")


In [None]:
movies['year'] = movies.year.str.replace(")","")

In [None]:
movies