# Beautiful Soup Tutorial -- Dev Sharma, Applied Analytics Club

Import libraries

In [1]:
from bs4 import BeautifulSoup
import requests

Define the url and use the GET html method to extract the web page

In [28]:
url = "https://www.imdb.com/search/title?genres=drama&groups=top_250&sort=user_rating,desc"
res = requests.get(url)

Let's check the response variable

In [29]:
print(res)

<Response [200]>


A response code of 200 indicates an 'OK' signal. There are various codes (e.g. 401, 404) which can be found here: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes


Use BeautifulSoup to parse the response variable

In [30]:
soup = BeautifulSoup(res.text,'lxml')
print(soup)

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>IMDb: Drama,
IMDb "Top 250"
(Sorted by IMDb Rating Descending) - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<

### Selecting a single element

In [32]:
movie = soup.select_one(".lister-item-header a")
print(movie)
print(movie.text)
print(movie["href"])

<a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
The Shawshank Redemption
/title/tt0111161/?ref_=adv_li_tt


### Selecting multiple elements

Use BeautifulSoup's select function to scrape the desired content 

In [34]:
# Use selector gadget plug in to select the CSS selector
movies = soup.select(".lister-item-header a")

print(movies)

[<a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>, <a href="/title/tt0068646/?ref_=adv_li_tt">The Godfather</a>, <a href="/title/tt0468569/?ref_=adv_li_tt">The Dark Knight</a>, <a href="/title/tt0071562/?ref_=adv_li_tt">The Godfather: Part II</a>, <a href="/title/tt0167260/?ref_=adv_li_tt">The Lord of the Rings: The Return of the King</a>, <a href="/title/tt0110912/?ref_=adv_li_tt">Pulp Fiction</a>, <a href="/title/tt0108052/?ref_=adv_li_tt">Schindler's List</a>, <a href="/title/tt0050083/?ref_=adv_li_tt">12 Angry Men</a>, <a href="/title/tt0137523/?ref_=adv_li_tt">Fight Club</a>, <a href="/title/tt0120737/?ref_=adv_li_tt">The Lord of the Rings: The Fellowship of the Ring</a>, <a href="/title/tt0109830/?ref_=adv_li_tt">Forrest Gump</a>, <a href="/title/tt0167261/?ref_=adv_li_tt">The Lord of the Rings: The Two Towers</a>, <a href="/title/tt0099685/?ref_=adv_li_tt">Goodfellas</a>, <a href="/title/tt0073486/?ref_=adv_li_tt">One Flew Over the Cuckoo's Nest</a>, <a hr

In [35]:
movies_titles = []
movies_links = []

for item in movies:
    movies_titles.append(item.text)
    link = "http://imdb.com" + item["href"]
    movies_links.append(link)

print(movies_titles)
print("\n")
print(movies_links)

['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', 'The Godfather: Part II', 'The Lord of the Rings: The Return of the King', 'Pulp Fiction', "Schindler's List", '12 Angry Men', 'Fight Club', 'The Lord of the Rings: The Fellowship of the Ring', 'Forrest Gump', 'The Lord of the Rings: The Two Towers', 'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Seven Samurai', 'Interstellar', 'City of God', 'Saving Private Ryan', 'The Green Mile', 'Life Is Beautiful', 'Se7en', 'Léon: The Professional', 'The Silence of the Lambs', "It's a Wonderful Life", 'Dangal', 'Whiplash', 'The Intouchables', 'The Prestige', 'The Departed', 'The Pianist', 'Gladiator', 'American History X', 'The Lion King', 'Cinema Paradiso', 'Grave of the Fireflies', 'Apocalypse Now', 'Casablanca', 'The Great Dictator', 'Modern Times', 'City Lights', 'Your Name.', 'Django Unchained', '3 Idiots', 'Taare Zameen Par', 'Babam ve Oglum', 'The Lives of Others', 'Oldeuboi', 'American Beauty', 'Braveheart', 'Once Upon a T

## Challenge

### Scrape the 250 best TV shows' titles and links

Link: https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250

In [17]:
# Answer
#
#
#
#
#

In [37]:
url = "https://www.imdb.com/chart/toptv/?ref_=nv_tvv_250"
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')

shows = soup.select(".titleColumn a")

print(shows[:10])
print("\n")

shows_titles = [title.text for title in shows]
shows_links = ["http://imdb.com"+title["href"] for title in shows]

print(shows_titles[:10])
print("\n")
print(shows_links[:10])

[<a href="/title/tt5491994/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=612KCT5SVCWHWHA9747A&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_1" title="David Attenborough">Planet Earth II</a>, <a href="/title/tt0185906/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=612KCT5SVCWHWHA9747A&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_2" title="Scott Grimes, Damian Lewis">Band of Brothers</a>, <a href="/title/tt0944947/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=612KCT5SVCWHWHA9747A&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=chttvtp_tt_3" title="Emilia Clarke, Peter Dinklage">Game of Thrones</a>, <a href="/title/tt0795176/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=12230b0e-0e00-43ed-9e59-8d5353703cce&amp;pf_rd_r=612KCT5SVCWHWHA9747A&amp;pf_rd_s=center-1&amp;pf_rd_t=15506&amp;pf_rd_i=toptv&amp;ref_=c

## Bonus: Creating a for loop to scrape multiple pages

In [39]:
base_url = "http://quotes.toscrape.com/page/"
number_of_pages = 3
quotes = []

for i in range(1,number_of_pages+1):
    url = base_url + str(i) # URL Manupilation for each page
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'lxml')
    quotes = quotes + soup.select(".text")
    
print(quotes[:10])
print("\n")
print("Length of quotes is",len(quotes))

quotes_text = [quote.text for quote in quotes]

print(quotes_text[:10])

[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>, <span class="text" itemprop="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span>, <span class="text" itemprop="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span>, <span class="text" itemprop="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span>, <span class="text" itemprop="text">“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”</span>, <span class="text" itemprop="text">“Try not to become a man of success. Rather become a man of value.”</span>, <span class="text" itemprop="text">“It is better to be hated for what you are than to be loved for what you are not.”</spa