# Web Scraping with Beautiful Soup
Scraping [IMDb Top 250 Movies](https://www.imdb.com/chart/top/) and creating a pandas Dataframe.

In [1]:
import requests

In [2]:
import bs4
from bs4 import BeautifulSoup

In [3]:
import pandas as pd

In [4]:
res=requests.get('https://www.imdb.com/chart/top/') # Used requests to grab the page

In [5]:
soup=bs4.BeautifulSoup(res.text,'html.parser')

### Scraping:

In [6]:
soup.select('.titleColumn') # under class="titleColumn" Rank,Title and Release year are mentioned.

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>,
 <td class="titleColumn">
       3.
       <a href="/title/tt0468569/" title="Christopher Nolan (dir.), Christian Bale, Heath Ledger">The Dark Knight</a>
 <span class="secondaryInfo">(2008)</span>
 </td>,
 <td class="titleColumn">
       4.
       <a href="/title/tt0071562/" title="Francis Ford Coppola (dir.), Al Pacino, Robert De Niro">The Godfather Part II</a>
 <span class="secondaryInfo">(1974)</span>
 </td>,
 <td class="titleColumn">
       5.
       <a href="/title/tt0050083/" title="Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb">12 Angry Men</a>
 <span class="secondaryInfo">(

In [7]:
movies=soup.select('.titleColumn')

In [8]:
for item in movies:
    rank=item.get_text(strip=True).split('.')[0] #rank is mentioned in <td> tag
    title=item.a.text # title is mentioned in <a> tag
    year=item.span.text.strip('()') # year is mentioned in <span> tag
    print(rank,title,year)

1 The Shawshank Redemption 1994
2 The Godfather 1972
3 The Dark Knight 2008
4 The Godfather Part II 1974
5 12 Angry Men 1957
6 Schindler's List 1993
7 The Lord of the Rings: The Return of the King 2003
8 Pulp Fiction 1994
9 The Lord of the Rings: The Fellowship of the Ring 2001
10 Il buono, il brutto, il cattivo 1966
11 Forrest Gump 1994
12 Fight Club 1999
13 The Lord of the Rings: The Two Towers 2002
14 Inception 2010
15 The Empire Strikes Back 1980
16 The Matrix 1999
17 Goodfellas 1990
18 One Flew Over the Cuckoo's Nest 1975
19 Se7en 1995
20 Shichinin no samurai 1954
21 It's a Wonderful Life 1946
22 The Silence of the Lambs 1991
23 Cidade de Deus 2002
24 Saving Private Ryan 1998
25 La vita è bella 1997
26 Interstellar 2014
27 The Green Mile 1999
28 Star Wars 1977
29 Terminator 2: Judgment Day 1991
30 Back to the Future 1985
31 Sen to Chihiro no kamikakushi 2001
32 Psycho 1960
33 The Pianist 2002
34 Gisaengchung 2019
35 Léon 1994
36 The Lion King 1994
37 Gladiator 2000
38 American His

In [9]:
soup.find_all(class_="ratingColumn imdbRating") # under class="ratingColumn imdbRating" IMDb rating mentioned.

[<td class="ratingColumn imdbRating">
 <strong title="9.2 based on 2,676,134 user ratings">9.2</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="9.2 based on 1,855,055 user ratings">9.2</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="9.0 based on 2,649,411 user ratings">9.0</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="9.0 based on 1,269,971 user ratings">9.0</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="9.0 based on 790,554 user ratings">9.0</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="8.9 based on 1,354,575 user ratings">8.9</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="8.9 based on 1,844,223 user ratings">8.9</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="8.8 based on 2,051,539 user ratings">8.8</strong>
 </td>,
 <td class="ratingColumn imdbRating">
 <strong title="8.8 based on 1,873,536 user ratings">8.8</strong>
 <

In [10]:
for rating in soup.find_all(class_="ratingColumn imdbRating"):
    print(rating.get_text(strip=True))

9.2
9.2
9.0
9.0
9.0
8.9
8.9
8.8
8.8
8.8
8.8
8.7
8.7
8.7
8.7
8.7
8.7
8.6
8.6
8.6
8.6
8.6
8.6
8.6
8.6
8.6
8.6
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.5
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.4
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.3
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.2
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.1
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0


### Creating Dataframe:

In [11]:
rank=[item.get_text(strip=True).split('.')[0] for item in movies]

In [12]:
name=[item.a.text for item in movies]

In [13]:
year=[item.span.text.strip('()') for item in movies]

In [14]:
rating=[rating.get_text(strip=True) for rating in soup.find_all(class_="ratingColumn imdbRating")]

In [15]:
top_250=pd.DataFrame({'Rank':rank,'Title':name,'Release Year':year,'IMDb Rating':rating})

In [16]:
top_250

Unnamed: 0,Rank,Title,Release Year,IMDb Rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,9.0
...,...,...,...,...
245,246,The Iron Giant,1999,8.0
246,247,Aladdin,1992,8.0
247,248,The Help,2011,8.0
248,249,Gandhi,1982,8.0
