# Data Wrangling for HTML

Using the BeautifulSoup to get all the movies in 2016 that are list in https://en.wikipedia.org/wiki/2016_in_film website

## Load the HTML File

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
with open('2016 in film - Wikipedia.html') as file:
    soup = BeautifulSoup(file, "lxml")

FileNotFoundError: [Errno 2] No such file or directory: '2016 in film - Wikipedia.html'

In [3]:
soup

NameError: name 'soup' is not defined

In [6]:
page = requests.get('https://en.wikipedia.org/wiki/2016_in_film')
page.status_code

200

In [7]:
soup = BeautifulSoup(page.content, 'html.parser')
soup


<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>2016 in film - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c66a4f1f-5c1c-4df3-bcee-ef16dce505b6","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"2016_in_film","wgTitle":"2016 in film","wgCurRevisionId":963781388,"wgRevisionId":963781388,"wgArticleId":20068492,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","2016 in film","Film by year"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"2016_in_film","wgRelev

## Print website title

In [8]:
soup.find('title')

<title>2016 in film - Wikipedia</title>

In [10]:
soup.find('title').contents[0]

'2016 in film - Wikipedia'

## Find the Highest-grossing films Table 

In [11]:
soup.find_all('table')

[<table class="infobox hlist">
 <tbody><tr>
 <th style="font-size:larger;">
 <table style="width:100%">
 <tbody><tr>
 <td style="text-align:left; width:5em;">
 </td>
 <td style="text-align:center"><a href="/wiki/List_of_years_in_film" title="List of years in film">List of years in film</a>
 </td>
 <td style="text-align:right; width:5em;">(<a href="/wiki/Table_of_years_in_film" title="Table of years in film">table</a>)
 </td></tr></tbody></table>
 </th></tr>
 <tr>
 <td style="text-align:center">
 <ul><li><a href="/wiki/Category:Film_by_year" title="Category:Film by year">…</a> <a href="/wiki/2006_in_film" title="2006 in film">2006</a></li>
 <li><a href="/wiki/2007_in_film" title="2007 in film">2007</a></li>
 <li><a href="/wiki/2008_in_film" title="2008 in film">2008</a></li>
 <li><a href="/wiki/2009_in_film" title="2009 in film">2009</a></li>
 <li><a href="/wiki/2010_in_film" title="2010 in film">2010</a></li>
 <li><a href="/wiki/2011_in_film" title="2011 in film">2011</a></li>
 <li><a 

In [11]:
soup.select('table')[3]

<table class="wikitable sortable" style="margin:auto; margin:auto;">
<caption>Highest-grossing films of 2016
</caption>
<tbody><tr>
<th>Rank</th>
<th>Title</th>
<th>Distributor</th>
<th>Worldwide gross
</th></tr>
<tr>
<th style="text-align:center;">1
</th>
<td><i><a href="/wiki/Captain_America:_Civil_War" title="Captain America: Civil War">Captain America: Civil War</a></i>
</td>
<td rowspan="5"><a href="/wiki/Walt_Disney_Studios_Motion_Pictures" title="Walt Disney Studios Motion Pictures">Disney</a>
</td>
<td>$1,153,304,495
</td></tr>
<tr>
<th style="text-align:center;">2
</th>
<td><i><a href="/wiki/Rogue_One" title="Rogue One">Rogue One: A Star Wars Story</a></i>
</td>
<td>$1,056,057,273
</td></tr>
<tr>
<th style="text-align:center;">3
</th>
<td><i><a href="/wiki/Finding_Dory" title="Finding Dory">Finding Dory</a></i>
</td>
<td>$1,028,570,889
</td></tr>
<tr>
<th style="text-align:center;">4
</th>
<td><i><a href="/wiki/Zootopia" title="Zootopia">Zootopia</a></i>
</td>
<td>$1,023,784,1

## Find the tag name and print the movies' names

The output:
- Captain America: Civil War
- Rogue One: A Star Wars Story
- Finding Dory
- Zootopia
- The Jungle Book
- The Secret Life of Pets
- Batman v Superman: Dawn of Justice
- Fantastic Beasts and Where to Find Them
- Deadpool
- Suicide Squad

In [12]:
soup.select('table')[3].find_all('td')

[<td><i><a href="/wiki/Captain_America:_Civil_War" title="Captain America: Civil War">Captain America: Civil War</a></i>
 </td>,
 <td rowspan="5"><a href="/wiki/Walt_Disney_Studios_Motion_Pictures" title="Walt Disney Studios Motion Pictures">Disney</a>
 </td>,
 <td>$1,153,304,495
 </td>,
 <td><i><a href="/wiki/Rogue_One" title="Rogue One">Rogue One: A Star Wars Story</a></i>
 </td>,
 <td>$1,056,057,273
 </td>,
 <td><i><a href="/wiki/Finding_Dory" title="Finding Dory">Finding Dory</a></i>
 </td>,
 <td>$1,028,570,889
 </td>,
 <td><i><a href="/wiki/Zootopia" title="Zootopia">Zootopia</a></i>
 </td>,
 <td>$1,023,784,195
 </td>,
 <td><i><a href="/wiki/The_Jungle_Book_(2016_film)" title="The Jungle Book (2016 film)">The Jungle Book</a></i>
 </td>,
 <td>$966,550,600
 </td>,
 <td><i><a href="/wiki/The_Secret_Life_of_Pets" title="The Secret Life of Pets">The Secret Life of Pets</a></i>
 </td>,
 <td><a href="/wiki/Universal_Pictures" title="Universal Pictures">Universal</a>
 </td>,
 <td>$875,457

In [33]:
soup.select('table')[3].find_all('a')

[<a href="/wiki/Captain_America:_Civil_War" title="Captain America: Civil War">Captain America: Civil War</a>,
 <a href="/wiki/Walt_Disney_Studios_Motion_Pictures" title="Walt Disney Studios Motion Pictures">Disney</a>,
 <a href="/wiki/Rogue_One" title="Rogue One">Rogue One: A Star Wars Story</a>,
 <a href="/wiki/Finding_Dory" title="Finding Dory">Finding Dory</a>,
 <a href="/wiki/Zootopia" title="Zootopia">Zootopia</a>,
 <a href="/wiki/The_Jungle_Book_(2016_film)" title="The Jungle Book (2016 film)">The Jungle Book</a>,
 <a href="/wiki/The_Secret_Life_of_Pets" title="The Secret Life of Pets">The Secret Life of Pets</a>,
 <a href="/wiki/Universal_Pictures" title="Universal Pictures">Universal</a>,
 <a href="/wiki/Batman_v_Superman:_Dawn_of_Justice" title="Batman v Superman: Dawn of Justice">Batman v Superman: Dawn of Justice</a>,
 <a href="/wiki/Warner_Bros." title="Warner Bros.">Warner Bros.</a>,
 <a href="/wiki/Fantastic_Beasts_and_Where_to_Find_Them_(film)" title="Fantastic Beasts a

In [11]:
for tag in soup.select('table')[3].find_all('a'):
    print (tag.contents[0])
    #print(tag.get('title'))
    #print (tag)
    

Captain America: Civil War
Disney
Rogue One: A Star Wars Story
Finding Dory
Zootopia
The Jungle Book
The Secret Life of Pets
Universal
Batman v Superman: Dawn of Justice
Warner Bros.
Fantastic Beasts and Where to Find Them
Deadpool
20th Century Fox
Suicide Squad


In [31]:
soup.select('table')[3].find_all('i')

[<i><a href="/wiki/Captain_America:_Civil_War" title="Captain America: Civil War">Captain America: Civil War</a></i>,
 <i><a href="/wiki/Rogue_One" title="Rogue One">Rogue One: A Star Wars Story</a></i>,
 <i><a href="/wiki/Finding_Dory" title="Finding Dory">Finding Dory</a></i>,
 <i><a href="/wiki/Zootopia" title="Zootopia">Zootopia</a></i>,
 <i><a href="/wiki/The_Jungle_Book_(2016_film)" title="The Jungle Book (2016 film)">The Jungle Book</a></i>,
 <i><a href="/wiki/The_Secret_Life_of_Pets" title="The Secret Life of Pets">The Secret Life of Pets</a></i>,
 <i><a href="/wiki/Batman_v_Superman:_Dawn_of_Justice" title="Batman v Superman: Dawn of Justice">Batman v Superman: Dawn of Justice</a></i>,
 <i><a href="/wiki/Fantastic_Beasts_and_Where_to_Find_Them_(film)" title="Fantastic Beasts and Where to Find Them (film)">Fantastic Beasts and Where to Find Them</a></i>,
 <i><a href="/wiki/Deadpool_(film)" title="Deadpool (film)">Deadpool</a></i>,
 <i><a href="/wiki/Suicide_Squad_(film)" title=

In [10]:

for tag in soup.select('table')[3].find_all('i'):
    print (tag.find('a').contents[0])
    #print (tag.find('a').get('title'))
    

Captain America: Civil War
Rogue One: A Star Wars Story
Finding Dory
Zootopia
The Jungle Book
The Secret Life of Pets
Batman v Superman: Dawn of Justice
Fantastic Beasts and Where to Find Them
Deadpool
Suicide Squad
