<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

I'm using BeautifulSoup to extract the movies Sandra Bullock appeared in from Wikipedia. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is one of the most popular packages used to scrape the web. 

In [2]:
# Import packages
import numpy as np
import pandas as pd

# For web scraping
import requests
from bs4 import BeautifulSoup

# To perform regex operations
import re

# To add delay to avoid spam requests
import time


We can use robots.txt to check which website is allowed and not disallowed to be scraped. Once we check, we can grab the URL of the page we want to scrape. 

In [3]:
# Save the URL of the website we want to scrape to a variable
sb_url = 'https://en.wikipedia.org/wiki/Sandra_Bullock_filmography'

In [5]:
# Send request to access the content of the page and assign to a variable
response = requests.get(sb_url)

# Show the content
response.content

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Sandra Bullock filmography - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"85f1bc19-b375-46d9-bb76-5950af0de820","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Sandra_Bullock_filmography","wgTitle":"Sandra Bullock filmography","wgCurRevisionId":1056952072,"wgRevisionId":1056952072,"wgArticleId":39573456,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use American English from

As you can see, the content is not easy to read. What we want to see a table of all the movies Sandra Bullock played in. This is where we get to see the magic of Beautiful Soup. 

In [6]:
# Create the soup object and assign to a variable
sb_soup = BeautifulSoup(response.content, 'html.parser')

# Show contents of the soup
sb_soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Sandra Bullock filmography - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"85f1bc19-b375-46d9-bb76-5950af0de820","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Sandra_Bullock_filmography","wgTitle":"Sandra Bullock filmography","wgCurRevisionId":1056952072,"wgRevisionId":1056952072,"wgArticleId":39573456,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Use American English from Janua

This is better but we can do much more! Next, we are going to parse the content. We're only looking for the names of the movies.

In [8]:
# Find title
movies = sb_soup.find('table',{'class':"wikitable plainrowheaders sortable"})
movies

<table class="wikitable plainrowheaders sortable">
<tbody><tr>
<th scope="col">Year
</th>
<th scope="col">Title
</th>
<th scope="col">Role(s)
</th>
<th class="unsortable" scope="col">Notes
</th>
<th class="unsortable" scope="col"><abbr title="Reference(s)">Ref(s)</abbr>
</th></tr>
<tr>
<th scope="row">1987
</th>
<td><i><a href="/wiki/Hangmen_(film)" title="Hangmen (film)">Hangmen</a></i>
</td>
<td><span data-sort-value="Edwards !">Lisa Edwards</span>
</td>
<td>
</td>
<td style="text-align:center;"><sup class="reference" id="cite_ref-17"><a href="#cite_note-17">[17]</a></sup>
</td></tr>
<tr>
<th scope="row">1989
</th>
<td><i><span data-sort-value="Fool !"><a href="/wiki/A_Fool_and_His_Money_(1989_film)" title="A Fool and His Money (1989 film)">A Fool and His Money</a></span></i>
</td>
<td><span data-sort-value="Cosgrove !">Debby Cosgrove</span>
</td>
<td>Also known as <i>Religion, Inc.</i><br/><a href="/wiki/Direct-to-video" title="Direct-to-video">Direct-to-video</a>
</td>
<td style="t

Note: took me a moment to find the table because I was entering the class name ash"wikitable plainrowheaders sortable jquery-tablesorter".

Now that we have the information that's in the table we can convert the table into a dataframe.

In [None]:
# 