# Web Scraping with BeautifulSoup

# What is Web Scraping?

Web scraping is the (automated) process of extracting data from a website.

You should note that depending on the website, web scraping could violate the terms of use.

You should scrape only as much as necessary, and pace your GET requests.

Scrape with caution!

# Some Preliminaries on Web Pages

We already discussed using a HTTP GET request to access an API.

Vising a website also uses a GET request. The response typically contains:

* HTML - page content
* CSS - page styling
* JS - page interactivity
* Images

# HyperText Markup Language (HTML)

This is the backbone framework for web pages. It makes the text on a webpage behave as it might in a word proecessor.

HTML uses __tags__ to define different parts of an HTML document.

\<html> \</html> is an HTML tag that tells a browser that everything between the tags is HTML.

Here is a link to the [elements of HTML](https://html.spec.whatwg.org/#semantics)

And [another reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)

# Common HTML tags

* __\<head> \</head>__ - defines the head of a document
* __\<body> \</body>__ - definse the body of a document
* __\<p> \</p>__ - defines a paragraph
* __\<a> \</a>__ - used for links with "href=_url_"

# Common HTML tags

* __\<div> \</div>__ - delineates a page division
* __\<table> \</table>__ - creates a table
* __\<form> \</form>__ - creates a form to take input
* __\<b> \</b>__ - bold text
* __\<i> \</i>__ - italicize text

# HTML class & id

HTML have properties __class__ and __id__ that give names to html elements.

* __class__ can be used by multiple elements, and single elements can have multiple classes.
* __id__ can only be used by a single element per page, and an element can only have a single id.

Think of __class__ and __id__ as metadata for html elements.

# HTML Relationship Structure

*__Parent__ - A tag that is hierarchically above another tag.

*__Child__ - A tag that is hierarchically below another tag.

*__Sibling__ - A tag that is hierarchically adjacent to another tag.

# What is Beautiful Soup?

[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc) is Python library for extracting data from HTML and XML files.

You import __BeautifulSoup__ from __bs4__

Let's start with [https://cofc.edu/](https://cofc.edu/)

In [1]:
import requests
from bs4 import BeautifulSoup

page = requests.get("https://cofc.edu/")

In [2]:
page.status_code #great success!

200

In [3]:
page.content #very unreadable.

b'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8">\r\n<title>College of Charleston | Charleston, South Carolina</title>\r\n<meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description"/>\r\n\r\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\r\n<link rel="shortcut icon" href="https://cofc.edu/favicon.ico" />\r\n\r\n<!-- osano code -->\r\n<script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js"></script>\r\n\r\n<!-- Cloudflare Web Analytics -->\r\n<script defer src=\'https://static.cloudflareinsights.com/beacon.min.js\' data-cf-beacon=\'{"token": "7c75bfae08974cb69e475be17f0eda9b"}\'></script>\r\n<!-- End Cloudflare Web Analytics -->\r\n\r\n<!-- fontawseome kit code -->\r\n<script src="https://kit.fontawesome.com/1328657c58.js" crossorigin="anonymous"></script>\r\n\r\n\r\n<!-- Styles -->

In [4]:
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify()) #spoiler, much more readable!

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   College of Charleston | Charleston, South Carolina
  </title>
  <meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description">
   <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
   <link href="https://cofc.edu/favicon.ico" rel="shortcut icon"/>
   <!-- osano code -->
   <script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js">
   </script>
   <!-- Cloudflare Web Analytics -->
   <script data-cf-beacon='{"token": "7c75bfae08974cb69e475be17f0eda9b"}' defer="" src="https://static.cloudflareinsights.com/beacon.min.js">
   </script>
   <!-- End Cloudflare Web Analytics -->
   <!-- fontawseome kit code -->
   <script crossorigin="anonymous" src="https://kit.fontawesome.com/1328657c58.js">
   </script>
   <!-- Styles -->
   <!--<link type

The BeautifulSoup object represents the parsed document as a whole.

In [None]:
#access the soup methods using TAB, which works for anything with methods
soup.

In [5]:
soup.title

<title>College of Charleston | Charleston, South Carolina</title>

In [6]:
#some tags
soup.head

<head>
<meta charset="utf-8"/>
<title>College of Charleston | Charleston, South Carolina</title>
<meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://cofc.edu/favicon.ico" rel="shortcut icon"/>
<!-- osano code -->
<script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js"></script>
<!-- Cloudflare Web Analytics -->
<script data-cf-beacon='{"token": "7c75bfae08974cb69e475be17f0eda9b"}' defer="" src="https://static.cloudflareinsights.com/beacon.min.js"></script>
<!-- End Cloudflare Web Analytics -->
<!-- fontawseome kit code -->
<script crossorigin="anonymous" src="https://kit.fontawesome.com/1328657c58.js"></script>
<!-- Styles -->
<!--<link type="text/css" rel="stylesheet" href="https://fast.fonts.net/cssapi/8d30cb26-2518-4e9e-83a1-90c

In [7]:
soup.body

<body>
<!-- Google Tag Manager (noscript) -->
<noscript><iframe height="0" src="https://www.googletagmanager.com/ns.html?id=GTM-KFMB5N6" style="display:none;visibility:hidden" title="Google Tag Manager" width="0"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<ul id="access-nav">
<li><a href="#bd-content" tabindex="0">Skip to Main Content</a></li><!--changed tabindex to 0 - LWS 12/3/19-->
</ul>
<div id="pg">
<div id="hd" role="banner">
<div class="navbar navbar-inverse navbar-static-top" id="sec-nav" role="navigation">
<div class="navbar-inner">
<div class="container">
<button aria-label="Quick Links Top Menu" class="btn btn-navbar collapsed" data-target="#nav2" data-toggle="collapse" type="button"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
<div class="nav-collapse collapse" id="nav2" style="height: 0px;">
<ul class="nav pull-right">
<li><a href="https://myportal.cofc.edu">MyPortal</a></li>
<li><a href="https

In [8]:
soup.div

<div id="pg">
<div id="hd" role="banner">
<div class="navbar navbar-inverse navbar-static-top" id="sec-nav" role="navigation">
<div class="navbar-inner">
<div class="container">
<button aria-label="Quick Links Top Menu" class="btn btn-navbar collapsed" data-target="#nav2" data-toggle="collapse" type="button"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
<div class="nav-collapse collapse" id="nav2" style="height: 0px;">
<ul class="nav pull-right">
<li><a href="https://myportal.cofc.edu">MyPortal</a></li>
<li><a href="https://library.cofc.edu/">Library</a></li>
<li><a href="https://directory.cofc.edu">Directory</a></li>
<li><a href="https://cofc.edu/siteindex/">A-Z Index</a></li>
<li><a href="https://emergency.cofc.edu">Emergency Info</a></li>
<li><a class="button2" href="https://cofc.edu/visit/">Explore</a></li>
<li><a class="button2" href="https://cofc.edu/apply/">Apply</a></li>
<li><a class="button2" href="https://give.cofc.edu

In [9]:
# Let's find 'a' which is a link. Note this isn't a generator, so it finds the first.
soup.a

<a href="#bd-content" tabindex="0">Skip to Main Content</a>

In [10]:
#let's find all 'a' links
soup.find_all('a')

[<a href="#bd-content" tabindex="0">Skip to Main Content</a>,
 <a href="https://myportal.cofc.edu">MyPortal</a>,
 <a href="https://library.cofc.edu/">Library</a>,
 <a href="https://directory.cofc.edu">Directory</a>,
 <a href="https://cofc.edu/siteindex/">A-Z Index</a>,
 <a href="https://emergency.cofc.edu">Emergency Info</a>,
 <a class="button2" href="https://cofc.edu/visit/">Explore</a>,
 <a class="button2" href="https://cofc.edu/apply/">Apply</a>,
 <a class="button2" href="https://give.cofc.edu/donate">Give</a>,
 <a href="https://cofc.edu"><img alt="CofC Logo" height="72" src="https://cofc.edu/images/cofc-logo-2014d.png" width="265"/></a>,
 <a class="dropdown-toggle" href="https://cofc.edu/admission-and-financial-aid/">Admission and Financial Aid</a>,
 <a href="https://admissions.cofc.edu/explore/index.php">Freshmen</a>,
 <a href="https://admissions.cofc.edu/enroll/index.php">Admitted Students</a>,
 <a href="https://admissions.cofc.edu/applyingtothecollege/transfers/">Transfer Studen

In [11]:
#find_all returns a list, you can access it like one
soup.find_all('a')[3]

<a href="https://directory.cofc.edu">Directory</a>

In [12]:
# we can find all based on an attribute
soup.find_all(type="text/css")

[<link href="https://cofc.edu/css/bootstrap/css/bootstrap.min.css" rel="stylesheet" type="text/css"/>,
 <link href="https://cofc.edu/css/style.css" rel="stylesheet" type="text/css">
 <!--<link type="text/css" rel="stylesheet" href="https://cofc.edu/css/style2.css?469447466" />-->
 <!-- dont use echo rand... -->
 <link href="https://cofc.edu/css/style2.css" rel="stylesheet" type="text/css"/>
 <link href="https://cofc.edu/scripts/prettyphoto/css/prettyPhoto.css" rel="stylesheet" type="text/css"/>
 <!--[if lt IE 9]>
     <script src="https://html5shim.googlecode.com/svn/trunk/html5.js"></script>
 <![endif]-->
 <!-- GA -->
 <script>
   (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
   (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
   m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
   })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
 
   ga('create', 'UA-25348783-2', 'au

In [13]:
# let's do something with page hierarchy
# children returns a list generator, so call it with list()
list(soup.children)

['html',
 '\n',
 <html lang="en">
 <head>
 <meta charset="utf-8"/>
 <title>College of Charleston | Charleston, South Carolina</title>
 <meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description">
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <link href="https://cofc.edu/favicon.ico" rel="shortcut icon"/>
 <!-- osano code -->
 <script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js"></script>
 <!-- Cloudflare Web Analytics -->
 <script data-cf-beacon='{"token": "7c75bfae08974cb69e475be17f0eda9b"}' defer="" src="https://static.cloudflareinsights.com/beacon.min.js"></script>
 <!-- End Cloudflare Web Analytics -->
 <!-- fontawseome kit code -->
 <script crossorigin="anonymous" src="https://kit.fontawesome.com/1328657c58.js"></script>
 <!-- Styles -->
 <!--<link type="text/css" rel="stylesheet" href="https://

In [14]:
# there are 3 children in the list
len(list(soup.children))

3

In [15]:
#they are beautiful soup objects
[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

# Beautiful Soup Objects

* __Doctype__: contains info about document type

* __NavigableString__: text of the document

* __Tag__: contains nested tags

In [16]:
#the tag object has the html tags
html = list(soup.children)[2]

#we can see that there are 7 children
len(list(html.children))

7

In [17]:
# item 1 includes the head
# item 5 includes the body
body = list(html.children)[5]
list(html.children)[1]

<head>
<meta charset="utf-8"/>
<title>College of Charleston | Charleston, South Carolina</title>
<meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description">
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="https://cofc.edu/favicon.ico" rel="shortcut icon"/>
<!-- osano code -->
<script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js"></script>
<!-- Cloudflare Web Analytics -->
<script data-cf-beacon='{"token": "7c75bfae08974cb69e475be17f0eda9b"}' defer="" src="https://static.cloudflareinsights.com/beacon.min.js"></script>
<!-- End Cloudflare Web Analytics -->
<!-- fontawseome kit code -->
<script crossorigin="anonymous" src="https://kit.fontawesome.com/1328657c58.js"></script>
<!-- Styles -->
<!--<link type="text/css" rel="stylesheet" href="https://fast.fonts.net/cssapi/8d30cb26-2518-4e9e-83a1-90c

In [18]:
#body child 9 has a paragraph, and you can navigate to it.
list(list(list(list(body.children)[9].children)[1].children)[7].children)[1].find('p').get_text()

'Explore the College'

In [19]:
# but that find command will navigate to the first instance of 'p'
soup.find("p").get_text()

'Explore the College'

In [20]:
# so if you only want to find the "p" tags you list with find_all
list(soup.find_all('p'))

[<p class="feature-title">Explore the College</p>,
 <p>Schedule a <a href="https://admissions.cofc.edu/explorethecollege/campusvisits/">visit</a>. You'll get why this is a hot school. </p>,
 <p class="feature-title">Welcome to the "New" Charleston</p>,
 <p>A top 10 fastest-growing city for software and Internet technology, an emerging hub for aerospace, and a hotbed for healthcare and biosciences.</p>,
 <p class="feature-title">Make Your Mark</p>,
 <p><a href="https://www.youtube.com/watch?v=DQHxEDD5fww">Be curious</a>. <a href="https://youtu.be/HCcjSonsL7g?list=PL5C843165F0C4F195">Explore</a>. <a href="https://youtu.be/dQh_mlCfe_A">Question</a>. <a href="https://youtu.be/gl_dEph-naY">Challenge</a> the status quo. Try the unfamiliar as well as the tried and true  – and your academic experience will pay big dividends. </p>,
 <p class="feature-title">The Good Life</p>,
 <p><a href="https://youtu.be/ecee7tVvKY4">Take advantage of everything</a> the College has to offer. Use your imaginati

In [22]:
type(soup.find_all('p'))

bs4.element.ResultSet

In [23]:
#you can find with labels, which is easier if you know what your lookng for
gs = soup.find(text="Graduate School")

#and we can find the parent of the text
gs.parent

<a href="https://gradschool.cofc.edu/applying-to-graduate-school/">Graduate School</a>

In [24]:
#Let's look at the next sibling
print(gs.next_sibling)

None


In [25]:
#and the next element
gs.next_element

' '

In [26]:
#chain for the next element
gs.next_element.next_element

'\n'

In [27]:
#two chains
gs.next_element.next_element.next_element

<li> <a href="https://admissions.cofc.edu/applyingtothecollege/non-degreeprograms/">Non-degree Programs</a> </li>

In [28]:
#previous element
#gs.parent = gs.previous_element
gs.previous_element 

<a href="https://gradschool.cofc.edu/applying-to-graduate-school/">Graduate School</a>

In [29]:
soup.find_all('p', class_='feature-title') #note the class_

[<p class="feature-title">Explore the College</p>,
 <p class="feature-title">Welcome to the "New" Charleston</p>,
 <p class="feature-title">Make Your Mark</p>,
 <p class="feature-title">The Good Life</p>,
 <p class="feature-title">Fan Favorites</p>,
 <p class="feature-title">Once a Cougar, always a Cougar.</p>]

In [30]:
soup.find_all(True) #finds all tags, but no text strings

[<html lang="en">
 <head>
 <meta charset="utf-8"/>
 <title>College of Charleston | Charleston, South Carolina</title>
 <meta content="The College of Charleston is a state-supported comprehensive university providing a high-quality education in the arts and sciences, education and business." name="description">
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <link href="https://cofc.edu/favicon.ico" rel="shortcut icon"/>
 <!-- osano code -->
 <script src="https://cmp.osano.com/AzyWE4SBbyt5AGMK/c73ab029-64e1-49d2-8a8e-300c9ef35cbc/osano.js"></script>
 <!-- Cloudflare Web Analytics -->
 <script data-cf-beacon='{"token": "7c75bfae08974cb69e475be17f0eda9b"}' defer="" src="https://static.cloudflareinsights.com/beacon.min.js"></script>
 <!-- End Cloudflare Web Analytics -->
 <!-- fontawseome kit code -->
 <script crossorigin="anonymous" src="https://kit.fontawesome.com/1328657c58.js"></script>
 <!-- Styles -->
 <!--<link type="text/css" rel="stylesheet" href="https://fast.fonts.net/c

# CSS Selectors

CSS Selectors are used to style HTML tags.

BeautifulSoup has a __select__ method to use them.

In [31]:
# for example, find paragraphs within divs
soup.select('div p')

[<p class="feature-title">Explore the College</p>,
 <p>Schedule a <a href="https://admissions.cofc.edu/explorethecollege/campusvisits/">visit</a>. You'll get why this is a hot school. </p>,
 <p class="feature-title">Welcome to the "New" Charleston</p>,
 <p>A top 10 fastest-growing city for software and Internet technology, an emerging hub for aerospace, and a hotbed for healthcare and biosciences.</p>,
 <p class="feature-title">Make Your Mark</p>,
 <p><a href="https://www.youtube.com/watch?v=DQHxEDD5fww">Be curious</a>. <a href="https://youtu.be/HCcjSonsL7g?list=PL5C843165F0C4F195">Explore</a>. <a href="https://youtu.be/dQh_mlCfe_A">Question</a>. <a href="https://youtu.be/gl_dEph-naY">Challenge</a> the status quo. Try the unfamiliar as well as the tried and true  – and your academic experience will pay big dividends. </p>,
 <p class="feature-title">The Good Life</p>,
 <p><a href="https://youtu.be/ecee7tVvKY4">Take advantage of everything</a> the College has to offer. Use your imaginati

In [32]:
# or find links within paragraphs within divs
soup.select("div p a")

[<a href="https://admissions.cofc.edu/explorethecollege/campusvisits/">visit</a>,
 <a href="https://www.youtube.com/watch?v=DQHxEDD5fww">Be curious</a>,
 <a href="https://youtu.be/HCcjSonsL7g?list=PL5C843165F0C4F195">Explore</a>,
 <a href="https://youtu.be/dQh_mlCfe_A">Question</a>,
 <a href="https://youtu.be/gl_dEph-naY">Challenge</a>,
 <a href="https://youtu.be/ecee7tVvKY4">Take advantage of everything</a>,
 <a href="https://youtu.be/3ODlxThaE3E">new experience</a>,
 <a href="https://youtu.be/sYmBhavUBMo">have fun</a>,
 <a href="https://today.cofc.edu/2022/09/14/cofc-podcast-artist-kirsten-stolle-explores-the-narrative-around-chemical-corporations/" target="_self">CofC Podcast: Artist Kirsten Stolle Explores the Narrative Around Chemical Corporations</a>,
 <a href="https://today.cofc.edu/2022/09/13/cofc-community-invited-to-celebrate-grand-opening-of-cougar-cutz-barber-shop/" target="_self">CofC Community Invited to Celebrate Grand Opening of Cougar Cutz Barber Shop</a>,
 <a href="ht

In [34]:
#find all paragraphs with a class "title"
#note .class
soup.select("p.feature-title")

[<p class="feature-title">Explore the College</p>,
 <p class="feature-title">Welcome to the "New" Charleston</p>,
 <p class="feature-title">Make Your Mark</p>,
 <p class="feature-title">The Good Life</p>,
 <p class="feature-title">Fan Favorites</p>,
 <p class="feature-title">Once a Cougar, always a Cougar.</p>]

In [35]:
#find the divisions with id "prim-nav"
#note #id
soup.select("div#prim-nav")

[<div class="navbar yamm" id="prim-nav" role="navigation">
 <div class="navbar-inner">
 <button aria-label="mobile navigation menu" class="btn btn-navbar" data-target="#nav1" data-toggle="collapse" type="button"> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
 <div class="nav-collapse collapse" id="nav1">
 <ul class="nav">
 <!-- Admissions and Financial Aid Dropdown -->
 <li class="dropdown"> <a class="dropdown-toggle" href="https://cofc.edu/admission-and-financial-aid/">Admission and Financial Aid</a>
 <ul class="dropdown-menu">
 <div class="container">
 <div class="row-fluid">
 <ul class="span2 unstyled">
 <li class="nav-title">Admission</li>
 <li> <a href="https://admissions.cofc.edu/explore/index.php">Freshmen</a> </li>
 <li> <a href="https://admissions.cofc.edu/enroll/index.php">Admitted Students</a> </li>
 <li> <a href="https://admissions.cofc.edu/applyingtothecollege/transfers/">Transfer Students</a> </li>
 <li> <a href="ht

In [36]:
#let's scrape all the links
links=[]
for link in soup.find_all('a'):
    l = link.get('href')
    links.append(l)
    print(l)

#bd-content
https://myportal.cofc.edu
https://library.cofc.edu/
https://directory.cofc.edu
https://cofc.edu/siteindex/
https://emergency.cofc.edu
https://cofc.edu/visit/
https://cofc.edu/apply/
https://give.cofc.edu/donate
https://cofc.edu
https://cofc.edu/admission-and-financial-aid/
https://admissions.cofc.edu/explore/index.php
https://admissions.cofc.edu/enroll/index.php
https://admissions.cofc.edu/applyingtothecollege/transfers/
https://admissions.cofc.edu/applyingtothecollege/readmits/
https://admissions.cofc.edu/applyingtothecollege/international-students/
https://honorscollege.cofc.edu/admission/
https://cofc.edu/veteran-services
https://gradschool.cofc.edu/applying-to-graduate-school/
https://admissions.cofc.edu/applyingtothecollege/non-degreeprograms/
https://finaid.cofc.edu/financial-aid-information/cost-of-attendance/tuition-and-fees/
https://finaid.cofc.edu
https://finaid.cofc.edu/types-of-financial-aid/scholarships/
https://finaid.cofc.edu/financial-literacy/npc/
https://a

In [37]:
links #we're scraping now!

['#bd-content',
 'https://myportal.cofc.edu',
 'https://library.cofc.edu/',
 'https://directory.cofc.edu',
 'https://cofc.edu/siteindex/',
 'https://emergency.cofc.edu',
 'https://cofc.edu/visit/',
 'https://cofc.edu/apply/',
 'https://give.cofc.edu/donate',
 'https://cofc.edu',
 'https://cofc.edu/admission-and-financial-aid/',
 'https://admissions.cofc.edu/explore/index.php',
 'https://admissions.cofc.edu/enroll/index.php',
 'https://admissions.cofc.edu/applyingtothecollege/transfers/',
 'https://admissions.cofc.edu/applyingtothecollege/readmits/',
 'https://admissions.cofc.edu/applyingtothecollege/international-students/',
 'https://honorscollege.cofc.edu/admission/',
 'https://cofc.edu/veteran-services',
 'https://gradschool.cofc.edu/applying-to-graduate-school/',
 'https://admissions.cofc.edu/applyingtothecollege/non-degreeprograms/',
 'https://finaid.cofc.edu/financial-aid-information/cost-of-attendance/tuition-and-fees/',
 'https://finaid.cofc.edu',
 'https://finaid.cofc.edu/type

In [38]:
links[1:]

['https://myportal.cofc.edu',
 'https://library.cofc.edu/',
 'https://directory.cofc.edu',
 'https://cofc.edu/siteindex/',
 'https://emergency.cofc.edu',
 'https://cofc.edu/visit/',
 'https://cofc.edu/apply/',
 'https://give.cofc.edu/donate',
 'https://cofc.edu',
 'https://cofc.edu/admission-and-financial-aid/',
 'https://admissions.cofc.edu/explore/index.php',
 'https://admissions.cofc.edu/enroll/index.php',
 'https://admissions.cofc.edu/applyingtothecollege/transfers/',
 'https://admissions.cofc.edu/applyingtothecollege/readmits/',
 'https://admissions.cofc.edu/applyingtothecollege/international-students/',
 'https://honorscollege.cofc.edu/admission/',
 'https://cofc.edu/veteran-services',
 'https://gradschool.cofc.edu/applying-to-graduate-school/',
 'https://admissions.cofc.edu/applyingtothecollege/non-degreeprograms/',
 'https://finaid.cofc.edu/financial-aid-information/cost-of-attendance/tuition-and-fees/',
 'https://finaid.cofc.edu',
 'https://finaid.cofc.edu/types-of-financial-a

In [39]:
#how many links?
len(links)

335

# There's a lot to Beautiful Soup!

I encourage you to play around with it!

Read the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Regular Expressions (RegEx)

Regular expressions are a string of characters that form a search pattern.

The search pattern can be used to find specific characters within text.

Python has a module __[re](https://docs.python.org/3/library/re.html)__ for regular expressions.

I reference [this tutorial](https://www.w3schools.com/python/python_regex.asp)

# RegEx Functions

* __findall__ - returns a list containing all matches

* __search__ - returns a match object if one exists

* __split__ - retrns a list where string is split on match

* __sub__ - replaces one or many matches with a string

# RegEx Metacharacters

![Screen%20Shot%202022-09-13%20at%2011.00.08%20AM.png](attachment:Screen%20Shot%202022-09-13%20at%2011.00.08%20AM.png)

# RegEx Special Sequences

![Screen%20Shot%202022-09-13%20at%2011.00.24%20AM.png](attachment:Screen%20Shot%202022-09-13%20at%2011.00.24%20AM.png)

# RegEx Sets

![Screen%20Shot%202022-09-13%20at%2011.00.41%20AM.png](attachment:Screen%20Shot%202022-09-13%20at%2011.00.41%20AM.png)

In [40]:
import re

fileObject = open(r'/Users/brandanscully/Documents/GitHub/DATA_510/zenofpython.txt','r')
data = fileObject.read()

In [47]:
print(data)

The Zen of Python
by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [45]:
# findall returns a list of matches
better = re.findall('better',data)
better

['better',
 'better',
 'better',
 'better',
 'better',
 'better',
 'better',
 'better']

In [48]:
# an empty list if no matches
re.findall('10',data)

[]

In [50]:
re.findall("--",data)

['--', '--', '--']

In [53]:
#search returns a match object
x = re.search("--",data)
x

<re.Match object; span=(474, 476), match='--'>

In [54]:
#span tells us where in the string it occurs
print(x.start(), x.end())

474 476


In [55]:
# string shows us what was passed into the function
x.string

"The Zen of Python\nby Tim Peters\n\nBeautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nFlat is better than nested.\nSparse is better than dense.\nReadability counts.\nSpecial cases aren't special enough to break the rules.\nAlthough practicality beats purity.\nErrors should never pass silently.\nUnless explicitly silenced.\nIn the face of ambiguity, refuse the temptation to guess.\nThere should be one-- and preferably only one --obvious way to do it.\nAlthough that way may not be obvious at first unless you're Dutch.\nNow is better than never.\nAlthough never is often better than *right* now.\nIf the implementation is hard to explain, it's a bad idea.\nIf the implementation is easy to explain, it may be a good idea.\nNamespaces are one honking great idea -- let's do more of those!"

In [56]:
# group shows us what was matched
x.group()

'--'

In [57]:
#we get an None when no match is found
print(re.search("10",data))

None


In [58]:
#split returns a list split at the match
re.split(r"\n",data)

['The Zen of Python',
 'by Tim Peters',
 '',
 'Beautiful is better than ugly.',
 'Explicit is better than implicit.',
 'Simple is better than complex.',
 'Complex is better than complicated.',
 'Flat is better than nested.',
 'Sparse is better than dense.',
 'Readability counts.',
 "Special cases aren't special enough to break the rules.",
 'Although practicality beats purity.',
 'Errors should never pass silently.',
 'Unless explicitly silenced.',
 'In the face of ambiguity, refuse the temptation to guess.',
 'There should be one-- and preferably only one --obvious way to do it.',
 "Although that way may not be obvious at first unless you're Dutch.",
 'Now is better than never.',
 'Although never is often better than *right* now.',
 "If the implementation is hard to explain, it's a bad idea.",
 'If the implementation is easy to explain, it may be a good idea.',
 "Namespaces are one honking great idea -- let's do more of those!"]

In [61]:
# you can pass a number of occurrences to split
re.split(r"\n",data,2)

['The Zen of Python',
 'by Tim Peters',
 "\nBeautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nFlat is better than nested.\nSparse is better than dense.\nReadability counts.\nSpecial cases aren't special enough to break the rules.\nAlthough practicality beats purity.\nErrors should never pass silently.\nUnless explicitly silenced.\nIn the face of ambiguity, refuse the temptation to guess.\nThere should be one-- and preferably only one --obvious way to do it.\nAlthough that way may not be obvious at first unless you're Dutch.\nNow is better than never.\nAlthough never is often better than *right* now.\nIf the implementation is hard to explain, it's a bad idea.\nIf the implementation is easy to explain, it may be a good idea.\nNamespaces are one honking great idea -- let's do more of those!"]

In [62]:
# sub let's you replace parts of the string
print(re.sub('better','awesomer',data))

The Zen of Python
by Tim Peters

Beautiful is awesomer than ugly.
Explicit is awesomer than implicit.
Simple is awesomer than complex.
Complex is awesomer than complicated.
Flat is awesomer than nested.
Sparse is awesomer than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is awesomer than never.
Although never is often awesomer than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [66]:
# you can pass a number of occurrences to sub
print(re.sub('better','awesomer',data,3))

The Zen of Python
by Tim Peters

Beautiful is awesomer than ugly.
Explicit is awesomer than implicit.
Simple is awesomer than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [67]:
#we can find words by ignoring case
re.findall(r'complex', data, flags=re.IGNORECASE)

['complex', 'Complex']

In [89]:
re.search('[a-z]*$', data)

''

In [80]:
# the ^ tells to search at beginning of string
# the [a-z] tells to look for a lower letter
# the * tells to look repeatedly
email = "scullybm@cofc.edu"

print(re.search(r'^[a-z]*',email).group())

scullybm


In [90]:
email = "scullybm@cofc.edu"
re.search(r"@[a-z]*[.][a-z]*",email)

<re.Match object; span=(8, 17), match='@cofc.edu'>

# Some Inspiration

* [How a Math Genius Hacked OkCupid to Find True Love](https://www.wired.com/2014/01/how-to-hack-okcupid/)
* [Automated Apartment Search Bot](https://www.dataquest.io/blog/apartment-finding-slackbot/)

# Handy References

* [Mining the Social Web](https://www.webpages.uidaho.edu/~stevel/504/mining-the-social-web-2nd-edition.pdf)
* [Scraping Cheat Sheet](https://blog.hartleybrody.com/web-scraping-cheat-sheet/)