<a href="https://colab.research.google.com/github/daryllman/basic-webscraper/blob/master/BasicWebscraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load in the necessary libraries

In [4]:
import requests
from bs4 import BeautifulSoup as bs #pip install beautifulsoup4

## Load sample webpage content



In [7]:
# Load the sample webpage
sample_url = 'https://keithgalli.github.io/web-scraping/example.html'
r = requests.get(sample_url)

# Convert to a beautiful soup object
soup = bs(r.content)

In [None]:
#print(r.content)
#print(soup)
print(soup.prettify())

## Using Beautiful Soup

### find() & find_all()

In [19]:
first_header = soup.find("h2")
print(first_header)

<h2>A Header</h2>


In [23]:
headers = soup.find_all("h2")
print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]


In [24]:
# Pass in a list of elements to look for
first_header = soup.find(["h1", "h2"])
print(first_header)

<h1>HTML Webpage</h1>


In [27]:
headers = soup.find_all(["h1", "h2"])
print(headers)

[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [29]:
# Pass in attributes to the find() & find_all()
paragraph = soup.find_all("p")
print(paragraph)

paragraph2 = soup.find_all("p", attrs={"id":"paragraph-id"})
print(paragraph2)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<p id="paragraph-id"><b>Some bold text</b></p>]


In [35]:
# Nesting find() & find_all() calls
body = soup.find("body")
print(body)
print("________________________")
div = body.find("div")
print(div)
print("________________________")
header = div.find("h1")
print(header)

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
________________________
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
________________________
<h1>HTML Webpage</h1>


In [41]:
# Search for specific strings in find() & find_all()
import re #regex is useful for string manipulation
paragraphs = soup.find_all("p", text="Some")
print(paragraphs)

paragraphs2 = soup.find_all("p", text=re.compile("Some"))
print(paragraphs2)

headers = soup.find_all("h2", text=re.compile("(H|h)eader"))
print(headers)

[]
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<h2>A Header</h2>, <h2>Another header</h2>]


### Select (CSS selector)
useful link: [https://www.w3schools.com/cssref/css_selectors.asp](https://www.w3schools.com/cssref/css_selectors.asp)

In [None]:
print(soup.body) #simple shorthand
print("_______________________")
print(soup.body.prettify())

In [42]:
content = soup.select("p")
print(content)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [46]:
content2 = soup.select("div p")
print(content2)

[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]


In [48]:
paragraphs = soup.select("h2 ~ p") # get p directly after h2
print(paragraphs)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


In [50]:
bold_text = soup.select("p#paragraph-id b") # search for b under p  with id(paragraph-id) and 
print(bold_text)

[<b>Some bold text</b>]


In [53]:
paragraphs = soup.select("body > p")
print(paragraphs)

for paragraph in paragraphs:
  para2 = paragraph.select("i")
  print(para2)


[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [54]:
# Grab element with specific property
soup.select("[align=middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

Difference between find/find_all and select: Select is more helpful if you have a specific path you are querying for.

### Get different properties of HTML

In [55]:
header = soup.find("h2")
print(header)
print(header.string)

<h2>A Header</h2>
A Header


In [58]:
div = soup.find("div")
print(div.prettify())
print(div.string) # returns None - cant have any nested html elements 
print(div.get_text()) # use this to get all available texts


<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>

None

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [62]:
# Get a specific property from an element
link = soup.find("a")
print(link)
print(link["href"])

paragraphs = soup.select("p#paragraph-id")
print(paragraphs)
print(paragraphs[0]["id"])

<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>
https://keithgalli.github.io/web-scraping/webpage.html
[<p id="paragraph-id"><b>Some bold text</b></p>]
paragraph-id


### Code Navigation


In [65]:
# Path Syntax
print(soup.body.div)
print(soup.body.div.h1.string)

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
HTML Webpage


In [69]:
# Know the terms: Parent, Sibling, Child
print(soup.body.find("div"))
print(soup.body.find("div").find_next_siblings())

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
[<h2>A Header</h2>, <p><i>Some italicized text</i></p>, <h2>Another header</h2>, <p id="paragraph-id"><b>Some bold text</b></p>]


## Practices

From [https://keithgalli.github.io/web-scraping/webpage.html]()

### Load the Webpage

In [5]:
# Load the sample webpage
sample_url = 'https://keithgalli.github.io/web-scraping/webpage.html'
r = requests.get(sample_url)

# Convert to a beautiful soup object
webpage = bs(r.content)
print(webpage)

<html><head>
<title>Keith Galli's Page</title>
<style>
  table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
</style>
</head>
<body>
<h1>Welcome to my page!</h1>
<img src="./images/selfie1.jpg" width="300px"/>
<h2>About me</h2>
<p>Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!</p>
<p>Here is a link to my channel: <a href="https://www.youtube.com/kgmit">youtube.com/kgmit</a></p>
<p>I grew up in the great s

In [None]:
# Take a look at the html
print(webpage.prettify())

### 1. Grab all of the social links from the webpage
(do in 3 different ways)


In [7]:
# Method 1
links = webpage.select("ul.socials a")
print(links)
actual_links = [link['href'] for link in links]
print(actual_links)

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>, <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>, <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>, <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']


In [10]:
# Method 2
ulist = webpage.find("ul", attrs={"class":"socials"})
links = ulist.find_all("a")
print(links)

actual_links = [link['href'] for link in links]
print(actual_links)

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>, <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>, <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>, <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']


In [14]:
# Method 3
links = webpage.select("li.social a")
print(links)

actual_links = [link['href'] for link in links]
print(actual_links)

[<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>, <a href="https://twitter.com/keithgalli">https://twitter.com/keithgalli</a>, <a href="https://www.linkedin.com/in/keithgalli/">https://www.linkedin.com/in/keithgalli/</a>, <a href="https://www.tiktok.com/@keithgalli">https://www.tiktok.com/@keithgalli</a>]
['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']


### 2. Scrape a Table in the webpage

In [None]:
# For reference to scrape tables into dataframe with BeautifulSoup
'''
l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    l.append(row)
pd.DataFrame(l, columns=["A", "B", ...])
'''

In [25]:
import pandas as pd

table = webpage.select("table.hockey-stats")[0]
columns = table.find("thead").find_all("th")
column_names = [c.string for c in columns]

# Add into Pandas Dataframe
l = []
table_rows = table.find("tbody").find_all("tr")
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)
df = pd.DataFrame(l, columns=column_names)
df

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


### 3. Grab all the Fun Facts with the word "is"

In [16]:
import re
facts = webpage.select("ul.fun-facts li")
print(facts)

facts_with_is =[ fact.find(string=re.compile("is")) for fact in facts ]
facts_with_is = [fact.find_parent().get_text() for fact in facts_with_is if fact] # remove None objects
print(facts_with_is)

[<li>Owned my dream car in high school <a href="#footer"><sup>1</sup></a></li>, <li>Middle name is Ronald</li>, <li>Never had been on a plane until college</li>, <li>Dunkin Donuts coffee is better than Starbucks</li>, <li>A favorite book series of mine is <i>Ender's Game</i></li>, <li>Current video game of choice is <i>Rocket League</i></li>, <li>The band that I've seen the most times live is the <i>Zac Brown Band</i></li>]
['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks', "A favorite book series of mine is Ender's Game", 'Current video game of choice is Rocket League', "The band that I've seen the most times live is the Zac Brown Band"]


In [None]:
### 4. Download an image

In [20]:
import os     
print(os.getcwd())
print( os.listdir() )

/content
['.config', 'sample_data']


In [23]:
# To download to Local
from google.colab import files

with open('example.txt', 'w') as f:
  f.write('some content')

files.download('example.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [24]:
# To download to googledrive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [25]:
with open('/content/drive/My Drive/Colab Notebooks/assets/foo.txt', 'w') as f:
  f.write('Hello Google Drive!')
!cat /content/drive/My\ Drive//Colab\ Notebooks/assets/foo.txt

Hello Google Drive!

### 4. Download an Image

In [28]:
import requests
from bs4 import BeautifulSoup as bs4

# Load webpage content
url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url+"webpage.html")

webpage = bs(r.content)

In [36]:
images = webpage.select("div.row div.column img")
image_url = images[0]['src'] #just print the first image for demo purpose
print(image_url)

full_url = url + image_url

# Download - either local or gdrive
img_data = requests.get(full_url).content
with open('sample_img.jpg', 'wb') as handler:
  handler.write(img_data)
from google.colab import files
files.download('sample_img.jpg')

images/italy/lake_como.jpg


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
### 5. Custom Scraping

In [49]:
files = webpage.select("div.block a")
relative_files = [f['href'] for f in files]
#print(relative_files)

url = "https://keithgalli.github.io/web-scraping/"
for f in relative_files:
  full_url = url + f
  page = requests.get(full_url)
  bs_page = bs(page.content)
  #print(bs_page.body.prettify())
  secret_word_element = bs_page.find("p", attrs={"id":"secret-word"})
  secret_word = secret_word_element.string
  print(secret_word)
  

Make
sure
to
smash
that
like
button
and
subscribe
!!!


Credits: Keith Galli