# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [13]:
url = 'https://www.reddit.com/r/MachineLearning/'

In [14]:
response = requests.get(url)

In [17]:
response.text

'\n<!doctype html>\n<html>\n  <head>\n    <title>Too Many Requests</title>\n    <style>\n      body {\n          font: small verdana, arial, helvetica, sans-serif;\n          width: 600px;\n          margin: 0 auto;\n      }\n\n      h1 {\n          height: 40px;\n          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;\n      }\n    </style>\n  </head>\n  <body>\n    <h1>whoa there, pardner!</h1>\n    \n\n\n<p>we\'re sorry, but you appear to be a bot and we\'ve seen too many requests\nfrom you lately. we enforce a hard speed limit on requests that appear to come\nfrom bots to prevent abuse.</p>\n\n<p>if you are not a bot but are spoofing one via your browser\'s user agent\nstring: please change your user agent string to avoid seeing this message\nagain.</p>\n\n<p>please wait 6 second(s) and try again.</p>\n\n    <p>as a reminder to developers, we recommend that clients make no\n    more than <a href="http://github.com/reddit/reddi

In [16]:
requests.get(url)

<Response [429]>

In [1]:
%%HTML
<h1>This is a header</h1>
<p class = 'super-paragraph'>This would be a paragraph. <strong>Strong Words</strong> here. </p>


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [3]:
response = requests.get(url)

In [4]:
response

<Response [200]>

In [5]:
response.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_21_Jump_Street_episodes","wgTitle":"List of 21 Jump Street episodes","wgCurRevisionId":844038329,"wgRevisionId":844038329,"wgArticleId":35403829,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2012","All articles needing additional references","21 Jump Street","Lists of American crime television series episodes"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparat

In [6]:
soup = BeautifulSoup(response.text, 'html.parser')

In [7]:
soup.find('h2')

<h2>Contents</h2>

In [8]:
all_h2 = soup.find_all('h2')

In [9]:
for header in all_h2[2:7]:
    print(header.text)

Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]
Season 4 (1989-90)[edit]
Season 5 (1990-91)[edit]


In [10]:
table_1 = soup.find('table', {'class': 'wikitable plainrowheaders'})

In [11]:
season_1_titles = table_1.find_all('td', {'class': 'summary'})

In [12]:
for title in season_1_titles:
    print(title.text)

"Pilot"
"America, What a Town"
"Don't Pet the Teacher"
"My Future's So Bright, I Gotta Wear Shades"
"The Worst Night of Your Life"
"Gotta Finish the Riff"
"Bad Influence"
"Blindsided"
"Next Generation"
"Low and Away""Running on Ice"
"16 Blown to 35"
"Mean Streets and Pastel Houses"


In [32]:
soup.find('p')

<p><i><a href="/wiki/21_Jump_Street" title="21 Jump Street">21 Jump Street</a></i> is an American <a href="/wiki/Police_procedural" title="Police procedural">police procedural</a> <a class="mw-redirect" href="/wiki/Crime_drama" title="Crime drama">crime drama</a> <a class="mw-redirect" href="/wiki/Television_series" title="Television series">television series</a> that aired on the <a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox Network</a> and in first run syndication from April 12, 1987, to April 27, 1991, with a total of 103 <a href="/wiki/Episode" title="Episode">episodes</a>. The series focuses on a squad of youthful-looking undercover police officers investigating crimes in high schools, colleges, and other teenage venues.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>

In [33]:
%%HTML
<a href = 'https://www.reddit.com/r/MachineLearning/'> The Reddit Page </a>

In [None]:
len(soup.find_all('p'))

In [None]:
len(soup.find_all('h2'))

In [None]:
soup.find('a', {'data-click-id': 'body'})['href']

In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

In [18]:

consumer_key = 'o24LbkkTsV3eVKERVYjIznnrT'

consumer_secret = 'Q4yUOhDhlagNWrgwOnqzroGHI5aWqaM1MkbkkO6p9gPRhtKIYz'
access_token = '820718295187918848-DjESel4eJhmWto48EwBrmkCBR5vthkZ'
access_token_secret = 'fC8KyuUJoPOft2hIpCvNVf4dWj2FH5zw6IMgdcIbqNmCK'

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [19]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [21]:
user = api.get_user('thrashermag')

In [23]:
for tweet in user.timeline(limit = 500):
    print(tweet.text)

If the pigs feet and dog coats didn't raise the red flag, the $600 in pizza definitely did the trick. No wonder the… https://t.co/juUgD90o4m
Style and brains eternal, we love you Phil Shao!  https://t.co/x6osDOcDFO https://t.co/BumbRgl2Kx
The WKND crew took a U-Haul full of ramps to the desert and things got weird. They may not have located Animal Chin… https://t.co/PqZ6jpe26p
With a nod to Jeremy Klein, the WKND boys hit the road Hook Ups style and put together a launch-ramp-infused tour,… https://t.co/9Sw1m5TJum
After smashing his grill, Jaws rejoins Real for a heavy handrail day with Jamie Thomas. Tyson goes for the biggest… https://t.co/nsuPXrr7lr
If Dustin Dollin is signing your checks you sure as hell aren’t gonna turn in any soft footage. This PD Promo is ha… https://t.co/qTl5xpyBpm
Randy Blythe looks out for his people, loves what he does and just wants his fans to know where he’s coming from.… https://t.co/gHaVT50zEX
Frog in Las Vegas, Gridlock in SF, Brent Atchley's return an

In [24]:
print(user.followers_count)

418430


In [25]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [26]:
tweets[:5]

["If the pigs feet and dog coats didn't raise the red flag, the $600 in pizza definitely did the trick. No wonder the… https://t.co/juUgD90o4m",
 'Style and brains eternal, we love you Phil Shao!  https://t.co/x6osDOcDFO https://t.co/BumbRgl2Kx',
 'The WKND crew took a U-Haul full of ramps to the desert and things got weird. They may not have located Animal Chin… https://t.co/PqZ6jpe26p',
 'With a nod to Jeremy Klein, the WKND boys hit the road Hook Ups style and put together a launch-ramp-infused tour,… https://t.co/9Sw1m5TJum',
 'After smashing his grill, Jaws rejoins Real for a heavy handrail day with Jamie Thomas. Tyson goes for the biggest… https://t.co/nsuPXrr7lr']

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?

In [58]:
#url = 'https://www.opentable.com/new-york-restaurant-listings'
response = requests.get('https://www.yelp.com/search?find_desc=burrito&find_loc=Civic+Center%2C+Manhattan%2C+NY&ns=1')

AttributeError: 'Response' object has no attribute 'view'

In [59]:
soup = BeautifulSoup(response.text, 'html.parser')

In [60]:
soup.text[:300]

'\n\n\n\n\n  \n\n            window.yPageStart = new Date().getTime();\n\n            var initialVisibilityState = document.webkitVisibilityState;\n\n                yPerfTimings = [];\n\n                ySitRepParams = {"clientIP": "144.121.201.14", "datacenter": "us-east-1", "is_internal_ip": false, "edgeStartT'

In [65]:
#test = soup.find_all('div', {'class': 'media-block media-block--18'})
test2 = soup.find_all('a', {'class': 'biz-name'})
title = []
for t in test2:
    title.append(t.text)

In [66]:
title

['Jerusalem Mexican Deli Grocery',
 'El Vez',
 'Holi Mole',
 'Breakroom',
 'Burrito House',
 'Pulqueria',
 'Dos Toros Taqueria',
 'Habana To-Go',
 'New Fresco Tortillas',
 'Oaxaca Taqueria',
 'Luchadores']

In [54]:
names = soup.find_all('div', {'class': 'rest-row-header'})

In [55]:
for name in names:
    print(name.text)

 Dorcass  
 Gregg Hegmann  
 Angelinas  
 Will  
 Lemke Ports  
 Herzog  
 Sed Erdman  
 Dolore  
 Et VonRueden  
 Rerum  
 Margarett Grant  
 Columbuss  
 45 Kirlin  
 Mews  
 Saepe Stracke  
 Recusandae Quigley  
 Noemi Glover  
 Kemmer Ports  
 1051 Dickinson  
 Liza Murazik  
 Earum Jacobson  
 559 Hammes  
 Sunt Wiegand  
 Delectus  
 Jacksons  
 Brants  
 Simonis  
 Titus Cremin  
 Brenden Mills  
 Pollich  
 Columbus Pfeffer  
 Mason Pike  
 1376 Prohaska  
 Branch  
 Tempore  
 Trail  
 Et  
 Autem  
 Hermiston  
 Crest  
 Quis Marks  
 Flo Crossroad  
 Consequatur Schinner  
 Raynor  
 Damariss  
 Agloe Bar & Grill  
 Forges  
 Marcia Shoal  
 Heathcote  
 Eum Tunnel  
 Arvilla Bosco  
 Jayde Key  
 Kunze  
 Mollie Heller  
 Exercitationem Summit  
 Lindsay Reichel  
 Quasi River  
 Crawford Willms  
 Similique  
 Luciano Hansen  
 Ratione Villages  
 Distinctio Ports  
 1023 MacGyver  
 Ex Harbors  
 Nulla  
 Rerum Mews  
 Quasi  
 Natus Torphy  
 Beulahs  
 Groves  
 Facere 