# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [3]:
import Requests
import bs4
url = 'https://www.reddit.com/r/MAchineLearning/'

ModuleNotFoundError: No module named 'Requests'

In [2]:
response = requests.get(url)

NameError: name 'requests' is not defined

In [19]:
response.text

'\n<!doctype html>\n<html>\n  <head>\n    <title>Too Many Requests</title>\n    <style>\n      body {\n          font: small verdana, arial, helvetica, sans-serif;\n          width: 600px;\n          margin: 0 auto;\n      }\n\n      h1 {\n          height: 40px;\n          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;\n      }\n    </style>\n  </head>\n  <body>\n    <h1>whoa there, pardner!</h1>\n    \n\n\n<p>we\'re sorry, but you appear to be a bot and we\'ve seen too many requests\nfrom you lately. we enforce a hard speed limit on requests that appear to come\nfrom bots to prevent abuse.</p>\n\n<p>if you are not a bot but are spoofing one via your browser\'s user agent\nstring: please change your user agent string to avoid seeing this message\nagain.</p>\n\n<p>please wait 2 second(s) and try again.</p>\n\n    <p>as a reminder to developers, we recommend that clients make no\n    more than <a href="http://github.com/reddit/reddi

In [1]:
%%HTML
<h1>This is a header</h1>
<p class = 'super-paragraph'>This would be a paragraph. <strong>Strong Words</strong> here. </p>


In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import 
import pandas as pd
import numpy as np

In [33]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [34]:
response = requests.get(url)
# use "requests" library to simply get all of the html of a url.  It's stored in a 
# response object.

In [35]:
response

<Response [200]>

In [36]:
response.text[:1000]
#text attribute is literally the whole thing

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_21_Jump_Street_episodes","wgTitle":"List of 21 Jump Street episodes","wgCurRevisionId":844038329,"wgRevisionId":844038329,"wgArticleId":35403829,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles needing additional references from May 2012","All articles needing additional references","21 Jump Street","Lists of American crime television series episodes"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparat

In [37]:
soup = BeautifulSoup(response.text, 'html.parser')
#create a beautiful soup object of the html

In [38]:
soup.find('h2')

<h2>Contents</h2>

In [39]:
all_h2 = soup.find_all('h2')

In [40]:
for header in all_h2[2:7]:
    print(header.text)

Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]
Season 4 (1989-90)[edit]
Season 5 (1990-91)[edit]


In [41]:
table_1 = soup.find('table', {'class': 'wikitable plainrowheaders'})

In [42]:
season_1_titles = table_1.find_all('td', {'class': 'summary'})

In [43]:
for title in season_1_titles:
    print(title.text)

"Pilot"
"America, What a Town"
"Don't Pet the Teacher"
"My Future's So Bright, I Gotta Wear Shades"
"The Worst Night of Your Life"
"Gotta Finish the Riff"
"Bad Influence"
"Blindsided"
"Next Generation"
"Low and Away""Running on Ice"
"16 Blown to 35"
"Mean Streets and Pastel Houses"


In [44]:
soup.find('p')

<p><i><a href="/wiki/21_Jump_Street" title="21 Jump Street">21 Jump Street</a></i> is an American <a href="/wiki/Police_procedural" title="Police procedural">police procedural</a> <a class="mw-redirect" href="/wiki/Crime_drama" title="Crime drama">crime drama</a> <a class="mw-redirect" href="/wiki/Television_series" title="Television series">television series</a> that aired on the <a href="/wiki/Fox_Broadcasting_Company" title="Fox Broadcasting Company">Fox Network</a> and in first run syndication from April 12, 1987, to April 27, 1991, with a total of 103 <a href="/wiki/Episode" title="Episode">episodes</a>. The series focuses on a squad of youthful-looking undercover police officers investigating crimes in high schools, colleges, and other teenage venues.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>

In [45]:
%%HTML
<a href = 'https://www.reddit.com/r/MachineLearning/'> The Reddit Page </a>

In [46]:
len(soup.find_all('p'))

1

In [47]:
len(soup.find_all('h2'))

9

In [48]:
soup.find('a', {'data-click-id': 'body'})['href']

TypeError: 'NoneType' object is not subscriptable

In [26]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [27]:
links

[]

In [28]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [29]:
import pandas as pd

In [30]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [31]:
df.head()

Unnamed: 0,links,title,body


### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

In [49]:
consumer_key = 'jp8VHqjPsvhrvfamltxmnTGjv'
consumer_secret = 'i7GacXRtre4ZQfr3YXt30bkn504FsPWaTU93iu1ObVD3jj4daI'
access_token_secret = 'PVfzUaR1wnACkLxJxbFZyAmDs4tIEiFKKEdMElMN0KWP6'
access_token = '161038561-Ztk8itAIddAWPPebJhzTpEZQS288BTF5PLnQp5LK'

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [51]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [65]:
user = api.get_user('realdonaldtrump')

In [66]:
for tweet in user.timeline():
    print(tweet.text)

...Cindy has voted for our Agenda in the Senate 100% of the time and has my complete and total Endorsement. We need… https://t.co/94PVS8YUFc
.@cindyhydesmith has helped me put America First! She’s strong on the Wall, is helping me create Jobs, loves our Ve… https://t.co/9ZwJS8pSBC
I have authorized an emergency disaster declaration to provide Hawaii the necessary support ahead of #HurricaneLane… https://t.co/7dWpL4fphZ
It was my great honor to host the Foreign Investment Risk Review Modernization Act Roundtable today at the… https://t.co/xkypzKTE17
https://t.co/6ZG0P6FRs5
https://t.co/6v90Th0zl1
https://t.co/3PAVDdfJJr
NO COLLUSION - RIGGED WITCH HUNT!
I have asked Secretary of State @SecPompeo to closely study the South Africa land and farm seizures and expropriati… https://t.co/iE6t1j1dak
The only thing that I have done wrong is to win an election that was expected to be won by Crooked Hillary Clinton… https://t.co/XD0Bh2rq9V
I will be interviewed on @foxandfriends by @ainsleyearhard

In [67]:
print(user.followers_count)

53957936


In [68]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [69]:
tweets[:5]

['...Cindy has voted for our Agenda in the Senate 100% of the time and has my complete and total Endorsement. We need… https://t.co/94PVS8YUFc',
 '.@cindyhydesmith has helped me put America First! She’s strong on the Wall, is helping me create Jobs, loves our Ve… https://t.co/9ZwJS8pSBC',
 'I have authorized an emergency disaster declaration to provide Hawaii the necessary support ahead of #HurricaneLane… https://t.co/7dWpL4fphZ',
 'It was my great honor to host the Foreign Investment Risk Review Modernization Act Roundtable today at the… https://t.co/xkypzKTE17',
 'https://t.co/6ZG0P6FRs5']

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?

In [97]:
url2 = 'https://www.yelp.com/search?find_desc=burrito&find_loc=Downtown%2C+Boston%2C+MA+02228&ns=1'
response = requests.get(url2)

In [100]:
soupy = BeautifulSoup(response.text, 'html.parser')

In [101]:
soupy.text[:100]

'\n\n\n\n\n  \n\n            window.yPageStart = new Date().getTime();\n\n            var initialVisibilitySta'

In [108]:
Rests = soupy.find_all('a', {'class': 'biz-name'})
titles = []
for t in Rests:
    titles.append(t.text)

In [109]:
titles

['Zumas Tex Mex Grill',
 'Sabroso Taqueria',
 'Villa Mexico Cafe',
 'Maria’s Taqueria',
 'Viva Burrito',
 'Anna’s Taqueria',
 'Herrera’s',
 'Tenoch Mexican',
 'Boloco',
 'Boloco Atlantic Wharf',
 'Cha Cha Cha Taqueria']