# Webscraping Tutorial

Things needed:  
https://www.anaconda.com/download

jupyter notebook

python packages:  
- bs4 (BeautifulSoup)
- pandas
- matplotlib

### Basic idea

- Webpage or web API has some piece of data that we want to collect
- No dataset out there for it, or something for which we need live data
- ex: Maybe you want to log weather in your area over time

### Principle

- When navigating in a browser: make a request, get back text response, render said text response
- Instead, we make a request, get back text response, and parse/analyze response to get the data we need

### Ethics

- Webscraping is an extremely gray area
- Courteous webscraping
    - Don't needlessly burden server
    - Obey robots.txt

# Example problem 1:

Finding the top artists on soundcloud

First, check https://soundcloud.com/robots.txt:

```
User-agent: *
Disallow:
Sitemap: https://a-v2.sndcdn.com/sitemap.txt
```

In [2]:
from bs4 import BeautifulSoup
import urllib.request

In [3]:
url = "https://soundcloud.com/charts/top"

request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
page = response.read().decode('utf-8')
page

'<!DOCTYPE html>\n\n<html lang="en">\n<head>\n  <meta charset="utf-8">\n  \n  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n\n  \n  <link rel="dns-prefetch" href="//style.sndcdn.com">\n  <link rel="dns-prefetch" href="//a-v2.sndcdn.com">\n  <link rel="dns-prefetch" href="//api-v2.soundcloud.com">\n  <link rel="dns-prefetch" href="//sb.scorecardresearch.com">\n  <link rel="dns-prefetch" href="//secure.quantserve.com">\n  <link rel="dns-prefetch" href="//eventlogger.soundcloud.com">\n  <link rel="dns-prefetch" href="//api.soundcloud.com">\n  <link rel="dns-prefetch" href="//ssl.google-analytics.com">\n  <link rel="dns-prefetch" href="//i1.sndcdn.com">\n  <link rel="dns-prefetch" href="//i2.sndcdn.com">\n  <link rel="dns-prefetch" href="//i3.sndcdn.com">\n  <link rel="dns-prefetch" href="//i4.sndcdn.com">\n  <link rel="dns-prefetch" href="//wis.sndcdn.com">\n  <link rel="dns-prefetch" href="//va.sndcdn.com">\n  <link rel="dns-prefetch" href="//pixel.quantserve.com">\n\n 

### Beautiful Soup Basics

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
soup = BeautifulSoup(page)
soup.head

<head>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<link href="//style.sndcdn.com" rel="dns-prefetch"/>
<link href="//a-v2.sndcdn.com" rel="dns-prefetch"/>
<link href="//api-v2.soundcloud.com" rel="dns-prefetch"/>
<link href="//sb.scorecardresearch.com" rel="dns-prefetch"/>
<link href="//secure.quantserve.com" rel="dns-prefetch"/>
<link href="//eventlogger.soundcloud.com" rel="dns-prefetch"/>
<link href="//api.soundcloud.com" rel="dns-prefetch"/>
<link href="//ssl.google-analytics.com" rel="dns-prefetch"/>
<link href="//i1.sndcdn.com" rel="dns-prefetch"/>
<link href="//i2.sndcdn.com" rel="dns-prefetch"/>
<link href="//i3.sndcdn.com" rel="dns-prefetch"/>
<link href="//i4.sndcdn.com" rel="dns-prefetch"/>
<link href="//wis.sndcdn.com" rel="dns-prefetch"/>
<link href="//va.sndcdn.com" rel="dns-prefetch"/>
<link href="//pixel.quantserve.com" rel="dns-prefetch"/>
<title>The most played tracks on SoundCloud this week</title>
<meta content="record, s

In [5]:
soup.body

<body>
<div id="app">
<style type="text/css">.header{width:100%;height:46px}.header,.header__logo{background:#333}.header__logoLink{background:url() no-repeat 12px 11px;background-size:48px 22px;display:block;height:46px;width:69px}.header__logoLink:focus{background-color:rgba(255,72,0,.8);outline:0}#header__loading{margin:13px auto 0;width:16px}@media (-webkit-min-device-pixel-ratio:2),(min-resolution:192dpi),(min-resolution:2dppx){.header__logoLink{background-image:url(data:image/png;bas

In [6]:
soup.body.div

<div id="app">
<style type="text/css">.header{width:100%;height:46px}.header,.header__logo{background:#333}.header__logoLink{background:url() no-repeat 12px 11px;background-size:48px 22px;display:block;height:46px;width:69px}.header__logoLink:focus{background-color:rgba(255,72,0,.8);outline:0}#header__loading{margin:13px auto 0;width:16px}@media (-webkit-min-device-pixel-ratio:2),(min-resolution:192dpi),(min-resolution:2dppx){.header__logoLink{background-image:url(data:image/png;base64,iVB

In [7]:
soup.body.find(id="app")

<div id="app">
<style type="text/css">.header{width:100%;height:46px}.header,.header__logo{background:#333}.header__logoLink{background:url() no-repeat 12px 11px;background-size:48px 22px;display:block;height:46px;width:69px}.header__logoLink:focus{background-color:rgba(255,72,0,.8);outline:0}#header__loading{margin:13px auto 0;width:16px}@media (-webkit-min-device-pixel-ratio:2),(min-resolution:192dpi),(min-resolution:2dppx){.header__logoLink{background-image:url(data:image/png;base64,iVB

In [8]:
soup.find_all("a")

[<a class="header__logoLink sc-border-box sc-ir" href="/" title="Home">SoundCloud</a>,
 <a class="sc-button sc-button-medium" href="http://www.enable-javascript.com/" target="_blank">Show me how to enable it</a>,
 <a href="/charts/top?genre=all-music">Top 50</a>,
 <a href="/charts/new?genre=all-music">New &amp; hot</a>,
 <a href="/charts/top?genre=all-music">All music genres</a>,
 <a href="/charts/top?genre=all-audio">All audio genres</a>,
 <a href="/charts/top?genre=alternativerock">Alternative Rock</a>,
 <a href="/charts/top?genre=ambient">Ambient</a>,
 <a href="/charts/top?genre=classical">Classical</a>,
 <a href="/charts/top?genre=country">Country</a>,
 <a href="/charts/top?genre=danceedm">Dance &amp; EDM</a>,
 <a href="/charts/top?genre=dancehall">Dancehall</a>,
 <a href="/charts/top?genre=deephouse">Deep House</a>,
 <a href="/charts/top?genre=disco">Disco</a>,
 <a href="/charts/top?genre=drumbass">Drum &amp; Bass</a>,
 <a href="/charts/top?genre=dubstep">Dubstep</a>,
 <a href="/c

At this point, helpful to view page code in a browser to get an idea of what we're looking for:  
https://soundcloud.com/charts/top

(make sure to disable javascript)

In [10]:
tracks_container = soup.find_all(class_="sounds")[0]
song_artist_combos = tracks_container.find_all("h2")[1:]
song_artist_combos

[<h2 itemprop="name"><a href="/lil_peep/lil-peep-ft-xxxtentacion-falling-down" itemprop="url">Lil Peep &amp; XXXTENTACION - Falling Down</a>
 by <a href="/lil_peep">☆LiL PEEP☆</a></h2>,
 <h2 itemprop="name"><a href="/lil-baby-4pf/drip-too-hard" itemprop="url">Drip Too Hard</a>
 by <a href="/lil-baby-4pf">Lil Baby</a></h2>,
 <h2 itemprop="name"><a href="/liluzivert/new-patek" itemprop="url">New Patek</a>
 by <a href="/liluzivert">LIL UZI VERT</a></h2>,
 <h2 itemprop="name"><a href="/uiceheidd/lucid-dreams-forget-me" itemprop="url">Lucid Dreams</a>
 by <a href="/uiceheidd">Juice WRLD</a></h2>,
 <h2 itemprop="name"><a href="/16yrold/mobamba" itemprop="url">sheck wes - mo bamba (prod. 16yrold &amp; take a daytrip)</a>
 by <a href="/16yrold">16yrold</a></h2>,
 <h2 itemprop="name"><a href="/scumgang6ix9ine/fefe-feat-nicki-minaj" itemprop="url">FEFE (Feat. Nicki Minaj &amp; Murda Beatz)</a>
 by <a href="/scumgang6ix9ine">6IX9INE</a></h2>,
 <h2 itemprop="name"><a href="/kanyewest/i-love-it-kan

In [12]:
song_data = []
for entry in song_artist_combos:
    song_elements = entry.find_all("a")
    name = song_elements[0].string
    artist = song_elements[1].string
    song_data.append({"song":name, "artist":artist})
    print("{0} :: {1}".format(name, artist))

Lil Peep & XXXTENTACION - Falling Down :: ☆LiL PEEP☆
Drip Too Hard :: Lil Baby
New Patek :: LIL UZI VERT
Lucid Dreams :: Juice WRLD
sheck wes - mo bamba (prod. 16yrold & take a daytrip) :: 16yrold
FEFE (Feat. Nicki Minaj & Murda Beatz) :: 6IX9INE
Kanye West & Lil Pump - I Love It :: Kanye West
XXXTENTACION - Fuck Love  (feat. Trippie Redd) :: XXXTENTACION
All Girls Are The Same :: Juice WRLD
HOPE :: XXXTENTACION
Leave Me Alone (Prod. by Young Forever x Cast Beats) :: Flipp Dinero
Taste (feat. Offset) :: Tyga
YNW MELLY - MURDER ON MY MIND Prod By; SMKEXCLSV :: Ynw Melly
lil peep - star shopping (prod. kryptik) :: Jack
Trip :: Ella Mai
I Kill People! ft Tadoe & Chief Keef [Produced by: Ozmusiqe] RR :: Trippie Redd
Lean Wit Me :: Juice WRLD
Noticed :: Lil Mosey
juice wrld - legends :( :: Juice WRLD
Close Friends :: Lil Baby
Eminem - KILLSHOT (Machine Gun Kelly MGK DISS) rap devil response :: WorldStarHipHop Radio
I don't wanna do this anymore :: XXXTENTACION
Walk :: COMETHAZINE
Marshmell

Now, in order to do actual analysis on our data, let's use pandas, the big name datascience library for python. (Usage is similar to R's default dataframes.)

https://pandas.pydata.org/pandas-docs/stable/

In [24]:
import pandas as pd
%matplotlib notebook

In [15]:
df = pd.DataFrame(song_data)
df

Unnamed: 0,artist,song
0,☆LiL PEEP☆,Lil Peep & XXXTENTACION - Falling Down
1,Lil Baby,Drip Too Hard
2,LIL UZI VERT,New Patek
3,Juice WRLD,Lucid Dreams
4,16yrold,sheck wes - mo bamba (prod. 16yrold & take a d...
5,6IX9INE,FEFE (Feat. Nicki Minaj & Murda Beatz)
6,Kanye West,Kanye West & Lil Pump - I Love It
7,XXXTENTACION,XXXTENTACION - Fuck Love (feat. Trippie Redd)
8,Juice WRLD,All Girls Are The Same
9,XXXTENTACION,HOPE


With pandas, you can query dataframes and supply expressions that return a series of booleans, to get only rows where true:

In [20]:
df[df.artist == "XXXTENTACION"]

Unnamed: 0,artist,song
7,XXXTENTACION,XXXTENTACION - Fuck Love (feat. Trippie Redd)
9,XXXTENTACION,HOPE
21,XXXTENTACION,I don't wanna do this anymore
39,XXXTENTACION,A GHETTO CHRISTMAS CAROL Prod. RONNY J


In [33]:
df[df.artist.str.contains("lil", False)]

Unnamed: 0,artist,song
0,☆LiL PEEP☆,Lil Peep & XXXTENTACION - Falling Down
1,Lil Baby,Drip Too Hard
2,LIL UZI VERT,New Patek
17,Lil Mosey,Noticed
19,Lil Baby,Close Friends
25,lil skies,Lust [prod. CashMoneyAp]
37,Lil Tjay,Lil TJAY - Brothers Prod by [JDONTHATRACK] & [...
40,☆LiL PEEP☆,Save That Shit (prod. by smokeasac & IIVI)
46,lil skies,Creeping (feat. Rich The Kid)[prod. by Menoh B...


There are some artists (currently) who show up more than once, so let's make a bar chart for the people who have more than one song

In [21]:
df.artist.value_counts()

Juice WRLD                         6
6IX9INE                            4
XXXTENTACION                       4
Kodak Black                        2
lil skies                          2
Trippie Redd                      2
☆LiL PEEP☆                         2
Lil Baby                           2
Kanye West                         1
SKI MASK THE SLUMP GOD             1
Adham Seliman                      1
Music Mhragnat - ميوزك مهرجانات    1
cubied                             1
marshmello                         1
Famous Dex                         1
Post Malone                        1
COMETHAZINE                        1
Tyga                               1
16yrold                            1
Flipp Dinero                       1
WorldStarHipHop Radio              1
Mohamed Talaat                     1
BIG BANK CAMPAIGN                  1
Gucci Mane                         1
SHORELINE MAFIA                    1
Ynw Melly                          1
No Jumper                          1
L

In [23]:
df.artist.value_counts()[df.artist.value_counts() > 1]

Juice WRLD       6
6IX9INE          4
XXXTENTACION     4
Kodak Black      2
lil skies        2
Trippie Redd    2
☆LiL PEEP☆       2
Lil Baby         2
Name: artist, dtype: int64

In [32]:
axes = df.artist.value_counts()[df.artist.value_counts() > 1].plot(kind="bar")
axes.set_title("Top artists by song count in top chart on Soundcloud")
axes.set_ylabel("Song count in top chart")
axes.figure.subplots_adjust(bottom=.3)

<IPython.core.display.Javascript object>

# Example Problem 2

Scraping an API endpoint

http://api.open-notify.org/iss-now.json

In [34]:
import json

In [35]:
response = urllib.request.urlopen("http://api.open-notify.org/iss-now.json")
result = json.loads(response.read())
result

{'timestamp': 1539050543,
 'iss_position': {'longitude': '71.1412', 'latitude': '-44.7565'},
 'message': 'success'}

In [37]:
import time

def scrapeISSAPI():
    response = urllib.request.urlopen("http://api.open-notify.org/iss-now.json")
    result = json.loads(response.read())
    return result
    
scrapes = 20
data = []
for i in range (0, scrapes):
    print("Scrape {0}".format(i))
    result = scrapeISSAPI()
    data.append({
        "time":result["timestamp"],
        "lat":result["iss_position"]["latitude"],
        "lon":result["iss_position"]["longitude"]
    })
    time.sleep(1)
data

Scrape 0
Scrape 1
Scrape 2
Scrape 3
Scrape 4
Scrape 5
Scrape 6
Scrape 7
Scrape 8
Scrape 9
Scrape 10
Scrape 11
Scrape 12
Scrape 13
Scrape 14
Scrape 15
Scrape 16
Scrape 17
Scrape 18
Scrape 19


[{'time': 1539050682, 'lat': '-39.9133', 'lon': '80.8077'},
 {'time': 1539050683, 'lat': '-39.8752', 'lon': '80.8718'},
 {'time': 1539050685, 'lat': '-39.8179', 'lon': '80.9680'},
 {'time': 1539050686, 'lat': '-39.7797', 'lon': '81.0320'},
 {'time': 1539050687, 'lat': '-39.7223', 'lon': '81.1278'},
 {'time': 1539050688, 'lat': '-39.6840', 'lon': '81.1916'},
 {'time': 1539050690, 'lat': '-39.6264', 'lon': '81.2872'},
 {'time': 1539050691, 'lat': '-39.5880', 'lon': '81.3508'},
 {'time': 1539050692, 'lat': '-39.5303', 'lon': '81.4461'},
 {'time': 1539050693, 'lat': '-39.4918', 'lon': '81.5095'},
 {'time': 1539050695, 'lat': '-39.4340', 'lon': '81.6045'},
 {'time': 1539050696, 'lat': '-39.3954', 'lon': '81.6677'},
 {'time': 1539050697, 'lat': '-39.3374', 'lon': '81.7624'},
 {'time': 1539050698, 'lat': '-39.2987', 'lon': '81.8255'},
 {'time': 1539050700, 'lat': '-39.2405', 'lon': '81.9199'},
 {'time': 1539050701, 'lat': '-39.2017', 'lon': '81.9828'},
 {'time': 1539050702, 'lat': '-39.1629',

In [38]:
df = pd.DataFrame(data)
df

Unnamed: 0,lat,lon,time
0,-39.9133,80.8077,1539050682
1,-39.8752,80.8718,1539050683
2,-39.8179,80.968,1539050685
3,-39.7797,81.032,1539050686
4,-39.7223,81.1278,1539050687
5,-39.684,81.1916,1539050688
6,-39.6264,81.2872,1539050690
7,-39.588,81.3508,1539050691
8,-39.5303,81.4461,1539050692
9,-39.4918,81.5095,1539050693
