### Intro to WebScraping / Parsing Practice

#### Notes from video by Ryan Orsinger

In [1]:
# imports
from requests import get
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
# URLs
text_url = 'https://machinelearning.fit/example.txt'
json_url = 'https://machinelearning.fit/example.json'
csv_url = 'https://machinelearning.fit/example.csv'
html_url = 'https://machinelearning.fit/example.html'

#### Text 

In [7]:
#create response variable for text_url
response = get(text_url)

# get the raw response text as one single string
response.text


"This is example text.\n\nThe text comes to us from request.get as a string. And it's a single string. \n\nThe way that we parse this single string of unstructured text is by splitting the string into component parts based on characters. "

In [8]:
# parse by replacing the new line character with a space
response.text.replace('\n',' ')

"This is example text.  The text comes to us from request.get as a string. And it's a single string.   The way that we parse this single string of unstructured text is by splitting the string into component parts based on characters. "

In [10]:
#create response variable for json_url
response = get(json_url)

# get the raw response as one single string
response.text



'{\n    "title": "This is an example of JSON",\n    "body": "Read in JSON and treat it like a dictionary in Python. No parsing necessary!"\n}'

#### json 

In [12]:
# parse using jason function to return a dictionary with title and body
response.json()

{'title': 'This is an example of JSON',
 'body': 'Read in JSON and treat it like a dictionary in Python. No parsing necessary!'}

In [15]:
# request the body to get text in one single string
response.json()['body']

'Read in JSON and treat it like a dictionary in Python. No parsing necessary!'

#### csv

In [16]:
#create response variable for csv_url
response = get(csv_url)

# get the raw response as one single string
response.text

"title, body\nthis is an example csv, we can manually parse CSV files by splitting strings on new line characters and commas\nhow to automatically parse a csv, use pandas's .read_csv and you're good to go"

In [17]:
# Pandas parses csv
pd.read_csv(csv_url)

Unnamed: 0,title,body
0,this is an example csv,we can manually parse CSV files by splitting ...
1,how to automatically parse a csv,use pandas's .read_csv and you're good to go


### Webscraping - BeautifulSoup

In [18]:
# create a response variable for html
response = get(html_url)

# get the raw response as one single string
response.text

'<!DOCTYPE html>\n<html>\n<head>\n    <meta charset="utf-8" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n\n    <title>This is an Example Web Page</title>\n    <style type="text/css">\n        * {\n            box-sizing: border-box;\n        }\n\n        body {\n            margin: 0 auto;\n            max-width: 50em;\n            font-family: sans-serif;\n        }\n\n        header {\n            font-family: serif;\n            background-color: black;\n            color: white;\n            margin: 2em;\n            padding: 1em;\n            text-align: center;\n        }\n\n        main {\n            padding: 3em;\n            border: 1px solid black;\n        }\n\n    </style>\n</head>\n<body>\n\n<header>\n    <h1>This is an example webpage</h1>\n</header>\n<main>\n    <section>\n        <h2>HTML</h2>\n        <p>HTML is a language for structure and content for the web.</p>\n       

In [19]:
# create a beautiful soup object from the response
soup = BeautifulSoup(response.text)

# look at html formatted text
soup

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>This is an Example Web Page</title>
<style type="text/css">
        * {
            box-sizing: border-box;
        }

        body {
            margin: 0 auto;
            max-width: 50em;
            font-family: sans-serif;
        }

        header {
            font-family: serif;
            background-color: black;
            color: white;
            margin: 2em;
            padding: 1em;
            text-align: center;
        }

        main {
            padding: 3em;
            border: 1px solid black;
        }

    </style>
</head>
<body>
<header>
<h1>This is an example webpage</h1>
</header>
<main>
<section>
<h2>HTML</h2>
<p>HTML is a language for structure and content for the web.</p>
<p>We need to parse HTML because we want the content, not the HTML tags.</p>
<h3>Some HTML structure/

In [20]:
# look at just the text, without the html formatting
soup.text

"\n\n\n\n\nThis is an Example Web Page\n\n\n\n\nThis is an example webpage\n\n\n\nHTML\nHTML is a language for structure and content for the web.\nWe need to parse HTML because we want the content, not the HTML tags.\nSome HTML structure/content\n\nText\nBullet points\nImages\nVideo/Audio\n\n\n\nManually Parsing HTML\nShort answer: Don't do it.\nAvoid parsing HTML with regex. See this StackOverflow if you are curious.\n\n\nBest Way to parse HTML\nUse a library, like Beautiful Soup, that is built for HTML.\n\n\n\n\n\n"

### find link

In [21]:
# ask soup to find an anchor tag (link) in the code
soup.find('a')

<a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/" target="_blank">this StackOverflow</a>

In [23]:
# treat it like a dictionary and get 'href' to pull just url
soup.find('a')['href']

'https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/'

### find heading

In [24]:
# ask soup to find a heading in the code
soup.find('h1')

<h1>This is an example webpage</h1>

In [25]:
# or just the text of the heading
soup.find('h1').text


'This is an example webpage'

### find all headings

In [28]:
# find all subheads
soup.find_all('h2')

[<h2>HTML</h2>,
 <h2>Manually Parsing HTML</h2>,
 <h2>Best Way to parse HTML</h2>]

In [30]:
# or just the last
soup.find_all('h2')[-1].text

'Best Way to parse HTML'