# Data Acquistion

- Manually explore the site in a web browser, and identify the relevant HTML elements.


- Use the requests module to obtain the HTML from the page.


- Use BeautifulSoup to parse the HTML and obtain the text/data that we want.


- (Maybe) Script the process of requesting another page and parsing the data from it as well.


- Take this data further down the data science pipeline.


## Steps

- Import the get() function from the requests module, BeautifulSoup from bs4, and pandas.


- Assign the address of the web page to a variable named url.


- Request the server the content of the web page by using get(), and store the server’s response in the variable response.


- Print the response text to ensure you have an html page.


- Take a look at the actual web page contents and inspect the source to understand the structure a bit.


- Use BeautifulSoup to parse the HTML into a variable ('soup').


- Identify the key tags you need to extract the data you are looking for.


- Create a dataframe of the data desired.


- Run some summary stats and inspect the data to ensure you have what you wanted.


- Edit the data structure as needed, especially so that one column has all the text you want included in this analysis.


- Create a corpus of the column with the text you want to analyze.
Store that corpus for use in a future notebook.

# Select

- returns a list, even if an empty list

- you can access parts using indexing

    - title = soup.select('#mk-page-introduce > div > h1')[0].get_text()

In [32]:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd

# Lesson Practice

In [2]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)

In [3]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US" >
<head>
		<meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><meta name="format-detection" content="telephone=no"><title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>
<script type


In [4]:
# create soup object

soup = BeautifulSoup(response.text)

In [5]:
# get title of web page using .select and index and .get_text()
# select returns ALL instances

title = soup.select('#mk-page-introduce > div > h1')[0].get_text()
title

'Codeup’s Data Science Career Accelerator is Here!'

In [6]:
# get title of web page using .find and .get_text()
# find returns only the first instance 

title = soup.find(class_='page-title').get_text()
title

'Codeup’s Data Science Career Accelerator is Here!'

In [7]:
# get content of web page

article = soup.find('div', class_='mk-single-content')
article

<div class="mk-single-content clearfix" itemprop="mainEntityOfPage">
<p>The rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in <strong><a href="https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm">Glassdoor’s #1 Best Job in America</a></strong>.</p>
<p><a href="https://tribucodeup.wpengine.com/what-is-data-science/"><strong>Data Science is a method of providing actionable intelligence from data.</strong></a> The data revolution has hit San Antonio, <strong><a href="https://www.indeed.com/jobs?q=Data+Scientist&amp;l=San+Antonio%2C+TX">resulting in an explosion in Data Scientist positions</a> </strong>across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen <strong><a href="https://therivardreport.com/utsa-lands-70m-for-cybersecurity-center-school-of-data-s

In [8]:
# store article text in a file

with open('article.txt', 'w') as f:
    f.write(article.text)

In [9]:
# Function that either reads data locally or goes to fetch data and saves it

def get_article_text():
    # if we already have the data, read it locally
    if os.path.exists('article.txt'):
        with open('article.txt') as f:
            return f.read()

    # otherwise go fetch the data
    url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'
    headers = {'User-Agent': 'Codeup Ada Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    article = soup.find('div', class_='mk-single-content')

    # save it for next time
    with open('article.txt', 'w') as f:
        f.write(article.text)

    return article.text

In [10]:
get_article_text()

'\nThe rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Stude

# Dataquest Web Scraping Tutorial

## Practice using element inspector on site

In [11]:
url = 'https://www.retirement.org/mirabellaportland/healthcare/'

In [12]:
headers = {'User-Agent':'Data Science Student'}
response = get(url,headers=headers)

In [13]:
soup = BeautifulSoup(response.text)

In [14]:
soup.select('body > table > tbody')

[]

In [15]:
button = soup.select('#content > div.elementor.elementor-17 > div > div > section > div > div > div > div > div > section.elementor-element.elementor-element-73eff40.elementor-section-boxed.elementor-section-height-default.elementor-section-height-default.elementor-section.elementor-inner-section > div > div > div.elementor-element.elementor-element-c266824.elementor-column.elementor-col-50.elementor-inner-column > div > div > div > div > div > div > div > a')[0].get_text()
button

'\n\t\t\t\t\t\tNeed care now? >>\t\t\t\t\t'

In [16]:
button = button.strip()
button

'Need care now? >>'

In [17]:
content = soup.h1.get_text()

In [18]:
content = content.strip()
content

'Healthcare'

In [19]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <title>
   Healthcare - Mirabella Portland
  </title>
  <!-- This site is optimized with the Yoast SEO plugin v12.6.2 - https://yoast.com/wordpress/plugins/seo/ -->
  <meta content="max-snippet:-1, max-image-preview:large, max-video-preview:-1" name="robots"/>
  <link href="https://www.retirement.org/mirabellaportland/healthcare/" rel="canonical"/>
  <meta content="en_US" property="og:locale"/>
  <meta content="article" property="og:type"/>
  <meta content="Healthcare - Mirabella Portland" property="og:title"/>
  <meta content="As a continuing care retirement community, Mirabella  Portland offers on-site healthcare services in addition to its independent living residences. From help with the activities of daily living to specialized nursing care, our continuum of care ensures you can live with peace of mind for the future knowing your needs will be taken care o

# Exercises

In [20]:
url = 'https://codeup.com/codeups-data-science-career-accelerator-is-here/'

headers = {'User-Agent': 'Codeup Bayes Data Science'} # codeup.com doesn't like our default user-agent
response = get(url, headers=headers)

In [21]:
urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
        'https://codeup.com/data-science-myths/',
        'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
        'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
        'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']

In [22]:
print(response.text[:400])

<!DOCTYPE html>
<html lang="en-US" >
<head>
		<meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><meta name="format-detection" content="telephone=no"><title>Codeup’s Data Science Career Accelerator is Here! - Codeup</title>
<script type


In [23]:
# create soup object

#soup = BeautifulSoup(response.content, 'html.parser')
soup = BeautifulSoup(response.text)

In [24]:
# Get title of web page

title = soup.find(class_='page-title').get_text()
title

'Codeup’s Data Science Career Accelerator is Here!'

In [25]:
# get content of web page

body = soup.find('div', class_='mk-single-content').get_text()
body

'\nThe rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Rackspace, along with input from dozens of practitioners and hiring partners. Stude

In [35]:
# def make_dictionary_from_article(url):
#     headers = {'User-Agent': 'Codeup Bayes Data Science'}
#     response = get(url, headers=headers)
#     soup = BeautifulSoup(response.text)
#     title = soup.title.get_text()
#     body = soup.find('div', class_='mk-single-content').get_text()
#     return {
#         'title': title,
#         'body': body
#     }

In [73]:
def make_dictionary_from_article(url):
    headers = {'User-Agent': 'Codeup Bayes Data Science'}
    response = get(url, headers=headers)
    soup = BeautifulSoup(response.text)
    title = soup.title.get_text()
    body = soup.find('div', class_='mk-single-content').get_text()
    
    output = {}
    output['title'] = title
    output['body'] = body
    
    return output

In [74]:
make_dictionary_from_article(url)

{'title': 'Codeup’s Data Science Career Accelerator is Here! - Codeup',
 'body': '\nThe rumors are true! The time has arrived. Codeup has officially opened applications to our new Data Science career accelerator, with only 25 seats available! This immersive program is one of a kind in San Antonio, and will help you land a job in\xa0Glassdoor’s #1 Best Job in America.\nData Science is a method of providing actionable intelligence from data.\xa0The data revolution has hit San Antonio,\xa0resulting in an explosion in Data Scientist positions\xa0across companies like USAA, Accenture, Booz Allen Hamilton, and HEB. We’ve even seen\xa0UTSA invest $70 M for a Cybersecurity Center and School of Data Science.\xa0We built a program to specifically meet the growing demands of this industry.\nOur program will be 18 weeks long, full-time, hands-on, and project-based. Our curriculum development and instruction is led by Senior Data Scientist, Maggie Giust, who has worked at HEB, Capital Group, and Ra

In [75]:
def get_blog_articles():
    urls = ['https://codeup.com/codeups-data-science-career-accelerator-is-here/',
            'https://codeup.com/data-science-myths/',
            'https://codeup.com/data-science-vs-data-analytics-whats-the-difference/',
            'https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/',
            'https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/']
    output = []
    
    for url in urls:
        output.append(make_dictionary_from_article(url))
        
    df = pd.DataFrame(output)
    df.to_csv('codeup_blog_posts.csv')
    
    return df
        
    
    

In [76]:
get_blog_articles()

Unnamed: 0,body,title
0,\nThe rumors are true! The time has arrived. C...,Codeup’s Data Science Career Accelerator is He...
1,\nBy Dimitri Antoniou and Maggie Giust\nData S...,Data Science Myths - Codeup
2,"\nBy Dimitri Antoniou\nA week ago, Codeup laun...",Data Science VS Data Analytics: What’s The Dif...
3,\n10 Tips to Crush It at the SA Tech Job Fair\...,10 Tips to Crush It at the SA Tech Job Fair - ...
4,\nCompetitor Bootcamps Are Closing. Is the Mod...,Competitor Bootcamps Are Closing. Is the Model...


In [77]:
def get_article_text():
    # if we already have the data, read it locally

    filename = 'codeup_blog_posts.csv'

    if os.path.exists(filename):
        return pd.read_csv(filename)
    else:
        return get_blog_articles()

In [78]:
df = get_article_text()
df.drop(columns='Unnamed: 0', inplace=True)

In [79]:
df.columns

Index(['body', 'title'], dtype='object')

In [80]:
df.head()

Unnamed: 0,body,title
0,\nThe rumors are true! The time has arrived. C...,Codeup’s Data Science Career Accelerator is He...
1,\nBy Dimitri Antoniou and Maggie Giust\nData S...,Data Science Myths - Codeup
2,"\nBy Dimitri Antoniou\nA week ago, Codeup laun...",Data Science VS Data Analytics: What’s The Dif...
3,\n10 Tips to Crush It at the SA Tech Job Fair\...,10 Tips to Crush It at the SA Tech Job Fair - ...
4,\nCompetitor Bootcamps Are Closing. Is the Mod...,Competitor Bootcamps Are Closing. Is the Model...
