# Mini Exercises

In [18]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## News Article Scraping

In [24]:
# make the HTTP request
response = requests.get('https://web-scraping-demo.zgulde.net/news')

In [31]:
# pull the text out of the response
html = response.text

In [30]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [32]:
# locate the articles using inspect feature of web browser to pull out desired parts of page using tags
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')

In [34]:
len(articles)

12

In [46]:
articles[0]

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">middle meeting cover</h2>
<div class="grid grid-cols-2 italic">
<p> 2020-04-28 </p>
<p class="text-right">By Charles Parker </p>
</div>
<p>Brother them grow feeling pull certain. Particular artist draw region.
Participant wide college someone information follow between. Can arm return media west or first. Natural once require occur full. Return official history modern pass rise talk.</p>
</div>
</div>

In [66]:
# get title
articles[0].h2.text

'middle meeting cover'

In [75]:
# get data and author
date, author = articles[0].select('.italic')[0].find_all('p')
date, author

(<p> 2020-04-28 </p>, <p class="text-right">By Charles Parker </p>)

In [71]:
# get paragraph
articles[0].find_all('p')[-1].text

'Brother them grow feeling pull certain. Particular artist draw region.\nParticipant wide college someone information follow between. Can arm return media west or first. Natural once require occur full. Return official history modern pass rise talk.'

In [76]:
# put into a function
def parse_news(article):
    title = article.h2.text
    date, author = article.select('.italic')[0].find_all('p')
    paragraph = article.find_all('p')[-1].text
    return {'title' : title, 'date' : date.text, 'author' : author.text, 'paragraph' : paragraph}

In [78]:
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,title,date,author,paragraph
0,middle meeting cover,2020-04-28,By Charles Parker,Brother them grow feeling pull certain. Partic...
1,bar beyond attack,2006-04-03,By Anthony Ray,Government wonder knowledge cost tonight. Ligh...
2,certainly half edge,1972-02-05,By Leah Coleman,Quality work condition war though certain. Fas...
3,behind car usually,2008-02-10,By Mark Johnson,Run certainly issue sense interview. For hundr...
4,institution toward knowledge,1995-07-18,By Andre Berger,Camera travel not white set evening. Usually t...
5,outside suggest experience,2019-12-20,By Paula Adams,Exist check particularly food need laugh. Arti...
6,loss south front,1983-03-01,By Ronald Alvarez,Painting former fish contain accept. Less char...
7,single whom find,2003-09-14,By Sarah Flores,Middle tough business.\nUs say moment general ...
8,every factor character,1980-10-27,By Samantha Foster,Generation difference live science trouble org...
9,practice number support,1976-07-04,By Jose Douglas,Service race poor part near later. Price Ameri...


## Contact Info Scraping

In [79]:
# make the HTTP request
response = requests.get('https://web-scraping-demo.zgulde.net/people')

In [80]:
# pull the text out of the response
html = response.text

In [81]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [84]:
people = soup.select('.person.border.rounded.px-3')

In [85]:
len(people)

10

In [89]:
# get name
people[0].h2.text

'Jacob Middleton'

In [145]:
# another way to get it
people[0].select('.name')[0].text

'Jacob Middleton'

In [124]:
# get tag line
people[0].p.text.strip()

'"Balanced high-level matrix"'

In [149]:
# another way to get it
people[0].select('.quote')[0].text.strip()

'"Balanced high-level matrix"'

In [102]:
# get email address
people[0].select('.grid.grid-cols-9')[0].p.text

'andrewrivera@hotmail.com'

In [152]:
# another way
people[0].select('.email')[0].text

'andrewrivera@hotmail.com'

In [110]:
# get phone number
people[0].select('.grid.grid-cols-9')[0].find_all('p')[1].text

'001-444-789-7536x68632'

In [153]:
people[0]

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jacob Middleton</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Balanced high-level matrix"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">andrewrivera@hotmail.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">001-444-789-7536x68632</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                1617 Lauren Track <br/>
                Spencerchester, OK 49034
            </p>
</div>
</div>

In [156]:
# another way
people[0].select('.phone')[0].text

'001-444-789-7536x68632'

In [131]:
# find address
address = people[0].select('.address')[0].p.text.strip()
address

'1617 Lauren Track \n                Spencerchester, OK 49034'

In [161]:
# another way
people[0].select('.address')[0].text.strip()

'1617 Lauren Track \n                Spencerchester, OK 49034'

In [132]:
import re

In [135]:
# use regex to clean up white spaces
re.sub(r'\s{2,}', ', ', address)

'1617 Lauren Track, Spencerchester, OK 49034'

In [162]:
def parse_people(person):
    name = person.h2.text
    tag_line = person.p.text.strip()
    email = person.select('.email')[0].text
    phone = person.select('.phone')[0].text
    address = person.select('.address')[0].text.strip()
    address = re.sub(r'\s{2,}', ', ', address)
    return {
        'name' : name,
        'tag_line' : tag_line,
        'email' : email,
        'phone' : phone,
        'address' : address
    }

In [163]:
pd.DataFrame([parse_people(person) for person in people])

Unnamed: 0,name,tag_line,email,phone,address
0,Jacob Middleton,"""Balanced high-level matrix""",andrewrivera@hotmail.com,001-444-789-7536x68632,"1617 Lauren Track, Spencerchester, OK 49034"
1,Christopher Horne,"""Profit-focused multimedia paradigm""",edwardsmith@stokes.info,608.645.2865x43202,"43375 Michael Fields Suite 347, South Richard,..."
2,Jennifer Scott,"""Multi-channeled intermediate workforce""",bobbycase@graham.net,525.275.6300x27293,"686 Joshua Brooks Suite 086, Stephanieville, A..."
3,Dylan Ali,"""Mandatory 5thgeneration encoding""",slogan@miller.com,652.620.3299,"068 Penny Lodge, East Kelly, GA 86120"
4,George Paul,"""Fundamental foreground success""",xbass@becker.com,+1-830-063-7414x8384,"35464 Lowe Ramp, Valeriefort, MS 02368"
5,Carol Bennett,"""Streamlined multi-state Internet solution""",kimjocelyn@hess-lee.com,188-734-8791,"553 Pamela Village Apt. 692, Petersmouth, NJ 7..."
6,Holly Munoz,"""Enterprise-wide directional database""",ntorres@ayers.com,714-977-7219x994,"5765 Vasquez Glens Apt. 365, Maryshire, TN 13005"
7,Kevin Walters,"""Business-focused explicit utilization""",georgecarpenter@oconnor-stewart.com,7000363100,"51171 Jones Springs, Rachelfort, NE 86265"
8,Brian Graves DDS,"""Right-sized didactic middleware""",erictaylor@waters-bennett.net,1835476156,"860 Sandra Roads, Proctorview, NH 52889"
9,Nathaniel Jennings,"""Progressive regional moratorium""",kathy53@hotmail.com,+1-650-388-5425x6533,"7562 Peck Prairie, Walkerbury, NC 86953"


# Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. `acquire_codeup_blog.py` and `acquire_news_articles.py`), but the end function should be present in `acquire.py` (that is, `acquire.py` should `import get_blog_articles` from the `acquire_codeup_blog` module.)

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's `title` and `content`.

Encapsulate your work in a function named `get_blog_articles` that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

`{'title': 'the title of the article', 'content': 'the full text content of the article'}`

Plus any additional properties you think might be helpful.

*Bonus: Scrape the text of all the articles linked on codeup's blog page.*

In [173]:
# make the HTTP request
url = 'https://codeup.com/data-science/why-you-should-become-a-data-scientist/'
headers = {'User-Agent': 'Codeup Data Science Germain Cohort'} # Some websites don't accept the pyhon-requests default user-agent
response = requests.get(url, headers=headers)

In [175]:
# pull the text out of the response
html = response.text

In [176]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [186]:
# let's first pull out section we need (might be too much)
body = soup.select('.et_pb_column.et_pb_column_3_4')

In [191]:
# lets just grab the title
soup.select('.et_pb_title_container')[0].h1.text

'Why You Should Become a Data Scientist'

In [203]:
# let's get the text of the article
len(soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p'))

18

In [207]:
# get list of all paragraphs
paragraphs = soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p')

In [208]:
len(paragraphs)

18

In [206]:
# let's look at one paragraph
soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p')[0].text

'What do you look for in a career? Chances are, you’re looking for a way to make use of your particular talents, a field that’s secure and reliable, a work/life balance, and good compensation. For the right people, data science offers all of that and more! It was LinkedIn’s #1 Most Promising Job in 2019, and Glassdoor’s 2nd Best Job of 2021! Actually, Data Scientists topped that list from 2016 to 2019 before being dethroned by developers, which we also train at Codeup. So why all the hype? What makes it the best? Keep reading to learn why you should become a Data Scientist!'

In [None]:
# create a large block of text by combining all paragraphs
n_paragraphs = len(soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p'))
all_text = ''
for i in range(0, n_paragraphs):
    all_text += soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p')[0].text
    

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

Business
Sports
Technology
Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

Start by inspecting the website in your browser. Figure out which elements will be useful.
Start by creating a function that handles a single article and produces a dictionary like the one above.
Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

## Bonus

Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).