# Mini Exercises

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## News Article Scraping

In [2]:
# make the HTTP request
response = requests.get('https://web-scraping-demo.zgulde.net/news')

In [3]:
# pull the text out of the response
html = response.text

In [4]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [5]:
# locate the articles using inspect feature of web browser to pull out desired parts of page using tags
articles = soup.select('.grid.grid-cols-4.gap-x-4.border')

In [6]:
len(articles)

12

In [7]:
articles[0]

<div class="grid grid-cols-4 gap-x-4 border rounded pr-3 bg-green-50 hover:shadow-lg transition duration-500">
<img src="/static/placeholder.png"/>
<div class="col-span-3 space-y-3 py-3">
<h2 class="text-2xl text-green-900">specific clear box</h2>
<div class="grid grid-cols-2 italic">
<p> 1970-01-02 </p>
<p class="text-right">By Meagan Stuart </p>
</div>
<p>Evidence recent politics continue. Herself team value. Important often water we writer find.
Television remember begin appear trial. Fill Republican design court since big media try.</p>
</div>
</div>

In [8]:
# get title
articles[0].h2.text

'specific clear box'

In [9]:
# get data and author
date, author = articles[0].select('.italic')[0].find_all('p')
date, author

(<p> 1970-01-02 </p>, <p class="text-right">By Meagan Stuart </p>)

In [10]:
# get paragraph
articles[0].find_all('p')[-1].text

'Evidence recent politics continue. Herself team value. Important often water we writer find.\nTelevision remember begin appear trial. Fill Republican design court since big media try.'

In [11]:
# put into a function
def parse_news(article):
    title = article.h2.text
    date, author = article.select('.italic')[0].find_all('p')
    paragraph = article.find_all('p')[-1].text
    return {'title' : title, 'date' : date.text, 'author' : author.text, 'paragraph' : paragraph}

In [12]:
pd.DataFrame([parse_news(article) for article in articles])

Unnamed: 0,title,date,author,paragraph
0,specific clear box,1970-01-02,By Meagan Stuart,Evidence recent politics continue. Herself tea...
1,item member outside,1988-12-10,By William Berry,Whom kid Congress exactly. Value dark hope per...
2,them example walk,1998-11-15,By Joshua Castillo,Condition cover include author help civil unti...
3,or church when,2017-05-26,By Brandon Spears,Politics despite cell look.\nTask live decisio...
4,join weight eye,1984-07-20,By Kiara Parker,Half face choice. Tell popular sing billion fe...
5,TV poor skill,2004-02-06,By Heather Davis,Various hotel seem fear who rich doctor. Towar...
6,east pay plan,1976-12-27,By Marc Kelley,Good personal risk expert forget. Hot perform ...
7,million positive century,1971-08-02,By Mrs. Cheryl Wilson,Phone over couple even environmental model amo...
8,five direction see,2005-09-08,By Ryan Carter,Walk strong need century call partner. Radio e...
9,reason particularly military,1980-11-22,By Lisa Lyons,Himself different away explain. Son language e...


## Contact Info Scraping

In [13]:
# make the HTTP request
response = requests.get('https://web-scraping-demo.zgulde.net/people')

In [14]:
# pull the text out of the response
html = response.text

In [15]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [16]:
people = soup.select('.person.border.rounded.px-3')

In [17]:
len(people)

10

In [18]:
# get name
people[0].h2.text

'Jeffrey Garcia'

In [19]:
# another way to get it
people[0].select('.name')[0].text

'Jeffrey Garcia'

In [20]:
# get tag line
people[0].p.text.strip()

'"Optimized discrete encryption"'

In [21]:
# another way to get it
people[0].select('.quote')[0].text.strip()

'"Optimized discrete encryption"'

In [22]:
# get email address
people[0].select('.grid.grid-cols-9')[0].p.text

'pattersonstephen@walker-brown.com'

In [23]:
# another way
people[0].select('.email')[0].text

'pattersonstephen@walker-brown.com'

In [24]:
# get phone number
people[0].select('.grid.grid-cols-9')[0].find_all('p')[1].text

'294-311-6002x7020'

In [25]:
people[0]

<div class="person border rounded px-3 py-5 grid grid-cols-2 gap-x-3 bg-purple-50 hover:shadow-lg transition duration-500">
<h2 class="text-2xl text-purple-800 name col-span-full border-b">Jeffrey Garcia</h2>
<p class="quote col-span-full px-5 py-5 text-center text-gray-500">
            "Optimized discrete encryption"
        </p>
<div class="grid grid-cols-9">
<i class="bi bi-envelope-fill text-purple-800"></i>
<p class="email col-span-8">pattersonstephen@walker-brown.com</p>
<i class="bi bi-telephone-fill text-purple-800"></i>
<p class="phone col-span-8">294-311-6002x7020</p>
</div>
<div class="address grid grid-cols-9">
<i class="bi bi-geo-fill text-purple-800"></i>
<p class="col-span-8">
                456 Cox Drives <br/>
                Changstad, OR 64982
            </p>
</div>
</div>

In [26]:
# another way
people[0].select('.phone')[0].text

'294-311-6002x7020'

In [27]:
# find address
address = people[0].select('.address')[0].p.text.strip()
address

'456 Cox Drives \n                Changstad, OR 64982'

In [28]:
# another way
people[0].select('.address')[0].text.strip()

'456 Cox Drives \n                Changstad, OR 64982'

In [29]:
import re

In [30]:
# use regex to clean up white spaces
re.sub(r'\s{2,}', ', ', address)

'456 Cox Drives, Changstad, OR 64982'

In [31]:
def parse_people(person):
    name = person.h2.text
    tag_line = person.p.text.strip()
    email = person.select('.email')[0].text
    phone = person.select('.phone')[0].text
    address = person.select('.address')[0].text.strip()
    address = re.sub(r'\s{2,}', ', ', address)
    return {
        'name' : name,
        'tag_line' : tag_line,
        'email' : email,
        'phone' : phone,
        'address' : address
    }

In [32]:
pd.DataFrame([parse_people(person) for person in people])

Unnamed: 0,name,tag_line,email,phone,address
0,Jeffrey Garcia,"""Optimized discrete encryption""",pattersonstephen@walker-brown.com,294-311-6002x7020,"456 Cox Drives, Changstad, OR 64982"
1,Ashley Curtis,"""Automated uniform support""",jill43@hotmail.com,573-385-6306,"492 Hill Rapid, East Michael, LA 20088"
2,Jeffrey Smith,"""Team-oriented impactful paradigm""",ucollins@hotmail.com,580-649-1752x54465,"135 Richardson Groves Apt. 236, Blevinsbury, N..."
3,Alejandro Smith,"""Universal 4thgeneration benchmark""",erictownsend@yahoo.com,999-308-9772x0808,"97757 Lynn Throughway Suite 082, Morrisonside,..."
4,Brian Bowers,"""Triple-buffered systematic protocol""",sarahburton@black.com,+1-327-770-7479x399,"950 Morrison Light, North Derrickport, IL 72442"
5,Andrea Jones,"""Advanced multi-state capability""",vazquezjeffery@ryan-richards.info,001-332-420-4827x867,"78941 Kyle Expressway, Thomasland, AZ 21475"
6,Rodney Luna,"""Programmable 3rdgeneration utilization""",oschneider@gmail.com,878.359.0467x29797,"35278 Oconnor Canyon, North Michelle, PA 53985"
7,Michael Howard,"""Balanced well-modulated Graphic Interface""",mendezsherry@gmail.com,+1-787-310-9526x210,"3820 Julie Crossing Suite 926, East Elizabethp..."
8,Christopher Rodriguez,"""Synergized impactful knowledgebase""",lelisa@gmail.com,+1-654-097-3805x120,"741 Kline Mission Apt. 088, Fischerhaven, NY 7..."
9,Jamie Blair,"""Fully-configurable cohesive archive""",vasquezjack@lewis.com,(252)196-3780x140,"863 Jacob Port Apt. 872, Monteston, UT 63018"


# Exercises

By the end of this exercise, you should have a file named acquire.py that contains the specified functions. If you wish, you may break your work into separate files for each website (e.g. `acquire_codeup_blog.py` and `acquire_news_articles.py`), but the end function should be present in `acquire.py` (that is, `acquire.py` should `import get_blog_articles` from the `acquire_codeup_blog` module.)

## Codeup Blog Articles

Visit Codeup's Blog and record the urls for at least 5 distinct blog posts. For each post, you should scrape at least the post's `title` and `content`.

Encapsulate your work in a function named `get_blog_articles` that will return a list of dictionaries, with each dictionary representing one article. The shape of each dictionary should look like this:

`{'title': 'the title of the article', 'content': 'the full text content of the article'}`

Plus any additional properties you think might be helpful.

*Bonus: Scrape the text of all the articles linked on codeup's blog page.*

In [33]:
# make the HTTP request
url = 'https://codeup.com/data-science/why-you-should-become-a-data-scientist/'
headers = {'User-Agent': 'Codeup Data Science Germain Cohort'} # Some websites don't accept the pyhon-requests default user-agent
response = requests.get(url, headers=headers)

In [34]:
# pull the text out of the response
html = response.text

In [35]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [36]:
# lets just grab the title
soup.select('.et_pb_title_container')[0].h1.text

'Why You Should Become a Data Scientist'

In [37]:
# let's get the text of the article
len(soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p'))

18

In [38]:
# get list of all paragraphs
paragraphs = soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p')

In [39]:
len(paragraphs)

18

In [40]:
# let's look at one paragraph
paragraphs[0].text

'What do you look for in a career? Chances are, you’re looking for a way to make use of your particular talents, a field that’s secure and reliable, a work/life balance, and good compensation. For the right people, data science offers all of that and more! It was LinkedIn’s #1 Most Promising Job in 2019, and Glassdoor’s 2nd Best Job of 2021! Actually, Data Scientists topped that list from 2016 to 2019 before being dethroned by developers, which we also train at Codeup. So why all the hype? What makes it the best? Keep reading to learn why you should become a Data Scientist!'

In [41]:
# create a large block of text by combining all paragraphs
all_text = ''
for paragraph in paragraphs:
    all_text += (paragraph.text)

In [42]:
# compile into function
def get_blog(url, user_agent):
    headers = {'User-Agent': user_agent}
    response = requests.get(url, headers=headers)
    html = response.text
    soup = BeautifulSoup(html)
    title = soup.select('.et_pb_title_container')[0].h1.text
    paragraphs = soup.select('.et_pb_post_content_0_tb_body')[0].find_all('p')
    all_text = ''
    for paragraph in paragraphs:
        all_text += (paragraph.text)
    return({
        'title' : title,
        'content' : all_text
    })

In [43]:
# get_blog('https://codeup.com/data-science/why-you-should-become-a-data-scientist/', 'Codeup Data Science Germain')

In [44]:
# repeat this for 5 articles
urls = [
    'https://codeup.com/data-science/why-you-should-become-a-data-scientist/',
    'https://codeup.com/data-science/math-in-data-science/',
    'https://codeup.com/data-science/transition-into-data-science/',
    'https://codeup.com/data-science/data-science-career/',
    'https://codeup.com/data-science/what-is-python/'
]

In [45]:
urls

['https://codeup.com/data-science/why-you-should-become-a-data-scientist/',
 'https://codeup.com/data-science/math-in-data-science/',
 'https://codeup.com/data-science/transition-into-data-science/',
 'https://codeup.com/data-science/data-science-career/',
 'https://codeup.com/data-science/what-is-python/']

In [46]:
blogs = pd.DataFrame([get_blog(url, 'Codeup Data Science Germain') for url in urls])

In [47]:
blogs

Unnamed: 0,title,content
0,Why You Should Become a Data Scientist,"What do you look for in a career? Chances are,..."
1,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
2,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
3,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
4,What is Python?,If you’ve been digging around our website or r...


In [48]:
def get_blog_articles(url_list, user_agent):
    return pd.DataFrame([get_blog(url, user_agent) for url in urls])

In [49]:
get_blog_articles(urls, 'Codeup Data Science Germain')

Unnamed: 0,title,content
0,Why You Should Become a Data Scientist,"What do you look for in a career? Chances are,..."
1,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
2,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
3,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
4,What is Python?,If you’ve been digging around our website or r...


In [50]:
# let's see if I can isolate all urls from main page
# soup.find_all('a')

## News Articles

We will now be scraping text data from inshorts, a website that provides a brief overview of many different topics.

Write a function that scrapes the news articles for the following topics:

Business
Sports
Technology
Entertainment
The end product of this should be a function named get_news_articles that returns a list of dictionaries, where each dictionary has this shape:

{
    'title': 'The article title',
    'content': 'The article content',
    'category': 'business' # for example
}

Hints:

Start by inspecting the website in your browser. Figure out which elements will be useful.
Start by creating a function that handles a single article and produces a dictionary like the one above.
Next create a function that will find all the articles on a single page and call the function you created in the last step for every article on the page.
Now create a function that will use the previous two functions to scrape the articles from all the pages that you need, and do any additional processing that needs to be done.

In [51]:
# make the HTTP request
url = 'https://inshorts.com/en/read/business'
headers = {'User-Agent': 'Codeup Data Science Germain Cohort'} # Some websites don't accept the pyhon-requests default user-agent
response = requests.get(url, headers=headers)

In [52]:
# pull the text out of the response
html = response.text

In [53]:
# pass HTML into Beautiful Soup
soup = BeautifulSoup(html)

In [54]:
# first, I will grab each card
articles = soup.select('.news-card.z-depth-1')

In [55]:
len(articles)

25

In [56]:
# lets just grab the title
articles[0].select('.news-card-title')[0].span.text

"India's Covaxin may get WHO approval in next 24 hours: Spokesperson"

In [57]:
# let's look at one paragraph
articles[0].select('.news-card-content')[0].div.text

'A WHO technical advisory group which met on Tuesday to consider Bharat Biotech\'s COVID-19 vaccine Covaxin for emergency use listing is likely to announce its decision soon. "If all is in place and all goes well and if the committee is satisfied, we would expect a recommendation within the next 24 hours or so," WHO spokesperson Margaret Harris told reporters. '

In [58]:
# create the function to just pull out title, short, and topic
def parse_card(article, topic):
    title = article.select('.news-card-title')[0].span.text
    content = article.select('.news-card-content')[0].div.text
    topic = topic
    return {
        'title' : title,
        'content' : content,
        'topic' : topic
    }

In [59]:
# define a function that gets all article info for articles on that page
def get_article_info(base_url, topic, user_agent):
    url = base_url + topic
    headers = {'User-Agent': user_agent}
    response = requests.get(url, headers=headers)
    html = response.text
    soup = BeautifulSoup(html)
    articles = soup.select('.news-card.z-depth-1')
    return pd.DataFrame([parse_card(article, topic) for article in articles])

In [60]:
get_article_info(base_url = 'https://inshorts.com/en/read/', topic = 'business', user_agent = 'Codeup Data Science Germain Cohort')

Unnamed: 0,title,content,topic
0,India's Covaxin may get WHO approval in next 2...,A WHO technical advisory group which met on Tu...,business
1,Which companies have $1 trillion or more marke...,Tesla has become the latest company to surpass...,business
2,I decided to support Doge as it felt like the ...,Tesla CEO and the world's richest person Elon ...,business
3,How many years did it take for various compan...,Tesla took 18 years to hit the $1-trillion m-c...,business
4,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Tesla CEO and the world's richest person Elon ...,business
5,Govt sends 202 notices to e-comm firms in a ye...,The government issued 202 notices to e-commerc...,business
6,Policybazaar fixes price band for its IPO at ₹...,"PB Fintech, the operator of online insurance a...",business
7,BharatPe opposes PhonePe's trademarks for usin...,BharatPe's holding firm Resilient Innovations ...,business
8,"India spending ₹8 lakh cr on petrol, diesel im...",Road Transport Minister Nitin Gadkari said tha...,business
9,"Govt to pitch to Tesla, Samsung for local batt...",The government is planning to pitch to compani...,business


In [61]:
# list of pages that I want to scrape
pages = [
    'business',
    'sports',
    'technology',
    'entertainment'
]

In [62]:
final_df = pd.DataFrame()
for page in pages:
    df = get_article_info(base_url = 'https://inshorts.com/en/read/', topic = page, user_agent = 'Codeup Data Science Germain Cohort')
    final_df = pd.concat([final_df, df], ignore_index=True)

In [63]:
final_df

Unnamed: 0,title,content,topic
0,India's Covaxin may get WHO approval in next 2...,A WHO technical advisory group which met on Tu...,business
1,Which companies have $1 trillion or more marke...,Tesla has become the latest company to surpass...,business
2,I decided to support Doge as it felt like the ...,Tesla CEO and the world's richest person Elon ...,business
3,How many years did it take for various compan...,Tesla took 18 years to hit the $1-trillion m-c...,business
4,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Tesla CEO and the world's richest person Elon ...,business
...,...,...,...
95,"Snoop Dogg's mother passes away, rapper posts ...",Rapper Snoop Dogg on Sunday revealed that his ...,entertainment
96,"Jr NTR's fan injured in accident, actor helps ...",A fan of Telugu actor Jr NTR was injured in a ...,entertainment
97,Web is more streamlined and fair as compared t...,"Actress Kritika Kamra, who was seen in the web...",entertainment
98,I'm so overwhelmed: B Praak on winning Nationa...,"Singer B Praak, speaking about winning the Nat...",entertainment


In [64]:
# compile into a final function
def get_articles(topic_list, base_url, user_agent):
    final_df = pd.DataFrame()
    for topic in topic_list:
        df = get_article_info(base_url = base_url, topic = topic, user_agent = user_agent)
        final_df = pd.concat([final_df, df], ignore_index=True)
    return final_df

In [65]:
df = get_articles(topic_list=pages, base_url='https://inshorts.com/en/read/', user_agent='Codeup Data Science Germain Cohort')

In [66]:
df.shape

(100, 3)

In [67]:
df.head()

Unnamed: 0,title,content,topic
0,India's Covaxin may get WHO approval in next 2...,A WHO technical advisory group which met on Tu...,business
1,Which companies have $1 trillion or more marke...,Tesla has become the latest company to surpass...,business
2,I decided to support Doge as it felt like the ...,Tesla CEO and the world's richest person Elon ...,business
3,How many years did it take for various compan...,Tesla took 18 years to hit the $1-trillion m-c...,business
4,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Tesla CEO and the world's richest person Elon ...,business


In [68]:
# import and test out functions
import acquire as a

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [69]:
urls = [
    'https://codeup.com/data-science/why-you-should-become-a-data-scientist/',
    'https://codeup.com/data-science/math-in-data-science/',
    'https://codeup.com/data-science/transition-into-data-science/',
    'https://codeup.com/data-science/data-science-career/',
    'https://codeup.com/data-science/what-is-python/'
]

In [70]:
a.get_blog_articles(urls, 'Codeup Data Science Germain Cohort')

Unnamed: 0,title,content
0,Why You Should Become a Data Scientist,"What do you look for in a career? Chances are,..."
1,What are the Math and Stats Principles You Nee...,"Coming into our Data Science program, you will..."
2,What is the Transition into Data Science Like?,Alumni Katy Salts and Brandi Reger joined us a...
3,What Data Science Career is For You?,If you’re struggling to see yourself as a data...
4,What is Python?,If you’ve been digging around our website or r...


In [71]:
topics = [
    'business',
    'sports',
    'technology',
    'entertainment'
]

In [72]:
a.get_articles(topics, 'https://inshorts.com/en/read/', 'Codeup Data Science Germain Cohort')

Unnamed: 0,title,content,topic
0,India's Covaxin may get WHO approval in next 2...,A WHO technical advisory group which met on Tu...,business
1,Which companies have $1 trillion or more marke...,Tesla has become the latest company to surpass...,business
2,I decided to support Doge as it felt like the ...,Tesla CEO and the world's richest person Elon ...,business
3,How many years did it take for various compan...,Tesla took 18 years to hit the $1-trillion m-c...,business
4,Elon Musk tweets 'Wild $T1mes' after Tesla hit...,Tesla CEO and the world's richest person Elon ...,business
...,...,...,...
95,"Snoop Dogg's mother passes away, rapper posts ...",Rapper Snoop Dogg on Sunday revealed that his ...,entertainment
96,"Jr NTR's fan injured in accident, actor helps ...",A fan of Telugu actor Jr NTR was injured in a ...,entertainment
97,Web is more streamlined and fair as compared t...,"Actress Kritika Kamra, who was seen in the web...",entertainment
98,I'm so overwhelmed: B Praak on winning Nationa...,"Singer B Praak, speaking about winning the Nat...",entertainment


## Bonus

Bonus: cache the data

Write your code such that the acquired data is saved locally in some form or fashion. Your functions that retrieve the data should prefer to read the local data instead of having to make all the requests everytime the function is called. Include a boolean flag in the functions to allow the data to be acquired "fresh" from the actual sources (re-writing your local cache).