## Data Acquisition Exercises

### Problem 1: Get some blog articles from a few addresses, put them in a function called get_blog_articles()
#### Approach: Do the thing once, do it in a loop, do the loop in a function

### Tools we will use:
    1. Browser inspector (command + option + i), right-click inspect element, right-click copy css selector
    2. Beautiful Soup to Parse page
### Find and select tips:
- To find elements with the class="link", `<a href="codeup.com" class="link">Codeup.com</a>`, we use:
    - If you're using soup.find, you can use "_class=link" in your arguments
    - If we use soup.select, we can put `.link` in soup.select(".link")
- Some other CSS selectors to use with .select:
    - .class_name
    - #id_name, and IDs are unique to a page
    - tag selectors. If we do `soup.select("a")`, we'll get back a list of all the anchor tags

- main, header, footer, section, article, div are generic containers for content. These are boxes of content
- anchor tag, strong, or a span is an in-line chunk of content

In [4]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

### Build upward in complexity!
- 1. Do our desired thing once, and put that in a function
- 2. Use that function in a loop
- 3. Put that loop in a function

In [21]:
url = 'https://codeup.com/data-science-myths/'

# # Set the user agent to reflect relatively ancient software just for fun
headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

# # Get the http response object from the server
response = get(url, headers=headers)

In [22]:
# # The raw_html produced is a string version of running "view-source" on the url
# # Compare this raw_html to view-source
raw_html = response.text
raw_html[0:300]

'<!DOCTYPE html><html lang="en-US"><head >\t<meta charset="UTF-8" />\n\t<meta name="viewport" content="width=device-width, initial-scale=1" />\n\t<meta name=\'robots\' content=\'index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1\' />\n<style type="text/css" id="nab-alternative-loader-'

In [23]:
# # Turn the raw html string into a BeautifulSoup object
soup = BeautifulSoup(raw_html)

In [24]:
# soup.find to find one thing
# soup.find_all to find all the matching things
# soup.select to find all the matching things (as a list of tags)

In [25]:
# # Let's get the title
# # h1 on its own works here, but not necessisarily everywhere. Pages can have > 1 h1 tag
title = soup.find('h1').text
title

'Data Science Myths'

In [26]:
# # If we wanted to be more specific
# # Give me the h1 that also has the jupiterx-post-title class
title = soup.select('.jupiterx-post-title')[0].text

In [27]:
title

'Data Science Myths'

In [28]:
content = soup.select('.jupiterx-post-content')[0].text

In [29]:
content

'By Dimitri Antoniou and Maggie Giust\nData Science, Big Data, Machine Learning, NLP, Neural Networks…these buzzwords have rapidly spread into mainstream use over the last few years. Unfortunately, definitions are varied and sources of truth are limited. Data Scientists are in fact not magical unicorn wizards who can snap their fingers and turn a business around! Today, we’ll take a cue from our favorite Mythbusters to tackle some common myths and misconceptions in the field of Data Science.\n\xa0\nMyth #1: Data Science = Statistics\nAt first glance, this one doesn’t sound unreasonable. Statistics is defined as, “A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” That sounds a lot like our definition of Data Science: a method of drawing actionable intelligence from data. \nIn truth, statistics is actually one small piece of Data Science. As our Senior Data Scientist puts it, “Statistics forces us to make assumpt

In [None]:
# # Each piece of soup we access is another soup object with the same methods and properties available
# # soup.element.text
# soup.time.text

In [None]:
# soup.element["attribute_name"]
# If you have an attribute name and need that attribute's value, then we use dictionary syntax
# soup.time["datetime"]

In [30]:
# # soup.select("img") is wayyy to broad, since it returns every image on the page
# # so we need to get to know our data, our html structure
# # Let's get more specific
div_for_image = soup.select('.jupiterx-post-image')[0]
image_src = div_for_image.picture.img['data-src']
image_src

'https://codeup.com/wp-content/uploads/2018/10/Blog_DSBustMyth_1200x628.png'

In [34]:
# Make a function that works on a single url
# Make sure your function has everything it needs inside (try to avoid globals)

def get_codeup_blog(url):
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output

In [31]:
def get_blog_articles(urls):
    # List of dictionaries
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [32]:
urls = [
    "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
    "https://codeup.com/data-science-myths/",
    "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
    "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
    "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"
]

In [35]:
df = get_blog_articles(urls)

In [36]:
df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


### Problem 2: Get some news articles from inshorts.com from various categories
#### layers from granular to macro:

#### Pages (categories) -> cards (articles) -> card content article

We want to get all the content on each card for each category, and will build up that way.

In [39]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]




In [40]:
base_url = 'https://inshorts.com/en/read/'

In [41]:
first_cat = categories[0]

In [42]:
first_page = base_url + first_cat

In [43]:
first_page

'https://inshorts.com/en/read/business'

In [44]:
headers

{'User-Agent': 'Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)'}

In [45]:
# step one: get our content
response = get(first_page, headers=headers)

In [48]:
response.text[:400]

'<!doctype html>\n<html lang="en">\n\n<head>\n  <meta charset="utf-8" />\n  <style>\n    /* The Modal (background) */\n    .modal_contact {\n        display: none; /* Hidden by default */\n        position: fixed; /* Stay in place */\n        z-index: 8; /* Sit on top */\n        left: 0;\n        top: 0;\n        width: 100%; /* Full width */\n        height: 100%;\n        overflow: auto; /* Enable scroll if ne'

In [49]:
# make our soup
soup = BeautifulSoup(response.text)

In [53]:
articles = soup.select('.news-card')

In [54]:
articles[0]

<div class="news-card z-depth-1" itemscope="" itemtype="http://schema.org/NewsArticle">
<span content="" itemid="https://inshorts.com/en/news/chinas-exteacher-turned-billionaire-no-more-a-billionaire-as-shares-fall-98-1627290782038" itemprop="mainEntityOfPage" itemscope="" itemtype="https://schema.org/WebPage"></span>
<span itemprop="author" itemscope="itemscope" itemtype="https://schema.org/Person">
<span content="Pragya Swastik" itemprop="name"></span>
</span>
<span content="China's ex-teacher turned billionaire no more a billionaire as shares fall 98%" itemprop="description"></span>
<span itemprop="image" itemscope="" itemtype="https://schema.org/ImageObject">
<meta content="https://static.inshorts.com/inshorts/images/v1/variants/jpg/m/2021/07_jul/26_mon/img_1627289502940_772.jpg?" itemprop="url"/>
<meta content="864" itemprop="width"/>
<meta content="483" itemprop="height"/>
</span>
<span itemprop="publisher" itemscope="itemscope" itemtype="https://schema.org/Organization">
<span c

In [61]:
articles[1].select("[itemprop='articleBody']")[0]

<div itemprop="articleBody">A new job posting by Amazon has fuelled speculations that the e-commerce major may begin accepting Bitcoin, Ether and other cryptocurrencies as a form of payment. According to the job posting, Amazon's Payments Acceptance &amp; Experience team is hiring a 'Digital Currency and Blockchain Product Lead'. Following the speculations around Amazon's plan, Bitcoin surged near $40,000 on Monday.</div>

In [62]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [63]:
def get_articles(category, base ="https://inshorts.com/en/read/"):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output
    

In [None]:
# Example of using the get_articles function sending in the category name that's part of the URL
# get_articles("business")

In [64]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [65]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
df = get_all_news_articles(categories)

In [66]:
df

Unnamed: 0,title,content,category
0,China's ex-teacher turned billionaire no more ...,"China's Larry Chen, a former teacher who becam...",business
1,Amazon job posting fuels speculations about pl...,A new job posting by Amazon has fuelled specul...,business
2,"Musk takes a jibe at rival car companies, says...",Tesla CEO and the world's second-richest perso...,business
3,"Unemployment rate rises in both urban, rural a...",India's unemployment rate soared to 7.14% in t...,business
4,Govt paid Infosys ₹164.5 crore for new Income ...,The government paid ₹164.5 crore to Infosys to...,business
...,...,...,...
142,New Zealand agrees to accept alleged Islamic S...,New Zealand on Monday agreed to repatriate an ...,world
143,Lebanese lawmakers pick billionaire Najib Mika...,Lebanese lawmakers during parliamentary consul...,world
144,Equatorial Guinea to close UK embassy over san...,Equatorial Guinea's Foreign Minister said that...,world
145,46 Afghan soldiers flee to Pakistan in retreat...,The Pakistani Army on Monday said that 46 Afgh...,world
