## Data Acquisition Exercises

### It's good to know that:
- `Command + option + i` brings up the element inspector
- We can also right click an element and select `inspect` in order to see an element's HTML
- Inside of the element inspector, we can right click, copy, then `copy selector` to get the CSS selector
- CSS selector(s) work  like a magic lasso for gettin your hands on a specific element or set of elements
- CSS selectors and their usage https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
- Adding `view-source:` immediately before the URL in your browser allows you to see the text of `response.text`
- Infinite scrolling content involves loading additional content after the first page load, making things more challenging to scrape

### Basic Selectors
- To find elements with the class="link", `<a href="codeup.com" class="link">Codeup.com</a>`, we use:
    - If you're using soup.find, you can use "_class=link" in your arguments
    - If we use soup.select, we can put `.link` in soup.select(".link")
- Some other CSS selectors to use with .select:
    - .class_name
    - #id_name, and IDs are unique to a page
    - tag selectors. If we do `soup.select("a")`, we'll get back a list of all the anchor tags

- main, header, footer, section, article, div are generic containers for content. These are boxes of content
- anchor tag, strong, or a span is an in-line chunk of content

In [1]:
from requests import get
from bs4 import BeautifulSoup
import pandas as pd

# When the Problem Statement Says "Scrape all the blog posts"
- Start by simplifying as much as you possibly can.
- This stuff is complicated already, so try to dis-entangle things

### One weird trick to making a function
- Step 1: Blow off the function
- Step 2: Get your code to work for a hard-coded input

### One weird trick to solving looping problems
- Step 1: Blow off the function
- Step 2: Get your code to work for a single item

In [2]:
# url = "https://codeup.com/codeups-data-science-career-accelerator-is-here/"

# # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
# headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

# # Get the http response object from the server
# response = get(url, headers=headers)

In [3]:
# # The raw_html produced is a string version of running "view-source" on the url
# # Compare this raw_html to view-source
# raw_html = response.text
# raw_html[0:300]

In [4]:
# # Turn the raw html string into a BeautifulSoup object
# soup = BeautifulSoup(raw_html)

In [5]:
# soup.find to find one thing
# soup.find_all to find all the matching things
# soup.select to find all the matching things (as a list of tags)

In [6]:
# # Let's get the title
# # h1 on its own works here, but not necessisarily everywhere. Pages can have > 1 h1 tag
# title = soup.find("h1").text
# title

In [7]:
# # If we wanted to be more specific
# # Give me the h1 that also has the jupiterx-post-title class
# title = soup.select("h1.jupiterx-post-title")[0].text
# title

In [8]:
# content = soup.select(".jupiterx-post-content")[0].text
# content

In [9]:
# # Each piece of soup we access is another soup object with the same methods and properties available
# # soup.element.text
# soup.time.text

In [10]:
# soup.element["attribute_name"]
# If you have an attribute name and need that attribute's value, then we use dictionary syntax
# soup.time["datetime"]

In [11]:
# # soup.select("img") is wayyy to broad, since it returns every image on the page
# # so we need to get to know our data, our html structure
# # Let's get more specific
# div_for_image = soup.select(".jupiterx-post-image")[0]
# image_src = div_for_image.picture.img["data-src"]
# image_src

In [12]:
# Make a function that works on a single url
# Make sure your function has everything it needs inside (try to avoid globals)

def get_codeup_blog(url):
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)
    
    soup = BeautifulSoup(response.text)
    
    title = soup.find("h1").text
    published_date = soup.time.text
    
    if len(soup.select(".jupiterx-post-image")) > 0:
        blog_image = soup.select(".jupiterx-post-image")[0].picture.img["data-src"]
    else:
        blog_image = None
        
    content = soup.select(".jupiterx-post-content")[0].text
    
    output = {}
    output["title"] = title
    output["published_date"] = published_date
    output["blog_image"] = blog_image
    output["content"] = content
    
    return output

In [13]:
def get_blog_articles(urls):
    # List of dictionaries
    posts = [get_codeup_blog(url) for url in urls]
    
    return pd.DataFrame(posts)

In [14]:
urls = [
    "https://codeup.com/codeups-data-science-career-accelerator-is-here/",
    "https://codeup.com/data-science-myths/",
    "https://codeup.com/data-science-vs-data-analytics-whats-the-difference/",
    "https://codeup.com/10-tips-to-crush-it-at-the-sa-tech-job-fair/",
    "https://codeup.com/competitor-bootcamps-are-closing-is-the-model-in-danger/"
]

In [15]:
df = get_blog_articles(urls)

In [16]:
df.head()

Unnamed: 0,title,published_date,blog_image,content
0,Codeup’s Data Science Career Accelerator is Here!,"September 30, 2018",https://codeup.com/wp-content/uploads/2018/10/...,The rumors are true! The time has arrived. Cod...
1,Data Science Myths,"October 31, 2018",https://codeup.com/wp-content/uploads/2018/10/...,By Dimitri Antoniou and Maggie Giust\nData Sci...
2,Data Science VS Data Analytics: What’s The Dif...,"October 17, 2018",https://codeup.com/wp-content/uploads/2018/10/...,"By Dimitri Antoniou\nA week ago, Codeup launch..."
3,10 Tips to Crush It at the SA Tech Job Fair,"August 14, 2018",,SA Tech Job Fair\nThe third bi-annual San Anto...
4,Competitor Bootcamps Are Closing. Is the Model...,"August 14, 2018",,Competitor Bootcamps Are Closing. Is the Model...


In [17]:
def get_article(article, category):
    # Attribute selector
    title = article.select("[itemprop='headline']")[0].text
    
    # article body
    content = article.select("[itemprop='articleBody']")[0].text
    
    output = {}
    output["title"] = title
    output["content"] = content
    output["category"] = category
    
    return output

In [18]:
def get_articles(category):
    """
    This function takes in a category as a string. Category must be an available category in inshorts
    Returns a list of dictionaries where each dictionary represents a single inshort article
    """
    base = "https://inshorts.com/en/read/"
    
    # We concatenate our base_url with the category
    url = base + category
    
    # Set the headers to show as Netscape Navigator on Windows 98, b/c I feel like creating an anomaly in the logs
    headers = {"User-Agent": "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)"}

    # Get the http response object from the server
    response = get(url, headers=headers)

    # Make soup out of the raw html
    soup = BeautifulSoup(response.text)
    
    # Ignore everything, focusing only on the news cards
    articles = soup.select(".news-card")
    
    output = []
    
    # Iterate through every article tag/soup 
    for article in articles:
        
        # Returns a dictionary of the article's title, body, and category
        article_data = get_article(article, category) 
        
        # Append the dictionary to the list
        output.append(article_data)
    
    # Return the list of dictionaries
    return output
    

In [19]:
# Example of using the get_articles function sending in the category name that's part of the URL
# get_articles("business")

In [20]:
def get_all_news_articles(categories):
    """
    Takes in a list of categories where the category is part of the URL pattern on inshorts
    Returns a dataframe of every article from every category listed
    Each row in the dataframe is a single article
    """
    all_inshorts = []

    for category in categories:
        all_category_articles = get_articles(category)
        all_inshorts = all_inshorts + all_category_articles

    df = pd.DataFrame(all_inshorts)
    return df

In [21]:
categories = ["business", "sports", "technology", "entertainment", "science", "world"]
df = get_all_news_articles(categories)

In [22]:
df

Unnamed: 0,title,content,category
0,India underestimated the coronavirus: Raghuram...,"Speaking about India's second COVID-19 wave, f...",business
1,Air India pilots demand vaccination on priorit...,Indian Commercial Pilots Association (ICPA) on...,business
2,Petrol and diesel prices hiked after 18 days,Prices of petrol and diesel were today hiked f...,business
3,World's biggest jeweller says it will no longe...,"Pandora, the world's biggest jeweller, has sai...",business
4,South Korea's richest woman gets fortune worth...,South Korea’s richest woman Hong Ra-hee added ...,business
...,...,...,...
143,Egypt buys 30 more Rafale jets from France in ...,Egypt will buy 30 more Rafale fighter jets fro...,world
144,Flash floods kill at least 12 in western Afgha...,At least 12 people were killed by flash floods...,world
145,Further violence in Myanmar could lead to civi...,China's Ambassador to the UN Zhang Jun on Mond...,world
146,Nepal PM appeals for COVID-19 vaccines as case...,Nepal PM KP Sharma Oli has urged neighbouring ...,world
