# Using Scraping

Scrape the front page of some always-updating website (some websites won't work with BeautifulSoup).

Send yourself an email with as much information as possible from the site, such as:  

 - The title of the thing (the sale, the article, whatever)
 - A URL for it
 - Upvotes/thumbs ups/subreddits/prices/links to images/etc

Save this as a CSV, and send it as an attachment to your email address every 6 hours. The email headline should say something like "**Here is your 6PM briefing.**" The CSV file should be timestamped with the current date and time, e.g. `briefing-2018-06-18-3PM.csv`

**BONUS**: Have the content actually be the **body** of the email, not just an attachment. I don't mean like a CSV or whatever, I mean it should actually look like nice lists and stuff, a real email.

In [1]:
from bs4 import BeautifulSoup
import requests
import datetime
import re
import pandas as pd

In [2]:
# For this assignment, I'm scraping China-related news from Google News.
response = requests.get("https://news.google.com/search?q=China&hl=en-US&gl=US&ceid=US%3Aen").content
soup = BeautifulSoup(response, "html.parser")

In [3]:
# Before we put the data into a df, we use print() to make sure our program works

# The first two blocks are featured article groups (with 'View more' buttons).
# They need to be dealt with separately.
# And since we don't know if the number of featured blocks is gonna change,
# we have to find the uniqueness of these blocks.
# I'm only taking the first article of each group for simplicity.

# It could be observed that the featured blocks have an attr 'jscontroller', whose value is d0DtYd
featured = soup.find(class_='lBwEZb BL5WZb xP6mwf').findAll('div',attrs={'jscontroller':'d0DtYd'})
for feature in featured:
    feature_article = feature.find('article').find(class_='ZulkBc qNiaOd')
    feature_title = feature_article.span.text
    feature_url = 'https://news.google.com'+feature_article.a['href'][1:]
    feature_first_lines = feature_article.find(class_='HO8did Baotjf').text
    feature_source = feature.find(class_='QmrVtf kybdz').find('div',attrs={'class':'PNwZO zhsNkd'}).text
    feature_rawtime = feature.find('time')['datetime'].split(': ')[1]
    feature_time = datetime.datetime.fromtimestamp(int(feature_rawtime)).strftime('%Y-%m-%d %H:%M:%S')
    print(feature_title)
    print(feature_url)
    print(feature_first_lines)
    print(feature_source)
    print(feature_time)

# The other "normal" articles have an attr 'jsmodel', whose value is 'zT6vwb'
articles = soup.find(class_='lBwEZb BL5WZb xP6mwf').findAll('div',attrs={'jsmodel':'zT6vwb'})

for article in articles:
    article_title = article.find('span').text
    article_url = 'https://news.google.com'+article.find('a',attrs={'class':'ipQwMb Q7tWef'})['href'][1:]
    article_first_lines = article.find('p').text
    article_source = article.find(class_='KbnJ8').text
    article_rawtime = re.findall(r'[\d]+', article.find('time')['datetime'].split(': ')[1])[0]
    article_time = datetime.datetime.fromtimestamp(int(article_rawtime)).strftime('%Y-%m-%d %H:%M:%S')
    print(article_title)
    print(article_url)
    print(article_first_lines)
    print(article_source)
    print(article_time)

Trump's Trade War Spooks Markets as White House Waits for China to Blink
https://news.google.com/articles/CAIiEP_Fh8Itu1ElBc1lC3IA7poqFwgEKg8IACoHCAowjuuKAzCWrzww9oAY
The administration, threatening tariffs on as much as $450 billion worth of goods, believes Beijing has more to lose. Companies, investors and markets are ...
The New York Times
2018-06-19 12:15:02
North Korea’s Kim makes another trip to China. That complicates things for Trump.
https://news.google.com/articles/CAIiEHraYy5Ba-EgBuspvANtpzsqGAgEKg8IACoHCAowjtSUCjC30XQwn6G5AQ
BEIJING — North Korean leader Kim Jong Un is in China. Again. Kim arrived Tuesday for his third visit to China in the span of three months, meeting with ...
The Washington Post
2018-06-19 20:10:42
There's no better example of Trump's trade fight with China than Lockheed Martin's crown jewel
https://news.google.com/articles/CAIiED_SAADQC-_WefGjAs6jJH4qGQgEKhAIACoHCAow2Nb3CjDivdcCMKuvhQY
The Chinese J-31 fighter jet is believed to be a knockoff of Lockhee

### Now we put everything together  
Put the data into a df, save it as a CSV, then send as attachment using mailgun.

In [4]:
# Generating auto-emails on China-related news from Google News

from bs4 import BeautifulSoup
import requests
import datetime
import re
import pandas as pd

response = requests.get("https://news.google.com/search?q=China&hl=en-US&gl=US&ceid=US%3Aen").content
soup = BeautifulSoup(response, "html.parser")

briefing = []

featured = soup.find(class_='lBwEZb BL5WZb xP6mwf').findAll('div',attrs={'jscontroller':'d0DtYd'})
for feature in featured:
    feature_row = {}
    feature_article = feature.find('article').find(class_='ZulkBc qNiaOd')
    feature_row['title'] = feature_article.span.text
    feature_row['url'] = 'https://news.google.com'+feature_article.a['href'][1:]
    feature_row['first_lines'] = feature_article.find(class_='HO8did Baotjf').text
    feature_row['source'] = feature.find(class_='QmrVtf kybdz').find('div',attrs={'class':'PNwZO zhsNkd'}).text
    feature_rawtime = feature.find('time')['datetime'].split(': ')[1]
    feature_row['time'] = datetime.datetime.fromtimestamp(int(feature_rawtime)).strftime('%Y-%m-%d %H:%M:%S')
    briefing.append(feature_row)

articles = soup.find(class_='lBwEZb BL5WZb xP6mwf').findAll('div',attrs={'jsmodel':'zT6vwb'})
for article in articles:
    article_row = {}
    article_row['title'] = article.find('span').text
    article_row['url'] = 'https://news.google.com'+article.find('a',attrs={'class':'ipQwMb Q7tWef'})['href'][1:]
    article_row['first_lines'] = article.find('p').text
    article_row['source'] = article.find(class_='KbnJ8').text
    article_rawtime = re.findall(r'[\d]+', article.find('time')['datetime'].split(': ')[1])[0]
    article_row['time'] = datetime.datetime.fromtimestamp(int(article_rawtime)).strftime('%Y-%m-%d %H:%M:%S')
    briefing.append(article_row)
    
df = pd.DataFrame(briefing)
right_now = datetime.datetime.now()
date_string_filename = right_now.strftime("%Y-%b-%d_%-I%p")
df.to_csv('China_news_briefing_{}.csv'.format(date_string_filename), index=False)
date_string_mail = right_now.strftime("%-I %p")

requests.post(
        "https://api.mailgun.net/v3/MY_SANDBOX_DOMAIN/messages",
        auth=("api", "MY_API_KEY"),
        files=[("attachment", open('China_news_briefing_{}.csv'.format(date_string_filename)))],
        data={"from": "Edward Hong <mailgun@MY_SANDBOX_DOMAIN>",
              "to": ["Edward.YSHF@gmail.com"],
              "subject": "{} China News Briefing".format(date_string_mail),
              "text": "See attachment to learn what's new about China."})

<Response [200]>