### Mission: To webscrap news articles from a variety of sources

- Use news API to identify interesting articles
- Build scrapping code to scrape the articles from the source provided by the news API
- Store the news articles in a permanent database
- Process the text of the articles in preparation for NLP
- Build interesting models around the text to understand sentiment

API source: https://newsapi.org/docs/endpoints/everything

In [6]:
import pandas as pd
import time
from datetime import date
import requests

In [18]:
def save_news (search_terms, n_pagesize, n_pages, save_to_csv): 
    '''
    term_request = which is the key word for search.
    save_to_csv = True indicates csv will be saved
    '''
    
    # API requests
    for term in search_terms: 
        url = 'https://newsapi.org/v2/everything?'
        
        param = {
    #'country' : 'us',
    'q': term,  #search term 
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': n_pagesize, #max page
    'language': 'en'}
        print (term)
        
        every_term = requests.get(url, params = param)
        every_term.json()['articles']
        #every_term.json().keys()
        #print (every_term)

    articles = every_term.json()['articles'] 
    
    for page in range(2, n_pages): #go throught 10 times, and get more pages, 10 more pages
        #print('page: ' + str(page))
        param['page'] = page
        
        more_term = requests.get(url, params = param)
        more_term = more_term.json()['articles']
        
        #print(len(more_term))
        articles.extend(more_term)
    arts = pd.DataFrame(articles)
    
    # Drop null and duplicate 
    arts.dropna(inplace=True)
    arts.drop_duplicates(subset='content',inplace = True)
    
    # Create columns
    arts['source_id'] = arts['source'].map(lambda x: x['id'])
    arts['source_name'] = arts['source'].map(lambda x: x['name']) #break up the source, source id, and name colums seperate
    
    # Save df to csv
    if save_to_csv == True: 
        arts.to_csv('../data/'+ str(date.today())+'.csv' ,index = False, sep = ",") #index = False for no extra columns
        print (f'{len(articles)} unique news haved been saved')
    
    # Review df 
    return arts.head()

In [22]:
# news = ['disaster', 'crisis']
# for new in ['disaster', 'crisis']: 
#     save_news (new, n_pagesize=10, n_pages=3, save_to_csv=True)

In [21]:
save_news (['earthquake'], n_pagesize=10, n_pages=3, save_to_csv=True)

earthquake
20 unique news haved been saved


Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage,source_id,source_name
0,Patrick Allan,"Earthquakes are violent, frightening, and can ...","Earthquakes are violent, frightening, and can ...",2018-05-22T20:30:00Z,"{'id': None, 'name': 'Lifehacker.com'}",Here's Everything You Need to Know to Survive ...,https://lifehacker.com/heres-everything-you-ne...,https://i.kinja-img.com/gawker-media/image/upl...,,Lifehacker.com
1,Caitlin Harrington,Rich Lee had armor implanted in his shins in 2...,Surgically installed vibrator or earthquake-se...,2018-02-07T14:00:00Z,"{'id': 'wired', 'name': 'Wired'}",Biopunks are Pushing the Limits With Implants ...,https://www.wired.com/story/biopunks-are-pushi...,https://media.wired.com/photos/5a7118c5c963725...,wired,Wired
2,Aimée Lutkin,"On Friday, the Indonesian island of Sulawesi w...","On Friday, the Indonesian island of Sulawesi w...",2018-10-03T18:00:00Z,"{'id': None, 'name': 'Lifehacker.com'}",How to Help Victims of the Indonesia Earthquak...,https://lifehacker.com/how-to-help-victims-of-...,https://i.kinja-img.com/gawker-media/image/upl...,,Lifehacker.com
3,Aimée Lutkin,California began using an earthquake warning s...,California began using an earthquake warning s...,2018-10-11T17:30:00Z,"{'id': None, 'name': 'Lifehacker.com'}",What to Do When an Earthquake Warning Goes Off,https://lifehacker.com/what-to-do-when-an-eart...,https://i.kinja-img.com/gawker-media/image/upl...,,Lifehacker.com
4,Alessandra Potenza,"Last night, a magnitude 7.9 earthquake struck ...","Last night, a magnitude 7.9 earthquake struck ...",2018-01-23T16:15:13Z,"{'id': 'the-verge', 'name': 'The Verge'}",A powerful earthquake in Alaska didn’t trigger...,https://www.theverge.com/2018/1/23/16922914/al...,https://cdn.vox-cdn.com/thumbor/rQqh3BMKCmEiuT...,the-verge,The Verge


In [None]:
--

#### Investigate the headlines endpoint

In [21]:
url = 'https://newsapi.org/v2/top-headlines?'

param = {
    'country' : 'us',
    'category': 'business', #top headline and everthing cater
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': 100,
}

business_headlines = requests.get(url, params = param)
business_headlines = business_headlines.json()
business_headlines.keys()

In [1]:
#business_headlines

In [2]:
#business_headlines.json()

dict_keys(['status', 'totalResults', 'articles'])

In [3]:
#business_headlines['articles'] #where these articles come from 

#### Investigate the sources endpoint

In [26]:
url = 'https://newsapi.org/v2/sources?'

param = {
    'country' : 'us',
    'q': 'earthquake',
    #'category': 'business',
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': 100,
}

earthquake_sources = requests.get(url, params = param)

In [4]:
#earthquake_sources.json()

In [28]:
import time

now = time.time()
#str(now)


# now = str(now) #Unknown format code 'f' for object of type 'str'

#create the time stampe

In [29]:
arts.to_csv(f'./earthquake_{now:.0f}.csv')

### Schedule automated tasks with Cron

Next let's setup a python script to automatically collect data everyday. We can do this with cron. 

Here are a couple great resources you might reference as we go through cron:   
https://www.taniarascia.com/setting-up-a-basic-cron-job-in-linux/  
https://ole.michelsen.dk/blog/schedule-jobs-with-crontab-on-mac-osx.html  

#### What is cron? 

Cron is a scheduler that allows a user to automate certain functionality to run at specific times. The way timing is encoded in cron is a bit arcane. 

Let's check out this site and try to understand how it works: https://crontab.guru/#0_0_*_*_*

Crontab have 5 * to indicate delineations in time. 

For example, 

\* \* \* \* \*

equates to every minute, every hour, every day, every month, and every day of the week  

This can be a bit confusing. I suggest that when you want to setup a crontab for a specific time, you play around with the crontab.guru and try to get it to output the timing you want. Let's try to get a cron schedule for once a day at noon. 



each markdown has meaning . * is any. 1 is mintues, second is hour, 3rd is day of the month, 4 is month, 5 is the day of the week