### Mission: To webscrap news articles from a variety of sources

- Use news API to identify interesting articles
- Build scrapping code to scrape the articles from the source provided by the news API
- Store the news articles in a permanent database
- Process the text of the articles in preparation for NLP
- Build interesting models around the text to understand sentiment

API source: https://newsapi.org/docs/endpoints/everything

In [4]:
import pandas as pd
import time
from datetime import date
import requests

In [5]:
def save_news (search_terms, n_pagesize, n_pages, save_to_csv): 
    '''
    term_request = which is the key word for search.
    save_to_csv = True indicates csv will be saved
    '''
    
    # API requests
    for term in search_terms: 
        url = 'https://newsapi.org/v2/everything?'
        
        param = {
    #'country' : 'us',
    'q': term,  #search term 
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': n_pagesize, #max page
    'language': 'en'}
        print (term)
        
        every_term = requests.get(url, params = param)
        #every_term.json()['articles']
        #every_term.json().keys()
        #print (every_term)

    articles = every_term.json()['articles'] 
    
    for page in range(2, n_pages): #go throught 10 times, and get more pages, 10 more pages
        #print('page: ' + str(page))
        param['page'] = page
        
        more_term = requests.get(url, params = param)
        more_term = more_term.json()['articles']
        
        #print(len(more_term))
        articles.extend(more_term)
    arts = pd.DataFrame(articles)
    
    # Drop null and duplicate 
    arts.dropna(inplace=True)
    arts.drop_duplicates(subset='content',inplace = True)
    
    # Create columns
    arts['source_id'] = arts['source'].map(lambda x: x['id'])
    arts['source_name'] = arts['source'].map(lambda x: x['name']) #break up the source, source id, and name colums seperate
    
    # Save df to csv
    if save_to_csv == True: 
        arts.to_csv('../data/'+ str(date.today())+'.csv' ,index = False, sep = ",") #index = False for no extra columns
        print (f'{len(articles)} unique news haved been saved')
    
    # Review df 
    return arts.head()

In [6]:
# news = ['disaster', 'crisis']
# for new in ['disaster', 'crisis']: 
#     save_news (new, n_pagesize=10, n_pages=3, save_to_csv=True)

In [8]:
save_news (['earthquake','Tornado'], n_pagesize=10, n_pages=3, save_to_csv=True)

earthquake
Tornado
20 unique news haved been saved


Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage,source_id,source_name
0,Patrick Allan,"Every year, around 1,200 tornadoes hit the Uni...","Every year, around 1,200 tornadoes hit the Uni...",2018-05-07T21:00:00Z,"{'id': None, 'name': 'Lifehacker.com'}",How to Stay Safe During a Tornado,https://lifehacker.com/how-to-stay-safe-during...,https://i.kinja-img.com/gawker-media/image/upl...,,Lifehacker.com
1,Patrick Allan,"You see those green, billowing storm clouds ov...","You see those green, billowing storm clouds ov...",2018-03-29T21:00:00Z,"{'id': None, 'name': 'Lifehacker.com'}","If You See Green Storm Clouds, Prepare for the...",https://lifehacker.com/if-you-see-green-storm-...,https://i.kinja-img.com/gawker-media/image/upl...,,Lifehacker.com
2,"Maddie Stone on Earther, shared by Andrew Cout...",Weather geeks went wild last week when the Nat...,Weather geeks went wild last week when the Nat...,2018-08-09T13:33:00Z,"{'id': None, 'name': 'Gizmodo.com'}",California's Viral Fire Tornado Has Scientists...,https://earther.gizmodo.com/californias-viral-...,https://i.kinja-img.com/gawker-media/image/upl...,,Gizmodo.com
3,Laura Vitto,Harrowing new footage released by California's...,Harrowing new footage released by California's...,2018-08-18T15:11:00Z,"{'id': 'mashable', 'name': 'Mashable'}","Deadly, 40,000-foot fire tornado revealed in n...",https://mashable.com/2018/08/18/fire-tornado-c...,https://i.amz.mshcdn.com/wGGKuJnaCby3RdJJ-sdWp...,mashable,Mashable
5,ALAN BLINDER,The presidents trip came three days after he a...,"President Trump’s visit to Beauregard, Ala., c...",2019-03-08T20:08:42Z,"{'id': 'the-new-york-times', 'name': 'The New ...",Trump Surveys Tornado Damage in Alabama,https://www.nytimes.com/2019/03/08/us/trump-al...,https://static01.nyt.com/images/2019/03/08/us/...,the-new-york-times,The New York Times


In [None]:
--

#### Investigate the headlines endpoint

In [21]:
url = 'https://newsapi.org/v2/top-headlines?'

param = {
    'country' : 'us',
    'category': 'business', #top headline and everthing cater
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': 100,
}

business_headlines = requests.get(url, params = param)
business_headlines = business_headlines.json()
business_headlines.keys()

In [1]:
#business_headlines

In [2]:
#business_headlines.json()

dict_keys(['status', 'totalResults', 'articles'])

In [3]:
#business_headlines['articles'] #where these articles come from 

#### Investigate the sources endpoint

In [26]:
url = 'https://newsapi.org/v2/sources?'

param = {
    'country' : 'us',
    'q': 'earthquake',
    #'category': 'business',
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': 100,
}

earthquake_sources = requests.get(url, params = param)

In [4]:
#earthquake_sources.json()

In [28]:
import time

now = time.time()
#str(now)


# now = str(now) #Unknown format code 'f' for object of type 'str'

#create the time stampe

In [29]:
arts.to_csv(f'./earthquake_{now:.0f}.csv')

### Schedule automated tasks with Cron

Next let's setup a python script to automatically collect data everyday. We can do this with cron. 

Here are a couple great resources you might reference as we go through cron:   
https://www.taniarascia.com/setting-up-a-basic-cron-job-in-linux/  
https://ole.michelsen.dk/blog/schedule-jobs-with-crontab-on-mac-osx.html  

#### What is cron? 

Cron is a scheduler that allows a user to automate certain functionality to run at specific times. The way timing is encoded in cron is a bit arcane. 

Let's check out this site and try to understand how it works: https://crontab.guru/#0_0_*_*_*

Crontab have 5 * to indicate delineations in time. 

For example, 

\* \* \* \* \*

equates to every minute, every hour, every day, every month, and every day of the week  

This can be a bit confusing. I suggest that when you want to setup a crontab for a specific time, you play around with the crontab.guru and try to get it to output the timing you want. Let's try to get a cron schedule for once a day at noon. 



each markdown has meaning . * is any. 1 is mintues, second is hour, 3rd is day of the month, 4 is month, 5 is the day of the week