### Mission: To webscrap news articles from a variety of sources

- Use news API to identify interesting articles
- Build scrapping code to scrape the articles from the source provided by the news API
- Store the news articles in a permanent database
- Process the text of the articles in preparation for NLP
- Build interesting models around the text to understand sentiment

API source: https://newsapi.org/docs/endpoints/everything

In [15]:
# !openssl version

OpenSSL 1.1.1a  20 Nov 2018


In [2]:
import pandas as pd
from datetime import date
import requests
import time
now = time.time()

In [2]:
def save_news (search_terms,file_name, n_pagesize, start_page, end_pages, save_to_csv): 
    '''
    term_request = which is the key word for search.
    save_to_csv = True indicates csv will be saved
    '''
    
    # API requests
    #for term in search_terms: 
    url = 'https://newsapi.org/v2/everything?'
        
    param = {
    #'country' : 'us',
    'q': search_terms,  #search term 
    'apiKey' : 'e685d6e1420f4882b86d029ed3c1a11d',
    'pageSize': n_pagesize, #max page
    'language': 'en'}
    print (search_terms)
        
    every_term = requests.get(url, params = param)

    articles = every_term.json()['articles'] 
    
    for page in range(start_page, end_pages): #go throught 10 times, and get more pages, 10 more pages
        param['page'] = page
        
        more_term = requests.get(url, params = param)
        more_term = more_term.json()['articles']
        
        articles.extend(more_term)
    arts = pd.DataFrame(articles)
    
    # Drop null and duplicate 
    arts.dropna(inplace=True)
    arts.drop_duplicates(subset='content',inplace = True)
    
    # Creahttp://localhost:8888/notebooks/dsi/Project4_Disaster_Test_Classification/code/NewAPI_exploration.ipynb#te columns
    arts['source_id'] = arts['source'].map(lambda x: x['id'])
    arts['source_name'] = arts['source'].map(lambda x: x['name']) #break up the source, source id, and name colums seperate
    arts.drop (columns = ['source'], axis=1)
    arts['types'] = str(search_terms)
    arts['yes_disaster'] = 1

    # Save df to csv
    if save_to_csv == True: 
        arts.to_csv('../data/'+str(file_name)+'_'+str(search_terms)+'_'+str(now) +'.csv' ,index = False, sep = ",") #index = False for no extra columns
        print (f'{len(articles)} unique news haved been saved')

In [19]:
list_ev = set(['blizzard', 'storm complex', 'flood', 'snow storm', 'mudflow', 'tornado','Complex fire', 'River floods',
         'earthquake', 'tsunami', 'California wildfires', 'Snow storm','deadlier', 'mph', 'Lake Storm', 'Hurricane Katrina',
         'tornado outbreak', 'ice storm','tsunami', 'tornado', 'blizzard','tremor', 'twister', 'cyclone', 'fire', 'lightning', 
         'hurricane', 'whirlpool', 'cloud', 'gale', 'force', 'snowstorm', 'nimbus', 'casualty', 'fatality', 'lost', 
         'tension', 'uproot', 'arsonist', 'rescue', 'fault', 'Natural disasters', 'avalanche', 'drought','dust storm'])

In [None]:
keyword_for_terrorism = []

In [20]:
list_final = list(list_ev)

In [21]:
len(list_final)

42

In [22]:
for item in list_final: 
    save_news (item, file_name ='e', n_pagesize=10, start_page=2, end_pages=25, save_to_csv=True)
#save_news (['disaster'], file_name ='b', n_pagesize=10, start_page=2, end_pages=3, save_to_csv=True)

earthquake
240 unique news haved been saved
snowstorm
240 unique news haved been saved
lightning
240 unique news haved been saved
Complex fire
240 unique news haved been saved
Hurricane Katrina
240 unique news haved been saved
hurricane
240 unique news haved been saved
uproot
240 unique news haved been saved
ice storm
240 unique news haved been saved
tornado outbreak
240 unique news haved been saved
fire
240 unique news haved been saved
mph
240 unique news haved been saved
storm complex
240 unique news haved been saved
drought
240 unique news haved been saved
cyclone
240 unique news haved been saved
whirlpool
240 unique news haved been saved
blizzard
240 unique news haved been saved
dust storm
240 unique news haved been saved
casualty
240 unique news haved been saved
nimbus
240 unique news haved been saved
fatality
240 unique news haved been saved
tension
240 unique news haved been saved
fault
240 unique news haved been saved
tornado
240 unique news haved been saved
lost
240 unique new

In [24]:
!pwd

/Users/evelyn/Documents/DSI/project_4/Project-4-Disaster-Classification/code


In [28]:
path+ "/e_*.csv"

'/Users/evelyn/Documents/DSI/project_4/Project-4-Disaster-Classification/code/e_*.csv'

In [30]:
#Code reference stack overflow 
#merge all files together 
import glob

all_files = glob.glob("../data/e_*.csv")

lst = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    lst.append(df)

df = pd.concat(lst, axis=0, ignore_index=True)

In [31]:
#df= pd.read_csv('../data/b_Harvey_1555193272.906144.csv')
df.columns

Index(['author', 'content', 'description', 'publishedAt', 'source', 'title',
       'url', 'urlToImage', 'source_id', 'source_name', 'types',
       'yes_disaster'],
      dtype='object')

In [32]:
df.shape

(8744, 12)

In [38]:
df_title = df[['title', 'yes_disaster']]

In [39]:
df_title.to_csv('../data/e_title_only.csv')

In [40]:
df_title.shape

(8744, 2)

In [43]:
df_title.drop_duplicates()

Unnamed: 0,title,yes_disaster
0,FilmStruck Is the Latest Casualty of the Strea...,1
1,news analysis: We’re All Stuck Inside George a...,1
2,"Belgian Prime Minister, Facing Populist Revolt...",1
3,Back-to-School Shopping for Districts: Armed G...,1
4,"White House Says ‘Horrific, Tragic’ Death of M...",1
5,Chief Security Officer Alex Stamos is leaving ...,1
6,The Death Toll for Afghan Forces Is Secret. He...,1
7,People hospitalized after Missouri tourist boa...,1
8,"China’s third-largest bike sharing service, wh...",1
9,How to Avoid the Most Common Moving Mistakes,1


In [35]:
df.to_csv('../data/e_all_keyword_list.csv')

In [36]:
df.head()

Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage,source_id,source_name,types,yes_disaster
0,Brian Raftery,"Early on Friday, WarnerMedia announced it was ...",The AT&T/Time Warner merger has claimed anothe...,2018-10-26T19:12:28Z,"{'id': 'wired', 'name': 'Wired'}",FilmStruck Is the Latest Casualty of the Strea...,https://www.wired.com/story/rip-filmstruck-str...,https://media.wired.com/photos/5bd365c6189fdd7...,wired,Wired,casualty,1
1,MARK LEIBOVICH,Politics has always loved a good odd-couple st...,Are we watching a life partnership fracture on...,2019-03-30T18:30:01Z,"{'id': 'the-new-york-times', 'name': 'The New ...",news analysis: We’re All Stuck Inside George a...,https://www.nytimes.com/2019/03/30/sunday-revi...,https://static01.nyt.com/images/2019/03/31/opi...,the-new-york-times,The New York Times,casualty,1
2,MATT APUZZO and MILAN SCHREUER,The agreement has recently become a flash poin...,A government collapse would be a high-profile ...,2018-12-18T21:36:47Z,"{'id': 'the-new-york-times', 'name': 'The New ...","Belgian Prime Minister, Facing Populist Revolt...",https://www.nytimes.com/2018/12/18/world/europ...,https://static01.nyt.com/images/2018/12/19/wor...,the-new-york-times,The New York Times,casualty,1
3,PATRICIA MAZZEI,“There were opportunities for the staff to hav...,"The shooting in Parkland, Fla., spurred school...",2018-08-11T15:18:34Z,"{'id': 'the-new-york-times', 'name': 'The New ...",Back-to-School Shopping for Districts: Armed G...,https://www.nytimes.com/2018/08/11/us/back-to-...,https://static01.nyt.com/images/2018/08/12/us/...,the-new-york-times,The New York Times,casualty,1
4,RON NIXON,"Around 6:30 a.m., officials said, Jakelin bega...",A 7-year-old Guatemalan girl died from dehydra...,2018-12-14T20:00:29Z,"{'id': 'the-new-york-times', 'name': 'The New ...","White House Says ‘Horrific, Tragic’ Death of M...",https://www.nytimes.com/2018/12/14/us/politics...,https://static01.nyt.com/images/2018/12/15/us/...,the-new-york-times,The New York Times,casualty,1


---

In [4]:
#Combind rows 
df2 = pd.read_csv('../data/Evelyn_csv_data/e_title_list.csv')

In [5]:
df2['yes_disaster'].value_counts()

0    6224
1    2520
Name: yes_disaster, dtype: int64

In [6]:
df3 = pd.read_csv('../data/Evelyn_csv_data/e_all_keyword_list.csv')

In [7]:
df3['yes_disaster'].value_counts()

1    8744
Name: yes_disaster, dtype: int64

In [8]:
df3['yes_disaster'] = df2['yes_disaster']

In [9]:
df3['yes_disaster'].value_counts()

0    6224
1    2520
Name: yes_disaster, dtype: int64

In [12]:
df3.drop(columns = 'Unnamed: 0', inplace = True)

In [13]:
df3.head()

Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage,source_id,source_name,types,yes_disaster
0,Brian Raftery,"Early on Friday, WarnerMedia announced it was ...",The AT&T/Time Warner merger has claimed anothe...,2018-10-26T19:12:28Z,"{'id': 'wired', 'name': 'Wired'}",FilmStruck Is the Latest Casualty of the Strea...,https://www.wired.com/story/rip-filmstruck-str...,https://media.wired.com/photos/5bd365c6189fdd7...,wired,Wired,casualty,0
1,MARK LEIBOVICH,Politics has always loved a good odd-couple st...,Are we watching a life partnership fracture on...,2019-03-30T18:30:01Z,"{'id': 'the-new-york-times', 'name': 'The New ...",news analysis: We’re All Stuck Inside George a...,https://www.nytimes.com/2019/03/30/sunday-revi...,https://static01.nyt.com/images/2019/03/31/opi...,the-new-york-times,The New York Times,casualty,0
2,MATT APUZZO and MILAN SCHREUER,The agreement has recently become a flash poin...,A government collapse would be a high-profile ...,2018-12-18T21:36:47Z,"{'id': 'the-new-york-times', 'name': 'The New ...","Belgian Prime Minister, Facing Populist Revolt...",https://www.nytimes.com/2018/12/18/world/europ...,https://static01.nyt.com/images/2018/12/19/wor...,the-new-york-times,The New York Times,casualty,0
3,PATRICIA MAZZEI,“There were opportunities for the staff to hav...,"The shooting in Parkland, Fla., spurred school...",2018-08-11T15:18:34Z,"{'id': 'the-new-york-times', 'name': 'The New ...",Back-to-School Shopping for Districts: Armed G...,https://www.nytimes.com/2018/08/11/us/back-to-...,https://static01.nyt.com/images/2018/08/12/us/...,the-new-york-times,The New York Times,casualty,0
4,RON NIXON,"Around 6:30 a.m., officials said, Jakelin bega...",A 7-year-old Guatemalan girl died from dehydra...,2018-12-14T20:00:29Z,"{'id': 'the-new-york-times', 'name': 'The New ...","White House Says ‘Horrific, Tragic’ Death of M...",https://www.nytimes.com/2018/12/14/us/politics...,https://static01.nyt.com/images/2018/12/15/us/...,the-new-york-times,The New York Times,casualty,0


In [14]:
df3.to_csv('../data/consolidate_data/e_consolidated_4_18_2019.csv')

In [15]:
df3['yes_disaster'].unique()

array([0, 1])