This notebook goes through a previous file called "news.csv" that contains a list of news urls and their frequency

In [1]:
from goose3 import Goose
from goose3.configuration import Configuration
import pandas as pd
import csv

config = Configuration()
config.strict = False
config.browser_user_agent = 'Chrome'
config.http_timeout = 70.05

g = Goose(config)
short = pd.read_csv('news.csv')             # read in csv
reader = csv.reader(short, delimiter = ',') # denotes separate in csv
article_data = []                           # initialize empty dictionary

From here, the code will loop through the each row and grab the article URLs.  For each article, it will store the following:

    1) Article Title
    2) Author 
    3) Publication Date
    4) Body of Article

In [2]:
# read in each row of the file
urls = short["Unnamed: 0"]
count = 1
for url in urls:
    if count == 400: 
        break
    count += 1
    # articles with these urls give errors when trying to access them
    if "latimes.com" in url or "usat.ly" in url:
        pass
    else:
#         print("Line ", count, " ", url)
        article = g.extract(url)            # grabs only the URL in the first column
        article_data.append({'url': url, 
                             'title': article.title, 
                             'author': article.authors, 
                             'date': article.publish_date, 
                             'body': article.cleaned_text})
    
    
print("\n\nEnd of File\n")
g.close()



End of File



Now that all of the data is stored in a dictionary, we will store it in a CSV so that we can perform analysis on it later on.

In [3]:
csv_col = ['url', 'title', 'author', 'date', 'body']
csv_file = "article_data.csv"
try:
    with open(csv_file, 'w', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_col)
        writer.writeheader()
        for data in article_data:
            writer.writerow(data)
except IOError:
    print("I/O Error")
    

print("End of program")

End of program
