# This code collects articles from The Guardian newspaper 

### Import necessary libraries

requests is a library that lets you connect to the internet. 

html is a class in lxml library which allows you to parse html skeleton of a webpage.

datetime is a library that allows you to manipulate dates using a date object in python.

string is a library that allows you to remove characters (like commas) from text.

csv allows you to export a csv.

In [1]:
import requests
from lxml import html
from datetime import date, timedelta
import string
import csv

### Set Variables
Set a variable for the final date that Python will obtain information until. 
The format that it is set here is the default for the datetime library.

In [2]:
final_date = date(2019, 2, 28)

Set a variable for the number of days to obtain information from.

In [3]:
num_days = 365

Create an empty dictionary for the archive materials obtained from The Guardian. 

In [4]:
archive = {}

Set a variable for the string to search for in the API.

In [6]:
string_to_search = "protest"

Set a counter variable starting at 0 that will eventually count the number of articles collected in the for loop.

In [7]:
counter = 0 

### Run For Loops

#### External For Loop

For each day in the number of days set in the variable num_days take the final date and subtract the change of one day at a time to get the date to iterate over. 

Create a variable called "search_query_text" which is equal to the url generated by the search function on the Guardian's API website which includes from date and to date (both which are set to {} and then appended to the url in the form .format(from_date, from_date). It also includes other specifications of your search including q={} which is the query term which I am setting equal to string_to_search variable. 

Then create a list labeled r which uses the url you have made to send a get request to the guardian's API. The output is a json file - so we tell python to save it that way using the .json function and save it in json_ouput.

Then go to one of the json response links that is created and put it into a json formatter online to see what parts of the json you need. I used https://jsonformatter.org/json-pretty-print

It is a nested json so first direct python to the 'response' key and then to the 'results' key. This stores just the articles.

#### Internal For Loop

Within this for loop is another for loop that says for each article in the list just_articles count them and (so you know it's working) print "requesting article content from..." next to the article's web url.

Then go to one of the pages of the urls for the article and inspect it to see what part of the html you want. I specified that I want the url and title of each article and saved them in lists.

Use the url of each article obtained from the API to send a get request to the internet to get the data from that url.

Then use the function html.fromstring() to save just the parts of the page that are a part of the HTML tree structure. 

Then use the function .xpath('//p') to take just the paragraphs from the html tree and save the article paragraphs to a variable called paragraphs.

Create a variable called all_text_from_article and save the title of the article.

#### Sub-Internal For Loop 1

Then create a for loop in this one that creates an enumerated list (with numbers for each paragraph so that you can select the paragraphs you want). 

Use function text_content() to get the actual text from the object. 

When I viewed the text I can see that the first three lines are junk and I don't want them in final dataset, so I create an if else statement that says keep only if the paragraph is greater than the third paragraph in the data.

Then create a variable called all_text_from_article and add the previous all_text_from_article (which contains the title only) now adding the article text.

Create a variable with a comma and period and label it "things to remove."

Create an empty string variable called "cleaned text."

#### Sub-Internal For Loop 2

Then for each character in the entire data from the articles, if the character is in things to remove continue past it, if its not, then add it to cleaned_text.

Finally, take the archive dictionary you originally made and save cleaned_text into it with the title variable as the key and the cleaned text as the value.

In [8]:
for i in range(num_days):  
    
    start_date = final_date - timedelta(days=i+1)
    start_date = start_date.strftime('%Y-%m-%d')

    guardian_api_url = "https://content.guardianapis.com/search?from-date={}&to-date={}&order-by=newest&use-date=published&page-size=150&q={}&api-key=f7d38377-bc20-48ed-be11-31423575280d".format(start_date, start_date, string_to_search)
    print(guardian_api_url)
    
    r = requests.get(guardian_api_url)
    json_output = r.json()
    just_articles = json_output['response']['results']
    
    for article in just_articles:
        counter = counter + 1
        print( 'requesting article content from ...', article['webUrl'] )
        url = article['webUrl']
        title = article['webTitle']
        
        page = requests.get(url)
        
        tree = html.fromstring(page.content)
        
        paragraphs = tree.xpath('//p')
        
        all_text_from_article = title + '. '
        for i,content in enumerate(paragraphs):
            
            text_from_p = content.text_content()
            if i > 3:
                all_text_from_article = all_text_from_article + text_from_p

                
        things_to_remove = ".,"
        cleaned_text = ""
        for character in all_text_from_article:
            if character in things_to_remove:
                continue
            else:
                cleaned_text = cleaned_text + character
                
        archive[title] = cleaned_text

https://content.guardianapis.com/search?from-date=2019-02-27&to-date=2019-02-27&order-by=newest&use-date=published&page-size=150&q=protest&api-key=f7d38377-bc20-48ed-be11-31423575280d
requesting article content from ... https://www.theguardian.com/football/blog/2019/feb/27/willy-caballero-chelsea-tottenham
requesting article content from ... https://www.theguardian.com/commentisfree/2019/feb/28/protests-such-as-dont-kill-live-music-seem-to-represent-white-self-interest
requesting article content from ... https://www.theguardian.com/world/2019/feb/28/last-four-refugee-children-leave-nauru-for-resettlement-in-us
requesting article content from ... https://www.theguardian.com/us-news/2019/feb/27/house-gun-control-legislation-passed
requesting article content from ... https://www.theguardian.com/politics/2019/feb/27/labour-suspends-chris-williamson-over-antisemitism-remarks
requesting article content from ... https://www.theguardian.com/environment/2019/feb/27/bailiffs-move-in-on-heathrow-

### Export to csv 

In [9]:
with open('guardiandata.csv', 'w') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in archive.items():
       writer.writerow([key, value])