# Scraping The Guardian: Opinion is Free

This is the first level of a text analysis scaffold. I have chosen to scrape only a particular section of The Guardian to help me investigate the quality and usefullness of the text for analysis. I hope to be able to use Natural Language Processing to find value in the choice of words and phrases for students of English as a second language.

## First Steps:
1. Scrape the relevant section and save the HTML into a file
2. Read the HTML file and parse it using Beautiful Soup, and return only the relevant content (i.e., removing anything that is not related to the opinion articles)
3. extract the specific information for each article (date, author, title, url)
4. Create a data frame to visualise the data
5. Export data as a .csv document

In [1]:
import requests 
from datetime import datetime
import re
from bs4 import BeautifulSoup
import pandas as pd

### Helper functions
Only one function is called: "main_export_to_csv". The helper functions below do the heavy lifting *inside* the main function

`scrape_opinions_data()` scrapes the entire page and save to a file. Saving to a file is not strictly necessary, but it may help if I needed to consult the shape of the HTML.

In [2]:
def scrape_opinions_data():
    opinions_url ="https://www.theguardian.com/uk/commentisfree"
    response = requests.get(opinions_url)
    main_content = response.text
    
    with open('opinions_webpage.html', 'w') as file:
        file.write(main_content)

With the help of Beautiful Soup, ```parse_select_main_content()``` can find and select only the part of the HTML that interest me.

In [3]:
def parse_select_main_content():
    with open('./opinions_webpage.html') as fp:
        soup = BeautifulSoup(fp, 'html.parser')
    
    all_articles = soup.find('section', {'id': "opinion"}).find_all('div', {'class': "fc-item__content"})
    
    return all_articles

While the date written in the date used in the URL is *human-readable*, it is better for future analysis to convert it to a predictable, *machine-readable* format.

In [4]:
def parse_format_date(date):
        date_title_case  = date.title()
        long_date = datetime.strptime(date_title_case, '%Y/%b/%d')
        date = datetime.date(long_date)
        return date

`append_to_lists` is a little more difficult to read because I am performing a couple of steps in one compounded line. For example, the line that starts `authors.append ...` does three things:
1. It finds a 'div' with class 'fc-item__byline' `.find('div', {'class': "fc-item__byline"})`,
2. it strips the the white spaces and carriage returns from the text `.text.strip()`,
3. and appends the result to the `authors` list

In [5]:
def append_to_lists(articles):
    for opinion in range(len(articles)):
        authors.append(articles[opinion].find('div', {'class': "fc-item__byline"}).text.strip())
        dates.append(parse_format_date(articles[opinion].h3.a['href'][42:53].title()))
        titles.append(articles[opinion].find('span', {'class': "js-headline-text"}).text.strip())
        urls.append(articles[opinion].h3.a['href'])


The `create_dict_df()` simply formats the various lists that result from `append_to_lists` above in a dictionary format (dict) so that Pandas can render it as a data frame (df)

In [6]:
def create_dict_df():
    dic = {'date': dates, 'author': authors, 'title': titles, 'url': urls}
    opinions_df = pd.DataFrame(dic)
    
    display(opinions_df)
    return opinions_df

In [7]:
def export_df_to_csv(df):
    df.to_csv('opinions.csv', index=None)

In [8]:
dates = []
authors =[]
titles =[]
urls=[]
def main_export_to_csv():
    scrape_opinions_data()
    article_items = parse_select_main_content()
    append_to_lists(article_items)
    df = create_dict_df()
    export_df_to_csv(df)

In [9]:
main_export_to_csv()

Unnamed: 0,date,author,title,url
0,2023-01-28,Mark Rice-Oxley,"To defeat Putin, we must support the brave Rus...",https://www.theguardian.com/commentisfree/2023...
1,2023-01-28,Charlotte Higgins,The Fabelmans will never be fought over like T...,https://www.theguardian.com/commentisfree/2023...
2,2023-01-27,Jonathan Freedland,The stench coming from this government? It’s t...,https://www.theguardian.com/commentisfree/2023...
3,2023-01-27,Marina Hyde,Why is British politics a raging bin-fire? Don...,https://www.theguardian.com/commentisfree/2023...
4,2023-01-27,James Bulgin,Hitler didn’t build the path to the Holocaust ...,https://www.theguardian.com/commentisfree/2023...
5,2023-01-27,Judy Griffith,"Where is the justice, Suella Braverman, for me...",https://www.theguardian.com/commentisfree/2023...
6,2023-01-27,Gaby Hinsliff,Madonna is the material proof: older women roc...,https://www.theguardian.com/commentisfree/2023...
7,2023-01-27,Simon Jenkins,"Bright lights, big cities: cash and HS2 are no...",https://www.theguardian.com/commentisfree/2023...
8,2023-01-27,Neal Lawson,"Terrified of leavers and remainers, Labour off...",https://www.theguardian.com/commentisfree/2023...
9,2023-01-27,Ammar Kalia,"Farewell, Netflix password sharing. Never agai...",https://www.theguardian.com/commentisfree/2023...


## Next Steps ...
with the csv, I can access the URLs and collect the text, count words, tokenize word and phrases, etc.