# From Scraped Text to Analysis-Ready Corpus

This code was made as a part of course group assignment to build corpus collection. To collect human-written news articles, we used webscrapper.io. 

Since our topic is about Jacinda Ardern resignation, we searched the news articles using Google Search. We only use "ardern" as our search keyword and did not specify "resignation", "resign", "quit" or other word implying her resignation to broaden our search so we can also get the news related to her and then we can filter later. However, we limit our source to one or two outlets. In this case, we used Tools > Advanced Search. And then we put the news site URL to the 'site or domain' field, e.g. https://www.rnz.co.nz/ and https://www.nzherald.co.nz/. We also narrow the results to only use English language. After that, we set the date range of the news article to focus the search in January, which was when the resignation took place. Although apparently we also got the results outside of the date range we set, however we still got most of the news articles in January just as expected.

After the Google Search parameters set, we then use scraper tool webscrapper.io with the url in the url address field as the root.
We set the selector and we also set the story_text as SelectorGroup.

Once the scrape process finished, we export the result to csv. However, since we grouped the text paragraphs, we got our story_text column in json format.
Therefore, we need to clean it. To clean the article content, we use a python script.
Beside cleaning the content, we also use this script to only select the story text, exclude the urls and extra informations that we do not need
and save the result in text files.
Each row in the csv results one text file which file name is from notation, article date and the article title which trimmed to 30 characters.
The trimming is to avoid the issue when we extract the zipped text files due to filename is too long.

Below is the python script we used for this process.

In [None]:
import csv
import os
import re
import json
from datetime import datetime

# Function to sanitize filenames
def sanitize_filename(filename):
    # Replace spaces and special characters with underscores
    return re.sub(r'[<>:\'"/\\|?*]', '_', filename).replace(' ', '_')[:30].ljust(30)

# Function to set the filename from notation, date and title
def set_filename(notation, date_str, sanitized_title):
    # get the date
    # since rnz and nz herald use different date format in their articles
    date_object = datetime.strptime(date_str.strip(), "%I:%M %p on %d %B %Y") if (notation == "rnz") else datetime.strptime(date_str.strip(), '%d %b, %Y %I:%M %p') # else nzherald
    formatted_date = date_object.strftime("%Y-%m-%d")

    return f"{notation} {formatted_date} {filename}"

# Function to convert the json format of the story_text to plain text
def clean_text(json_str):
    # Initialize a list to hold the values
    story_text_arr = []

    json_array = json.loads(json_str)

    # Extract values from each JSON object
    for obj in json_array:
        # Example: Extract the 'story_text' value from each JSON object
        paragraph = obj.get('story_text')
        story_text.append(paragraph)

    return story_text_arr

# use these for rnz, and comment the nzherald part
notation = '[rnz]'
csv_file_path = 'googleArdern.csv' # Path to the input CSV file
output_directory = 'googleArdern_rnz_output_text_only'  # Directory where text files will be saved

# use these for nzherald, and comment the rnz part
# notation = '[nzherald]'
# csv_file_path = 'google_nzherald.csv' # Path to the input CSV file
# output_directory = 'google_nzherald_output_text_only'  # Directory where text files will be saved

# Create output directory if it does not exist
os.makedirs(output_directory, exist_ok=True)

# Open the CSV file for reading
with open(csv_file_path, mode='r', newline='', encoding='utf-8') as csv_file:
    csv_reader = csv.reader(csv_file)

    # Process each row in the CSV file
    for row_index, row in enumerate(csv_reader):
        if len(row) >= 7 and row[5].strip() != '' and 'Ardern' in row[6]: # filter only when text contains 'ardern'
            filename_base = sanitize_filename(row[4].strip()) # Get the title from the 5th column (index 4)
            filename = set_filename(notation, row[5], filename_base)

            # Create a file name for each row
            text_file_name = f'{filename}.txt'
            text_file_path = os.path.join(output_directory, text_file_name)

            # get the story text
            story_text = clean_text(row[6])

            # Open the text file for writing
            with open(text_file_path, mode='w', encoding='utf-8') as txt_file:
                row_text = '\n\n'.join(str(x) for x in story_text)
                txt_file.write(row_text + '\n')

print(f"Each row has been written to a separate text file in '{output_directory}' directory.")

After running the script, we got the cleaned text files as a result from scraping from rnz and nzherald sites.