# TRAIN_DATA generator

Here Wikipedia and news articles are passed through the EntityRuler-enhanced model (loc_er) created earlier to generate the training data in spaCy v2 format in an automated manner.

Import the following modules: WikipediaAPI, spaCy, newspaper3k and its Article function.  

In [7]:
import wikipediaapi
import spacy
import json
import newspaper
import random
from newspaper import Article

In [8]:
#NOTE: The model required is in models/loc_er, which is not directly linked to this folder. Replace the path below with the full path to models/loc_er.
#path = os.path.realpath(r"C:\Users\nicol\OneDrive\Desktop\python_stuff\spacytestenv_in_vscode\models\loc_er")
path = "..\..\models\loc_er"

Load the EntityRuler-enhanced model from earlier. On my laptop CPU this takes 2.5 mins on average.

In [6]:
nlp = spacy.load(path)

Set up the Wikipedia API as below.

In [None]:
#Calling the Wikipedia API and inputting language and format 
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI
)

List of articles to be processed by the EntityRuler-enhanced models. This is a small sample collection, and placing the article names/url into a list enables future provision for the list to be stored in a seperate JSON file to avoid clogging up the main ipynb.

In [None]:
list_of_newspaper_urls = [
    "https://www.channelnewsasia.com/singapore/police-arrest-man-clementi-police-centre-gunshot-2505181",
    "https://www.channelnewsasia.com/singapore/flood-heavy-rain-weather-warnings-fallen-trees-vehicles-punggol-sungei-gedong-2509816",
    "https://www.straitstimes.com/singapore/community/integrated-community-hub-one-punggol-to-open-from-mid-2022-700-seat-hawker",
    "https://www.channelnewsasia.com/singapore/fire-breaks-out-jurong-east-condominium-1-taken-hospital-2388701",
    "https://mothership.sg/2022/01/jurong-east-choked-garbage-chute/",
    "https://www.straitstimes.com/singapore/housing/hougang-bto-flats-draw-more-than-10000-applicants-all-seven-projects",
    
]

In [None]:
list_of_wikipedia_articles = [
    "North South MRT line", 
    "East West MRT line", 
    "Downtown MRT line", 
    "Mass Rapid Transit (Singapore)", 
    "Geography of Singapore", 
    "Transport in Singapore", 
    'Pan Island Expressway', 
    "Urban planning in Singapore", 
    "Future developments in Singapore", 
    "North East MRT line", 
    "Chinatown MRT station", 
    "Circle MRT line"
]

Processing of text article by the EntityRuler-enhanced model below. A large function (generate_annotations) calls the following "sub-functions":  
1. Pulling of the article from Wikipedia or the news website (extract_wikipedia / extract_news_articles)

In [None]:
#Function to extract only the text portion from the article, and clean up any syntax issues.
def extract_wikipedia(article):
    p_wiki = wiki_wiki.page(article)
    wikitext = p_wiki.text
    if "\n" in wikitext:
        wikitext = wikitext.replace("\n", "")
    return wikitext

In [None]:
#Function to extract news articles. The articles are downloaded and parsed before the text is returned.
def extract_news_articles(article_url):
    news_article = Article(article_url)
    news_article.download()
    news_article.parse()
    return news_article.text

2. Breaking the text into sentences (break_into_sentences)  

In [None]:
#Function to break the article into sentences to make it easier for the NLP model to process. The list of sentences is typically referred to as a "corpus", hence the odd naming.
def break_into_sentences(article):
    corpus = []
    doc = nlp(article)
    for sent in doc.sents:
        corpus.append(sent.text)
    return corpus

3. Sentences are passed into the EntityRuler-enhanced model and annotated in spaCy v2 traindata format (generate_traindata)

In [None]:
def generate_traindata(text):
    doc = nlp(text)
    results = []
    locations = []
    for ent in doc.ents:
        if ent.label_ == "LOC":
            locations.append((ent.start_char, ent.end_char, ent.label_))
    if len(locations) > 0:
        results = [text, {"entities": locations}]
        return results

The 3 functions are called in the large function, and the results are appended to a list.

In [None]:
def generate_annotations(article, type):
    extracted_values = []
    if type == "Wikipedia Article":
        extracted_article = extract_wikipedia(article)
    elif type == "News Article":
        extracted_article = extract_news_articles(article)
    cleaned_text = break_into_sentences(extracted_article)
    for sentences in cleaned_text:
        results = generate_traindata(sentences)
        if results != None:
            extracted_values.append(results)
    return extracted_values

The final list of annotations is generated by running the list of news and wikipedia articles through the generate_annotations function.

In [None]:
TRAIN_DATA = []
for articles in list_of_wikipedia_articles:
    list_of_annotations = generate_annotations(articles, "Wikipedia Article")
    for items in list_of_annotations:
        TRAIN_DATA.append(items)

for articles in list_of_newspaper_urls:
    list_of_annotations = generate_annotations(articles, "News Article")
    for items in list_of_annotations:
        TRAIN_DATA.append(items)

print(f'{len(TRAIN_DATA)} sets of annotations were created.')

924 sets of annotations were created.


In [None]:
final_index = len(TRAIN_DATA) - 1
mid_index = int(final_index / 2)
valid_index_start = mid_index + 1

print(final_index, mid_index, valid_index_start)

923 461 462


In [None]:
random.shuffle(TRAIN_DATA)
training_set = TRAIN_DATA[0:mid_index]
validation_set = TRAIN_DATA[valid_index_start:final_index]

The data is then saved to a JSON file.

In [None]:
save_data_path = "../../data/training_datasets/"

def save_data(file, data):
    with open (save_data_path + file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

save_data("full_train_data_er.json", TRAIN_DATA)
save_data("training_set_er.json", training_set)
save_data("validation_set_er.json", validation_set)

NameError: name 'TRAIN_DATA' is not defined