# Building an annotated corpus for main-entity-extraction systems

## Source of the data

The source of the data that we will be using is the [**The Guardian Open Platform**](https://open-platform.theguardian.com/) website. The Open Platform is a public web service that provides access to all the content generated by The Guardian. The Guardian, originally founded as The Manchester Guardian, back in 1821, is a British Daily Newspaper. As a news platform, it contains humongous amounts of text data that can be used to facilitate advanced research in the field of computational linguistics. To get access to its data, the Open Platform caters an application programming interface that contains articles, videos, images dating back to 1999. Easy access to a large amount of data (1.9 million content pieces) coupled with an easy to work-with API makes it a good option to get started on building this corpus. 

## The data

- The API provides access to all kinds of data that can be found in a news website: ranging from images, articles to videos, crossword puzzles etc. However, as stated earlier, we would be working with news articles in particular. 
- We'll be working with news articles specifically written in English. Primarily, The Guardian writes in English but more importantly, English is easier to work with, comprehend and annotate in. 
- The news articles that we're using to build this corpus contains articles written by a diverse range of journalists and not a single person in particular. This allows us to work with different writing styles and explore a diverse range of semantic and syntactic information as well. 
- The documents are extremely long and varying in length with articles (just the content excluding all the metadata) ranging anywhere from 400-4000 words. Once we kick-off the project workflow, we'll also try to identify if we might benefit more from working with documents that are less diverse in length and more uniform.

## News of interest

The data, once fetched, can be retrieved in Python in a JSON format as demonstrated in the POC document. One of the keys in the JSON document is 'topic', which categorizes news articles into politics, sports, opinions etc. We decided to work with categories that are mostly entity oriented and not abstract in structure. Thus we filtered in categories like: Politics, Science, Business, Film and Technology and left the rest out. Further discussions might lead to inclusion/exclusion of different categories as well.

As demonstrated in the POC document, by utilizing the well structured JSON formatted documents, we've easily identified and extracted our news categories of interest. Given how vast the news catalogue is in general, as well as for a newspaper this popular, even after filtering out  most of the categories, we've been able to fetch more than 1500 articles just for a period of 45 days (ranging from early-January 2022 to mid-February 2022). This indicates that our data source has enough textual information required to build a corpus.

## Corpus

### Structure

Tentatively, we have decided to go with a tabular structure for the corpus. Each news article would be a single row or an example. Along with the text as a label, we've decided to include 'topics' metadata to analyze how different categories of news might affect the type of entity recognized as the central piece of information in the news text. We've decided to eliminate headlines or title of the news article, to annotate the main entity of the text based solely on the text itself. This will help us to create a corpus that can be further used to extract useful information from online comments, tweets, facebook posts etc. 

### Annotation

As has been made abundantly clear by now, we're trying to extract the main entity of the piece of information available to be analyzed. For example, a news article written about `Microsoft`, might extract `Microsoft` as the central entity of the piece of text, depending on the context. This is where having different categories of news comes into play. This can help us analyze how different types of entities (person, organization, location) are distributed with respect to the different categories of the news. 

### Format

The format of the corpus, as demonstrated by the POC is a tab-seperated value file with labels 'Topic' and 'Text' corresponding to the topic of the news article and the entire text of the news article (excluding author, title etc.) respectively. This can be demonstrated as follows:

### Our Vision

I was browsing Twitter the other day and was impressed by the feature where it let's us browse the trending tweets corresponding to our favorite celebrities, key politicians and iconic sportsmen. Our vision for the corpus is consolidated around the similar grounds. This corpus can be used to train machine learning models to identify central entities in a given piece of information across various online platforms. This can help us not only to categorize information better, but also make information retrieval more efficient. Google clearly uses this kind of technology to recommend articles to its users based on their search preferences. Moreover, incorporating certain categories of news also helps us to inspect how different categories of news correspond to different categories of entities. Since, it is an abstract idea thus far, as we go further into developing our corpus, we can incorporate other elements to our corpus with the intention of driving further research into the field of computational linguistics.