<a href="https://colab.research.google.com/github/anuraged51a/LSGAA_SNG/blob/main/LSGAA_SNG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Preprocessing
We have loaded the events of the 26th March 2020 for reference, from the GDELT website. The [dataset](https://drive.google.com/file/d/1BjG_HA-TXWxU0xDxIampqa2gP6fs574z/view?usp=sharing) contains more than 1.5 lakh rows, which is very difficult to process.<br><br>
We aim to make use of news events coming only from some of the major news websites which are listed below, and only going to use the news which has some reference to the U.K. After filtering through the given websites we still attain a massive dataset of 708 rows.


*   www.dailymail.co.uk
*   www.express.co.uk
*   www.independent.co.uk
*   www.mirror.co.uk
*   www.standard.co.uk
*   www.bbc.co.uk
*   www.thetimes.co.uk
*   www.dailystar.co.uk
*   www.hulldailymail.co.uk
*   www.eveningexpress.co.uk



In [None]:
# Installing Libraries
! pip install goose3
! pip install flair



In [None]:
# Importing Libraries
import pandas as pd
import string
import re
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import nltk

from goose3 import Goose
from itertools import combinations, product
form tqdm import tqdm
from flair.data import Sentence
from flair.models import SequenceTagger
from nltk import tokenize

In [None]:
# Load Data
df1 = pd.read_csv('/content/drive/MyDrive/LSGAA_SNG/url_dataset.csv')
len(df1['URL'])

708

## Extracting the Raw Text from URLs
We will be using [goose3](https://goose3.readthedocs.io/en/latest/index.html) to extract the raw text from these news URLs shortlisted in the previous step.

In [None]:
g = Goose()
dateList, domainList, titleList, contentList = [], [], [], []
for item in df1['URL']:
  try:
    article = g.extract(url = str(item))
    dateList.append('2020-03-26')
    domainList.append(article.domain)
    titleList.append(article.title)
    contentList.append(article.cleaned_text)
  except Exception as e:
    print(str(item))
    continue
g.close()
print("Extraction Completed.")

In [None]:
# Combining these fields to form a new DataFrame.
dic = {'date': dateList, 'domain': domainList, 'title': titleList, 'content': contentList}
df2 = pd.DataFrame(data = dic)
df2

## Entity Recognition
Named Entity Recognition or Entity Extraction is a subtask of Information Extraction that seeks to locate and classify named entitities mentioned in the unstructured text to pre-defined  categories. We can use multitude of libraries for this purpose like GATE, OpenNLP, SpaCy etc. For this project, we aim to utilise the [Flair](https://github.com/flairNLP/flair) library to identify the individuals and orgainsations of interest.

In [None]:
# Downloading NLTK's punkt tokenizer
nltk.download('punkt')
# Loading Flair's ER model
tagger = SequenceTagger.load('ner')

In [None]:
#Removing Pronouns
pronouns = ['I', 'You', 'It', 'He', 'She', 'We', 'They']
suffixes = ["", "'m", "'re", "'s", "'ve", "'d", ]