<H1> Information Extraction Classifier Using SpaCy and NLTK

In this notebook, I follow D. Sarkar's tutorial on info extraction using SpaCy. I will expand on this notebook further after deploying the full model and building its API.
[Link to full tutorial here!](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72)


In [54]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import json
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import os
import spacy
import nltk
nltk.download('stopwords')
from nltk.tokenize.toktok import ToktokTokenizer
import re
from contractions import CONTRACTION_MAP
import unicodedata

nlp = spacy.load('en_core_web_md', parse=True, tag=True, entity=True)
#nlp_vec = spacy.load('en_vecs', parse = True, tag=True, #entity=True)
tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')

%matplotlib inline


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/georgehanna/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


<H1>First let's get some sample data to practice on... </H1>

Here we use a combination of Python's request library and beautiful soup to scrape data from inshort

In [29]:
seed_urls = ['https://inshorts.com/en/read/technology',
             'https://inshorts.com/en/read/sports',
             'https://inshorts.com/en/read/world']

In [46]:
def build_dataset(seed_urls):
    news_data = []
    for url in seed_urls:
        #Save category
        news_category = url.split('/')[-1]
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content, 'html.parser')

        #First get all the card titles and bodies (see bottom for loop)
        #Then, extract headlines, article bodies and place them in a dict
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        
    df =  pd.DataFrame(news_data)
    df = df[['news_headline', 'news_article', 'news_category']]
    return df

In [49]:
df = build_dataset(seed_urls)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 3 columns):
news_headline    75 non-null object
news_article     75 non-null object
news_category    75 non-null object
dtypes: object(3)
memory usage: 1.8+ KB


<H1> Text Pre-processing and Clean up </H1>

Here the goal is to: remove HTML tags, remove stop words and extend contractions

In [55]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

In [57]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [59]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

In [61]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [63]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [65]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text


In [67]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

In [69]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [71]:
# combining headline and article text
df['full_text'] = df["news_headline"].map(str)+ '. ' + df["news_article"]

# pre-process text and store the same
df['clean_text'] = normalize_corpus(df['full_text'])
norm_corpus = list(df['clean_text'])

# show a sample news article
df.iloc[1][['full_text', 'clean_text']].to_dict()

{'full_text': 'Cops are not welcome: Social network suspends Assam Police\'s account. Social network Mastodon on November 14 suspended the account of the Assam Police, saying that "cops are not welcome". "One of the moderators of Mastodon informed that our account has been suspended following reports from users who didn\'t feel safe with our presence," a member of the social media team of the Assam Police said.',
 'clean_text': 'cop not welcome social network suspend assam police account social network mastodon november suspend account assam police say cop not welcome one moderator mastodon inform account suspend follow report user not feel safe presence member social medium team assam police say'}

In [72]:
#Save dataset to disk for later use
df.to_csv('news.csv', index=False, encoding='utf-8')

In [74]:
from spacy import displacy

sentence = str(df.iloc[1].full_text)
sentence_nlp = nlp(sentence)

# print named entities in article
print([(word, word.ent_type_) for word in sentence_nlp if word.ent_type_])

# visualize named entities
displacy.render(sentence_nlp, style='ent', jupyter=True)

[(Assam, 'ORG'), (Police, 'ORG'), ('s, 'ORG'), (Mastodon, 'ORG'), (on, 'ORG'), (November, 'DATE'), (14, 'DATE'), (the, 'ORG'), (Assam, 'ORG'), (Police, 'ORG'), (One, 'CARDINAL'), (Mastodon, 'PERSON'), (the, 'ORG'), (Assam, 'ORG'), (Police, 'ORG')]
