# General information

____________________

In this Jupyter Notebook we will be working with news articles retrieved from <a href='https://inshorts.com'> Inshorts.com</a> a website that gives us short, 60-word news articles on a wide variety of topics. Inshorts provides news for 12 thematically different sets of content like: Sports, Politics, Business and Technology.

The main focus is to cover some aspects of NLP like:
- Data Retrieval with Web Scraping
- Text wrangling and pre-processing
- Parts of Speech tagging + visualizing dependencies
- Named Entity Recognition
- Building a classifier able to recognize type of content based on words used in the article

Let's start off by importing all necessary packages.

# Importing necessary libraries

_________________________

In [1]:
# Basic modules for dataframe manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Web scraping
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

# NLP
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from contractions import CONTRACTION_MAP
import re
import unicodedata

# Retrieving data
_____________________

In this section we will focus on creating a function for gathering news data from <a href='https://inshorts.com'>Inshorts.com</a>. Since articles are divided by topic and each category displays only ~ 25 articles, we need a solution which will trigger the 'Load more' button desired number of times before retrieving the data and creating a DataFrame, we will achieve this by utilizing Selenium - a popular testing framework - and ChromeDriver. For those who prefer Firefox over Chrome, there is a possibility to use Mozilla's GeckoDriver with Selenium.

In [2]:
source_urls = ['https://inshorts.com/en/read/business',
               'https://inshorts.com/en/read/technology',
               'https://inshorts.com/en/read/science',
               'https://inshorts.com/en/read/world',
               'https://inshorts.com/en/read/sports',
               'https://inshorts.com/en/read/politics',
               'https://inshorts.com/en/read/entertainment',
               'https://inshorts.com/en/read/hatke',
               'https://inshorts.com/en/read/automobile']

PATH = 'chromedriver.exe'

In [3]:
def get_data(seed_urls):
    news_data = []
    for url in seed_urls:
        # Creating a new Chrome session
        driver = webdriver.Chrome(PATH)
        driver.implicitly_wait(30)
        driver.get(url)
        for i in range(20):
            try:
                python_button = driver.find_element_by_id('load-more-btn') # Find 'Load more' button
                sleep(2)
                python_button.click() # Click 'Load more' button to load more articles
                sleep(5)
            except Exception as e:
                print(e)
                break
        news_category = url.split('/')[-1]
        soup = BeautifulSoup(driver.page_source, 'html.parser')     
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        driver.quit()
    df =  pd.DataFrame(news_data)
    return df

In [None]:
df_news = get_data(source_urls)

In [5]:
# Printing first 5 rows of data
df_news.head()

Unnamed: 0,news_article,news_category,news_headline
0,"When deciding the right term insurance plan, c...",business,Max Life Insurance delivering high claims paid...
1,Andhra Pradesh has topped the ease of doing bu...,business,Andhra Pradesh tops ease of doing business ran...
2,Rajinikanth's wife Latha Rajinikanth will have...,business,Rajinikanth's wife to face trial for fraud ove...
3,Online payments company PayPal's UK unit sent ...,business,"Being dead is breach of contract, PayPal tells..."
4,"The world's fifth largest smartphone seller, C...",business,World's 5th largest smartphone seller Xiaomi m...


In [6]:
# Checking the distribution of each topic in our DataFrame
df_news['news_category'].value_counts()

business         564
politics         563
world            563
entertainment    560
science          560
sports           558
hatke            557
technology       556
automobile       553
Name: news_category, dtype: int64

In [None]:
# Saving data into a '.csv' file to prevent redundant overloading of the Inshort's server.
df_news.to_csv('news.csv', index=False, encoding='utf-8')

In [4]:
df_news = pd.read_csv('news.csv')

# Text pre-processing

_________________________

In [7]:
# Loading one of English language models for spacy
nlp = spacy.load('en_core_web_md', parse = True, tag = True, entity = True)
tokenizer = ToktokTokenizer()

# Saving english stopwords from nltk module in a list
stopword_list = nltk.corpus.stopwords.words('english')

# Removing 'no' and 'not' from stopwords list
stopword_list.remove('no')
stopword_list.remove('not')

## Writing helper functions

### Removing HTML tags

Let's start off our data preparation by writing some helper functions which we will use to clean the news data. Since HTML tags don't add much value towards understanding and analyzing text, we will get rid of them.

In [8]:
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

### Removing accented characters

While dealing with text data, very often we encounter accented characters like 'é' or 'ó'. Since they may not be useful while working with English language, we will create function for converting them into unaccented counterparts.

In [9]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

### Expanding contractions

As a next step, let's create a function for dealing with contractions. Contractions are shortened versions of words or syllabes which often exist in written or spoken forms in English language. The typical examples would be <b>do not</b> to <b>don't</b> and <b>I would</b> to <b>I'd</b>. For this purpose we will utilize function and contractions dictionary written by Dipanjan S - Data Scientist working for Intel company, author of <i>Text analytics with Python</i> and <i>Practical machine learning with Python</i>.

<a href="https://github.com/dipanjanS/practical-machine-learning-with-python/tree/master/bonus%20content/nlp%20proven%20approach">Contraction dictionary by Dipanjan S</a>

In [10]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### Removing special characters

Special characters are usually non-alphanumeric or even numeric characters which constitute to the extra noise in unstructured text data. We will create a function based on simple regular expressions which will get rid of them.

In [11]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

### Lemmatization

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. It depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. For this part we will utilize <b>Spacy</b> as it has excellent built-in lemmatizers.

In [12]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### Removing Stopwords

In computing, <i>stopwords</i> are words which are filtered out before or after processing of natural language data. Though stopwords usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Some examples include <b>a</b>, <b>an</b>, <b>the</b>, <b>and</b>.

In [13]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

### Combining above functions - building a Text Normalizer

In [14]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # Normalizing each document in the corpus
    for doc in corpus:
        # Stripping HTML
        if html_stripping:
            doc = remove_html_tags(doc)
        # Removing accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # Expanding contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # Lowering the text    
        if text_lower_case:
            doc = doc.lower()
        # Removing extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # Lemmatizing text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # Removing special characters and\or digits    
        if special_char_removal:
            # Inserting spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # Removing extra whitespace
        doc = re.sub(' +', ' ', doc)
        # Removing stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [15]:
# Combining headline and article text
df_news['full_text'] = df_news["news_headline"].map(str)+ '. ' + df_news["news_article"]

# Pre-processing text and store the same
df_news['clean_text'] = normalize_corpus(df_news['full_text'])
norm_corpus = list(df_news['clean_text'])

# Displaying first 5 rows
df_news.head()

Unnamed: 0,news_article,news_category,news_headline,full_text,clean_text
0,"When deciding the right term insurance plan, c...",business,Max Life Insurance delivering high claims paid...,Max Life Insurance delivering high claims paid...,max life insurance deliver high claim pay year...
1,Andhra Pradesh has topped the ease of doing bu...,business,Andhra Pradesh tops ease of doing business ran...,Andhra Pradesh tops ease of doing business ran...,andhra pradesh top ease business rank nd year ...
2,Rajinikanth's wife Latha Rajinikanth will have...,business,Rajinikanth's wife to face trial for fraud ove...,Rajinikanth's wife to face trial for fraud ove...,rajinikanth wife face trial fraud cr unpaid du...
3,Online payments company PayPal's UK unit sent ...,business,"Being dead is breach of contract, PayPal tells...","Being dead is breach of contract, PayPal tells...",dead breach contract paypal tell deceased pati...
4,"The world's fifth largest smartphone seller, C...",business,World's 5th largest smartphone seller Xiaomi m...,World's 5th largest smartphone seller Xiaomi m...,world th large smartphone seller xiaomi make i...


In [16]:
# Saving preprocessed DataFrame into .csv file
df_news.to_csv('news_preprocessed.csv', index=False, encoding='utf-8')

# Parts of Speech tagging

______________________________

Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and role. Usually, words can fall into one of the following major categories.

- <b>Noun</b>: This usually denotes words that depict some object or entity, which may be living or nonliving. Some examples would be fox , dog , book , and so on. The POS tag symbol for nouns is N.

- <b>Verb</b>: Verbs are words that are used to describe certain actions, states, or occurrences. There are a wide variety of further subcategories, such as auxiliary, reflexive, and transitive verbs (and many more). Some typical examples of verbs would be running , jumping , read , and write . The POS tag symbol for verbs is V.

- <b>Adjective</b>: Adjectives are words used to describe or qualify other words, typically nouns and noun phrases. The phrase beautiful flower has the noun (N) flower which is described or qualified using the adjective (ADJ) beautiful . The POS tag symbol for adjectives is ADJ .

- <b>Adverb</b>: Adverbs usually act as modifiers for other words including nouns, adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb (ADV) very , which modifies the adjective (ADJ) beautiful , indicating the degree to which the flower is beautiful. The POS tag symbol for adverbs is ADV.

Besides these four major categories of parts of speech , there are other categories that occur frequently in the English language. These include pronouns, prepositions, interjections, conjunctions, determiners, and many others. Furthermore, each POS tag like the noun (N) can be further subdivided into categories like singular nouns (NN), singular proper nouns (NNP), and plural nouns (NNS).



Since in this Jupyter Notebook we are going to show examples of Parts of Speech tagging and Named Entity Recognition using Spacy with 'en_core_web_md' model, we will apply them on the unprocessed data as it was proved and described in the below article that lowering the case or lemmatizing ruins the precision of these models. 


<a href='https://medium.com/@dudsdu/named-entity-recognition-for-unstructured-documents-c325d47c7e3a'>Named Entity Recognition for Unstructured Documents</a>

In [17]:
# Displaying first 5 rows
df_news.head()

Unnamed: 0,news_article,news_category,news_headline,full_text,clean_text
0,"When deciding the right term insurance plan, c...",business,Max Life Insurance delivering high claims paid...,Max Life Insurance delivering high claims paid...,max life insurance deliver high claim pay year...
1,Andhra Pradesh has topped the ease of doing bu...,business,Andhra Pradesh tops ease of doing business ran...,Andhra Pradesh tops ease of doing business ran...,andhra pradesh top ease business rank nd year ...
2,Rajinikanth's wife Latha Rajinikanth will have...,business,Rajinikanth's wife to face trial for fraud ove...,Rajinikanth's wife to face trial for fraud ove...,rajinikanth wife face trial fraud cr unpaid du...
3,Online payments company PayPal's UK unit sent ...,business,"Being dead is breach of contract, PayPal tells...","Being dead is breach of contract, PayPal tells...",dead breach contract paypal tell deceased pati...
4,"The world's fifth largest smartphone seller, C...",business,World's 5th largest smartphone seller Xiaomi m...,World's 5th largest smartphone seller Xiaomi m...,world th large smartphone seller xiaomi make i...


In [18]:
# Randomly selecting exemplary news_headline for POS tagging
doc = str(df_news.loc[np.random.choice(np.arange(len(df_news)), 1).item(), 'news_headline'])
doc = nlp(doc)

# POS tagging
pos_tagged = [(word, word.pos_, spacy.explain(word.pos_), word.tag_, spacy.explain(word.tag_)) for word in doc]
df_doc = pd.DataFrame(pos_tagged, columns=['word', 'simple_POS_tag', 'simple_POS_tag_description', 'detailed_POS_tag', 'detailed_POS_tag_description'])
df_doc

Unnamed: 0,word,simple_POS_tag,simple_POS_tag_description,detailed_POS_tag,detailed_POS_tag_description
0,Nick,PROPN,proper noun,NNP,"noun, proper singular"
1,'s,PART,particle,POS,possessive ending
2,ex,X,other,FW,foreign word
3,-,PUNCT,punctuation,HYPH,"punctuation mark, hyphen"
4,girlfriend,NOUN,noun,NN,"noun, singular or mass"
5,says,VERB,verb,VBZ,"verb, 3rd person singular present"
6,she,PRON,pronoun,PRP,"pronoun, personal"
7,ca,VERB,verb,MD,"verb, modal auxiliary"
8,n't,ADV,adverb,RB,adverb
9,compete,VERB,verb,VB,"verb, base form"


In [19]:
# Importing dependency visualizer that lets us check model's predictions in Jupyter notebook or browser
from spacy import displacy

displacy.render(doc, style='dep', jupyter = True)

# Named Entity Recognition

_______________________________

Named Entity Recognition (NER) is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns that are mentioned in that string which more specifically refer to terms that represent real-world objects like people, places, organizations. Named entity recognition, also known as entity chunking/extraction, is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes. Now, let's check how to find them using Spacy.

In [20]:
# Randomly selecting exemplary news article for NER 
doc2 = str(df_news.loc[np.random.choice(np.arange(len(df_news)), 1).item(), 'news_article'])
doc2 = nlp(doc2)

# Finding Named Entities in the article
ne_list = [(word, word.ent_type_) for word in doc2 if word.ent_type_]

# Visualizing Named Entities
displacy.render(doc2, style='ent', jupyter=True)

Now, let's find out what these entity types exactly mean...

In [21]:
# Selecting unique types of Named Enities found by Spacy
unique_ne_type = set(tup[1] for tup in ne_list)
unique_list = [(ne, spacy.explain(ne)) for ne in unique_ne_type]

# Dispalying unique types of Named Entities along with their descriptions
df_doc2 = pd.DataFrame(unique_list, columns=['entity_type', 'entity_description'])
df_doc2

Unnamed: 0,entity_type,entity_description
0,PERSON,"People, including fictional"
1,GPE,"Countries, cities, states"
2,NORP,Nationalities or religious or political groups
3,ORG,"Companies, agencies, institutions, etc."
4,MONEY,"Monetary values, including unit"


Before delving into the Machine Learning part, let's scan our news corpus and reveal what are the most frequent entities using the following code. 

In [22]:
# create a basic pre-processed corpus
news_corpus = normalize_corpus(df_news['full_text'], text_lower_case=False, 
                          text_lemmatization=False, special_char_removal=False)

In [23]:
entities_list = []
for sentence in news_corpus:
    temp_entity_name = ''
    temp_named_entity = None
    sentence = nlp(sentence)
    for word in sentence:
        term = word.text
        tag = word.ent_type_
        if tag:
            temp_entity_name = ' '.join([temp_entity_name, term]).strip()
            temp_named_entity = (temp_entity_name, tag)
        else:
            if temp_named_entity:
                entities_list.append(temp_named_entity)
                temp_entity_name = ''
                temp_named_entity = None

df_entity = pd.DataFrame(entities_list, columns=['entity_name', 'entity_type'])

In [24]:
df_entity.groupby(by = ['entity_name', 'entity_type']).size().sort_values(ascending = False).reset_index().rename(columns={0: 'Count'})[:15]

Unnamed: 0,entity_name,entity_type,Count
0,US,GPE,1038
1,first,ORDINAL,446
2,India,GPE,436
3,Congress,ORG,347
4,Indian,NORP,299
5,two,CARDINAL,295
6,Tesla,ORG,189
7,China,GPE,186
8,UK,GPE,182
9,one,CARDINAL,175


# Building a classifier

TO BE CONTINUED...