# General information

____________________

In this Jupyter Notebook we will be working with news articles retrieved from <a href='https://inshorts.com'> Inshorts.com</a> a website that gives us short, 60-word news articles on a wide variety of topics. Inshorts provides news for 12 thematically different sets of content like: Sports, Politics, Business and Technology.

The main focus is to cover some aspects of NLP like:
- Data Retrieval with Web Scraping
- Text wrangling and pre-processing
- Parts of Speech Tagging
- Named Entity Recognition
- Building a classifier able to recognize type of content based on words used in the article

Let's start off by importing all necessary packages.

# Importing necessary libraries

_________________________

In [1]:
# Basic modules for dataframe manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Web scraping
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

# NLP
import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from contractions import CONTRACTION_MAP
import re
import unicodedata

# Retrieving data
_____________________

In this section we will focus on creating a function for gathering news data from <a href='https://inshorts.com'>Inshorts.com</a>. Since articles are divided by topic and each category displays only ~ 25 articles, we need a solution which will trigger the 'Load more' button desired number of times before retrieving the data and creating a DataFrame, we will achieve this by utilizing Selenium - a popular testing framework - and ChromeDriver. For those who prefer Firefox over Chrome, there is a possibility to use Mozilla's GeckoDriver with Selenium.

In [2]:
source_urls = ['https://inshorts.com/en/read/business',
               'https://inshorts.com/en/read/technology',
               'https://inshorts.com/en/read/science',
               'https://inshorts.com/en/read/world',
               'https://inshorts.com/en/read/sports',
               'https://inshorts.com/en/read/politics',
               'https://inshorts.com/en/read/entertainment']

In [3]:
def get_data(seed_urls):
    news_data = []
    for url in seed_urls:
        # create a new Chrome session
        driver = webdriver.Chrome('/home/monster/Documents/Python/NLP/chromedriver')
        driver.implicitly_wait(30)
        driver.get(url)
        for i in range(10):
            try:
                python_button = driver.find_element_by_id('load-more-btn') # Find 'Load more' button
                sleep(2)
                python_button.click() # Click 'Load more' button to load more articles
                sleep(5)
            except Exception as e:
                print(e)
                break
        news_category = url.split('/')[-1]
        soup = BeautifulSoup(driver.page_source, 'html.parser')     
        news_articles = [{'news_headline': headline.find('span', 
                                                         attrs={"itemprop": "headline"}).string,
                          'news_article': article.find('div', 
                                                       attrs={"itemprop": "articleBody"}).string,
                          'news_category': news_category}
                         
                            for headline, article in 
                             zip(soup.find_all('div', 
                                               class_=["news-card-title news-right-box"]),
                                 soup.find_all('div', 
                                               class_=["news-card-content news-right-box"]))
                        ]
        news_data.extend(news_articles)
        driver.quit()
    df =  pd.DataFrame(news_data)
    return df

In [4]:
df_news = get_data(source_urls)

In [5]:
# Printing first 5 rows of data
df_news.head()

Unnamed: 0,news_article,news_category,news_headline
0,Facebook's 34-year-old CEO Mark Zuckerberg has...,business,Mark Zuckerberg becomes the 3rd richest person...
1,"The three richest persons in the world, Jeff B...",business,World's 3 richest persons are now technology b...
2,"The Congress has alleged that a ""Bitcoin scam""...",business,"Congress alleges ₹5,000 crore 'Bitcoin scam' i..."
3,The Indian Air Force (IAF) charged the governm...,business,IAF charged ₹29 crore to ferry new notes post ...
4,Nestle India has denied claims that 9 children...,business,"9 kids fall ill on consuming 'Maggi', Nestle I..."


In [6]:
# Checking the distribution of each topic in our DataFrame
df_news['news_category'].value_counts()

politics         294
business         294
world            293
sports           293
science          292
technology       291
entertainment    290
Name: news_category, dtype: int64

In [7]:
# Saving data into a '.csv' file to prevent redundant overloading of the Inshort's server.
df_news.to_csv('news.csv', index=False, encoding='utf-8')

# Data pre-processing

_________________________

In [17]:
# Loading one of English language models for spacy
nlp = spacy.load('en_core_web_md', parse = True, tag = True, entity = True)
tokenizer = ToktokTokenizer()

# Saving english stopwords from nltk module in a list
stopword_list = nltk.corpus.stopwords.words('english')

# Removing 'no' and 'not' from stopwords list
stopword_list.remove('no')
stopword_list.remove('not')

## Writing helper functions

### Removing HTML tags

Let's start off our data preparation by writing some helper functions which we will use to clean the news data. Since HTML tags don't add much value towards understanding and analyzing text, we will get rid of them.

In [9]:
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

### Removing accented characters

While dealing with text data, very often we encounter accented characters like 'é' or 'ó'. Since they may not be useful while working with English language, we will create function for converting them into unaccented counterparts.

In [10]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

### Expanding contractions

As a next step, let's create a function for dealing with contractions. Contractions are shortened versions of words or syllabes which often exist in written or spoken forms in English language. The typical examples would be <b>do not</b> to <b>don't</b> and <b>I would</b> to <b>I'd</b>. For this purpose we will utilize function and contractions dictionary written by Dipanjan S - Data Scientist working for Intel company, author of <i>Text analytics with Python</i> and <i>Practical machine learning with Python</i>.

<a href="https://github.com/dipanjanS/practical-machine-learning-with-python/tree/master/bonus%20content/nlp%20proven%20approach">Contraction dictionary by Dipanjan S</a>

In [11]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### Removing special characters

Special characters are usually non-alphanumeric or even numeric characters which constitute to the extra noise in unstructured text data. We will create a function based on simple regular expressions which will get rid of them.

In [12]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

### Lemmatization

In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. It depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. For this part we will utilize <b>Spacy</b> as it has excellent built-in lemmatizers.

In [13]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### Removing Stopwords

In computing, <i>stopwords</i> are words which are filtered out before or after processing of natural language data. Though stopwords usually refer to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search. Some examples include <b>a</b>, <b>an</b>, <b>the</b>, <b>and</b>.

In [14]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

### Combining above functions - building a Text Normalizer

In [15]:
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = remove_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # expand contractions    
        if contraction_expansion:
            doc = expand_contractions(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
        # remove stopwords
        if stopword_removal:
            doc = remove_stopwords(doc, is_lower_case=text_lower_case)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [18]:
# combining headline and article text
df_news['full_text'] = df_news["news_headline"].map(str)+ '. ' + df_news["news_article"]

# pre-process text and store the same
df_news['clean_text'] = normalize_corpus(df_news['full_text'])
norm_corpus = list(df_news['clean_text'])

# show a sample news article
df_news.iloc[1][['full_text', 'clean_text']].to_dict()

{'full_text': "World's 3 richest persons are now technology billionaires. The three richest persons in the world, Jeff Bezos, Bill Gates and Mark Zuckerberg, are technology billionaires leading their companies Amazon, Microsoft and Facebook respectively. Facebook Co-Founder and CEO Mark Zuckerberg on Saturday overtook Berkshire Hathaway's Warren Buffett to become the world's third-richest person. Zuckerberg has a net worth of $81.6 billion as per Bloomberg.",
 'clean_text': 'world rich person technology billionaire three rich person world jeff bezos bill gate mark zuckerberg technology billionaire lead company amazon microsoft facebook respectively facebook co founder ceo mark zuckerberg saturday overtake berkshire hathaway warren buffett become world third rich person zuckerberg net worth billion per bloomberg'}

In [19]:
df_news.to_csv('news_preprocessed.csv', index=False, encoding='utf-8')

TO BE CONTINUED...