<a href="https://colab.research.google.com/github/biku1998/NLP-Notebooks/blob/master/Inshorts-Topic-Modeling/Inshorts_topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Inshort News Topic Modeling Using Gensim
---
### Notebook Outlines
* **Scrape the news data from Inshorts Using `beautifulSoup` and `selenium`**
* **Process the collected data**
* **Build a topic model**


**If you are new to topic modeling check out this** <a href = "https://colab.research.google.com/github/biku1998/NLP-Notebooks/blob/master/NLP_04_Topic_Modeling_LDA_Gensim.ipynb"><button>**Notebook**</button>

- Data Collection
    
    For data collection we will use bs4 and selenium as if you look on the inshort website [https://inshorts.com/en/read]. We have to click on the button at the bottom of the page to fetch more news.

In [0]:
path_to_selenium_driver = "C:/chromedriver_win32/chromedriver.exe"

In [0]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
import pandas as pd
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup
import time

In [0]:
def news_collector(ntimes,url,path_to_selenium_driver):
    """
    to collect data from inshorts
    parameters :
    ntime : how many times `load more` button will be clicked ?
    url : url of the website
    path_to_selenium_driver : path to selenium driver
    """
    driver = webdriver.Chrome(path_to_selenium_driver)
    driver.get(url)
    
    
    
    # find the button
    load_more_button  = driver.find_element_by_xpath("//*[@id='load-more-btn']")
    
    # click on the button n times
    print("Performing the click event !!")
    
    for i in range(ntimes):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1) # a great hack
        load_more_button.click()
    
    # get the news blocks
    news_page_source = driver.page_source
    
    driver.close()
    
    return news_page_source
    

In [0]:
def get_news_from_page_source(page_source):
    """
    to return a clean formated data from a page source provided by selenium
    """
    
    soup = BeautifulSoup(page_source)
    
    news_blocks = soup.find_all("div","news-card z-depth-1")
    
    print(f"Total {len(news_blocks)} news blocks found !")
    
#     print(news_blocks[0].prettify()) # for debug

    # create structure for dataFrame
    
    df_columns = ['news_title',"news_text"]
    
    rows = []
    
    print("Extracting news from html !")
    for block in news_blocks:
        
        news_title = block.find(itemprop="headline").text
        
        news_text = block.find("div","news-card-content news-right-box").text
        
        # append the data in rows
        
        rows.append((news_title,news_text))
        
        # for debug only
        
#         print("-----------------------------------")
        
#         print(f"news title : ",news_title)
#         print(f"news text  :", news_text)
        
#         print("-----------------------------------")

    return pd.DataFrame(data = rows,columns  = df_columns)    

In [0]:
def dataloader(ntimes,url,path_to_selenium_driver):
    """
    will use the above functions to fetch and collect data and return a DataFrame
    """
    
    news_blocks_page_source = news_collector(ntimes,url,path_to_selenium_driver)
    
    df = get_news_from_page_source(news_blocks_page_source)
    
    return df
    

In [0]:
df_news = dataloader(20,"https://inshorts.com/en/read",path_to_selenium_driver)

In [0]:
df_news.shape

In [0]:
df_news.head()

In [0]:
df_news.to_csv("./inshorts_news_data.csv",index = False)

In [0]:
# load the data

df_news = pd.read_csv("./inshorts_news_data.csv")

In [0]:
df_news.head()

Unnamed: 0,news_title,news_text
0,Passenger train services to restart from May 1...,\nThe Indian Railways will gradually restart p...
1,Former PM Manmohan Singh admitted to AIIMS aft...,\nFormer Prime Minister Manmohan Singh has bee...
2,"Coronavirus cases rise to 3,814 in Rajasthan, ...",\nRajasthan on Sunday reported 106 new coronav...
3,India develops its 1st indigenous antibody det...,\nThe National Institute of Virology in Pune h...
4,What are the guidelines for passengers ahead o...,\nBooking for the special passenger trains tha...


In [0]:
print(df_news['news_title'].values[0])

Passenger train services to restart from May 12 with 15 pairs of trains: Govt


In [0]:
print(df_news['news_text'].values[0])


The Indian Railways will gradually restart passenger train operations from May 12, initially with 15 pairs of trains (30 return journeys), it announced. These trains will be run as special trains from New Delhi connecting 15 important cities including Bengaluru, Chennai, Mumbai and Ahmedabad, Railways tweeted. Booking for reservation in these trains will start at 4 pm on May 11.

short by Anmol Sharma / 
      09:00 pm on 10 May




**Note**
* we have to remove "short by Anmol Sharma / 09:00 pm on 10 May" phrase from the end of the news text, as it contains no information

In [0]:
def remove_text(news_text):
    return news_text.strip().split("short")[0].strip()

In [0]:
df_news['news_text'] = df_news['news_text'].apply(remove_text)

In [0]:
print(df_news['news_text'].values[0])

The Indian Railways will gradually restart passenger train operations from May 12, initially with 15 pairs of trains (30 return journeys), it announced. These trains will be run as special trains from New Delhi connecting 15 important cities including Bengaluru, Chennai, Mumbai and Ahmedabad, Railways tweeted. Booking for reservation in these trains will start at 4 pm on May 11.


In [0]:
# pre-process imports

import numpy as np
import matplotlib.pyplot as plt
import os,re
from gensim.parsing.preprocessing import remove_stopwords,strip_punctuation
import nltk
nltk.download("wordnet")

from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sourabh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
def pre_process_news(texts):
    
    # tokenization
    
    texts = [re.findall(r'\w+', line.lower()) for line in texts]
    
    # remove stopwords
    
    texts = [remove_stopwords(' '.join(line)).split() for line in texts]
    
    # remove punctuation
    
    texts = [strip_punctuation(' '.join(line)).split() for line in texts]
    
    # remove words that are only 1-2 character
    texts = [[token for token in line if len(token) > 2] for line in texts]
    
    # remove numbers
    texts = [[token for token in line if not token.isnumeric()] for line in texts]
    
    # lemmatization 
 
    lemmatizer = WordNetLemmatizer()
    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    
    return texts

In [0]:
news_cleaned = pre_process_news(df_news['news_text'].values)

In [0]:
len(news_cleaned)

532

In [0]:
news_cleaned[0]

['indian',
 'railways',
 'gradually',
 'restart',
 'passenger',
 'train',
 'operations',
 'initially',
 'pairs',
 'trains',
 'return',
 'journeys',
 'announced',
 'trains',
 'run',
 'special',
 'trains',
 'new',
 'delhi',
 'connecting',
 'important',
 'cities',
 'including',
 'bengaluru',
 'chennai',
 'mumbai',
 'ahmedabad',
 'railways',
 'tweeted',
 'booking',
 'reservation',
 'trains',
 'start']

In [0]:
# bi gram collection detection

from gensim.models.phrases import Phraser,Phrases

# train the bigram det

phrases = Phrases(news_cleaned,min_count = 1,threshold = 0.8,scoring = 'npmi')

In [0]:
# now we create a transformer that will use the above trained model to transform new words

bi_gram = Phraser(phrases)

In [0]:
# merging detected collocations with data

news_cleaned = list(bi_gram[news_cleaned])

In [0]:
# creating a numerical mapping for each word

from gensim.corpora import Dictionary

In [0]:
dictionary = Dictionary(news_cleaned)

In [0]:
# remove rare words and most common words to improve our topic modeling

# Filter out words that occur less than 10 documents, or more than 60% of the documents.

dictionary.filter_extremes(no_below=10, no_above=0.6)

In [0]:
#  transform the documents to a vectorized form. We simply compute the frequency of each word, including the bigrams.

corpus = [dictionary.doc2bow(text) for text in news_cleaned]

In [0]:
doc_number = 0
corpus[doc_number] # A document is represented as a list of tuples of (vocab ID, frequency) for each word.

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 4),
 (10, 1)]

In [0]:
# train the lda model

from gensim.models import LdaModel


In [0]:
lda_model = LdaModel(corpus = corpus, id2word = dictionary, num_topics = 10, \
                      passes = 50, chunksize = 1500,iterations = 200,alpha = "auto")

In [0]:
# once the model is trained we save the model

if os.path.isdir("./topic_models") == False:
    os.mkdir("./topic_models")
    
lda_model.save("./topic_models/lda_model")

In [0]:
# To load the modal 

# lda_model  = lda_model.load("./topic_models/lda_model")

In [0]:
lda_model.show_topics(num_topics = 5)

[(5,
  '0.100*"coronavirus" + 0.082*"said" + 0.055*"country" + 0.048*"china" + 0.041*"cases" + 0.041*"reported" + 0.037*"covid" + 0.033*"health" + 0.029*"ministry" + 0.027*"city"'),
 (0,
  '0.068*"world" + 0.055*"women" + 0.054*"million" + 0.044*"coronavirus" + 0.033*"group" + 0.033*"video" + 0.032*"app" + 0.032*"trump" + 0.032*"report" + 0.030*"april"'),
 (4,
  '0.110*"cases" + 0.084*"coronavirus" + 0.056*"state" + 0.043*"reported" + 0.036*"patients" + 0.034*"total_number" + 0.033*"health" + 0.029*"new" + 0.028*"taking" + 0.024*"covid"'),
 (7,
  '0.105*"said" + 0.062*"coronavirus" + 0.050*"covid" + 0.049*"pandemic" + 0.045*"added" + 0.041*"amid" + 0.032*"people" + 0.031*"lockdown" + 0.030*"government" + 0.030*"friday"'),
 (9,
  '0.150*"said" + 0.103*"india" + 0.094*"added" + 0.032*"indian" + 0.030*"cricket" + 0.028*"team" + 0.028*"australia" + 0.025*"stated" + 0.024*"played" + 0.021*"years"')]

In [0]:
import pyLDAvis.gensim

# set the notebook model
pyLDAvis.enable_notebook()

In [0]:
import warnings
warnings.filterwarnings("ignore")

**To understand what's going on in this plot refer this** <a href = "https://colab.research.google.com/github/biku1998/NLP-Notebooks/blob/master/NLP_04_Topic_Modeling_LDA_Gensim.ipynb"><button>**Notebook**</button>

In [0]:
pyLDAvis.gensim.prepare(lda_model,corpus,dictionary,sort_topics = False)

In [0]:
### Document Clustering using LDA on Tensorboard

# Get document topics
all_topics = lda_model.get_document_topics(corpus, minimum_probability=0)
all_topics[0]

[(0, 0.0028396605),
 (1, 0.0030884051),
 (2, 0.6604015),
 (3, 0.0031240103),
 (4, 0.004412665),
 (5, 0.002818684),
 (6, 0.003101953),
 (7, 0.0045325384),
 (8, 0.31105405),
 (9, 0.0046265377)]

In [0]:
# create file for tensors(vectors)

with open('doc_lda_tensor.tsv','w') as w:
    for doc_topics in all_topics:
        for topics in doc_topics:
            w.write(str(topics[1])+ "\t") # store the probablity value
        w.write("\n")

In [0]:
# create file for metadata(documet titles)
with open('doc_lda_metadata.tsv','w',encoding="utf-8") as w:
    for doc_id in range(len(all_topics)):
        w.write(df_news.news_title[doc_id] + "\n")

* Now open - http://projector.tensorflow.org/
* upload both the files
* Upload both the files using the `load` button on the left

**Below is the quick view of the news cluster that we have made**

---

<img src = "./news_cluster.gif">