# Recommendation

## 1. Importing necessary Libraries

In [None]:
import numpy as np
import pandas as pd

import os
import math
import time
import json

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

# Libraries for text processing using NLTK
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Libraries for feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries for similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances

## 2. Loading Data


* Utilizing the drive module from google.colab to mount my Google Drive to Colab. After that, it enables me to explore the contents via “Files explorer” and read the data.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


* csv file consisting of the final dataframe from classification notebook is uploaded into google drive, then downloaded and read into a new dataframe in this Recommendation notebook using the following code.
* Since classification is performed using two different algorithms, we have to csv files as output and hence two new dataframes namely : kmeansDf and nmfDf are utilised.




In [None]:
! gdown 1PvwhRnfzh9xSDEB2tMTXF_NarbuWLFQN
path_dir = '/content/'
kmeansDf_path = path_dir + 'classification(KMeans).csv'
kmeansDf= pd.read_csv(kmeansDf_path)

! gdown 1WLfIEhBEvpVhF2pTLCssolWSBMcIdzNv
path_dir = '/content/'
nmfDf_path = path_dir + 'classification(NMF).csv'
# nmfDf= pd.read_csv(nmfDf_path)
news_articles=pd.read_csv(nmfDf_path)

Downloading...
From: https://drive.google.com/uc?id=1PvwhRnfzh9xSDEB2tMTXF_NarbuWLFQN
To: /content/classification(KMeans).csv
100% 54.4k/54.4k [00:00<00:00, 33.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1WLfIEhBEvpVhF2pTLCssolWSBMcIdzNv
To: /content/classification(NMF).csv
100% 198k/198k [00:00<00:00, 55.8MB/s]


In [None]:
print(len(news_articles))

64


The recommendations for an article are going to be from the set of articles that are available in the news_articles variable.

## 3. Data Preprocessing

In [None]:
news_articles.shape

(64, 11)

Now, the number of news articles that are currently available from the nbc news's website are 64


### 3.a Removing all the short headline articles 

After stop words removal from headline, the articles with very short headline may become blank headline articles. So let's remove all the articles with less words(<5) in the headline.   

In [None]:
news_articles = news_articles[news_articles['headline'].apply(lambda x: len(x.split())>5)]
print("Total number of articles after removal of headlines with short title:", news_articles.shape[0])

Total number of articles after removal of headlines with short title: 63


### 3.b Checking and removing all the duplicates

Since some articles are exactly same in headlines, so let's remove all such articles having duplicate headline appearance.

In [None]:
news_articles.sort_values('headline',inplace=True, ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]
print("Total number of articles after removing duplicates:", news_articles.shape[0])

Total number of articles after removing duplicates: 63


### 3.c Checking for missing values

In [None]:
news_articles.isna().sum()

Unnamed: 0      0
id              0
headline        0
summary         0
category        0
suggestions     0
img             0
no_punct        0
no_punct_num    0
no_stopwords    0
clean_text      0
dtype: int64

## 4. Basic Data Exploration 

### 4.a Basic statistics - Number of articles,authors,categories

In [None]:
print("Total number of articles : ", news_articles.shape[0])
print("Total number of unqiue categories : ", news_articles["category"].nunique())

Total number of articles :  63
Total number of unqiue categories :  5


### 4.b Distribution of articles category-wise

In [None]:
fig = go.Figure([go.Bar(x=news_articles["category"].value_counts().index, y=news_articles["category"].value_counts().values)])
fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)
fig

From the bar chart, we can observe that **politics** category has **highest** number of articles then **entertainment** and so on.  

4.c PDF for the length of headlines 

In [None]:
fig = ff.create_distplot([news_articles['headline'].str.len()], ["ht"],show_hist=False,show_rug=False)
fig['layout'].update(title={'text':'PDF','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Length of a headline",yaxis_title="probability")
fig.update_layout(showlegend = False,width=500,height=500)
fig

The probability distribution function of headline length is almost similar to a **Guassian distribution**, where most of the headlines are 58 to 80 words long in length. 

By Data processing in Step 2, we get a subset of original dataset which has different index labels so let's make the indices uniform ranging from 0 to total number of articles. 

In [None]:
news_articles.index = range(news_articles.shape[0])

In [None]:
news_articles_temp = news_articles.copy()

## 5. Text Preprocessing

### 5.a Stopwords removal

Stop words are not much helpful in analyis and also their inclusion consumes much time during processing so let's remove these. 

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
stop_words = set(stopwords.words('english'))

In [None]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][i].split():
        word = ("".join(e for e in word if e.isalnum()))
        word = word.lower()
        if not word in stop_words:
          string += word + " "  
    if(i%1000==0):
      print(i)           # To track number of records processed
    news_articles_temp.at[i,"headline"] = string.strip()

0


### 5.b Lemmatization

Let's find the base form(lemma) of words to consider different inflections of a word same as lemma.

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][i]):
        string += lemmatizer.lemmatize(w,pos = "v") + " "
    news_articles_temp.at[i, "headline"] = string.strip()
    if(i%1000==0):
        print(i)           # To track number of records processed

0


## 6. Headline based similarity on new articles

Generally, we assess **similarity** based on **distance**. If the **distance** is minimum then high **similarity** and if it is maximum then low **similarity**.
To calculate the **distance**, we need to represent the headline as a **d-dimensional** vector. Then we can find out the **similarity** based on the **distance** between vectors.

There are multiple methods to represent a **text** as **d-dimensional** vector like **Bag of words**, **TF-IDF method**, **Word2Vec embedding** etc. Each method has its own advantages and disadvantages. 

Let's see the feature representation of headline through all the methods one by one.

Creating dataframes for each method that is being used. All these dataframes are used to provide a good comarision of the impact of different models on the same data.

In [None]:
path_dir = '/content/'
path = path_dir + 'classification(NMF).csv'
df_bow=pd.read_csv(path)
df_tfidf=pd.read_csv(path)
df_w2v=pd.read_csv(path)
df_onehot=pd.read_csv(path)

### 6.a Using Bag of Words method

A **Bag of Words(BoW)** method represents the occurence of words within a **document**. Here, each headline can be considered as a **document** and set of all headlines form a **corpus**.

Using **BoW** approach, each **document** is represented by a **d-dimensional** vector, where **d** is total number of **unique words** in the corpus. The set of such unique words forms the **Vocabulary**.

In [None]:
headline_vectorizer = CountVectorizer()
headline_features   = headline_vectorizer.fit_transform(news_articles_temp['headline'])

In [None]:
headline_features.get_shape()

(63, 494)

The output **BoW matrix**(headline_features) is a sparse matrix.

In [None]:
pd.set_option('display.max_colwidth', -1)  # To display a very long headline completely


Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.



In [None]:
def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(headline_features,headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df_bow.at[row_index,'suggestions']=indices
    dataFrame = pd.DataFrame({'headline':news_articles_temp['headline'][indices],'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles_temp['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    return dataFrame.iloc[1:,]

for i in range(len(news_articles_temp)):
  print(bag_of_words_based_model(i, 4))


headline :  workers consumers say theyre likely favor prolgbtq businesses study find

                                              headline  \
61  2 boaters dog find safe strand sea                   
6   twitter disband trust safety council                 
38  tár dispel toxic lesbian stereotype find many film   

    Euclidean similarity with the queried article  
61  3.741657                                       
6   3.872983                                       
38  4.000000                                       
headline :  win ukraine could hinge side secure enough artillery ammunition

                                         headline  \
6   twitter disband trust safety council            
61  2 boaters dog find safe strand sea              
2   ignition fusion breakthrough draw energy gain   

    Euclidean similarity with the queried article  
6   3.741657                                       
61  3.872983                                       
2   3.872983               

In [None]:
df_bow.to_csv('df_bow.csv',index=False)

Above function recommends **4 similar** articles to the **queried**(read) article based on the headline. It accepts two arguments - index of already read artile and the total number of articles to be recommended.

Based on the **Euclidean distance** it finds out 10 nearest neighbors and recommends. 

**Disadvantages**
1. It gives very low **importance** to less frequently observed words in the corpus. Few words from the queried article like "employer", "flip", "fire" appear less frequently in the entire corpus so **BoW** method does not recommend any article whose headline contains these words. Since **trump** is commonly observed word in the corpus so it is recommending the articles with headline containing "trump".   
2. **BoW** method doesn't preserve the order of words.

To overcome the first disadvantage we use **TF-IDF** method for feature representation. 


### 6.b Using TF-IDF method

**TF-IDF** method is a weighted measure which gives more importance to less frequent words in a corpus. It assigns a weight to each term(word) in a document based on **Term frequency(TF)** and **inverse document frequency(IDF)**.

**TF(i,j)** = (# times word i appears in document j) / (# words in document j)

**IDF(i,D)** = log_e(#documents in the corpus D) / (#documents containing word i)

weight(i,j) = **TF(i,j)** x **IDF(i,D)**

So if a word occurs more number of times in a document but less number of times in all other documents then its **TF-IDF** value will be high.


In [None]:
tfidf_headline_vectorizer = TfidfVectorizer(min_df = 0)
tfidf_headline_features = tfidf_headline_vectorizer.fit_transform(news_articles_temp['headline'])

In [None]:
def tfidf_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(tfidf_headline_features,tfidf_headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df_tfidf.at[row_index,'suggestions']=indices
    dataFrame = pd.DataFrame({
               'headline':news_articles_temp['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles_temp['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    return dataFrame.iloc[1:,]
for i in range(len(news_articles_temp)):
  print(tfidf_based_model(i, 4))

headline :  workers consumers say theyre likely favor prolgbtq businesses study find

                                                                            headline  \
1  myocarditis covid vaccine low among teens young adults large study find             
2  blacklist russian propagandists thrive rightwing apps gab truth social study find   
3  asian americans heavily favor warnock georgia runoff exit poll show                 

   Euclidean similarity with the queried article  
1  1.319265                                       
2  1.319534                                       
3  1.347108                                       
headline :  win ukraine could hinge side secure enough artillery ammunition

                                                                 headline  \
1  donors meet paris get ukraine winter russian bomb                        
2  ancient coin unearth desert cave could point evidence maccabean revolt   
3  us set approve send patriot missile battery uk

In [None]:
df_tfidf.to_csv('df_tfidf.csv',index=False)

Compared to **BoW** method, here **TF-IDF** method recommends the articles with headline containing words like "employer", "fire", "flip" in top 5 recommendations and these words occur less frequently in the corpus.   

**Disadvantages :- **

**Bow** and **TF-IDF** method do not capture **semantic** and **syntactic** similarity of a given word with other words but this can be captured using **Word embeddings**.

For example: there is a good association between words like "trump" and "white house", "office and employee", "tiger" and "leopard", "USA" and "Washington D.C" etc. Such kind of **semantic** similarity can be captured using **word embedding** techniques.
**Word embedding** techniques like **Word2Vec**, **GloVe** and **fastText** leverage semantic similarity between words. 

### 6.c Using Word2Vec embedding

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Word2Vec** is one of the techniques for **semantic** similarity which was invented by **Google** in 2013. For a given corpus, during training it observes the patterns and respresents each word by a **d-dimensional** vector. To get better results we need fairly large corpus.

Since our corpus size is small so let's use Google's pretrained model on **google news** articles. This standard model contains vector representation for billions of words obtained by training on millions of new articles. Here, each word is represented by a **300** dimensional dense vector. 




In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

Since this **pre-trained Word2Vec** model is **1.5 GB** in compressed form. So it needs a high end RAM to load it in the memory after unzipping.

Here, we are loading this pre-build model from a **pickle** file which contains this model in advance.

In [None]:

loaded_model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)

Since the model gives vector representation for each **word** but we calculate the distance between **headlines** so we need to obtain vector representation for each **headline**. One way to obtain this is by first adding vector representation of all the words available in the **headline** and then calculating the average. It is also known as **average Word2Vec** model.   

Below code cell performs the same. 

In [None]:
loaded_model['porter']

In [None]:

from gensim.models import word2vec
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/Colab Notebooks/GoogleNews-vectors-negative300.bin', binary=True)
model.save("file.txt")

In [None]:
vocabulary = model
w2v_headline = []
for i in news_articles_temp['headline']:
    w2Vec_word = np.zeros(300, dtype="float32")
    for word in i.split():
        if word in vocabulary:
            w2Vec_word = np.add(w2Vec_word, model[word])
    w2Vec_word = np.divide(w2Vec_word, len(i.split()))
    w2v_headline.append(w2Vec_word)
w2v_headline = np.array(w2v_headline)

In [None]:
def avg_w2v_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(w2v_headline, w2v_headline[row_index].reshape(1,-1))
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df_w2v.at[row_index,'suggestions']=indices
    dataFrame = pd.DataFrame({
               'headline':news_articles_temp['headline'][indices[0]],
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles_temp['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    return dataFrame.iloc[1:,]
for i in range(len(news_articles_temp)):
  print(avg_w2v_based_model(i, 4))

headline :  workers consumers say theyre likely favor prolgbtq businesses study find

                                                                   headline  \
1  workers consumers say theyre likely favor prolgbtq businesses study find   
2  workers consumers say theyre likely favor prolgbtq businesses study find   
3  workers consumers say theyre likely favor prolgbtq businesses study find   

   Euclidean similarity with the queried article  
1  0.937121                                       
2  1.017899                                       
3  1.093764                                       
headline :  win ukraine could hinge side secure enough artillery ammunition

                                                          headline  \
1  win ukraine could hinge side secure enough artillery ammunition   
2  win ukraine could hinge side secure enough artillery ammunition   
3  win ukraine could hinge side secure enough artillery ammunition   

   Euclidean similarity with the qu

In [None]:
df_w2v.to_csv('df_w2v.csv',index=False)

Here, **Word2Vec** based representation recommends the headlines containing the word **white house** which is associated with the word **trump** in the queried article. Similarly, it recommends the headlines with words like "offical", "insist" which have semantic similarity to the words "employer", "sue" in the queried headline.

So far we were recommending using only one feature i.e. **headline** but in order to make a **robust** recommender system we need to consider **multiple** features at a time. Based on the business interest and rules, we can decide weight for each feature.

Let's see different models with combinations of different features for article similarity.

### 6.d Weighted similarity based on headline and category

Let's first see articles similarity based on **headline** and **category**. We are using **onehot encoding** to get feature representation for **category**.

Sometimes as per the business requirements, we may need to give more preference to the articles from the same **category**. In such cases we can assign more weight to **category** while recommending. Higher the weight, more the significance. Similarly less weight leads to less signficance to a particular feature.


In [None]:
from sklearn.preprocessing import OneHotEncoder 
import numpy as np

In [None]:
category_onehot_encoded = OneHotEncoder().fit_transform(np.array(news_articles_temp["category"]).reshape(-1,1))

In [None]:
import json
from json import JSONEncoder
import numpy

class NumpyArrayEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, numpy.ndarray):
            return obj.tolist()
        return JSONEncoder.default(self, obj)

In [None]:
def avg_w2v_with_category(row_index, num_similar_items, w1,w2): #headline_preference = True, category_preference = False):
    w2v_dist  = pairwise_distances(w2v_headline, w2v_headline[row_index].reshape(1,-1))
    category_dist = pairwise_distances(category_onehot_encoded, category_onehot_encoded[row_index]) + 1
    weighted_couple_dist   = (w1 * w2v_dist +  w2 * category_dist)/float(w1 + w2)
    indices = np.argsort(weighted_couple_dist.flatten())[0:num_similar_items].tolist()
    df_onehot.at[row_index,'suggestions']= list(indices)
    df = pd.DataFrame({
               'headline':news_articles_temp['headline'][indices].values,
                'Weighted Euclidean similarity with the queried article': weighted_couple_dist[indices].ravel(),
                'Word2Vec based Euclidean similarity': w2v_dist[indices].ravel(),
                 'Category based Euclidean similarity': category_dist[indices].ravel(),
                'Categoty': news_articles_temp['category'][indices].values})
    
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles_temp['headline'][indices[0]])
    print('Categoty : ', news_articles_temp['category'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    return df.iloc[1:, ]

for i in range(len(news_articles_temp)):
  print(avg_w2v_with_category(i,4,0.1,0.8))

headline :  workers consumers say theyre likely favor prolgbtq businesses study find
Categoty :  tech

                                                                      headline  \
1  research find negative effect screen time kid include higher risk ocd         
2  elon musk say stop child exploitation twitter far hes ax job push watchdogs   
3  amy schumer feel like new person operations treat endometriosis               

   Weighted Euclidean similarity with the queried article  \
1  1.001989                                                 
2  1.010418                                                 
3  1.014795                                                 

   Word2Vec based Euclidean similarity  Category based Euclidean similarity  \
1  1.017899                             1.0                                   
2  1.093764                             1.0                                   
3  1.133159                             1.0                                   

  Cate

In [None]:
temp1 = df_w2v.to_dict('records')

In [None]:
for i in range(len(temp1)):
  type(temp1[i]['suggestions'])
  del temp1[i]['Unnamed: 0']

In [None]:
df_w2v.to_csv('temp1.csv',index=False)

Above function takes two extra arguments **w1** and **w2** for weights corresponding to **headline** and **category**. It is always a good practice to pass the **weights** in a range scaled from **0 to 1**, where a value close to 1 indicates high weight whereas close to 0 indicates less weight.  

Here, we can observe that the recommended articles are from the same **category** as the queried article **category**. This is due to passing of high value to **w2**.

### 7. Connecting to the MongoDb databse

In [None]:
! python -m pip install pymongo==3.7.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymongo==3.7.2
  Downloading pymongo-3.7.2.tar.gz (628 kB)
[K     |████████████████████████████████| 628 kB 4.2 MB/s 
[?25hBuilding wheels for collected packages: pymongo
  Building wheel for pymongo (setup.py) ... [?25l[?25hdone
  Created wheel for pymongo: filename=pymongo-3.7.2-cp38-cp38-linux_x86_64.whl size=415790 sha256=b47d8241e646690943c3b9c93ec9994ca0517d5e83b0c5d80d6bdec67a3fa928
  Stored in directory: /root/.cache/pip/wheels/28/62/b5/ede9674d1415d2c15c3e805e6cc7debfcdf380105da0887776
Successfully built pymongo
Installing collected packages: pymongo
  Attempting uninstall: pymongo
    Found existing installation: pymongo 4.3.3
    Uninstalling pymongo-4.3.3:
      Successfully uninstalled pymongo-4.3.3
Successfully installed pymongo-3.7.2


In [None]:
import datetime                            # Imports datetime library

import pymongo
from pymongo import MongoClient

# uri (uniform resource identifier) defines the connection parameters 
uri = 'mongodb+srv://nbcuser:nbcuser@atlascluster.osjbeup.mongodb.net/?retryWrites=true&w=majority'
# start client to connect to MongoDB server 
client = MongoClient( uri )

In [None]:
client.list_database_names()

['News_Articles', 'test', 'admin', 'local']

In [None]:
db = client.News_Articles

Following are the names of the tables that are available in the News_Articles's database

In [None]:
db.list_collection_names()

['articles', 'categories']

In [None]:
temp = pd.read_csv(path_dir+'df_w2v.csv')

In [None]:
import pandas as pd

Grouping all the articles available in the dataframe according the category of the message. 

In [None]:
techDf=pd.DataFrame()
polDf=pd.DataFrame()
busDf=pd.DataFrame()
sportsDf=pd.DataFrame()
enterDf=pd.DataFrame()

In [None]:
temp=pd.DataFrame(df_w2v.query('category == "tech"'))
techDf=pd.DataFrame({'heading':"tech",'news':json.loads(temp[['headline','img','category']].to_json(orient='table'))})

temp=pd.DataFrame(df_w2v.query('category == "politics"'))
polDf=pd.DataFrame({'heading':"politics",'news':json.loads(temp[['headline','img','category']].to_json(orient='table'))})

temp=pd.DataFrame(df_w2v.query('category == "sports"'))
busDf=pd.DataFrame({'heading':"sports",'news':json.loads(temp[['headline','img','category']].to_json(orient='table'))})

temp=pd.DataFrame(df_w2v.query('category == "business"'))
sportsDf=pd.DataFrame({'heading':"business",'news':json.loads(temp[['headline','img','category']].to_json(orient='table'))})

temp=pd.DataFrame(df_w2v.query('category == "entertainment"'))
enterDf=pd.DataFrame({'heading':"entertainment",'news':json.loads(temp[['headline','img','category']].to_json(orient='table'))})


In [None]:
tech=pd.DataFrame(techDf.iloc[0]).T
sports=pd.DataFrame(sportsDf.iloc[0]).T
politics=pd.DataFrame(polDf.iloc[0]).T
entertainment=pd.DataFrame(enterDf.iloc[0]).T
business=pd.DataFrame(busDf.iloc[0]).T


Pushing the data into the remote MongoDb database named "News_Articles" and the table named "categories". This results in 5 rows in the table with each row standing for a specific category of new article.

In [None]:
db.categories.insert_one(tech.to_dict())
db.categories.insert_one(sports.to_dict())
db.categories.insert_one(politics.to_dict())
db.categories.insert_one(entertainment.to_dict())
db.categories.insert_one(business.to_dict())

In [None]:
temp1 = pd.read_csv(path_dir+'temp1.csv')
temp1 = df_w2v.to_dict('records')

Pushing the data into the remote MongoDb database named "News_Articles" and the table named "articles". This results in articles that are availble in the website on that particular date to be stored into the table with the category and suggestions associated with that particular article included in the table as new columns.

In [None]:
db.articles.insert_many(temp1.to_dict('records'))

Hence, the data that is available from the database is fed into the application to make the articles available to the users in summarized form, grouped by the category of the article, and associated with a set of 4 articles recommended to the user after selection of a specific article.