## Preface

**Content based recommendation ** is based on similarity among users/items obtained through their **attributes**. It uses the additional information(meta data) about the **users** or **items** i.e. it relies on what kind of **content** is already available. This meta data could be **user's demograpic information** like *age*, *gender*, *job*, *location*, *skillsets* etc. Similarly for **items** it can be *item name*, *specifications*, *category*, *registration date* etc.

So the core idea is to recommend items by finding similar items/users to the concerned **item/user** based on their **attributes**. 

In this kernel, I am going to discuss about **Content based recommendation** using **News category** dataset. The goal is to recommend **news articles** which are similar to the already read article by using attributes like article *headline*, *category*, *author* and *publishing date*.

So let's get started without any further delay.

## 1. Importing necessary Libraries

In [2]:
import numpy as np
import pandas as pd

import os
import math
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
import plotly.express as px

#for storing the model
import pickle

# Below libraries are for text processing using NLTK
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Below libraries are for feature representation using sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Below libraries are for similarity matrices using sklearn
from sklearn.metrics.pairwise import cosine_similarity  
from sklearn.metrics import pairwise_distances


## 2. Loading Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#news_articles = pd.read_json("/content/News_Category_Dataset_v2.json", lines = True)
news_articles = pd.read_json("/content/drive/MyDrive/News_Category_Dataset_v2.json", lines = True)
df=news_articles

In [4]:
news_articles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200853 entries, 0 to 200852
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   category           200853 non-null  object        
 1   headline           200853 non-null  object        
 2   authors            200853 non-null  object        
 3   link               200853 non-null  object        
 4   short_description  200853 non-null  object        
 5   date               200853 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.2+ MB


The dataset contains about two million records of six different features. 

In [5]:
news_articles.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


## 3. Data Preprocessing

### 3.a Fetching only the articles from 2018  

Since the dataset size is quite large so processing through entire dataset may consume too much time. To refrain from this, we are only considering the latest articles from the year 2018. 

In [6]:
news_articles = news_articles[news_articles['date'] >= pd.Timestamp(2018,1,1)]

In [7]:
news_articles.shape

(8583, 6)

Now, the number of news articles comes down to 8583.

### 3.b Removing all the short headline articles 

After stop words removal from headline, the articles with very short headline may become blank headline articles. So let's remove all the articles with less words(<5) in the headline.   

In [8]:
news_articles = news_articles[news_articles['headline'].apply(lambda x: len(x.split())>5)]
print("Total number of articles after removal of headlines with short title:", news_articles.shape[0])

Total number of articles after removal of headlines with short title: 8530


### 3.c Checking and removing all the duplicates

Since some articles are exactly same in headlines, so let's remove all such articles having duplicate headline appearance.

In [9]:
news_articles.sort_values('headline',inplace=True, ascending=False)
duplicated_articles_series = news_articles.duplicated('headline', keep = False)
news_articles = news_articles[~duplicated_articles_series]
print("Total number of articles after removing duplicates:", news_articles.shape[0])

Total number of articles after removing duplicates: 8485


### 3.d Checking for missing values

In [10]:
news_articles.isna().sum()

category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64

## 4. Basic Data Exploration 

### 4.a Basic statistics - Number of articles,authors,categories

In [11]:
print("Total number of articles : ", news_articles.shape[0])
print("Total number of authors : ", news_articles["authors"].nunique())
print("Total number of unqiue categories : ", news_articles["category"].nunique())

Total number of articles :  8485
Total number of authors :  892
Total number of unqiue categories :  26


### 4.b Distribution of articles category-wise

In [12]:
fig = go.Figure([go.Bar(x=news_articles["category"].value_counts().index, y=news_articles["category"].value_counts().values)])
fig['layout'].update(title={"text" : 'Distribution of articles category-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Category name",yaxis_title="Number of articles")
fig.update_layout(width=800,height=700)
fig

From the bar chart, we can observe that **politics** category has **highest** number of articles then **entertainment** and so on.  

### 4.c Number of articles per month

Let's first group the data on monthly basis using **resample()** function. 

In [13]:
news_articles_per_month = news_articles.resample('m',on = 'date')['headline'].count()
news_articles_per_month

date
2018-01-31    2065
2018-02-28    1694
2018-03-31    1778
2018-04-30    1580
2018-05-31    1368
Freq: M, Name: headline, dtype: int64

In [14]:
fig = go.Figure([go.Bar(x=news_articles_per_month.index.strftime("%b"), y=news_articles_per_month)])
fig['layout'].update(title={"text" : 'Distribution of articles month-wise','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Month",yaxis_title="Number of articles")
fig.update_layout(width=500,height=500)
fig

From the bar chart, we can observe that **January** month has **highest** number of articles then **March** and so on.  

### 4.d PDF for the length of headlines 

In [15]:
fig = ff.create_distplot([news_articles['headline'].str.len()], ["ht"],show_hist=False,show_rug=False)
fig['layout'].update(title={'text':'PDF','y':0.9,'x':0.5,'xanchor': 'center','yanchor': 'top'}, xaxis_title="Length of a headline",yaxis_title="probability")
fig.update_layout(showlegend = False,width=500,height=500)
fig

The probability distribution function of headline length is almost similar to a **Guassian distribution**, where most of the headlines are 58 to 80 words long in length. 

By Data processing in Step 2, we get a subset of original dataset which has different index labels so let's make the indices uniform ranging from 0 to total number of articles. 

In [16]:
news_articles.index = range(news_articles.shape[0])

In [17]:
# Adding a new column containing both day of the week and month, it will be required later while recommending based on day of the week and month
news_articles["day and month"] = news_articles["date"].dt.strftime("%a") + "_" + news_articles["date"].dt.strftime("%b")

Since after text preprocessing the original headlines will be modified and it doesn't make sense to recommend articles by displaying modified headlines so let's copy the dataset into some other dataset and perform text preprocessing on the later.

In [18]:
news_articles_temp = news_articles.copy()

## 5. Text Preprocessing

### 5.a Stopwords removal

Stop words are not much helpful in analyis and also their inclusion consumes much time during processing so let's remove these. 

In [19]:
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [20]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for word in news_articles_temp["headline"][i].split():
        word = ("".join(e for e in word if e.isalnum()))
        word = word.lower()
        if not word in stop_words:
          string += word + " "  
    if(i%1000==0):
      print(i)           # To track number of records processed
    news_articles_temp.at[i,"headline"] = string.strip()

0
1000
2000
3000
4000
5000
6000
7000
8000


### 5.b Lemmatization

Let's find the base form(lemma) of words to consider different inflections of a word same as lemma.

In [21]:
lemmatizer = WordNetLemmatizer()
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [22]:
for i in range(len(news_articles_temp["headline"])):
    string = ""
    for w in word_tokenize(news_articles_temp["headline"][i]):
        string += lemmatizer.lemmatize(w,pos = "v") + " "
    news_articles_temp.at[i, "headline"] = string.strip()
    if(i%1000==0):
        print(i)           # To track number of records processed

0
1000
2000
3000
4000
5000
6000
7000
8000


## 6. Headline based similarity on new articles

Generally, we assess **similarity** based on **distance**. If the **distance** is minimum then high **similarity** and if it is maximum then low **similarity**.
To calculate the **distance**, we need to represent the headline as a **d-dimensional** vector. Then we can find out the **similarity** based on the **distance** between vectors.

There are multiple methods to represent a **text** as **d-dimensional** vector like **Bag of words**, **TF-IDF method**, **Word2Vec embedding** etc. Each method has its own advantages and disadvantages. 

Let's see the feature representation of headline through all the methods one by one.

### 6.a Using Bag of Words method

A **Bag of Words(BoW)** method represents the occurence of words within a **document**. Here, each headline can be considered as a **document** and set of all headlines form a **corpus**.

Using **BoW** approach, each **document** is represented by a **d-dimensional** vector, where **d** is total number of **unique words** in the corpus. The set of such unique words forms the **Vocabulary**.

In [23]:
headline_vectorizer = CountVectorizer()
headline_features   = headline_vectorizer.fit_transform(news_articles_temp['headline'])

#### Saving the model

In [24]:
# save the model to disk
filename = 'Count-Vectorizer features.sav'
pickle.dump(headline_features, open(filename, 'wb'))

In [25]:
headline_features.get_shape()

(8485, 11122)

The output **BoW matrix**(headline_features) is a sparse matrix.

In [26]:
pd.set_option('display.max_colwidth', -1)  # To display a very long headline completely


Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.



In [27]:
def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(headline_features,headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    #return df.iloc[1:,1]
    return df.iloc[1:,]

bag_of_words_based_model(133, 11) # Change the row index for any other queried article

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-04-02,The Trump Administration Is Suing California Again,2.828427
2,2018-05-01,Texas Sues Trump Administration To End DACA,3.162278
3,2018-03-07,Stormy Daniels Suing Trump Over Nondisclosure Agreement,3.162278
4,2018-04-28,Trump: Mueller Should Never Have Been Appointed,3.162278
5,2018-04-24,Spanish Woman Looks More Like Trump Than The Donald Himself,3.162278
6,2018-02-12,What You Should Know About Trump's Nihilist Budget,3.162278
7,2018-05-09,The Caliphate Of Trump And A Planet In Ruins,3.162278
8,2018-03-26,Trump Ally Sues Qatar For Hacking His Email,3.162278
9,2018-02-21,All They Will Call You Will Be Deportees,3.162278
10,2018-04-11,Pursuing Desegregation In The Trump Era,3.162278


In [28]:
bag_of_words_based_model(5673, 11)

headline :  Gender Confirmation Surgeries Are Rising, And So Is Insurance Coverage: Study



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-02-21,All They Will Call You Will Be Deportees,3.0
2,2018-01-11,Why The California Mudslides Have Been So Deadly,3.162278
3,2018-01-25,Where The Work-For-Welfare Movement Is Heading,3.162278
4,2018-01-03,Mitt Romney is Not The Answer,3.162278
5,2018-01-12,No Shitholes In The Eyes Of Jesus,3.162278
6,2018-01-03,I Was Ghosted By My Best Friend,3.162278
7,2018-03-13,The Fight Over The Criminalization Of Immigrants,3.162278
8,2018-01-29,Here Are All The 2018 Grammy Winners,3.162278
9,2018-01-03,The 9 Most Important Scientific Studies For Parents Of 2017,3.162278
10,2018-04-28,All The Movies That Are Cool For The Summer,3.162278


#### Load the saved data

In [29]:
# To LOAD THE SAVED MODEL
filename = 'Count-Vectorizer features.sav'
BoW_vectorizer_features = pickle.load(open(filename,'rb'))

def bag_of_words_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(BoW_vectorizer_features,BoW_vectorizer_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    #return df.iloc[1:,1]
    return df.iloc[1:,]

bag_of_words_based_model(1833, 11)

headline :  That Time Barbra Streisand Called In Sick To The Grammys And Celine Dion Saved The Day



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-02-21,All They Will Call You Will Be Deportees,3.162278
2,2018-01-20,How The LA Times Union Won,3.316625
3,2018-01-29,"'Despacito' Was Robbed At The Grammys, And We’re All Worse Off For It",3.464102
4,2018-01-04,"2018: More Of The Old, Time For The New",3.464102
5,2018-05-01,Katie Couric On The Time's Up Movement: Now What?,3.464102
6,2018-03-09,'A Wrinkle In Time' And The Burden Of Being First,3.464102
7,2018-02-27,"Barbra Streisand Had Her Dog, Samantha, Cloned Twice",3.464102
8,2018-05-17,The Battle To Save Our Dying Soil,3.464102
9,2018-01-04,A Day Of Liberation In The Golden State,3.464102
10,2018-01-08,No Longer Ruled By Historical Time?,3.464102


Above function recommends **10 similar** articles to the **queried**(read) article based on the headline. It accepts two arguments - index of already read artile and the total number of articles to be recommended.

Based on the **Euclidean distance** it finds out 10 nearest neighbors and recommends. 

**Disadvantages**
1. It gives very low **importance** to less frequently observed words in the corpus. Few words from the queried article like "employer", "flip", "fire" appear less frequently in the entire corpus so **BoW** method does not recommend any article whose headline contains these words. Since **trump** is commonly observed word in the corpus so it is recommending the articles with headline containing "trump".   
2. **BoW** method doesn't preserve the order of words.

To overcome the first disadvantage we use **TF-IDF** method for feature representation. 


### 6.b Using TF-IDF method

**TF-IDF** method is a weighted measure which gives more importance to less frequent words in a corpus. It assigns a weight to each term(word) in a document based on **Term frequency(TF)** and **inverse document frequency(IDF)**.

**TF(i,j)** = (# times word i appears in document j) / (# words in document j)

**IDF(i,D)** = log_e(#documents in the corpus D) / (#documents containing word i)

weight(i,j) = **TF(i,j)** x **IDF(i,D)**

So if a word occurs more number of times in a document but less number of times in all other documents then its **TF-IDF** value will be high.


In [30]:
tfidf_headline_vectorizer = TfidfVectorizer(min_df = 0)
tfidf_headline_features = tfidf_headline_vectorizer.fit_transform(news_articles_temp['headline'])

#### Saving the model

In [31]:
# save the model to disk
filename = 'tf-idf_vectorizer features.sav'
pickle.dump(tfidf_headline_features, open(filename, 'wb'))

In [32]:
def tfidf_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(tfidf_headline_features,tfidf_headline_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    
    #return df.iloc[1:,1]
    return df.iloc[1:,]
tfidf_based_model(133, 11)

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-05-21,The Supreme Court Just Made It A Lot Harder For You To Sue Your Employer,1.164067
2,2018-04-02,The Trump Administration Is Suing California Again,1.253867
3,2018-04-10,"Lou Dobbs Flips Out On Live TV, Urges Trump To 'Fire The SOB' Robert Mueller",1.25881
4,2018-04-26,Cardi B's Former Manager Sues Her For $10 Million,1.268704
5,2018-04-03,A Third Woman Is Suing To Break A Trump-Related Nondisclosure Agreement,1.274264
6,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.274847
7,2018-01-16,State Employer Side Payroll Taxes And Loser Liberalism,1.276696
8,2018-02-21,Democrats Flip Kentucky State House Seat Where Trump Won Overwhelmingly,1.282008
9,2018-01-09,Big Tax Game Hunting: Employer Side Payroll Taxes,1.285147
10,2018-02-28,Democrats Flip 2 More GOP-Held State House Seats,1.287403


In [33]:
tfidf_based_model(2345, 11)

headline :  Shaun White Dismisses Sexual Harassment Allegations As 'Gossip'



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-02-15,Shaun White Called Out By Accuser's Lawyer For Minimizing Sexual Harassment,1.055
2,2018-01-24,GOP Rep. Blames Obamacare For Sexual Harassment Allegations,1.161245
3,2018-04-06,GOP Rep. Blake Farenthold Resigns After Sexual Harassment Allegations,1.18467
4,2018-02-14,Shaun White Makes History With Gold Medal Win In Halfpipe,1.201415
5,2018-02-13,Shaun White Is Damn Near Perfect In His Winter Olympics Comeback,1.211735
6,2018-02-25,CPAC Speaker Lambasts GOP 'Hypocrites' Over Trump Sexual Harassment Allegations,1.221723
7,2018-01-09,Deputy Head Of Norway's Labor Party Resigns Amid Sexual Harassment Allegations,1.224776
8,2018-01-27,Steve Wynn Steps Down As RNC Finance Chair Amid Sexual Harassment Allegations,1.227933
9,2018-03-04,Someone Gave Shaun White's Gold Medal Run The Super Mario Treatment,1.237115
10,2018-01-22,Pope Francis Compares Gossiping Nuns To Terrorists,1.238153


#### Load the saved data

In [34]:
# To LOAD THE SAVED MODEL
filename = 'tf-idf_vectorizer features.sav'
vectorizer_features = pickle.load(open(filename,'rb'))

def tfidf_based_model(row_index, num_similar_items):
    couple_dist = pairwise_distances(vectorizer_features,vectorizer_features[row_index])
    indices = np.argsort(couple_dist.ravel())[0:num_similar_items]
    df = pd.DataFrame({'publish_date': news_articles['date'][indices].values,
               'headline':news_articles['headline'][indices].values,
                'Euclidean similarity with the queried article': couple_dist[indices].ravel()})
    print("="*30,"Queried article details","="*30)
    print('headline : ',news_articles['headline'][indices[0]])
    print("\n","="*25,"Recommended articles : ","="*23)
    
    #return df.iloc[1:,1]
    return df.iloc[1:,]
tfidf_based_model(133, 11)

headline :  Woman Fired After Flipping Off Trump's Motorcade Sues Former Employer



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-05-21,The Supreme Court Just Made It A Lot Harder For You To Sue Your Employer,1.164067
2,2018-04-02,The Trump Administration Is Suing California Again,1.253867
3,2018-04-10,"Lou Dobbs Flips Out On Live TV, Urges Trump To 'Fire The SOB' Robert Mueller",1.25881
4,2018-04-26,Cardi B's Former Manager Sues Her For $10 Million,1.268704
5,2018-04-03,A Third Woman Is Suing To Break A Trump-Related Nondisclosure Agreement,1.274264
6,2018-02-24,Former RNC Chair Fires Back At Claim He Was Only Hired Because He Was Black,1.274847
7,2018-01-16,State Employer Side Payroll Taxes And Loser Liberalism,1.276696
8,2018-02-21,Democrats Flip Kentucky State House Seat Where Trump Won Overwhelmingly,1.282008
9,2018-01-09,Big Tax Game Hunting: Employer Side Payroll Taxes,1.285147
10,2018-02-28,Democrats Flip 2 More GOP-Held State House Seats,1.287403


###6.c Taking Input from User

In [55]:
import random
y=news_articles["category"].unique()
inp = {}
cat_ls = []
for i in range(len(y)):
  inp[i] = y[i]
print(inp)
cat=input("\nEnter the index corresponding to the category of your choice: ")
# Find the index of the first row where 'category' matches
print(inp[int(cat)])
cat_ls = df.index[df['category'] == inp[int(cat)]][:25]

cat_index = random.choice(cat_ls)
print(cat_index)

{0: 'QUEER VOICES', 1: 'COMEDY', 2: 'ENTERTAINMENT', 3: 'POLITICS', 4: 'IMPACT', 5: 'LATINO VOICES', 6: 'WEIRD NEWS', 7: 'WORLD NEWS', 8: 'BLACK VOICES', 9: 'BUSINESS', 10: 'MEDIA', 11: 'GREEN', 12: 'TRAVEL', 13: 'CRIME', 14: 'TECH', 15: 'WOMEN', 16: 'SPORTS', 17: 'PARENTS', 18: 'EDUCATION', 19: 'STYLE', 20: 'TASTE', 21: 'RELIGION', 22: 'SCIENCE', 23: 'HEALTHY LIVING', 24: 'ARTS & CULTURE', 25: 'COLLEGE'}

Enter the index corresponding to the category of your choice: 2
ENTERTAINMENT
1


In [56]:
bag_of_words_based_model(cat_index,11)


headline :  ‘The Voice’ Blind Auditions Make History With First Trans Contestant



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-04-30,'The Voice' Contestant Makes Aussie TV History By Proposing To His Boyfriend,2.828427
2,2018-05-19,The History Of The National Anthem In Sports,3.162278
3,2018-02-26,Why These Were The Gayest Winter Olympics In History,3.162278
4,2018-04-15,Beyoncé Makes History As First Black Woman To Headline Coachella,3.162278
5,2018-04-16,5 Lessons In The History Of American Defeat,3.162278
6,2018-03-09,'A Wrinkle In Time' And The Burden Of Being First,3.162278
7,2018-02-20,Nigeria And Jamaica Just Made Winter Olympics History,3.162278
8,2018-02-21,All They Will Call You Will Be Deportees,3.162278
9,2018-03-16,The Fog Of War In America,3.316625
10,2018-01-25,Where The Work-For-Welfare Movement Is Heading,3.316625


In [57]:
tfidf_based_model(cat_index,11)

headline :  ‘The Voice’ Blind Auditions Make History With First Trans Contestant



Unnamed: 0,publish_date,headline,Euclidean similarity with the queried article
1,2018-04-30,'The Voice' Contestant Makes Aussie TV History By Proposing To His Boyfriend,1.054343
2,2018-01-30,'Drag Race' Star To Make History As A Trans Leading Lady On Broadway,1.210646
3,2018-04-15,Beyoncé Makes History As First Black Woman To Headline Coachella,1.246795
4,2018-01-22,The Voices Of Larry Nassar's Victims Made It All The Way To The Women’s March,1.249617
5,2018-02-19,John Goodman's Audition And How He Got The Role On 'Roseanne',1.257244
6,2018-04-19,"Tammy Duckworth And Baby Make History, Becoming First On Senate Floor",1.261307
7,2018-02-24,USA Makes History With First Olympic Curling Gold After Unlikely Comeback,1.266063
8,2018-02-01,Andy Cohen Auditions To Play Samantha In 'Sex And The City 3',1.267544
9,2018-05-09,Kim Kardashian To Make History With Fashion Council’s First Influencer Award,1.271638
10,2018-05-07,Lupe Valdez Could Make History As Texas’ First Hispanic Governor,1.272267


Compared to **BoW** method, here **TF-IDF** method recommends the articles with headline containing words like "employer", "fire", "flip" in top 5 recommendations and these words occur less frequently in the corpus.   

**Disadvantages :- **

**Bow** and **TF-IDF** method do not capture **semantic** and **syntactic** similarity of a given word with other words but this can be captured using **Word embeddings**.

For example: there is a good association between words like "trump" and "white house", "office and employee", "tiger" and "leopard", "USA" and "Washington D.C" etc. Such kind of **semantic** similarity can be captured using **word embedding** techniques.
**Word embedding** techniques like **Word2Vec**, **GloVe** and **fastText** leverage semantic similarity between words. 