## References
Deep Learning Content-Based Filtering: https://github.com/Toqeer-Ahmad/Deep-Learning-for-Content-Based-Filtering/blob/main/Deep%20Learning%20for%20Content-Based%20Filtering.ipynb

Predicting Like or Dislike: https://towardsdatascience.com/researching-content-based-filtering-for-news-feeds-17954b1865d7

## Import Libraries

In [1]:
import pandas as pd
from rake_nltk import Rake
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

## Import Data

### News

In [2]:
columns = ['News ID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'Title Entities', 'Abstract Entites']
news_df = pd.read_csv('./MINDsmall_train/news.tsv', names=columns, header=None, sep='\t')
news_df.head()

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entites
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [4]:
news_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   News ID           51282 non-null  object
 1   Category          51282 non-null  object
 2   SubCategory       51282 non-null  object
 3   Title             51282 non-null  object
 4   Abstract          48616 non-null  object
 5   URL               51282 non-null  object
 6   Title Entities    51279 non-null  object
 7   Abstract Entites  51278 non-null  object
dtypes: object(8)
memory usage: 3.1+ MB


In [3]:
news_dev_df = pd.read_csv('./MINDsmall_dev/news.tsv', names=columns, header=None, sep='\t')
news_dev_df.head()

Unnamed: 0,News ID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entites
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N18955,health,medical,Dispose of unwanted prescription drugs during ...,,https://assets.msn.com/labs/mind/AAISxPN.html,"[{""Label"": ""Drug Enforcement Administration"", ...",[]
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [6]:
news_dev_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42416 entries, 0 to 42415
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   News ID           42416 non-null  object
 1   Category          42416 non-null  object
 2   SubCategory       42416 non-null  object
 3   Title             42416 non-null  object
 4   Abstract          40395 non-null  object
 5   URL               42416 non-null  object
 6   Title Entities    42414 non-null  object
 7   Abstract Entites  42414 non-null  object
dtypes: object(8)
memory usage: 2.6+ MB


### Behavior

In [2]:
columns = ['Impression ID', 'User ID', 'Time', 'History', 'Impressions']
behaviors_df = pd.read_csv('./MINDsmall_train/behaviors.tsv', names=columns, header=None, sep='\t')
behaviors_df.head()

Unnamed: 0,Impression ID,User ID,Time,History,Impressions
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


In [19]:
behaviors_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156965 entries, 0 to 156964
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Impression ID  156965 non-null  int64 
 1   User ID        156965 non-null  object
 2   Time           156965 non-null  object
 3   History        153727 non-null  object
 4   Impressions    156965 non-null  object
dtypes: int64(1), object(4)
memory usage: 6.0+ MB


In [3]:
behaviors_dev_df = pd.read_csv('./MINDsmall_dev/behaviors.tsv', names=columns, header=None, sep='\t')
behaviors_dev_df.head()

Unnamed: 0,Impression ID,User ID,Time,History,Impressions
0,1,U80234,11/15/2019 12:37:50 PM,N55189 N46039 N51741 N53234 N11276 N264 N40716...,N28682-0 N48740-0 N31958-1 N34130-0 N6916-0 N5...
1,2,U60458,11/15/2019 7:11:50 AM,N58715 N32109 N51180 N33438 N54827 N28488 N611...,N20036-0 N23513-1 N32536-0 N46976-0 N35216-0 N...
2,3,U44190,11/15/2019 9:55:12 AM,N56253 N1150 N55189 N16233 N61704 N51706 N5303...,N36779-0 N62365-0 N58098-0 N5472-0 N13408-0 N5...
3,4,U87380,11/15/2019 3:12:46 PM,N63554 N49153 N28678 N23232 N43369 N58518 N444...,N6950-0 N60215-0 N6074-0 N11930-0 N6916-0 N248...
4,5,U9444,11/15/2019 8:25:46 AM,N51692 N18285 N26015 N22679 N55556,N5940-1 N23513-0 N49285-0 N23355-0 N19990-0 N3...


In [18]:
behaviors_dev_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73152 entries, 0 to 73151
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Impression ID  73152 non-null  int64 
 1   User ID        73152 non-null  object
 2   Time           73152 non-null  object
 3   History        70938 non-null  object
 4   Impressions    73152 non-null  object
dtypes: int64(1), object(4)
memory usage: 2.8+ MB


## Checking difference between train and dev behavior.tsv

In [16]:
behaviors_df[behaviors_df['User ID'] == 'U10008']

Unnamed: 0,Impression ID,User ID,Time,History,Impressions
46854,46855,U10008,11/9/2019 2:13:33 PM,N23912 N30148 N46754 N33117 N9619 N36526 N4449...,N24272-0 N8016-0 N43595-0 N732-0 N8929-0 N3567...


In [17]:
behaviors_dev_df[behaviors_dev_df['User ID'] == 'U10008']

Unnamed: 0,Impression ID,User ID,Time,History,Impressions
25324,25325,U10008,11/15/2019 9:48:12 AM,N23912 N30148 N46754 N33117 N9619 N36526 N4449...,N55237-0 N50775-0 N5472-0 N58251-0 N62365-0 N4...


In [15]:
set(behaviors_df['User ID'].unique()) & set(behaviors_dev_df['User ID'].unique())

{'U10008',
 'U12671',
 'U82589',
 'U43301',
 'U56365',
 'U85257',
 'U26637',
 'U69688',
 'U50667',
 'U7160',
 'U74655',
 'U55281',
 'U60895',
 'U44999',
 'U14651',
 'U24085',
 'U16647',
 'U72159',
 'U6138',
 'U22551',
 'U54367',
 'U25433',
 'U9803',
 'U136',
 'U43951',
 'U5288',
 'U41966',
 'U10862',
 'U53200',
 'U62558',
 'U29706',
 'U26328',
 'U22115',
 'U42421',
 'U26626',
 'U85481',
 'U48126',
 'U27037',
 'U41969',
 'U47673',
 'U76591',
 'U81480',
 'U86516',
 'U70274',
 'U44217',
 'U92282',
 'U82568',
 'U46134',
 'U40751',
 'U74097',
 'U72722',
 'U10152',
 'U58150',
 'U85606',
 'U78750',
 'U76223',
 'U78709',
 'U58543',
 'U47226',
 'U11945',
 'U10742',
 'U47041',
 'U43229',
 'U26616',
 'U12851',
 'U58027',
 'U54898',
 'U1837',
 'U29508',
 'U84613',
 'U25314',
 'U86368',
 'U60681',
 'U48099',
 'U38904',
 'U32568',
 'U64885',
 'U31709',
 'U68046',
 'U78428',
 'U26543',
 'U53244',
 'U65059',
 'U78722',
 'U75893',
 'U82392',
 'U41835',
 'U13009',
 'U71849',
 'U63643',
 'U91774',
 'U380

## Experiments

### Limiting to Title, Category, Subcategory, and Abstract

In [5]:
df = news_df[['Title', 'Category', 'SubCategory', 'Abstract']]
df.head()

Unnamed: 0,Title,Category,SubCategory,Abstract
0,"The Brands Queen Elizabeth, Prince Charles, an...",lifestyle,lifestyleroyals,"Shop the notebooks, jackets, and more that the..."
1,50 Worst Habits For Belly Fat,health,weightloss,These seemingly harmless habits are holding yo...
2,The Cost of Trump's Aid Freeze in the Trenches...,news,newsworld,Lt. Ivan Molchanets peeked over a parapet of s...
3,I Was An NBA Wife. Here's How It Affected My M...,health,voices,"I felt like I was a fraud, and being an NBA wi..."
4,"How to Get Rid of Skin Tags, According to a De...",health,medical,"They seem harmless, but there's a very good re..."


### Drop rows with missing abstracts

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51282 entries, 0 to 51281
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        51282 non-null  object
 1   Category     51282 non-null  object
 2   SubCategory  51282 non-null  object
 3   Abstract     48616 non-null  object
dtypes: object(4)
memory usage: 1.6+ MB


In [7]:
# dropping the Plot column
df.dropna(inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48616 entries, 0 to 51280
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        48616 non-null  object
 1   Category     48616 non-null  object
 2   SubCategory  48616 non-null  object
 3   Abstract     48616 non-null  object
dtypes: object(4)
memory usage: 1.9+ MB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


### Generating Bag of Words

In [8]:
# initializing the new column
df['Keywords'] = ""

for index, row in df.iterrows():
    # instantiating Rake, by default it uses english stopwords from NLTK
    # and discards all puntuation characters as well
    r = Rake()

    # extracting the words by passing the text
    r.extract_keywords_from_text(row['Abstract'])

    # getting the dictionary whith key words as keys and their scores as values
    key_words_dict_scores = r.get_word_degrees()
    
    # assigning the key words to the new column for the corresponding movie
    row['Keywords'] = list(key_words_dict_scores.keys())

# dropping the Abstract column
df.drop(columns = ['Abstract'], inplace = True)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Keywords'] = ""
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns = ['Abstract'], inplace = True)


Unnamed: 0,Title,Category,SubCategory,Keywords
0,"The Brands Queen Elizabeth, Prince Charles, an...",lifestyle,lifestyleroyals,"[shop, notebooks, jackets, royals, live, without]"
1,50 Worst Habits For Belly Fat,health,weightloss,"[seemingly, harmless, habits, holding, back, k..."
2,The Cost of Trump's Aid Freeze in the Trenches...,news,newsworld,"[lt, ivan, molchanets, peeked, parapet, sand, ..."
3,I Was An NBA Wife. Here's How It Affected My M...,health,voices,"[felt, like, fraud, nba, wife, help, fact, nea..."
4,"How to Get Rid of Skin Tags, According to a De...",health,medical,"[seem, harmless, good, reason, ignore, post, g..."


In [9]:
df['Bag of Words'] = df.apply(lambda x: ' '.join([x['Category']] + [x['SubCategory']] + x['Keywords']), axis=1)

# dropping the Category column
df.drop(columns = ['Category'], inplace = True)
# dropping the SubCategory column
df.drop(columns = ['SubCategory'], inplace = True)
# dropping the Keywords column
df.drop(columns = ['Keywords'], inplace = True)

df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Bag of Words'] = df.apply(lambda x: ' '.join([x['Category']] + [x['SubCategory']] + x['Keywords']), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns = ['Category'], inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns = ['SubCategory'], inplace = True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: htt

Unnamed: 0,Title,Bag of Words
0,"The Brands Queen Elizabeth, Prince Charles, an...",lifestyle lifestyleroyals shop notebooks jacke...
1,50 Worst Habits For Belly Fat,health weightloss seemingly harmless habits ho...
2,The Cost of Trump's Aid Freeze in the Trenches...,news newsworld lt ivan molchanets peeked parap...
3,I Was An NBA Wife. Here's How It Affected My M...,health voices felt like fraud nba wife help fa...
4,"How to Get Rid of Skin Tags, According to a De...",health medical seem harmless good reason ignor...


### Applying Count Vectorizer

In [10]:
# instantiating and generating the count matrix
count = CountVectorizer()
count_matrix = count.fit_transform(df['Bag of Words'])

# generating the cosine similarity matrix
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [18]:
# defining the function that takes in movie title 
# as input and returns the top 10 recommended movies
def recommendations(title, cosine_sim = cosine_sim):
    
    # initializing the empty list of recommended movies
    recommended_movies = []
    
    # gettin the index of the movie that matches the title
    idx = df[df['Title'] == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(df.index)[i])
        
    return recommended_movies

In [22]:
indices = recommendations('50 Worst Habits For Belly Fat')

df.loc[indices]['Title']

154      Those Grueling Workouts May Not Help You Lose ...
292      If You Have a Slow Metabolism, Here Are 5 Doct...
23602    Jillian Michaels Says Intermittent Fasting 'Is...
6440               10 Ways to Burn Belly Fat in 10 Minutes
6172     Discouraged From Trying to Lose Belly Fat and ...
29142    Get paid to get the flu: Study looking for vol...
6741     Kristina Went From 'Drastic Diets' to Losing 1...
4333     If You Can't Seem to Lose Belly Fat, Experts A...
31183    5 Changes I Made in My Life to Lose Face Fat a...
7676     Local anchor returning to TV after dealing wit...
Name: Title, dtype: object