# Model training and optimzation
The recommendation model training and optimiation involves two parts as shown below: 
1. Data cleaning and processing <br>
    a. Access the category information from new articles in validation set <br>
    b. Get two feature TF-IDF matrixes for previous articles and new articels respectively <br>
    c. Compute the cosine similarity bewtween previous articles and new articles <br>
    d. Modelling <br>
2. Model optimization <br>
a. Optimization on category features using min_df and max_df in "TfidfVectorizer" process <br>
b. Optimization on the total number of articles recommended <br>
c. Optimization on the ratio of the number of new articles to previous articles <br>

In [585]:
import random
import pandas as pd
import numpy as np
import math
import scipy.stats as st
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import seaborn as sns
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer,TfidfTransformer 

### Load the data

In [671]:
%store -r train_df
%store -r valid_df
%store -r test_df

%store -r val_articles_df
%store -r train_articles_df

%store -r end0
%store -r end1
%store -r end2
%store -r end3
%store -r end4
%store -r end5
%store -r end6
%store -r end7

### 1. Data cleaning and processing

In [650]:
train_articles_df.head()

Unnamed: 0,contentID,headline,categories,releaseDateTime,h_score,h_compound
0,www.arkansasonline.com/news/2018/jun/30/arrest...,arrest state legislator urge quit post,News Arkansas News Politics Arkansas,2018-06-30,-0.34,negative
1,www.arkansasonline.com/news/2018/jun/30/suspec...,suspect arrest fatal shoot pulaski county,News Arkansas Crime,2018-06-30,-0.8591,negative
2,www.arkansasonline.com/news/2018/jun/30/police...,police officer fatal shoot self central arkans...,News Arkansas,2018-06-30,-0.7096,negative
3,www.arkansasonline.com/news/2018/jun/30/motorc...,motorcyclist kill headon crash little rock suv...,News Arkansas News Arkansas Crime,2018-06-30,-0.8126,negative
4,www.arkansasonline.com/news/2018/jun/30/early-...,recruit guy early offer get high recruit wr ca...,None Recruiting Sports College Razorbacks Ra...,2018-06-30,0.4588,positive


In [672]:
#  get the dataframe showing the number of page visits and visitors for each article
train_article_visit_counts = train_df.groupby(['contentID']).count().sort_values('headline', ascending = False)
train_article_visit_counts['contentID'] = train_article_visit_counts.index
train_article_visit_counts = train_article_visit_counts.reset_index(drop = True)
train_article_visitor_counts = train_df.drop_duplicates(subset = ['contentID', 'visitorID']).groupby(['contentID']).count()
train_article_visitor_counts['contentID'] = train_article_visitor_counts.index
train_article_visitor_counts = train_article_visitor_counts.reset_index(drop = True)

train_article_counts_df = pd.merge(train_article_visit_counts, train_article_visitor_counts, on = 'contentID', how = 'inner')[['contentID', 'headline_x','headline_y']]
train_article_counts_df.columns = ['read_article', 'visits', 'visitor']
train_article_counts_df

Unnamed: 0,read_article,visits,visitor
0,www.arkansasonline.com/news/2018/aug/29/food-n...,910,393
1,www.arkansasonline.com/news/2018/aug/31/former...,734,296
2,www.arkansasonline.com/news/2018/aug/22/arkans...,656,316
3,www.arkansasonline.com/news/2018/jul/19/sherif...,594,243
4,www.arkansasonline.com/news/2018/aug/16/little...,583,268
...,...,...,...
8518,www.arkansasonline.com/news/2018/aug/12/look-f...,1,1
8519,www.arkansasonline.com/news/2018/aug/12/major-...,1,1
8520,www.arkansasonline.com/news/2018/aug/12/mercy-...,1,1
8521,www.arkansasonline.com/news/2018/aug/12/new-le...,1,1


In [673]:
val_articles_df.head()

Unnamed: 0,contentID,headline,categories,releaseDateTime
0,www.arkansasonline.com/news/2018/sep/03/trump-...,trump attack union leader labor day,News National News Politics National,2018-09-03
1,www.arkansasonline.com/news/2018/sep/03/injuri...,injury force merrick football retirement,Sports College Razorbacks RazobacksCollegeFoo...,2018-09-03
2,www.arkansasonline.com/news/2018/sep/03/galler...,gallery annual national championship chuckwago...,News Arkansas,2018-09-03
3,www.arkansasonline.com/news/2018/sep/03/33-yea...,arkansan dy wreck involve peterbilt truck,News Arkansas News Fatalwrecks,2018-09-03
4,www.arkansasonline.com/news/2018/sep/03/arkans...,arkansas cinema society support local filmmake...,None Lr News Arkansas News Arkansas Entert...,2018-09-03


In [674]:
train_visitors = train_df['visitorID'].unique().tolist() # 2884
val_visitors = valid_df['visitorID'].unique().tolist() # 1868

In [675]:
# the page visits info in training set
# which we could refer to in order to find similar new articles for former visitors
train_visits = train_df.drop_duplicates(subset = ['contentID', 'visitorID'])[['contentID', 'visitorID']]
train_visits = train_visits.sort_values('visitorID').reset_index(drop = True)
train_visits

Unnamed: 0,contentID,visitorID
0,www.arkansasonline.com/news/2018/jul/17/author...,2.011047e+14
1,www.arkansasonline.com/news/2017/may/25/patti-...,2.011047e+14
2,www.arkansasonline.com/news/2018/aug/09/arkans...,2.011047e+14
3,www.arkansasonline.com/news/2018/aug/09/banana...,2.011047e+14
4,www.arkansasonline.com/news/2018/aug/01/blaze-...,2.011047e+14
...,...,...
102876,www.arkansasonline.com/news/2018/aug/19/legisl...,1.805825e+19
102877,www.arkansasonline.com/news/2018/aug/29/food-n...,1.805825e+19
102878,www.arkansasonline.com/news/2018/aug/24/missou...,1.805825e+19
102879,www.arkansasonline.com/news/2018/aug/24/channe...,1.805825e+19


In [676]:
# the unique page visits in validatio set, 
# which we could use to evaluate the recommendation system
valid_visits = valid_df.drop_duplicates(subset = ['contentID', 'visitorID']).sort_values('visitorID').reset_index(drop = True)[['contentID', 'visitorID']]
valid_visits

Unnamed: 0,contentID,visitorID
0,www.arkansasonline.com/news/2018/sep/12/bigger...,2.011047e+14
1,www.arkansasonline.com/news/2018/sep/12/morris...,2.011047e+14
2,www.arkansasonline.com/news/2018/sep/12/video-...,2.011047e+14
3,www.arkansasonline.com/news/2018/sep/06/jury-t...,4.259291e+15
4,www.arkansasonline.com/news/2018/sep/07/confro...,4.259291e+15
...,...,...
25188,www.arkansasonline.com/news/2018/sep/12/fronti...,1.805825e+19
25189,www.arkansasonline.com/news/2018/sep/13/board-...,1.805825e+19
25190,www.arkansasonline.com/news/2018/sep/15/key-pi...,1.805825e+19
25191,www.arkansasonline.com/news/2018/sep/05/judge-...,1.805825e+19


First we can have a look at how the random recommendation model performs.

In [677]:
# define a function to recommend n artciels from new articles in validation set randomly to visitors
# and also calculate the precision/recall confidence interval for random recommendation
def random_recom_result(n):
    precision_list = []
    recall_list = []
    avg_precision_list = []
    avg_recall_list = []
    
    for j in range(0,100):
        print('Cycle', j)
        nv_df = pd.DataFrame({'visitorID': [], 'contentID': []})
        i = 0
        randomlist = random.sample(range(0, 1399), n)
        new_article = val_new_articles_df.iloc[randomlist]['contentID'].tolist()

        for visitor in val_visitors:
            i += 1
            one_new_visitor_df = pd.DataFrame({'visitorID': visitor, 'contentID': new_article})
            nv_df = nv_df.append(one_new_visitor_df)
    
        precision_recall_list = precision_recall(n, nv_df)
        avg_precision_recall_list = avg_precision_recall(n, nv_df)
        precision_list.append(precision_recall_list[0])
        recall_list.append(precision_recall_list[1])
        avg_precision_list.append(avg_precision_recall_list[0])
        avg_recall_list.append(avg_precision_recall_list[1])
    
    return [precision_list, recall_list, avg_precision_list, avg_recall_list]

In [678]:
def CI(result_list, n): 
    precision_list = result_list[0]
    recall_list = result_list[1]
    avg_precision_list = result_list[2]
    avg_recall_list = result_list[3]
    print(f"{n} random recommendation precision mean (100 cycles):", round(sum(precision_list), 2), '%' )
    CI1 = st.t.interval(alpha=0.95, df=len(precision_list)-1, loc=np.mean(precision_list), scale=st.sem(precision_list))
    print("CI: ", f"[{round(max(CI1[0],0)*100,2)}%,{round(CI1[1]*100, 2)}%]" )
    print(f"{n} random recommendation recall mean (100 cycles):", round(sum(recall_list), 2), '%' )
    CI2 = st.t.interval(alpha=0.95, df=len(recall_list)-1, loc=np.mean(recall_list), scale=st.sem(recall_list)) 
    print("CI: ", f"[{round(max(CI2[0],0)*100,2)}%,{round(CI2[1]*100, 2)}%]" )
    print(f"{n} random recommendation average precision for each visitor (100 cycles):", round(sum(avg_precision_list), 2), '%' )
    CI3 = st.t.interval(alpha=0.95, df=len(avg_precision_list)-1, loc=np.mean(avg_precision_list), scale=st.sem(avg_precision_list))
    print("CI: ", f"[{round(max(CI3[0],0)*100,2)}%,{round(CI3[1]*100, 2)}%]" )
    print(f"{n} random recommendation average recall for each visitor (100 cycles):", round(sum(avg_recall_list), 2), '%' )
    CI4 = st.t.interval(alpha=0.95, df=len(avg_recall_list)-1, loc=np.mean(avg_recall_list), scale=st.sem(avg_recall_list))
    print("CI: ", f"[{round(max(CI4[0],0)*100,2)}%,{round(CI4[1]*100, 2)}%]" )
    
    return [CI1, CI2, CI3, CI4]

In [591]:
random_results5 = random_recom_result(5)

Cycle 0
Number of true positive cases: 18
Top-5 recommendation precision: 0%
Top-5 recommendation recall: 0%
Top-5 recommendation average precision: 0.19%
Top-5 recommendation average recall: 0.02%
Cycle 1
Number of true positive cases: 41
Top-5 recommendation precision: 0%
Top-5 recommendation recall: 0%
Top-5 recommendation average precision: 0.44%
Top-5 recommendation average recall: 0.13%
Cycle 2
Number of true positive cases: 67
Top-5 recommendation precision: 1%
Top-5 recommendation recall: 0%
Top-5 recommendation average precision: 0.72%
Top-5 recommendation average recall: 0.22%
Cycle 3
Number of true positive cases: 20
Top-5 recommendation precision: 0%
Top-5 recommendation recall: 0%
Top-5 recommendation average precision: 0.21%
Top-5 recommendation average recall: 0.04%
Cycle 4
Number of true positive cases: 70
Top-5 recommendation precision: 1%
Top-5 recommendation recall: 0%
Top-5 recommendation average precision: 0.75%
Top-5 recommendation average recall: 0.27%
Cycle 5
Nu

In [616]:
CI5 = CI(random_results5, 5)

5 random recommendation precision mean (100 cycles): 0.83 %
CI:  [0.68%,0.97%]
5 random recommendation recall mean (100 cycles): 0.31 %
CI:  [0.25%,0.36%]
5 random recommendation average precision for each visitor (100 cycles): 0.83 %
CI:  [0.68%,0.97%]
5 random recommendation average recall for each visitor (100 cycles): 0.31 %
CI:  [0.24%,0.38%]


In [612]:
random_results10 = random_recom_result(10)

Cycle 0
Number of true positive cases: 68
Top-10 recommendation precision: 0%
Top-10 recommendation recall: 0%
Top-10 recommendation average precision: 0.36%
Top-10 recommendation average recall: 0.18%
Cycle 1
Number of true positive cases: 259
Top-10 recommendation precision: 1%
Top-10 recommendation recall: 1%
Top-10 recommendation average precision: 1.39%
Top-10 recommendation average recall: 0.92%
Cycle 2
Number of true positive cases: 60
Top-10 recommendation precision: 0%
Top-10 recommendation recall: 0%
Top-10 recommendation average precision: 0.32%
Top-10 recommendation average recall: 0.13%
Cycle 3
Number of true positive cases: 145
Top-10 recommendation precision: 1%
Top-10 recommendation recall: 1%
Top-10 recommendation average precision: 0.78%
Top-10 recommendation average recall: 0.53%
Cycle 4
Number of true positive cases: 140
Top-10 recommendation precision: 1%
Top-10 recommendation recall: 1%
Top-10 recommendation average precision: 0.75%
Top-10 recommendation average r

In [641]:
CI10 = CI(random_results10, 10)

10 random recommendation precision mean (100 cycles): 0.8 %
CI:  [0.7%,0.91%]
10 random recommendation recall mean (100 cycles): 0.59 %
CI:  [0.52%,0.67%]
10 random recommendation average precision for each visitor (100 cycles): 0.8 %
CI:  [0.7%,0.91%]
10 random recommendation average recall for each visitor (100 cycles): 0.58 %
CI:  [0.48%,0.68%]


In [619]:
random_results15 = random_recom_result(15)

Cycle 0
Number of true positive cases: 280
Top-15 recommendation precision: 1%
Top-15 recommendation recall: 1%
Top-15 recommendation average precision: 1.0%
Top-15 recommendation average recall: 1.32%
Cycle 1
Number of true positive cases: 358
Top-15 recommendation precision: 1%
Top-15 recommendation recall: 1%
Top-15 recommendation average precision: 1.28%
Top-15 recommendation average recall: 1.39%
Cycle 2
Number of true positive cases: 564
Top-15 recommendation precision: 2%
Top-15 recommendation recall: 2%
Top-15 recommendation average precision: 2.01%
Top-15 recommendation average recall: 2.83%
Cycle 3
Number of true positive cases: 255
Top-15 recommendation precision: 1%
Top-15 recommendation recall: 1%
Top-15 recommendation average precision: 0.91%
Top-15 recommendation average recall: 0.94%
Cycle 4
Number of true positive cases: 239
Top-15 recommendation precision: 1%
Top-15 recommendation recall: 1%
Top-15 recommendation average precision: 0.85%
Top-15 recommendation average 

In [640]:
CI15 = CI(random_results10, 15)

15 random recommendation precision mean (100 cycles): 0.8 %
CI:  [0.7%,0.91%]
15 random recommendation recall mean (100 cycles): 0.59 %
CI:  [0.52%,0.67%]
15 random recommendation average precision for each visitor (100 cycles): 0.8 %
CI:  [0.7%,0.91%]
15 random recommendation average recall for each visitor (100 cycles): 0.58 %
CI:  [0.48%,0.68%]


In [621]:
random_results20 = random_recom_result(20)

Cycle 0
Number of true positive cases: 440
Top-20 recommendation precision: 1%
Top-20 recommendation recall: 2%
Top-20 recommendation average precision: 1.18%
Top-20 recommendation average recall: 1.92%
Cycle 1
Number of true positive cases: 320
Top-20 recommendation precision: 1%
Top-20 recommendation recall: 1%
Top-20 recommendation average precision: 0.86%
Top-20 recommendation average recall: 1.58%
Cycle 2
Number of true positive cases: 421
Top-20 recommendation precision: 1%
Top-20 recommendation recall: 2%
Top-20 recommendation average precision: 1.13%
Top-20 recommendation average recall: 1.9%
Cycle 3
Number of true positive cases: 365
Top-20 recommendation precision: 1%
Top-20 recommendation recall: 1%
Top-20 recommendation average precision: 0.98%
Top-20 recommendation average recall: 1.42%
Cycle 4
Number of true positive cases: 417
Top-20 recommendation precision: 1%
Top-20 recommendation recall: 2%
Top-20 recommendation average precision: 1.12%
Top-20 recommendation average 

In [622]:
CI20 = CI(random_results20, 20)

20 random recommendation precision mean (100 cycles): 0.81 %
CI:  [0.74%,0.88%]
20 random recommendation recall mean (100 cycles): 1.2 %
CI:  [1.1%,1.3%]
20 random recommendation average precision for each visitor (100 cycles): 0.81 %
CI:  [0.74%,0.88%]
20 random recommendation average recall for each visitor (100 cycles): 1.18 %
CI:  [1.05%,1.31%]


In [645]:
random_results30 = random_recom_result(30)

Cycle 0
Number of true positive cases: 318
Top-30 recommendation precision: 1%
Top-30 recommendation recall: 1%
Top-30 recommendation average precision: 0.57%
Top-30 recommendation average recall: 1.19%
Cycle 1
Number of true positive cases: 526
Top-30 recommendation precision: 1%
Top-30 recommendation recall: 2%
Top-30 recommendation average precision: 0.94%
Top-30 recommendation average recall: 2.11%
Cycle 2
Number of true positive cases: 492
Top-30 recommendation precision: 1%
Top-30 recommendation recall: 2%
Top-30 recommendation average precision: 0.88%
Top-30 recommendation average recall: 2.06%
Cycle 3
Number of true positive cases: 333
Top-30 recommendation precision: 1%
Top-30 recommendation recall: 1%
Top-30 recommendation average precision: 0.59%
Top-30 recommendation average recall: 1.04%
Cycle 4
Number of true positive cases: 415
Top-30 recommendation precision: 1%
Top-30 recommendation recall: 2%
Top-30 recommendation average precision: 0.74%
Top-30 recommendation average

In [646]:
CI30 = CI(random_results30, 30)

30 random recommendation precision mean (100 cycles): 0.91 %
CI:  [0.85%,0.97%]
30 random recommendation recall mean (100 cycles): 2.02 %
CI:  [1.88%,2.15%]
30 random recommendation average precision for each visitor (100 cycles): 0.91 %
CI:  [0.85%,0.97%]
30 random recommendation average recall for each visitor (100 cycles): 1.98 %
CI:  [1.81%,2.16%]


In [647]:
random_results35 = random_recom_result(35)

Cycle 0
Number of true positive cases: 1049
Top-35 recommendation precision: 2%
Top-35 recommendation recall: 4%
Top-35 recommendation average precision: 1.6%
Top-35 recommendation average recall: 5.03%
Cycle 1
Number of true positive cases: 692
Top-35 recommendation precision: 1%
Top-35 recommendation recall: 3%
Top-35 recommendation average precision: 1.06%
Top-35 recommendation average recall: 3.22%
Cycle 2
Number of true positive cases: 947
Top-35 recommendation precision: 1%
Top-35 recommendation recall: 4%
Top-35 recommendation average precision: 1.45%
Top-35 recommendation average recall: 4.11%
Cycle 3
Number of true positive cases: 497
Top-35 recommendation precision: 1%
Top-35 recommendation recall: 2%
Top-35 recommendation average precision: 0.76%
Top-35 recommendation average recall: 1.98%
Cycle 4
Number of true positive cases: 630
Top-35 recommendation precision: 1%
Top-35 recommendation recall: 3%
Top-35 recommendation average precision: 0.96%
Top-35 recommendation average

In [648]:
CI35 = CI(random_results35, 35)

35 random recommendation precision mean (100 cycles): 0.93 %
CI:  [0.87%,0.99%]
35 random recommendation recall mean (100 cycles): 2.42 %
CI:  [2.26%,2.58%]
35 random recommendation average precision for each visitor (100 cycles): 0.93 %
CI:  [0.87%,0.99%]
35 random recommendation average recall for each visitor (100 cycles): 2.45 %
CI:  [2.24%,2.67%]


In [623]:
random_results40 = random_recom_result(40)

Cycle 0
Number of true positive cases: 785
Top-40 recommendation precision: 1%
Top-40 recommendation recall: 3%
Top-40 recommendation average precision: 1.05%
Top-40 recommendation average recall: 2.99%
Cycle 1
Number of true positive cases: 252
Top-40 recommendation precision: 0%
Top-40 recommendation recall: 1%
Top-40 recommendation average precision: 0.34%
Top-40 recommendation average recall: 0.78%
Cycle 2
Number of true positive cases: 870
Top-40 recommendation precision: 1%
Top-40 recommendation recall: 3%
Top-40 recommendation average precision: 1.16%
Top-40 recommendation average recall: 3.5%
Cycle 3
Number of true positive cases: 653
Top-40 recommendation precision: 1%
Top-40 recommendation recall: 3%
Top-40 recommendation average precision: 0.87%
Top-40 recommendation average recall: 2.55%
Cycle 4
Number of true positive cases: 949
Top-40 recommendation precision: 1%
Top-40 recommendation recall: 4%
Top-40 recommendation average precision: 1.27%
Top-40 recommendation average 

In [624]:
CI40 = CI(random_results40, 40)

40 random recommendation precision mean (100 cycles): 0.83 %
CI:  [0.78%,0.88%]
40 random recommendation recall mean (100 cycles): 2.46 %
CI:  [2.31%,2.6%]
40 random recommendation average precision for each visitor (100 cycles): 0.83 %
CI:  [0.78%,0.88%]
40 random recommendation average recall for each visitor (100 cycles): 2.44 %
CI:  [2.25%,2.63%]


Both precision and recall are very low no matter how many new articles are randomly recommended. <br>
Randomly recommending 35 articles (1.34%) seems to be slightly better than recommending 20/30/40 articles, according to the f-1 score.

#### a. Access the category information from new articles in validation set

In [644]:
valid_visits.groupby('visitorID').count().describe()

Unnamed: 0,contentID
count,1868.0
mean,13.486617
std,23.609784
min,1.0
25%,2.0
50%,7.0
75%,15.0
max,356.0


The upper limit of the number of unique visitors for each article can be calculated by Q3+1.5IQR = 34.5, which give us a reference of how many articles should be recommended to each visitor.

In validation set, I will use the contentID to decide if an article is a new article or not

In [679]:
# info about new articles in validation set
val_new_articles_df = val_articles_df[(val_articles_df['releaseDateTime'] > end5) & (val_articles_df['releaseDateTime'] <= end6)][['contentID','headline', 'categories', 'releaseDateTime']].reset_index(drop = True)
val_new_articles_df.head() # 1399 new articles

Unnamed: 0,contentID,headline,categories,releaseDateTime
0,www.arkansasonline.com/news/2018/sep/04/homele...,homelessness battle fight samaritan open door,News Arkansas,2018-09-04
1,www.arkansasonline.com/news/2018/sep/04/south-...,south lr site shop soccer table,News Arkansas,2018-09-04
2,www.arkansasonline.com/news/2018/sep/04/herita...,heritage teacher recognize physic instruction,News Arkansas,2018-09-04
3,www.arkansasonline.com/news/2018/sep/04/traffi...,traffic stop net steal police gun,News Arkansas News Arkansas Crime,2018-09-04
4,www.arkansasonline.com/news/2018/sep/04/letter...,letter,Editorial Editorial Letters,2018-09-04


In [680]:
# compare two lists
def notintersection(lst1, lst2): 
    lst3 = [value for value in lst2 if value not in lst1] 
    return lst3 

def intersection(lst1, lst2): 
    lst3 = [value for value in lst2 if value in lst1] 
    return lst3 

In [681]:
# validation visits about new visitors
val_new_visitors = notintersection(train_visitors, val_visitors)
val_new_visitors_df = valid_visits[valid_visits['visitorID'].isin(val_new_visitors)].reset_index(drop= True)
val_new_visitors_df.head() # 1488

Unnamed: 0,contentID,visitorID
0,www.arkansasonline.com/news/2018/sep/11/pedest...,1.284325e+16
1,www.arkansasonline.com/news/2018/sep/10/health...,1.284325e+16
2,www.arkansasonline.com/news/2018/sep/10/wwi-ca...,1.284325e+16
3,www.arkansasonline.com/news/2018/sep/13/2-cent...,1.284325e+16
4,www.arkansasonline.com/news/2018/sep/12/jonesb...,1.284325e+16


In [682]:
# validation information about former visitors
val_former_visitors = intersection(train_visitors, val_visitors)
val_former_visitors_df = valid_visits[valid_visits['visitorID'].isin(val_former_visitors)].reset_index(drop= True)
val_former_visitors_df.head()  # 23705

Unnamed: 0,contentID,visitorID
0,www.arkansasonline.com/news/2018/sep/12/bigger...,201104700000000.0
1,www.arkansasonline.com/news/2018/sep/12/morris...,201104700000000.0
2,www.arkansasonline.com/news/2018/sep/12/video-...,201104700000000.0
3,www.arkansasonline.com/news/2018/sep/06/jury-t...,4259291000000000.0
4,www.arkansasonline.com/news/2018/sep/07/confro...,4259291000000000.0


#### b. Get two feature TF-IDF matrixes for previous articles and new articels respectively

#### category features

In [683]:
# convert into TF-IDF featuer matrixes
tfidf = TfidfVectorizer() # min_df = 2, max_df = 0.4 norm = 'l2'
train_tfidf = tfidf.fit_transform(train_articles_df['categories'])
val_tfidf = tfidf.transform(val_new_articles_df['categories'])

In [684]:
# define a function to create the article-feature dataframe for previous articles and new articles
def feature_df(tfidf, articleIDlist, train_features):
    '''Convert the input TF-IDF matrix into TF-IDF data frame whose index is article ID and column names are feature names '''
    df = pd.DataFrame(tfidf.toarray())
    df['contentID'] = articleIDlist
    df = df.set_index('contentID')
    df.columns = train_features
    return df

In [685]:
previous_articleID = train_articles_df['contentID'].tolist()
val_new_articleID = val_new_articles_df['contentID'].tolist()
train_features = tfidf.get_feature_names()

In [686]:
train_tfidf_df = feature_df(train_tfidf, previous_articleID, train_features)
val_tfidf_df = feature_df(val_tfidf, val_new_articleID, train_features)
val_tfidf_df.head() # 1399*234

Unnamed: 0_level_0,2010election,activestyle,activities,acxiom,adgbreaking,adghighschool,adgpolitics,adgsports,ambushschoolyard,americanidol,...,whats,windstream,world,worldbusiness,zballot18,zcongress18,zeditorial2018,zgovernor18,zstatehouse18,zsupremecourt
contentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
www.arkansasonline.com/news/2018/sep/04/homelessness-battle-fought-by-samaritan/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/south-lr-site-for-shopping-soccer-on-ta/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/heritage-teacher-recognized-for-physics/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/traffic-stop-nets-stolen-police-gun-201/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/letters-20180904/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### c. Compute the cosine similarity bewtween previous articles and new articles

In [687]:
# create similarity matrix between all the articles and test articles
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
# the similarity dataframe between previous articles and new articles
# the index names stand for new articles and column names stand for previous articles
similarity_df = pd.DataFrame(1- pairwise_distances(np.array(val_tfidf_df), np.array(train_tfidf_df), metric = 'cosine'))
similarity_df.index = val_tfidf_df.index
similarity_df.columns = train_tfidf_df.index

similarity_df # 1399*8529

contentID,www.arkansasonline.com/news/2018/jun/30/arrested-lawmaker-urged-to-quit-post-20/,www.arkansasonline.com/news/2018/jun/30/suspect-arrested-fatal-shooting-pulaski-county/,www.arkansasonline.com/news/2018/jun/30/police-officer-fatally-shoots-self-central-arkansa/,www.arkansasonline.com/news/2018/jun/30/motorcyclist-killed-head-crash-little-rock-suv-dri/,www.arkansasonline.com/news/2018/jun/30/early-offer-gets-highly-recruited-receiver-campus-/,www.arkansasonline.com/news/2018/may/03/1-killed-1-injured-killed-when-car-runs-curve-sout/,www.arkansasonline.com/news/2018/apr/24/facebook-profile-lands-arkansas-sex-offender-jail-/,www.arkansasonline.com/news/2018/jun/30/3-motorcyclists-among-highway-deaths-20/,www.arkansasonline.com/news/2018/jun/30/cupcakes-idea-takes-boot-camp-top-prize/,www.arkansasonline.com/news/2018/jun/30/steel-mill-to-expand-add-500-new-worker/,...,www.arkansasonline.com/news/2018/sep/03/asu-coach-confirms-junior-wr-will-miss-rest-season/,www.arkansasonline.com/news/2018/sep/03/tropical-storm-gordon-brings-hurricane-watch-gulf-/,www.arkansasonline.com/news/2018/sep/03/man-gets-25-year-sentence-fatal-shooting-arkansas-/,www.arkansasonline.com/news/2018/sep/03/arkansas-man-accused-fatally-shooting-his-uncle-gr/,www.arkansasonline.com/news/2018/sep/02/couple-named-arkansas-foster-parents-year/,www.arkansasonline.com/news/2018/sep/03/ebola-survivors-face-stigma-in-congo-20/,www.arkansasonline.com/news/2018/sep/03/s-korea-security-officials-to-visit-nor/,www.arkansasonline.com/news/2016/aug/29/rock-skipping-champ-coming-to-local-con/,www.arkansasonline.com/news/2017/aug/28/run-or-skip-but-hop-to-it-to-get-regist/,www.arkansasonline.com/news/2018/sep/03/injuries-force-merrick-football-retirement/
contentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
www.arkansasonline.com/news/2018/sep/04/homelessness-battle-fought-by-samaritan/,0.792723,0.625040,1.000000,0.848214,0.0,0.464991,0.848214,0.567961,1.000000,0.441182,...,0.0,0.145726,0.738367,0.848214,0.000000,0.227850,0.227850,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/south-lr-site-for-shopping-soccer-on-ta/,0.792723,0.625040,1.000000,0.848214,0.0,0.464991,0.848214,0.567961,1.000000,0.441182,...,0.0,0.145726,0.738367,0.848214,0.000000,0.227850,0.227850,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/heritage-teacher-recognized-for-physics/,0.792723,0.625040,1.000000,0.848214,0.0,0.464991,0.848214,0.567961,1.000000,0.441182,...,0.0,0.145726,0.738367,0.848214,0.000000,0.227850,0.227850,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/traffic-stop-nets-stolen-police-gun-201/,0.672399,0.943612,0.848214,1.000000,0.0,0.394412,1.000000,0.481753,0.848214,0.374217,...,0.0,0.123606,0.789095,1.000000,0.000000,0.193265,0.193265,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/04/letters-20180904/,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
www.arkansasonline.com/news/2018/sep/17/comedian-john-mulaney-perform-central-arkansas/,0.427342,0.336947,0.539081,0.457256,0.0,0.250668,0.457256,0.306177,0.539081,0.237833,...,0.0,0.078558,0.398040,0.457256,0.000000,0.122829,0.122829,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/17/for-ex-captive-today-in-charleston-case/,0.245309,0.193419,0.309451,0.262480,0.0,0.198676,0.262480,0.242672,0.309451,0.087995,...,0.0,0.470917,0.228488,0.262480,0.000000,0.157207,0.157207,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/16/goodbye-my-dear-friend-fred/,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.859849,0.000000,0.000000,0.0,0.0,0.0
www.arkansasonline.com/news/2018/sep/16/uca-professorartist-displays-work-downtown-gallery/,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.827920,0.000000,0.000000,0.0,0.0,0.0


#### d. Modelling

In [688]:
# find top recommendation for former visitors
def fv(n, similarity_df):
    '''Takes an integer n and gives the top-n article recommendation from new articles for former visitors'''
    blank_df = pd.DataFrame({'visitorID':[], 'read_article':[],'contentID': [], 'similarity':[]})
    fv_df = blank_df
    i = 0
    # if the visitor is not a new visitor
    for visitor in val_former_visitors:
        # filtered out the artilces visitors have already read
        read_articles = list(train_visits.loc[train_visits['visitorID'] == visitor]['contentID'])
        i += 1
        print(i)
        # calculate the similarity between unread articles and read articles
        new_articles = blank_df
        for article in read_articles:
            # select similar articles for one article that former visitor had read: 
            one_article = pd.DataFrame({'visitorID': visitor, 'read_article': article,'contentID': similarity_df.index, 'similarity': similarity_df[article]})
            one_article.index = similarity_df.index
            one_article = one_article[one_article['similarity'] != 0]
            # combine all the similar articles for former visitor: 
            new_articles = new_articles.append(one_article) 

        # similar articles results for one former visitor
        new_articles = new_articles.drop_duplicates(subset = ['read_article', 'contentID']).sort_values('similarity', ascending=False)
    #     print(f"find {len(new_articles)} new articles combinations")
        new_articles = pd.merge(new_articles, train_article_counts_df, on = 'read_article', how = 'left')
        new_articles = new_articles.sort_values(['similarity', 'visitor'], ascending = False).head(n)
        new_articles = new_articles[['visitorID', 'contentID']]
    #     print(new_articles_5)

        # combine the similarity results for all visitors
        fv_df = fv_df.append(new_articles)
        fv_df = fv_df.reset_index(drop = True)
        
    return fv_df

In [689]:
# find top recommendation for new visitors
def nv(n, fv_df):
    new_article_top = fv_df.groupby('contentID').count().sort_values('visitorID', ascending = False)
    new_article_top = new_article_top.head(n).index.tolist()
    nv_df = pd.DataFrame({'visitorID':[], 'contentID':[]})
    i = 0
    for visitor in val_new_visitors:
        print(visitor)
        i += 1
        print(i)
        one_new_visitor_df = pd.DataFrame({'visitorID': visitor, 'contentID': new_article_top})
        nv_df = nv_df.append(one_new_visitor_df)
    
    return nv_df

In [690]:
# compare recommendation with actural condition
def precision_recall(n, fv_nv_df):
    prediction_count = len(fv_nv_df)
    actual_count = len(valid_visits)
    TP = pd.merge(valid_visits, fv_nv_df, on=['visitorID','contentID'], how = 'inner')
    TP_count = len(TP)
    precision = TP_count/prediction_count
    recall = TP_count/actual_count
    print(f"Number of true positive cases: {TP_count}")
    print(f"Top-{n} recommendation precision: {round(precision*100)}%")
    print(f"Top-{n} recommendation recall: {round(recall*100)}%")
    return [precision, recall]
    

In [691]:
# compare recommendation with actural condition
def avg_precision_recall(n, fv_nv_df):
    recommendation = fv_nv_df[['visitorID', 'contentID']]
    recommendation['predicted'] = recommendation['contentID']
    actual = valid_visits
    actual['actual'] = actual['contentID']

    results = pd.merge(actual, recommendation, on = ['visitorID','contentID'], how = 'left')
    results['match'] = np.where(results['predicted'] == results['actual'], 1, 0)

    val_visitor_counts = results.groupby('visitorID').count()['contentID']
    val_visitor_counts.columns = ['counts']

    val_TP = results.groupby('visitorID').sum()
    val_TP.columns = ['TP']

    results_calculation = pd.merge(val_visitor_counts, val_TP, left_index=True, right_index=True)
    results_calculation['precision'] = results_calculation['TP']/n
    results_calculation['recall'] = results_calculation['TP']/results_calculation['contentID']

    avg_precision = results_calculation['precision'].mean()
    avg_recall = results_calculation['recall'].mean()

    print(f"Top-{n} recommendation average precision: {round(avg_precision*100,2)}%")
    print(f"Top-{n} recommendation average recall: {round(avg_recall*100,2)}%")
    return [avg_precision, avg_recall]
    

### 2. Model optimization


In [692]:
# convert into TF-IDF featuer matrixes
def tfidf(tfidf, ID):
    train_tfidf = tfidf.fit_transform(train_articles_df[ID])
    val_tfidf = tfidf.transform(val_new_articles_df[ID])
    
    previous_articleID = train_articles_df['contentID'].tolist()
    val_new_articleID = val_new_articles_df['contentID'].tolist()

    train_features = tfidf.get_feature_names()

    train_tfidf_df = feature_df(train_tfidf, previous_articleID, train_features) 
    val_tfidf_df = feature_df(val_tfidf, val_new_articleID, train_features)
    val_tfidf_df.head() # 1399*234

    similarity_df = pd.DataFrame(1- pairwise_distances(np.array(val_tfidf_df), np.array(train_tfidf_df), metric = 'cosine'))
    similarity_df.index = val_tfidf_df.index
    similarity_df.columns = train_tfidf_df.index

    return similarity_df

### Category


In [696]:
tfidf1 = TfidfVectorizer() # min_df = 2, max_df = 0.4 norm = 'l2'
similarity_df1 = tfidf(tfidf1, 'categories')

#### Optimization on the number of articles recommendation

In [697]:
fv_df_51 = fv(5, similarity_df1)
nv_df_51 = nv(5, fv_df_51)
fv_nv_df_51 = fv_df_51.append(nv_df_51).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [698]:
precision_recall(5, fv_nv_df_51)
avg_precision_recall(5, fv_nv_df_51)

Number of true positive cases: 247
Top-5 recommendation precision: 3%
Top-5 recommendation recall: 1%
Top-5 recommendation average precision: 2.64%
Top-5 recommendation average recall: 1.35%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.026445396145610315, 0.013489572106964231]

n = 30

In [699]:
fv_df_301 = fv(30, similarity_df1)
nv_df_301 = nv(30, fv_df_301)
fv_nv_df_301 = fv_df_301.append(nv_df_301).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [700]:
precision_recall(30, fv_nv_df_301)
avg_precision_recall(30, fv_nv_df_301)

Number of true positive cases: 1130
Top-30 recommendation precision: 2%
Top-30 recommendation recall: 4%
Top-30 recommendation average precision: 2.02%
Top-30 recommendation average recall: 4.91%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.020164168451106457, 0.049066651069215664]

n = 35

In [702]:
fv_df_351 = fv(35, similarity_df1)
nv_df_351 = nv(35, fv_df_351)
fv_nv_df_351 = fv_df_351.append(nv_df_351).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [703]:
precision_recall(35, fv_nv_df_351)
avg_precision_recall(35, fv_nv_df_351)

Number of true positive cases: 1299
Top-35 recommendation precision: 2%
Top-35 recommendation recall: 5%
Top-35 recommendation average precision: 1.99%
Top-35 recommendation average recall: 5.85%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.01986846130315086, 0.05851095410454944]

n = 40

In [704]:
fv_df_401 = fv(40, similarity_df1)
nv_df_401 = nv(40, fv_df_401)
fv_nv_df_401 = fv_df_401.append(nv_df_401).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [705]:
precision_recall(40, fv_nv_df_401)
avg_precision_recall(40, fv_nv_df_401)

Number of true positive cases: 1424
Top-40 recommendation precision: 2%
Top-40 recommendation recall: 6%
Top-40 recommendation average precision: 1.91%
Top-40 recommendation average recall: 6.41%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.019057815845824347, 0.0640928890400027]

Recommending 35 articles has slightly higher f-1 score than recommending 40 articles.

#### Optimization on category features

min_df = 2

In [707]:
tfidf2 = TfidfVectorizer(min_df = 2) # min_df = 2, max_df = 0.4 norm = 'l2'
similarity_df2 = tfidf(tfidf2, 'categories')

In [708]:
fv_df_352 = fv(35, similarity_df2)
nv_df_352 = nv(35, fv_df_352)
fv_nv_df_352 = fv_df_351.append(nv_df_352).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [709]:
precision_recall(35, fv_nv_df_352)
avg_precision_recall(35, fv_nv_df_352)

Number of true positive cases: 1312
Top-35 recommendation precision: 2%
Top-35 recommendation recall: 5%
Top-35 recommendation average precision: 2.01%
Top-35 recommendation average recall: 5.93%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.020067298868155455, 0.05927369733555289]

min_df = 4

In [711]:
tfidf4 = TfidfVectorizer(min_df = 4)
similarity_df4 = tfidf(tfidf4, 'categories')

In [712]:
fv_df_354 = fv(35, similarity_df4)
nv_df_354 = nv(35, fv_df_354)
fv_nv_df_354 = fv_df_354.append(nv_df_354).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [713]:
precision_recall(35, fv_nv_df_354)
avg_precision_recall(35, fv_nv_df_354)

Number of true positive cases: 1202
Top-35 recommendation precision: 2%
Top-35 recommendation recall: 5%
Top-35 recommendation average precision: 1.84%
Top-35 recommendation average recall: 5.33%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.01838482716427042, 0.0533129173498512]

min_df = 3

In [714]:
tfidf3 = TfidfVectorizer(min_df = 3) # min_df = 4, max_df = 0.4 norm = 'l2'
similarity_df3 = tfidf(tfidf3)

In [716]:
fv_df_353 = fv(35, similarity_df3)
nv_df_353 = nv(35, fv_df_353)
fv_nv_df_353 = fv_df_354.append(nv_df_353).reset_index(drop = True)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [717]:
precision_recall(35, fv_nv_df_353)
avg_precision_recall(35, fv_nv_df_353)

Number of true positive cases: 1192
Top-35 recommendation precision: 2%
Top-35 recommendation recall: 5%
Top-35 recommendation average precision: 1.82%
Top-35 recommendation average recall: 5.2%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  recommendation['predicted'] = recommendation['contentID']


[0.018231875191189958, 0.051964721635227196]

The recommendation system performs the best when min_df = 2

### Headline features optimization

There are over 8000 features in the headlines from articles in training set. As a result, we need to select features