# System Project
## Nearest Neighbor with Semantic Textual Similarity Recommender System
### Brandi Hamming
********************


## Introduction

For my system project, I created a web application for viewing news articles. The application allows users to log in to the application to view news articles. While they are browsing the catalog of articles, they will be provided recommendations based on their click history. The recommendations will be generated from the nearest neighbor based algorithm with semantic textual similarity (STS), which was created from my research. While in my research, I created three different algorithms, the STS algorithm was selected because it had the best overall performance. 

In [1]:
# imports for code demos
import pandas as pd
import news_rec as nr

  from .autonotebook import tqdm as notebook_tqdm


## Decomposition

When you have a large repository of news articles, a user can get bored or overwhelmed by all of the articles on a system. If you provide recommendations to a user, it reduces the work on the users end for searching for articles and it provides a tailored experience, which results in an increase in customer satisfaction. While providing recommendations seems like a straightforward solution to ensuring returning users, it is not a simple task. Building a content based recommender system comes with many obstacles.

For one, the algorithm usually requires large amounts of data about the articles. This data can be article title, category, sub-category, abstract, entities in the articles, etc. While this data is often needed to build the system, sometimes all of that information is unavailable due to time, access, costs, etc. Even if the data is available, since most of the data needed is usually text based, transforming this data into numeric data can be a time consuming task. 

Another issue is that the algorithms for recommender systems can become very complicated. While performance is important for most machine learning models, trust is also critical. If a user wants to know why they were recommended something, an algorithm based on complex mathematics will not be able to provide straightforward explanations. This could cause the user to mistrust and eventually stop using the service. 

To overcome the current challenges content based recommender systems face, I will be using a nearest neighbor based algorithm with semantic textual similarity to produce recommendations to the users. This model can perform well even with only a limited amount of input. The model also is very straightforward, which could easily provide explainability to a user upon request. The recommendations produced by the algorithm will keep users happy and returning to the site. 


## Domain Expertise

The targeted user for this system is anyone that enjoys reading news articles. Note that news articles does not mean only news categorized data (such as current events or politics). All types of news articles will be available to read on the system such as sports, health and entertainment typed articles. In my research, I discovered that my recommender system had better precision when the article repository was diverse rather than narrowed down to only one type of news. 

The stakeholder’s operational requirements are that the system must be easy to use and the recommendations must be relevant. A user must be able to easily navigate the site and have access to readily available recommendations. The system must be personalized, which means all users need a separate account. This will ensure all recommendations are based on one user's history rather than all users of the system, which will result in relevant recommendations to a user. The system should not recommend articles already read by a user, which will avoid a user getting bored with its recommendations. 

While this system is targeted for reading users, we could make the site more welcoming to people that prefer video information. This could be done by adding more content that has mixed media (article text and video content). Then articles containing videos could have a video/tv icon next to the article title indicating to the user that a video is present. 

Another human blocker could be users that mistrust AI. A way to get them to use the system is by providing explanations for their recommendations such as mentioning that the historically read articles have similar sounding titles to the recommended titles. 

The system could incorporate feedback in the system by allowing the user access to delete some of their history. This will improve the performance of the recommendations because the model will no longer base recommendations off of clickbait articles, articles accidently clicked on and/or topics that are no longer of interest to a user. 

## Data
For the system, I am using the MIND data set [1]. This data set contains news and user information. For each user, the user is given a unique id, a list of historical articles read and a list of impressions of shown articles. The impressions are a list of random articles shown to a user at a particular point in time and an indicator representing if a user clicked on the article or not. For the news information, a news id is given to each article along with its title, category, sub-category, summary, entities in article and url. Since this new data is relatively old, 2019, we should eventually build a news and user data set from scratch. This could be done in a few steps. 
 

For news data, the data would be obtained from getting a repository of news articles. This could be from articles written directly from the company's writers. If all of the articles are from the company's writers, then we can require all articles must have a title, category and subcategory for them to be added to our repository. For articles, not written by the company's own authors, we would require someone to manually categorize each article. For each article, we would create a unique id called the news id. 

We would collect user information by requiring all users to login in order to get recommended articles. For the initial log in, we will not recommend any articles to a user until history (one view of an article) is obtained. This history will be stored in a separate table per user. 

All of this data will be stored in a relational database since those are easy to use and the user history will be transactional. Then we can easily join the user table to the news table on the news id. 

Since I was unable to build the data from scratch, I had to preprocess the MIND data. First, one-hot encoding was performed on the category and sub-category fields. Since user history is stored as a list/array for each user, this history list was split into multiple rows so that each read article had its own row. The same pre-processing was done for the impression field, which is also a list field. Next, the label field was created from splitting the impression indicator from each impression article into its own separate column. Only ten users were stored for the system project. This is to ensure the system is fast and that the system will be based on new users since the users in the MIND dataset will not be real users of the system. The news repository contains over 400 randomly selected articles. 

Because nearest neighbor is a model free algorithm, training will only be done to test the performance of the algorithm and for hyper-parameter tuning. A model, containing data and logic, will not be trained and then deployed. Instead, only the logic for the algorithm will be deployed. All logic will point to the database containing the user and article data for generating recommendations.

See examples below of the history, candidate and news data frames after pre-processing.

In [2]:
# Load and pre-process all the MIND data. 
# 3 different data frames are returned
history, cand, news = nr.all_preprocessing_web_app()

In [3]:
# contains all historically read articles for a user
history.head()

Unnamed: 0,user_id,date,news_id,category,sub_category,title
0,U60170,2019-11-14,N871,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","Woman, suspect dead at 'Tarzan' actor Ron Ely'..."
1,U60170,2019-11-14,N64208,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Report: Patrick Mahomes to miss at least three...
2,U60170,2019-11-14,N52536,"[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",A couple's attempt to re-create a picture-perf...
3,U60170,2019-11-14,N4526,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Bill Murray Applied To Work At An Airport P.F....
4,U60170,2019-11-14,N53872,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","Brady jabs Manning, joins other athletes in co..."


In [4]:
# sample of all candidate (potential recommendation) articles for a user.
# New users will copy user U9318's candidate articles
cand.head() 

Unnamed: 0,user_id,date,news_id,label,category,sub_category,title
0,U60170,2019-11-14,N25165,0,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Top Putin aide named by MH17 airliner investig...
1,U60170,2019-11-14,N63060,0,"[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",32 Legit Companies That Will Pay You To Work F...
2,U60170,2019-11-14,N45734,1,"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",One in 5 children live below the poverty line:...
3,U60170,2019-11-14,N29212,0,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Kiss Cancel 'End of the Road' Tour of Australi...
4,U57295,2019-11-14,N29212,0,"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",Kiss Cancel 'End of the Road' Tour of Australi...


In [5]:
# contains all of the article information without one-hot encoding
# This information is displayed on the web application
news.head()

Unnamed: 0,news_id,category,sub_category,title,abstract,url,title_entities,abstract_entitites
47,N41387,tv,tv-gallery,Can you answer these real Jeopardy questions a...,"Culling data straight from the ""Jeopardy!"" arc...",https://assets.msn.com/labs/mind/AABs6Gq.html,"[{""Label"": ""Jeopardy!"", ""Type"": ""W"", ""Wikidata...","[{""Label"": ""Jeopardy!"", ""Type"": ""W"", ""Wikidata..."
268,N30344,lifestyle,lifestylebuzz,Snakehead fish that survives on land was disco...,An invasive fish species that can breathe air ...,https://assets.msn.com/labs/mind/AAIzlnB.html,"[{""Label"": ""Georgia (U.S. state)"", ""Type"": ""G""...","[{""Label"": ""Georgia (U.S. state)"", ""Type"": ""G""..."
375,N50299,tv,tv-celebrity,Kelly Ripa responds to backlash over son in 'e...,Kelly RIpa is defending a joke she made about ...,https://assets.msn.com/labs/mind/AAJfUQq.html,"[{""Label"": ""Kelly Ripa"", ""Type"": ""P"", ""Wikidat...","[{""Label"": ""Kelly Ripa"", ""Type"": ""P"", ""Wikidat..."
764,N1644,travel,traveltips,8 Secret Spots You Never Knew Existed in Disne...,Make your next trip even more magical.,https://assets.msn.com/labs/mind/AACrqlJ.html,"[{""Label"": ""Disney Parks, Experiences and Prod...",[]
919,N54822,health,nutrition,"If You Don't Eat a Banana Every Day, This Migh...",An apple a day keeps the doctor away? Not so m...,https://assets.msn.com/labs/mind/AAHyq1v.html,[],[]


## Design
For this system, the content recommender system will be built using a nearest neighbor (NN) based algorithm. NN approaches for recommender systems have key advantages such as simplicity, justifiability and efficiency [2]. The model will also have a semantic textual similarity score baked into the algorithm. Semantic textual similarity (STS) is used to compare the relatedness of two phrases. This approach could help eliminate ties between articles that might share the same score when using only a NN approach. 

The recommender architecture is the following: For each user, every historical articles' features are scored based on distance between all candidate articles' features. This will be referred to as the feature score. For example, if a user has 3 historical articles and there are 4 candidate articles, there will be 12 scores generated. This score is generated using only the category and subcategory features. For the feature scoring step, the similarity measure will be the cosine distance because it is a common measure used in recommender systems according to [2] and it had the best performance in my research. Next, a semantic textual similarity (STS) score will be generated, which will be added to the feature score. For each user, a semantic textual similarity score will be generated between each candidate article's title and historical article's title. The STS score will be generated using Sentence-BERT (sBERT), which is a pre-trained BERT based transformer [4] and [5]. This NLP task was chosen because it had the best performance in my research. sBert was chosen for this step because of its speed, ease of use and is considered a state of art NLP model.

Then for each historical article's score, the top K scores will be taken. For example, if K is 2, then 6 candidate articles will remain in the running for being recommended. This will be called the neighbor history step. The final step is to select the final top K recommendations based on the articles most recommended by all of the candidates from the neighbor history step. If there are multiple candidate articles with the same number of neighbor history recommendations, then the tie will be broken based on which article that had the highest score followed by title order. See image below regarding algorithms architecture. 

![NN STS Architecture](./images/nn_sts_model.png)

The web application will be built using Streamlit [6]. The application will have 3 menu options. The About page, which explains how to navigate the site. A Sign Up page, which allows a user to create an account.  The Login page, which a user must log in into the system in order to browse the news collection and to receive recommendations. Recommendations will only be generated after history is created (at least one article clicked). 5 recommendations will be displayed to the user on the side panel. The recommendations will be in order of recommendation score and will include the title and abstract of the article. See images of web app below

Sign Up Page of Application\
![Sign Up](./images/web_app_sign_up.png)

Login In No History\
![Login No Recommendations](./images/web_app_no_recs.png)

Login In With Recommendations\
![Login with Recommendations](./images/web_app_recs.png)


In [6]:
# Will take about a minute to run
# The status is printed to see the progress of recommendations
# final dictionary printed are the recommendations per user
model_sts = nr.get_sts() # sts (NLP) model 
recs = nr.rec_any(cand, history , 3, "cosine","STS",model_sts, 1,True)
recs # list of recommendations per user. ALl users recieve 5 recommendations

User Status (array([0], dtype=int64),) / (10,)
User Status (array([1], dtype=int64),) / (10,)
User Status (array([2], dtype=int64),) / (10,)
User Status (array([3], dtype=int64),) / (10,)
User Status (array([4], dtype=int64),) / (10,)
User Status (array([5], dtype=int64),) / (10,)
User Status (array([6], dtype=int64),) / (10,)
User Status (array([7], dtype=int64),) / (10,)
User Status (array([8], dtype=int64),) / (10,)
User Status (array([9], dtype=int64),) / (10,)


{'U60170': ['N38779', 'N23446', 'N25165', 'N29212', 'N45523'],
 'U89637': ['N23446', 'N50872', 'N6477', 'N29212', 'N38779'],
 'U25497': ['N41934', 'N61233', 'N56211', 'N14478', 'N27737'],
 'U9318': ['N40109', 'N50872', 'N38779', 'N36226', 'N47098'],
 'U57255': ['N64037', 'N46917', 'N60750', 'N22975', 'N50055'],
 'U66830': ['N62318', 'N19661', 'N40109', 'N50872', 'N34185'],
 'U27158': ['N50872', 'N9623', 'N48017', 'N8015', 'N64174'],
 'U24208': ['N6578', 'N38779', 'N6477', 'N23446', 'N29212']}

## Diagnosis

The algorithm for the system was selected based on a series of nearest neighbor based experiments from my research. The model with the best overall performance was selected for the web application algorithm. The models were measured by accuracy and information retrieval's Mean Average Precision @ K (MAP@K). For both metrics, the higher the score, the better the recommendations.

Accuracy is calculated using each user's number one recommendation. For this calculation, the total number of recommendations is the same as the number of users since only the highest recommendation for each user is used. This score informs us of how well our model is doing when recommending only a single article to a user. 

![Accuracy](./images/accuracy.png)


MAP@K is calculated from using the top 5 recommendations provided from each experiment and then scoring only K (3 and 5) of those recommendations. MAP@K is used to evaluate the precision of the entire system (all users instead of one user). In the formula, N is the number of users. 

![MAP@K](./images/map.png)

From the results, we can see that the NN with STS model had the best MAP@3 and MAP@5 scores and its accuracy was the second highest of all experiments. Based on the results, we know that the NN with STS model is able to recommend more relevant articles than the other models when the system recommends more than one article.

![Results](./images/table_results.png)  

After deployment, operational performance will be measured in MAP@K, individual precision at 5 (P@K)  per log in and the amount of time for recommendations to generate. We want to avoid interest drift: users being uninterested in our recommendations. We will be able to track the drift by tracking if the user clicks on the recommendations or not. To evaluate the entire system, we continue to measure the MAP@5. We will set the threshold for MAP@5 to be 0.20 (based on the research results). If the score falls below that precision, the algorithm will need to be re-evaluated in order to adjust the parameters of the model. For individual users, we will store past P@K (last login) and compare it to the current sessions P@K. If we see that P@K decreases by .10 between the last and current login, we know that newer articles read by the user caused the recommendations relevancy to decrease; therefore, we would remove the more recent historical records from the users history.

For time measurement, recommendations should take a second to generate. Any longer and the user could get annoyed with the system. From the research, I learned that the more items in a user's history, the longer it takes for recommendations to generate. To avoid recommendations taking too long to populate, I will set a history threshold of 20 articles per user. Only the most recent 20 articles read by a user will be used to generate predictions. Having a limited history, will also ensure that recommendations stay aligned with a user's current interests. In production, if the time it takes to generate predictions takes longer than one second, then this threshold will be reduced. 


## Deployment
![deployment diagram](./images/deployment.png)

All article data will be stored in a SQL server database. The database will consist of 4 tables: user login information, user history (reading and precision scores), user candidate articles, and all news information (for displaying in the UI). Depending on the data used (MIND vs created dataset), a domain expert might need to label the article data. This labeling would provide the articles categories and sub-categories. All article data will then be pre-processed (see Data section for pre-processing details). Once all data is groomed, it will be stored in the database.

The logic for the NN STS algorithm and the Streamlit web application will both be hosted in the cloud. Specifically, the code will live in an Amazon EC2 P3 instance. This was chosen because it has record breaking BERT training, which is used in the STS logic [7]. This instance also offers GPU. Since the data for the initial deployment of the application is expected to only contain a few thousand articles and traffic a few thousand users per day, the smallest instance, 2xlarge, will be used. If the system grows to millions, a larger instance will need to be used. This 2xlarge instance offers one GPU, which should be sufficient for this system because all data can be processed at once rather than a batch/distributed manner. When a user uses the application, the reading history will be tracked and eventually stored to the database. While the user interacts with the application, recommendations will be generated and sent to the user. 


## References
1. Fangzhao Wu et al. “MIND: A Large-scale Dataset for News Recommenda-
tion”. In: Proceedings of the 58th Annual Meeting of the Association for Com-
putational Linguistics. Online: Association for Computational Linguistics, July
2020, pp. 3597–3606. doi: 10.18653/v1/2020.acl- main.331. url: https:
//aclanthology.org/2020.acl-main.331.18


2. Athanasios N. Nikolakopoulos et al. “Trust your neighbors: A comprehensive
survey of neighborhood-based methods for recommender systems”. In: CoRR
abs/2109.04584 (2021). arXiv: 2109.04584. url: https://arxiv.org/abs/
2109.04584.


3. Rishabh Ahuja, Arun Solanki, and Anand Nayyar. “Movie Recommender Sys-
tem Using K-Means Clustering AND K-Nearest Neighbor”. In: 2019 9th Inter-
national Conference on Cloud Computing, Data Science & Engineering (Con-
fluence) (2019), pp. 263–268.


4. Nils Reimers and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, Nov. 2019. url: https://arxiv.org/abs/1908.10084.


5. Nils Reimers and Iryna Gurevych. “Making Monolingual Sentence Embeddings
Multilingual using Knowledge Distillation”. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, Nov. 2020. url: https://arxiv.org/abs/2004.09813.

6. [Streamlit](https://streamlit.io/)

7. [Amazon ML](https://aws.amazon.com/machine-learning/infrastructure/?pg=ln&sec=uc)