# Recommendation System for Stack Overflow Users
## Table of Contents
1. [Introduction](#Introduction)
2. [Implementation](#Implementation)
3. [Results and Conclusion](#Results-and-Conclusion)

# Introduction
Any company providing services through the internet faces a daunting challenge: how does one keep customers coming back in a world where time and attention are becoming increasingly limited? Designing a convenient, user-friendly website and effective advertising will only go so far if users do not see a need to engage with the company's services. So if Company A wants to keep users coming back, what better way to do this than by showing users similar items they might enjoy? For example, a music streaming service can increase the likelihood users coming back by recommending them songs from their favorite genre. This is the idea behind many [recommender systems](https://en.wikipedia.org/wiki/Recommender_system). In this project, we will create a simple recommender system for [Stack Overflow](https://stackoverflow.com) users.

It is worth noting that there are many different angles we can take with this approach. Most of the "big questions" such as, "if a user is asking a question, can we show them similar questions?" or "Based on the text in the user's question, can we recommend what tags they should attach to the question?" require the use of [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP). While this is something I am very much interested, I am still learning about different NLP techniques, and therefore I will postpone using any NLP approaches.

Instead, I seek to answer the following two questions:
1. If a user has posted at least one question, can we recommend them other questions that might interest them?
2. Can we "categorize" a user? That is, if a user has interacted with certain tags, can we recommend them questions they'll enjoy?

Both questions can be answered by building a [content-based](https://en.wikipedia.org/wiki/Recommender_system#Content-based_filtering) recommender system. To report which questions are most similar, I will use the [$\underline{k}$-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) ($k$-NN) to cluster the questions, with the idea that two questions are "similar" if their [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) is sufficiently large. Conveniently, the [last project](https://nbviewer.org/github/daniel-rossano/Data-Analysis-Projects/blob/main/Visualizing%20Stack%20Overflow%20Tags/Visualizing%20Stack%20Overflow%20Tags%20Using%20KeplerMapper.ipynb) provides much of the information needed to find the Jaccard index of the questions (though it will certainly need to be adjusted, since that project was concerned solely with tags and not the questions themselves!). This also ties into the [persistent homology project](https://github.com/daniel-rossano/Data-Analysis-Projects/blob/main/Persistent%20Homology%20SO%20Tags/Persistent%20Homology%20of%20Stack%20Overflow%20Tags.ipynb) I did, because that project used the Jaccard index to represent Stack Overflow tags as an abstract metric space.

# Implementation
A more descriptive, yet still concise overview of what I want to do is as follows:
* Query the Stack Overflow Data Explorer for a random sample of Stack Overflow User IDs, as the IDs for every question that each user has made, and the tag(s) attached to each of those questions. The query I use will specifically ignore users who have not made any questions, as such users would essentially "take up empty space" in the sample and leave room for greater inaccuracies in the recommendations.
* Consider two questions to be similar if their tag(s) have a sufficiently large Jaccard index, i.e., two questions are similar if they use several of the same tags
* For a given question, we will look at its Jaccard index with all other questions in the sample, and then report back the $10$ most similar questions ($10$ questions with the highest Jaccard index). We will, of course, exclude the input if it happens to be in the sample

Since the SQL query here is more advanced than in the previous projects, where I only needed to obtain a random sample of any type of post with a nonzero number of tags, I will explicitly write the query I used to obtain all the necessary information:

`SELECT 
    Users.Id AS 'User ID', 
    Posts.Id AS 'Question ID', 
    Posts.Tags AS 'Tags'
FROM 
    (SELECT TOP 50000 Users.Id FROM Users ORDER BY NEWID()) AS Users
JOIN 
    Posts ON Posts.OwnerUserId = Users.Id
WHERE
    Posts.PostTypeId = 1
    AND Posts.DeletionDate IS NULL
    AND Posts.ClosedDate IS NULL
    AND Posts.CommunityOwnedDate IS NULL
`

This query fetches a random assortment of users, each of whom has made at least one question. Each row is a user's ID, a question they have made, and the tags they've used for that question. The last few lines narrow down what criteria we allow for questions. The line `Posts.PostTypeId = 1` means we will only consider Stack Overflow *questions* and not articles or answers, because if the end-goal is user *engagement*, I should recommend them something they can participate in. Users should also be recommended questions that have not been deleted or already closed, so that they actulaly have a chance to engage in discussions, hence the lines `AND Posts.DeletionDate IS NULL` and `Posts.ClosedDate IS NULL`. Finally, `AND Posts.CommunityOwnedDate IS NULL` filters out any Community Wiki posts, mostly because such posts are generally more complex to answer, and newer users may feel overwhelmed. In an ideal world where I had direct access to all of the nesessary SO data, I would also filter questions based on creation date, as newer questions ensure the user will have a fair chance to engage in discussion.

Now we have all the data we need to create our recommendation system. This can be done surprisingly quickly, in very few lines of code.

In [69]:
import pandas as pd
import re

questions = pd.read_csv("UserQuestionSample.csv") #read in sample

input_tag_list = re.findall(r'<(.*?)>', questions.iat[4,2]) #obtain input tags and split them up into a list

questions['Tag_List'] = questions.Tags.map(lambda tags: re.findall(r'<(.*?)>',tags)) #turn all the tags from the sample into tag lists and append this list to the sample

jaccard_indices = pd.DataFrame(columns = ['Question ID'] ) #initialize jaccard_index dataframe; all we really need are the question IDs
jaccard_indices['Question ID'] = questions['Question ID']

#map jaccard indices (with respect to the input) into this new dataframe
jaccard_indices['Jaccard Index'] = questions.Tag_List.map(lambda x: len(set(x).intersection(set(input_tag_list)))/len(set(x).union(set(input_tag_list))))

#now sort by the 10 questions with highest Jaccard indices with the input
jaccard_indices.sort_values(by = 'Jaccard Index', ascending = False).head(11)

Unnamed: 0,Question ID,Jaccard Index
4,121707,1.0
29729,62591570,0.8
49476,52186180,0.666667
49474,52172524,0.666667
10005,23731526,0.666667
48408,46346378,0.666667
46458,36489300,0.6
20518,47698549,0.6
20802,45011961,0.6
8334,12128125,0.6


# Results and Conclusion