<h1><center>Epsilon-Greedy Latent Recommender</center></h1>

<center>Hamza El Bouatmani on 14th April, 2019 </center>


____

## Problem Statement:
<a href="https://www.careervillage.org/" target="_blank">CareerVillage.org</a> <span style="color: purple;">is a cloud-based solution for career advice</span>. It provides a platform where students with career-related questions meet professionals from the industry who help them by answering their questions.

The goal of <a href="https://www.kaggle.com/c/data-science-for-good-careervillage/overview" target="_blank">this competition</a>, is to develop a method to recommend relevant questions to the professionals who are most likely to answer them.

In this notebook, I propose a solution that addresses the problem in an efficient manner using a probabilistic approach (Epsilon-Greedy) combined with an industry-proven *state-of-the-art technique (SVD)*. **This combination aims to balance between Exploration & Exploitation, targeting both the new and already-engaged professionals in an effective and efficient manner.**

## Why do we need a Recommender ? Let's ask the Data !

<a href="https://www.kaggle.com/hamzael1/an-extensive-eda-for-careervillage" target="_blank">In a previous notebook</a> I have made a detailed Exploratory Data Analysis on the provided data. Here, I will be brief and focus on the most important points which relate to the problem at hand.

*Note: some code snippets that are trivial are collapsed for better readability, feel free to expand them if you want to check the code*

In [1]:
# Imports

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('max_colwidth', 200)


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, LatentDirichletAllocation
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import re
import string
import datetime
import random
from random import choice, choices
from collections import OrderedDict


from IPython.display import display

# Professionals Import

professionals = pd.read_csv('../input/professionals.csv', index_col='professionals_id')
professionals = professionals.rename(columns={'professionals_location': 'location', 'professionals_industry': 'industry', 'professionals_headline': 'headline', 'professionals_date_joined': 'date_joined'})
professionals['headline'] = professionals['headline'].fillna('')
professionals['industry'] = professionals['industry'].fillna('')

# Students Import

students = pd.read_csv('../input/students.csv', index_col='students_id')
students = students.rename(columns={'students_location': 'location', 'students_date_joined': 'date_joined'})

# Questions Import
questions = pd.read_csv('../input/questions.csv', index_col='questions_id', parse_dates=['questions_date_added'], infer_datetime_format=True)
questions = questions.rename(columns={'questions_author_id': 'author_id', 'questions_date_added': 'date_added', 'questions_title': 'title', 'questions_body': 'body', 'questions_processed':'processed'})

# Answers Import
answers = pd.read_csv('../input/answers.csv', index_col='answers_id', parse_dates=['answers_date_added'], infer_datetime_format=True)
answers = answers.rename(columns={'answers_author_id':'author_id', 'answers_question_id': 'question_id', 'answers_date_added': 'date_added', 'answers_body': 'body'})

# Tags Import
tags = pd.read_csv('../input/tags.csv',)
tags = tags.set_index('tags_tag_id')
tags = tags.rename(columns={'tags_tag_name': 'name'})

# Comments Import
comments = pd.read_csv('../input/comments.csv', index_col='comments_id')
comments = comments.rename(columns={'comments_author_id': 'author_id', 'comments_parent_content_id': 'parent_content_id', 'comments_date_added': 'date_added', 'comments_body': 'body' })


# School Memberships
school_memberships = pd.read_csv('../input/school_memberships.csv')
school_memberships = school_memberships.rename(columns={'school_memberships_school_id': 'school_id', 'school_memberships_user_id': 'user_id'})

# Groups Memberships
group_memberships = pd.read_csv('../input/group_memberships.csv')
group_memberships = group_memberships.rename(columns={'group_memberships_group_id': 'group_id', 'group_memberships_user_id': 'user_id'})


#####################################################
print('Important numbers:')
print('\nThere are:')
print(f'- {len(students)} Students.', end="\t")
print(f'- {len(professionals)} Professionals.')
print(f'- {len(questions)} Questions.', end="\t")
print(f'- {len(answers)} Answers.')
print(f'- {len(tags)} Tags.', end="\t\t")
print(f'- {len(comments)} Comments.')
print(f'- {school_memberships["school_id"].nunique()} Schools.', end="\t\t")
print(f'- {len(pd.read_csv("../input/groups.csv"))} Groups.')
#####################################################

# Questions-related stats
tag_questions = pd.read_csv('../input/tag_questions.csv',)
tag_questions = tag_questions.rename(columns={'tag_questions_tag_id': 'tag_id', 'tag_questions_question_id': 'question_id'})
count_question_tags = tag_questions.groupby('question_id').count().rename(columns={'tag_id': 'count_tags'}).sort_values('count_tags', ascending=False)
print('\nInteresting statistics: ')
print(f'- {(answers["question_id"].nunique()/len(questions))*100:.2f} % of the questions have at least 1 answer.')
print(f'\n- {(len(count_question_tags)/len(questions))*100:.2f}% of questions are tagged by at least {count_question_tags["count_tags"].tail(1).values[0]} tag.')
print(f'- Mean of tags per question: {count_question_tags["count_tags"].mean():.2f} tags per question.')

tag_users = pd.read_csv('../input/tag_users.csv',)
tag_users = tag_users.rename(columns={'tag_users_tag_id': 'tag_id', 'tag_users_user_id': 'user_id'})
users_who_follow_tags = list(tag_users['user_id'].unique())
nbr_pros_tags = len(professionals[professionals.index.isin(users_who_follow_tags)])
nbr_students_tags = len(students[students.index.isin(users_who_follow_tags)])
print(f'\n- {(nbr_pros_tags / len(professionals))*100:.2f} % of the professionals follow at least 1 Tag ({nbr_pros_tags}).')
print(f'- {(nbr_students_tags / len(students))*100:.2f} % of the students follow at least 1 Tag ({nbr_students_tags}).')

question_scores = pd.read_csv('../input/question_scores.csv')
nbr_questions_with_hearts = question_scores[question_scores['score'] > 0]['id'].nunique()
print(f'\n- {(nbr_questions_with_hearts/len(questions))*100:.2f} % of questions were upvoted ({nbr_questions_with_hearts}).')

answer_scores = pd.read_csv('../input/answer_scores.csv')
nbr_answers_with_hearts = answer_scores[answer_scores['score'] > 0]['id'].nunique()
print(f'- {(nbr_answers_with_hearts/len(questions))*100:.2f} % of answers were upvoted ({nbr_answers_with_hearts}).')

# Professionals who did not contribute
nbr_pros_without_answers = len(professionals) - answers['author_id'].nunique()
print(f'\n- {(nbr_pros_without_answers/len(professionals))*100:.2f} % of the professionals have Zero answers ({nbr_pros_without_answers}).')

# Number of accurate recommendations
emails = pd.read_csv('../input/emails.csv')
emails = emails.set_index('emails_id')
emails = emails.rename(columns={'emails_recipient_id':'recipient_id', 'emails_date_sent': 'date_sent', 'emails_frequency_level': 'frequency_level'})
#emails.sample(2)

matches = pd.read_csv('../input/matches.csv')
matches = matches.join(emails[['recipient_id', 'date_sent']], on='matches_email_id')

matches = matches.rename(columns={'matches_question_id': 'question_id', 'matches_email_id': 'email_id'})

matches['author_id'] = matches['recipient_id']
m = answers.reset_index().merge(matches, on=['question_id', 'author_id']).set_index('answers_id')
nbr_accurate_recommendations = len(m)
matches = matches.drop('author_id', axis=1)
print(f'- {(nbr_accurate_recommendations/len(matches))*100:.2f} % of recommended questions in emails were accurate (lead to professional answering the recommended question) ({nbr_accurate_recommendations})')


# School/Group Related Stats

def is_student(user_id):
    if user_id in students.index.values:
        return 1
    elif user_id in professionals.index.values:
        return 0
    else:
        raise ValueError('User ID not student & not professional')

school_memberships['is_student'] = school_memberships['user_id'].apply(is_student)
school_memberships['is_student'] = school_memberships['is_student'].astype(int)
count_students_professionals = school_memberships.groupby('is_student').count()[['school_id']].rename(columns={'school_id':'count'})
print(f'\n- Only {count_students_professionals.loc[1].values[0]/len(students):.2f} % of the students are members of schools ({count_students_professionals.loc[1].values[0]}).')
print(f'- Only {count_students_professionals.loc[0].values[0]/len(professionals):.2f} % of the professionals are members of schools ({count_students_professionals.loc[0].values[0]}).')

group_memberships['is_student'] = group_memberships['user_id'].apply(is_student)
group_memberships['is_student'] = group_memberships['is_student'].astype(int)
count_students_professionals = group_memberships.groupby('is_student').count()[['group_id']].rename(columns={'group_id':'count'})
print(f'\n- Only {count_students_professionals.loc[1].values[0]/len(students):.2f} % of the students are members of groups ({count_students_professionals.loc[1].values[0]}).')
print(f'- Only {count_students_professionals.loc[0].values[0]/len(professionals):.2f} % of the professionals are members of groups ({count_students_professionals.loc[0].values[0]}).')

del m
del emails
del matches
del students
del school_memberships
del group_memberships
del count_question_tags
del users_who_follow_tags
del nbr_pros_tags
del nbr_students_tags
del nbr_pros_without_answers
del nbr_questions_with_hearts
del count_students_professionals

Important numbers:

There are:
- 30971 Students.	- 28152 Professionals.
- 23931 Questions.	- 51123 Answers.
- 16269 Tags.		- 14966 Comments.
- 2706 Schools.		- 49 Groups.

Interesting statistics: 
- 96.57 % of the questions have at least 1 answer.

- 97.31% of questions are tagged by at least 1 tag.
- Mean of tags per question: 3.29 tags per question.

- 90.91 % of the professionals follow at least 1 Tag (25594).
- 14.88 % of the students follow at least 1 Tag (4608).

- 96.93 % of questions were upvoted (23196).
- 57.82 % of answers were upvoted (13837).

- 63.88 % of the professionals have Zero answers (17983).
- 0.41 % of recommended questions in emails were accurate (lead to professional answering the recommended question) (17576)

- Only 0.04 % of the students are members of schools (1355).
- Only 0.15 % of the professionals are members of schools (4283).

- Only 0.01 % of the students are members of groups (311).
- Only 0.03 % of the professionals are members of groups (727).


### Takeaways:
* Tags are heavily used by students in questions.
* Most professionals follow tags to find questions related to their expertise.
* **A big portion of the professionals (~63%) hasn't answered any question yet.**
* **Only a tiny proportion of recommended questions (~0.41%) in emails were accurate enough to probably lead the recipient to answer.**
* For the moment, we can not rely on school/group memberships, because only a tiny portion of the users have used them.

## Problem with the current System

In the current system, emails containing recommended questions are sent to professionals on a daily basis by default. 

The possible frequencies that a professional can choose from are:
* Immediate
* Daily
* Weekly

The 'daily' option is problematic. It is extremely difficult to to maintain a good quality of recommendations when the frequency is as high as 'Daily'. **We thus end up with a huge number of emails being sent daily with poor-quality recommendations. This can cause the professional to start ignoring emails and ultimately not returning to the site.**

<span style="color: blue; font-weight: bold;">Quick Solution proposal: </span> Maintain good-quality recommendations by removing the 'Daily' option, and only keeping the 'Immediate' & 'Weekly' options.

Another *future* solution would be to leave it up to the system to decide when to email each professional depending on the interaction of the professional with the site.


## Basic techniques used in the Recommender System
The recommender system works in two "modes":
* **Professional-to-Questions**: Recommend top K questions to a particular professional (needed for the professionals who choose a fixed frequency like 'Weekly' option )
* **Question-to-Professionals**: Recommend top K professionals most likely to answer a particular question. (needed for the professional who choose the 'Immediate' option)


### The Exploration-Exploitation Dilemma in Recommendations:

![slots](https://i.imgur.com/pFO04zu.jpg?3)



A recommender system's job is not that simple. If a recommender system keeps suggesting the same items to the same users, then in some cases, questions about fairness might be raised, in other cases, users might get bored getting the same type of content. In the case of Question-Answering platforms like CareerVillage, potential interests (other than the ones already expressed by the professional through tags) might be ignored and users might stop coming to the platform.

A recommendation system must not only recommend relevant questions to the professionals, **Occasionally, it should also introduce them to potentially new types of questions that might interest them**. It has to deal with the cold-start problem, where very little information about he professional is known.

In the ML litterature, finding the right tradeoff between these two components is called the **Exploration-Exploitation problem**.

### The Epsilon-Greedy Algorithm ( in a nutshell )

To tackle the Exploration-Exploitation problem, a popular algorithm called **'Epsilon-Greedy'** is used.

> It works by setting an Epsilon threshold, which represents the probability of 'Exploitation' .
> 
> A random number N between 0.0 and 1.0 is generated,
> 
> if N < Epsilon
> 
>     Exploit by searching similar questions based on the past
> 
> else
> 
>     Explore new questions

**The Epsilon-Greedy Algorithm is simple, easy to implement and does not need heavy computation, making it a great solution for the problem at hand. **

*( More details on the inner-workings in a later section )*

*Note: normaly Epsilon is used for exploration, in this implementation I used it for exploitation, but the idea is the same*

### LSA: Latent Semantic Analysis (in a nutshell)

Latent Semantic Analysis is a **simple**, yet **powerful** technique in Natural Language Processing. It captures the latent (hidden) topics of a corpus of text and represents each document by a vector of k dimensions, each pointing to one latent topic.

To do this, LSA relies on a robust mathematical technique called SVD (Singular-Value Decomposition), which factorizes a real matrix to a product of 3 matrices. (<a href="" target="_blank">More on LSA and SVD</a>)


<span style="color: red; font-weight: bold;">Takeaway:</span> **Each question will be represented by a vector of length k. comparing the questions will be as easy as performing a cosine similarity between the vectors.**

## Data Preprocessing is paramount !


<div style="border: solid 1px blue; padding: 5px;"><h4><center><span style="color: red;">If we let Garbage In, we get Garbage Out ! (GIGO)</span><center></h4></div>

<br/>
The most important data type in this project is Text (questions, tags ...). Unfortunately, if left unpreprocessed, it becomes extremely hard to extract useful information from it.

This section's goal, is to prepare the data by simplifying it and removing any noise that migh get in the way between us and the True Information that we want to extract.

This simple preprocessing can be easily done online in production, doesn't require a lot of computation.


### Tags:
For some reason, there are many tags which are not used in any question (and they are also not followed by any user).

In [2]:
# Drop tags that are not used in any question and not followed by any user (it will clean a lot of useless stuff)
useless_tags = tags[~tags.index.isin(tag_questions['tag_id'].unique())]
useless_tags = tags[ (tags.index.isin(useless_tags.index.values)) & (~tags.index.isin(tag_users['tag_id'].values)) ]
tags = tags.drop(useless_tags.index)

print(f'- {len(useless_tags)} useless tags were found and dropped.')

- 1865 useless tags were found and dropped.


Next, we make the following transformations to the tags:
* make all tags lowercase.
* create a new 'processed' column to hold the processed version of each tag
* remove any special characters from the text.
* correct some short words (yrs -> years)
* lemmatize the tags ( eg. 'wolves' -> 'wolf' )
* remove tags without any meaning that are just numbers, just preprositions, pronouns, stop-words ... ('where', 'and', 'the', '10', ...etc)



In [3]:
# Preprocessing Tags

nbr_tags = len(tags)

stop_words = set(stopwords.words('english'))
# some common words / mistakes to filter out too
stop_words.update(['want', 'go', 'like', 'aa', 'aaa', 'aaaaaaaaa', 
                   'good', 'best', 'would', 'get', 'as', 'th', 'k',
                   'become', 'know', 'us'])
special_characters = f'[{string.punctuation}]'
lm = WordNetLemmatizer()


tags['name'] = tags['name'].str.lower()
tags.fillna('', inplace=True)
tags['processed'] = tags['name'].str.replace(special_characters, '')
tags['processed'] = tags['processed'].str.replace('^\d+$', '') # tags that are just numbers :-/
tags['processed'] = tags['processed'].apply(lambda x: lm.lemmatize(x)) # avoid having plurals like 'career' and 'careers'
tags['processed'] = tags['processed'].str.replace('^\w$', '') # single letter tags :-/
tags['processed'] = tags['processed'].str.replace(r'(\d+)(yrs?)', r'\1year') #
tags['processed'] = tags['processed'].apply(lambda x: x if x not in stop_words else '')

# Drop tags which are prepositions, pronouns, determiners, wh-adverbs (where, ...)
tags_to_drop = []
for i, t in tags['processed'].iteritems():
    if len(t) > 0 and nltk.pos_tag([t])[0][1] in ['IN', 'PRP', 'WP$', 'PRP$', 'WP', 'DT', 'WRB']:
        tags_to_drop.append(i)
tag_questions = tag_questions.drop(tag_questions[tag_questions['tag_id'].isin(tags_to_drop)].index)
tags = tags.drop(tags_to_drop)

# Drop tags which are just numbers
tags_to_drop = tags[tags['name'].str.contains('^\d+$')].index
tag_questions = tag_questions.drop(tag_questions[tag_questions['tag_id'].isin(tags_to_drop)].index)
tags = tags.drop(tags_to_drop)

# Drop tags which are just stop words ( after, the , with , ...)
tags_to_drop = tags[tags['name'].isin(stop_words)].index
tag_questions = tag_questions.drop(tag_questions[tag_questions['tag_id'].isin(tags_to_drop)].index)
tags = tags.drop(tags_to_drop)

print(f'{nbr_tags - len(tags)} Tags were filtered out.')
tags.sample(2)

63 Tags were filtered out.


Unnamed: 0_level_0,name,processed
tags_tag_id,Unnamed: 1_level_1,Unnamed: 2_level_1
34421,#realestate,realestate
29786,deaf-education,deafeducation


### Questions
* We create a new column 'processed' containing both 'title' & 'body' text, and do the same transformations we did to tags ( remove special characters, lemmatize words and remove stop words ).
* Create a new column 'count_answers'.

In [4]:
# Questions Cleaning

questions['processed'] = questions['title'] + ' ' + questions['body']
questions['processed'] = questions['processed'].str.lower()
questions['processed'] = questions['processed'].str.replace('<.*?>', '') # remove html tags
questions['processed'] = questions['processed'].str.replace('[-_]', '') # remove separators
questions['processed'] = questions['processed'].str.replace(special_characters, ' ') # remove special characters

questions['processed'] = questions['processed'].str.replace('\d+\s?yrs?', ' years') # single letter tags :-/

def lem_question(q):
    return " ".join([lm.lemmatize(w) for w in q.split() if w not in stop_words])
questions['processed'] = questions['processed'].apply(lem_question)

questions['processed'] = questions['processed'].str.replace(r'(\d+)($|\s+)', r'\2') # remove numbers which are not part of words
questions['processed'] = questions['processed'].str.replace(r'(\d+)([th]|k)', r'\2') # remove numbers from before th and k


# Function to preprocess new questions
# TODO: update function to do like above
def preprocess_question(q):
    q = q.lower()
    q = re.sub("<.*?>", "", q)
    q = re.sub("[-_]", "", q)
    q = re.sub("\d+", "", q)
    q = q.translate(q.maketrans('', '', string.punctuation))
    q = " ".join([lm.lemmatize(t) for t in q.split()])
    return q

cnt_answers = answers.groupby('question_id').count()[['body']].rename(columns={'body': 'count_answers'})
questions = questions.join(cnt_answers)
questions['count_answers'] = questions['count_answers'].fillna(0)
questions['count_answers'] = questions['count_answers'].astype(int)

print('Questions preprocessed.')
questions.sample(1)[['title', 'body', 'processed', 'count_answers']]

Questions preprocessed.


Unnamed: 0_level_0,title,body,processed,count_answers
questions_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
55f90ef541474d6d9ad6f1564135d340,What do colleges and universities look for in applicants?,"I've always been told that you should be a ""well-rounded"" person if you want to get accepted to a good college, but that has never made sense to me. If I was a part of a college board I'd want stu...",college university look applicant always told wellrounded person accepted college never made sense part college board student excelled field passionate rather great area alright every area look co...,1


### Professionals
* **Count Answers:** Create a new column 'count_answers' for professionals
* **Cleaning the headlines:**

In [5]:
# Count Answers
pro_answers_count = answers.groupby('author_id').count()[['question_id']].rename(columns={'question_id': 'count_answers'})
professionals = professionals.join(pro_answers_count)
professionals['count_answers'] = professionals['count_answers'].fillna(0)
professionals['count_answers'] = professionals['count_answers'].astype(int)


# Cleaning the headlines
professionals['headline'] = professionals['headline'].fillna('')
professionals['headline'] = professionals['headline'].str.lower()
professionals['headline'] = professionals['headline'].str.replace('--|hello|hello!|hellofresh', '')

print('Professionals preprocessed')
professionals.sample(1)

Professionals preprocessed


Unnamed: 0_level_0,location,industry,headline,date_joined,count_answers
professionals_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
f635e282b31045819ac998a120fae0e1,"Cambridge, Massachusetts",Internet,software engineering intern at linkedin,2014-02-21 20:25:45 UTC+0000,5


## Start Modeling !

Now that we have pre-processed our data, we are ready for the modeling part.

The modeling steps are as follows:

* **Apply TF-IDF on the hole question corpus.**
* **Apply SVD to reduce the dimensionality of the vectors.**
* **Construct a Questions Similarity Matrix using The Cosine Similarity function.**


In [6]:
# TF-IDF Transformation

tfidf = TfidfVectorizer(stop_words=stop_words,)
Qs = tfidf.fit_transform(questions['processed'])
terms = tfidf.get_feature_names()
print('TF-IDF matrix shape: ', Qs.shape)
#print(f'{len(terms)} terms: ', list(pd.Series(terms).sample(10)), '...')

TF-IDF matrix shape:  (23931, 18761)


In [7]:
# SVD Transformation

NUM_TOPICS = 1000
model = TruncatedSVD(n_components=NUM_TOPICS)
model_transformer = model.fit(Qs)
Qs_transformed = model_transformer.transform(Qs)
print('Shape after Dimensionality Reduction:', Qs_transformed.shape)
del Qs

Shape after Dimensionality Reduction: (23931, 1000)


In [8]:
%%time
# Construct the Questions Similarity Matrix

Qs_sim_matrix = cosine_similarity(Qs_transformed, Qs_transformed)
print('Similarity Matrix Shape: ', Qs_sim_matrix.shape, '\n')


Similarity Matrix Shape:  (23931, 23931) 

CPU times: user 43.5 s, sys: 3.47 s, total: 46.9 s
Wall time: 19.2 s


<span style="color: blue;">Notes on production:</span>
* *Here, we use the totality of the question corpus when constructing the Similarity Matrix, in practice though, the similarity matrix will only be constructed with relatively recent questions ( last 1 or 2 years ), since old questions will not be of any use. Its construction only takes ~ 40 seconds for a ~ 24k x 24k matrix (pretty quick).*
* *When in production, the Similarity Matrix & Questions Transformed Matrix must be updated on regular basis, depending on the traffic.*

## Building the Recommendations Engine

In this section, we will build the recommendations engine from scratch using only the techniques previously talked about. Each sub-section will deal with a specific sub-problem.

There are two main data structures we will work with:
* **The Transformed Questions Matrix**: a Matrix where each row represents a single question encoded in a K dimensional vector
* **The Similarity Matrix**: a NxN Matrix (where N is the number of questions) rating the similarity between all pair of questions on a scale of 0 ~ 1.

### Professional-to-Questions Mode:
In this mode, three types of professionals can be distinguished:
* **Hot**: The professional has posted at least one answer.
* **Cold**: The professional has never posted an answer, but follows some tags.
* **Freezing**: The professional never posted an answer and doesn't follow any tags.

**How to deal with the 'Freezing' professional ?**

The 'Freezing' professional has most probably registered recently, and doesn't follow any tags. We don't know much about him, the recommendations are more like 'suggestions' with the goal of taking him to the higher categories. 

We can suggest:
* **Exploit:** session-based recommendations which recommend questions similar to questions already visited / upvoted or commented by the professional
* **Explore:** popular questions on the platform and newly created ones.


**How to deal with the 'Cold' professional ?**

Unlike the 'Freezing' professional, the 'Cold' one follows one more tags. This important hint must be fully exploited as followed:
* **Exploit:** find relatively recent questions from the tags followed.
* **Explore:** find tags similar to the followed tags and do the same.

**How to deal with the 'Hot' professional ?**

This type of professional has already expressed interest in one or more questions.
* **Exploit:** suggest similar questions to the ones answered
* **Explore:** suggest relatively recent questions from the tags followed and similar tags

### Formulas for Exploit Threshold and Question Scoring (for the 'Hot' professional):

Unlike the the two other types of professionals, the optimal Exploit Threshold for the 'Hot' professional is dynamic and changes from a professional to another. Some professionals have only answered one question, while others have answered many. Some professionals have answers which date to a relatively long time, while others have just recently answered a few. **Taking these parameters into consideration affects positively the quality of the recommendations**.

* We want our recommender to prioritize questions similar to questions recently answered by the professional
* We also don't want to completely ignore older questions.

The following formula scores the questions while capturing the notes above:

$$ score(x) = \frac{log (\frac{x}{\epsilon})}{log (\frac{1}{\epsilon})} $$
where:
- x is the number of days elapsed between the date question was answered and today
- $\epsilon$ is the maximum number of days after which we no longer consider the question to be relevant

The formula gives a score between 0~1 where 1 means that the question is very relevant and should be used as a reference.

After the questions get scored, the Exploit Threshold is calculated as follows:

$$ threshold = log (\sqrt{x} + 1) \cdot \alpha +  \epsilon $$
where:
- x is the number of recently answered questions
- $\alpha$ controls the exploitation intensity (1.35 in the implementation)
- $\epsilon$ is an optional small value term (0.1 on the implementation) added if all answered questions are old ( meaning, that x = 0, see the implementation below )

Below is the implementation of the two formulas:

In [9]:
def calculate_score_question_answered(days_elapsed):
    eps = 370
    score = np.log10(days_elapsed*(1/eps)) / (np.log10(1/eps))
    score = 0.001 if score < 0 else score # questions that got a score lower than 0 are still given a very low score
    return score

def calculate_exploit_threshold(answered_question_scores, nbr_recommendations):
    nbr_questions_answered = len([s for s in answered_question_scores if s > 0.001])
    eps = 0.1 if nbr_questions_answered == 0 else 0
    alpha = 1.35
    return np.log10(np.sqrt(nbr_questions_answered) + 1) * alpha + eps


**The next snippet of code builds the recommendation engine using two main functions:**
* get_similar_questions: returns similar questions to the one given using the similarity matrix.
* recommend_questions_to_professional: given a professional ID, returns top K recommended questions

The debug variable below if set to True  makes exploration / exploitation decisions visible.

In [10]:
debug = False

*( Feel free to check the code below collapsed )*

In [11]:
# Init important variables
today = pd.to_datetime('today')
min_date_for_questions = today - np.timedelta64(600, 'D')


def choose_random_answered_question(question_score_dic):
    random_key = choices(list(question_score_dic.keys()), list(question_score_dic.values()))[0]
    return (random_key, question_score_dic[random_key])


def choose_random_followed_tag(pro_id):
    followed_tags = tag_users[tag_users['user_id'] == pro_id]
    return followed_tags.sample(1)['tag_id'].values[0]

def get_similar_questions(qid, nbr_questions=10, except_questions_ids=[], prioritize=False):
    recommendations = pd.DataFrame([])

    #print(qid)
    q_dists_row = list(Qs_sim_matrix[questions.index.get_loc(qid)])
    for eq_id in except_questions_ids:
        #print('removing ', eq_id)
        q_dists_row.pop(questions.index.get_loc(eq_id))
    q_dists_row = pd.Series(q_dists_row).sort_values(ascending=False)[:100]
    q_dists_row = q_dists_row[1:]

    if not prioritize:
        q_dists_row = q_dists_row[:nbr_questions]
        for i, d in q_dists_row.iteritems():
            qid = questions.index.values[i]
            recommendations = recommendations.append(questions.loc[qid])
    else:
        qid_to_score = {}
        for i, d in q_dists_row.iteritems():
            qid = questions.index.values[i]
            if d > 0.4:
                #print(qid)
                q_added = questions.loc[qid, 'date_added']
                days_elapsed = (today - q_added) / np.timedelta64(1, 'D')
                qid_to_score[qid] = d * days_elapsed
        qid_scores = sorted(qid_to_score.items(), key=lambda x: x[1])[:nbr_questions]
        for qid, score in qid_scores:
            print(q_dists_row[questions.index.get_loc(qid)], qid_to_score[qid]) if debug else None
            recommendations = recommendations.append(questions.loc[qid])
    return recommendations



def recommend_questions_to_professional(pro_id, nbr_recommendations=10):
    print('Professional ID:', pro_id )

    # tags followed
    tags_followed = tag_users[tag_users['user_id'] == pro_id]['tag_id']
    tags_followed = tags[tags.index.isin(tags_followed)]
    print('Followed Tags: ', tags_followed['name'].values)

    # Number of answered questions
    cnt_pro_answers = professionals.loc[pro_id, 'count_answers']

    # Type of Start
    cold_start = (cnt_pro_answers == 0)
    freezing_start = (cold_start and len(tags_followed) == 0 )

    n = 3 # Nbr of questions per tag
    recommendations = pd.DataFrame([])


    # Freezing Start
    if freezing_start:
        print('Freezing ...')
        recommendations = recommendations.append(questions[questions['date_added'] > min_date_for_questions].sample(10))

    # Cold Start
    elif cold_start:
        print('Cold', cnt_pro_answers)

        qids_from_followed_tags  = tag_questions[tag_questions['tag_id'].isin(tags_followed.index.values)]['question_id'].values
        qids_from_followed_tags  = list(questions[(questions.index.isin(qids_from_followed_tags))   & (questions['date_added'] > min_date_for_questions)].sort_values('date_added', ascending=False).index.values)

        tags_suggested = tags[tags['processed'].isin(tags_followed['processed'].values)]
        tags_suggested = tags_suggested[~tags_suggested.index.isin(tags_followed.index.values)]
        print('Suggested Tags: ', tags_suggested['name'].values)
        suggested_tags_available = len(tags_suggested) > 0
        if suggested_tags_available:
            qids_from_suggested_tags = tag_questions[tag_questions['tag_id'].isin(tags_suggested.index.values)]['question_id'].values
            qids_from_suggested_tags = list(questions[(questions.index.isin(qids_from_followed_tags))  & (questions['date_added'] > min_date_for_questions)].sort_values('date_added', ascending=False).index.values)
            exploit_threshold = .6
        else:
            exploit_threshold = 1


        print('Exploit Threshold: ', exploit_threshold) if debug else None
        for i in range(1, nbr_recommendations+1):
            if np.random.rand() < exploit_threshold and len(qids_from_followed_tags) > 0:
                # Exploit followed tags
                print(f'{i}- Exploit followed tags') if debug else None
                random_index = choice(qids_from_followed_tags)
                q = questions.loc[random_index]
                recommendations = recommendations.append(q)
                qids_from_followed_tags.remove(random_index)
            elif suggested_tags_available and len(qids_from_suggested_tags) > 0:
                # Suggest from suggested tags
                print(f'{i}- Explore suggested tags') if debug else None
                random_index = choice(qids_from_suggested_tags)
                q = questions.loc[random_index]
                recommendations = recommendations.append(q)
                qids_from_suggested_tags.remove(random_index)
            else:
                # no more questions from the pool
                pass

    # Hot Start
    else:
        questions_answered_ids = list(answers[answers['author_id'] == pro_id]['question_id'].values)
        questions_answered = questions[questions.index.isin(questions_answered_ids)].sort_values('date_added', ascending=False)
        questions_answered_locs = []
        for qid in questions_answered_ids:
            questions_answered_locs.append(questions.index.get_loc(qid))
        
        print('Hot, Answered Questions: ', cnt_pro_answers)
        #print(questions_answered_locs)
        display(questions_answered[['date_added', 'processed', 'count_answers']])
        
        # calculate questions scores
        q_scores = {}
        for i, q in questions_answered.iterrows():
            days_elapsed = (today - q['date_added'])/np.timedelta64(1, 'D')
            q_scores[i] = calculate_score_question_answered(days_elapsed)
        print(q_scores) if debug else None

        # calculate exploit_threshold
        exploit_threshold = calculate_exploit_threshold(list(q_scores.values()), nbr_recommendations)
        print('Exploit Threshold:', exploit_threshold) if debug else None
        except_qs = []
        except_qs += questions_answered_ids
        for i in range(nbr_recommendations):

            if np.random.rand() < exploit_threshold:
                # Exploit
                random_q_score = choose_random_answered_question(q_scores)
                print('\nExploit Question', random_q_score) if debug else None
                recommendations = recommendations.append(get_similar_questions(random_q_score[0], nbr_questions=1, except_questions_ids=except_qs, prioritize=True))
            else:
                # Explore
                latest_questions = pd.DataFrame([])
                for tid in tags_followed.index.values:
                    qids = tag_questions[tag_questions['tag_id'] == tid]['question_id'].values
                    tag_qs = questions[questions.index.isin(qids)]
                    tag_qs = tag_qs[~tag_qs.index.isin(except_qs)]
                    if len(tag_qs) > 0:
                        tag_qs = tag_qs.sort_values('date_added', ascending=False)
                        latest_questions = latest_questions.append(tag_qs.head(3))
                #display(latest_questions)
                best_question_id = 0
                best_distance = float('-inf')
                for qid, r in latest_questions.iterrows():
                    qloc = questions.index.get_loc(qid)
                    for aqloc in questions_answered_locs:
                        d = Qs_sim_matrix[qloc, aqloc]
                        if best_question_id == 0 or d > best_distance:
                            best_question_id = qid
                            best_distance = d

                print('\nExplore Tags', best_question_id, best_distance) if debug else None
                if best_question_id != 0:
                    recommendations = recommendations.append(questions.loc[best_question_id])
            except_qs = list(recommendations.index.values)
            except_qs += questions_answered_ids

    return recommendations

* Testing the recommender

In [12]:
# Random Hot Professional
random_hot_pro_id = professionals[(professionals['count_answers'] > 2) & (professionals['count_answers'] < 5)].sample(1).index.values[0]

# Random Cold Professional
random_cold_pro_id = professionals[professionals['count_answers'] == 0].sample(1).index.values[0]

#for random_pro_id in [random_hot_pro_id, random_cold_pro_id]:
for random_pro_id in [random_hot_pro_id, random_cold_pro_id]:
    recs = recommend_questions_to_professional(random_pro_id, nbr_recommendations=10)
    print('Recommendations: ')
    display(recs[['date_added', 'title',]]) if len(recs) > 0 else None
    

Professional ID: a5be6c7b5cd64ab984c4609654db33d5
Followed Tags:  ['law-practice']
Hot, Answered Questions:  3


Unnamed: 0_level_0,date_added,processed,count_answers
questions_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2b6c64ebebba4e5e8a42f6a503b153cf,2017-06-23 03:36:12,method studying sat act time planning take act september sat october study simultaneously receive score possibly advice done college testing ivyleague sat act,6
1652e344c48d417dbaf60423419af230,2017-06-22 11:23:42,career available prevention human trafficking high school student currently enrolled collegiate high school working towards interested fighting sex trafficking interested option far field go major...,10
1eb92e311dbd420cb848d1e577009726,2017-03-21 13:19:21,studying post first degree law possible arears focus applying post first degree law school year learn new thing believe communication debatable skill sure expect entrance exam interview really app...,3


Recommendations: 


Unnamed: 0,date_added,title
1bed152bd4ce420499cc052799f0ffa3,2019-01-17 19:39:40,Need help deciding which program should I choose Social work or Community worker?
f40dc76a7a5844f3bc71da9f391688e6,2018-07-13 19:46:08,what is law school like an how does it work?
360ca20467f9477791d71082670b3506,2018-07-16 20:46:15,Do you need a law degree to become a lawyer?
336a3900441742f69d79c45f0a5f601d,2018-08-04 15:06:53,What are the best tips to know to become an assistant district attorney?
5bc0a66a6c004dceb8934df95f15bb4a,2018-05-07 23:56:49,What is it like getting out of law school and trying to get a job?
dd78646277de4461970f7d4acfcc5972,2018-05-29 16:37:15,How does a lawyer separate personal values from the law when prosecuting or defending?
1c0bab6c198045b8a344b2b40b58717f,2018-06-19 21:01:01,Should I go to law school?
ac7997b2cded4e7d87a9ff75acb2cc7c,2018-04-19 20:06:05,What courses are needed to gain a law degree
9503dee1f2f748ada2dfc5d7dbb1a33d,2018-04-03 21:43:19,Is it hard for a foreigner that went to American Law School to find work?
2ae488d2353742a086afa639a94009e9,2018-03-27 15:48:58,Which law schools have notable programs in international human rights law?


Professional ID: 6277a1f1a2644c3695e8ca23ea1c4c50
Followed Tags:  ['education-management'
 '#readingspecialist-#educator-#literacyspecialist-#teacherofadults']
Cold 0
Suggested Tags:  ['educationmanagement']
Recommendations: 


Unnamed: 0,date_added,title
2e25ea09c6d842b0adfafe13ebb4b47c,2017-10-12 17:54:41,What are the most lucrative jobs in the Nonprofit sector?
c1d01dce85fe4778a27387dd9b22773c,2018-03-20 17:58:04,What is it like to start your own school?
164522e7595649729deebf48cad87e1b,2018-01-30 03:16:43,What helped you decide on your major?
2e25ea09c6d842b0adfafe13ebb4b47c,2017-10-12 17:54:41,What are the most lucrative jobs in the Nonprofit sector?
e1860d4512b746a19270e5675efb7b44,2018-02-03 01:09:21,What are some of the most flexible majors?
dd71db6e5a724e019f385a702fde87d5,2017-08-31 18:58:48,How do you manage your classroom and students?
476f5056ac1d4d149194da020cea2c43,2018-01-24 03:41:44,"Would it be rude to tell my teacher to stop smoking, if not, how can I put it as gently as possible?"
4c8d994d667d4f84828ceb09fc7715e9,2018-01-12 03:39:30,"Growing up abroad, all my teachers were international teachers from England, but my question is how does one become an international teacher from America?"
fbe15b7b17ee4333ad2425bc426eb295,2018-01-17 07:24:37,What is the Best Road to Success in the Educational Field?
e1860d4512b746a19270e5675efb7b44,2018-02-03 01:09:21,What are some of the most flexible majors?


### Question-to-Professionals mode:

Given a question, recommend the top K professionals to answer. This mode is used for the 'Immediate' folks.

The approach taken is straightforward: recommend professionals who answered similar questions to the one given.

In [13]:
def recommend_professionals_for_question(qid, nbr_recommendations=10):
    similar_questions = get_similar_questions(qid, nbr_questions=10, except_questions_ids=[], prioritize=False)
    #display(similar_questions)
    answer_author_ids = answers[answers['question_id'].isin(similar_questions.index.values)]['author_id'].values
    top_authors_ids = pd.Series(answer_author_ids).value_counts(ascending=False)[:nbr_recommendations].index.values
    
    
    return professionals[professionals.index.isin(top_authors_ids)]

* Testing the recommender:

In [14]:
random_question_index = choice(questions.index.values)
print('Random Question: ', questions.loc[random_question_index]['date_added'])
print(questions.loc[random_question_index]['title'])
print(questions.loc[random_question_index]['body'])
recommend_professionals_for_question(random_question_index, nbr_recommendations=5)

Random Question:  2018-03-26 02:37:55
Which careers should I look into?
My interests include linguistics, english, writing, editing, and psychology. Additionally, I really enjoy learning new things and researching. I also enjoy explaining things to others and tutoring. Are there any careers out there that might suit these interests?
#linguistics #english #writing-and-editing #research #teaching


Unnamed: 0_level_0,location,industry,headline,date_joined,count_answers
professionals_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9798c8df06244283ba14d30978ec318a,"Boston, Massachusetts",,senior linguistics developer at luminoso,2015-03-02 22:35:27 UTC+0000,13
36ff3b3666df400f956f8335cf53e09e,"Cleveland, Ohio",Mental Health Care,assist with recognizing and developing potential,2015-10-19 20:56:49 UTC+0000,1710
507e62b41e3a4410837fbdb760a6c946,"Woodbridge Township, New Jersey",,,2017-05-10 15:18:13 UTC+0000,14
9bf67236d34743768be67bd789dc618e,"Pomona, California",Higher Education,retention and graduation specialist,2018-07-07 17:13:04 UTC+0000,37
e77127a9fc4948a89e65835d410804bf,,IT,product manager,2018-11-14 11:25:22 UTC+0000,1


### Helper functions in production:

#### Get Tag Suggestions for a new question:
It is very important to control tags, and use existing ones when possible. The following function suggests tags for a new question:

In [15]:
# Analyze processed question and extracts implicit tags ( eg. 'computer science' => 'computerscience')
def get_tag_suggestions(q_p):
    #q_p = preprocess_question(q)
    #print(q_p)
    q_tokens = nltk.word_tokenize(q_p)
    q_tokens_cpy = q_tokens.copy()
    
    qp_tagged = nltk.pos_tag(q_tokens)
    important = []
    for t,pos in qp_tagged:
        if t not in stop_words and pos == 'NN' and len(tags[tags['processed'] == t]) > 0 :
            i = q_tokens.index(t)
            #print(len(q_p), t, i)
            poses_before_after = []
            if i > 0:
                poses_before_after.append(nltk.pos_tag([q_tokens[i-1]])[0])
            if i < (len(q_tokens)-1):
                poses_before_after.append(nltk.pos_tag([q_tokens[i+1]])[0])
            for i, bf in enumerate(poses_before_after):
                #print(t, bf)
                if bf[1] in ['NN', 'NNS', 'JJ', 'JJR', 'VBG']:
                    s = f'{t}{bf[0]}' if i == 1 else f'{bf[0]}{t}'
                    important.append(s)
            q_tokens.remove(t)
    important = set(important)
    for i in set(important):
        if i not in tags['processed'].values or i in q_tokens_cpy:
            important.remove(i)
    #print(len(important),important)
    return important

Example: 

In [16]:
new_question = 'I am a student in computer science and I want to be a data scientist but I dont now how to study machine learning and artificial intelligence.' 
p_q = preprocess_question(new_question)
suggestions = get_tag_suggestions(p_q)
print('Question: ', new_question)
print('Suggestions: ', suggestions)

Question:  I am a student in computer science and I want to be a data scientist but I dont now how to study machine learning and artificial intelligence.
Suggestions:  {'computerscience', 'artificialintelligence', 'machinelearning'}


#### Function to add a new question to DB:

In [17]:
# Generate a random index for adding a question to DB
def gen_test_index():
    length = np.random.randint(10,15)
    letters_digits = string.ascii_lowercase + string.digits
    return ''.join(random.sample(letters_digits, length))


def add_question_to_db(title, body):
    global Qs_transformed
    global Qs_sim_matrix
    q = title + ' ' + body
    q_p = preprocess_question(q)
    
    tag_suggestions = get_tag_suggestions(q_p)
    q_p = q_p + ' ' + ' '.join(tag_suggestions)
    
    print(q_p)
    
    author_id = 1 # special if for test ( doesn't exist in DB )
    index = gen_test_index()
    questions.loc[index] = {'author_id': author_id,'date_added': datetime.datetime.now(),
                  'title': title, 'body': body, 'processed': q_p}
    
    q_transformed = model_transformer.transform(tfidf.transform([q_p]))
    Qs_transformed = np.append(Qs_transformed, [q_transformed[0]], axis=0)
    #print(Qs_transformed.shape)
    
    sim_mat_shape = Qs_sim_matrix.shape
    #print('Qs_sim_mat shape', sim_mat_shape)
    new_sims = cosine_similarity(Qs_transformed[-1].reshape(1,-1), Qs_transformed)[0]
    #print('new_sims', new_sims.shape)
    Qs_sim_matrix = np.hstack((Qs_sim_matrix, np.zeros((sim_mat_shape[0], 1))))
    Qs_sim_matrix = np.vstack((Qs_sim_matrix, np.zeros((sim_mat_shape[0]+1))))
    Qs_sim_matrix[-1] = new_sims
    Qs_sim_matrix[:, -1] = new_sims
    #print(Qs_sim_matrix.shape)
    return index


## Summary and Future Explorations

The proposed system's strengths are **its effectiveness, ease of implementation and ease of maintainance in production.**
It uses controlled randomness to encourage new users while keeping engaged professionals in the platform.

Some recommendations for future improvements: 
* Valuable information can be obtained when users open a browsing session ( viewed questions ... ).
* Preprocessing of tags and preventing the students from using tags which already exist.
* Using recent questions ( last 2 years ) for the similarity matrix.
* Using a auto-correction system to reduce the typos made by students.
* Using the answers in the model.
* Updating the system to decide when to email each professional depending on his interaction with the site.

#### I hope that this Kernel was useful, and see you in the <a href="https://www.kaggle.com/hamzael1/kernels" target="_blank">next one</a> !

*PS: upvotes & feedback are welcome !*