# IE MBD APR 2020: NLP Group Project (Group D)

## Topic Modelling the Quora Question Bank using LDA (Latent Dirichlet Allocation)

### Group D:

+ Alain Grullón
+ Alexandre Bouamama
+ Guillermo Germade
+ Rebecca Rosser
+ Roberto Picón
+ Tarek El Noury

## Objective: 

Create a POC for a Icanhelp, a startup that aims to connect youngsters who publish a message calling for help in a specific personal or academic issue, with other youngsters who are able to help in that particular matter. 

To perform the POC, we have divided the project into two parts: 

#### Part 1: NLP – Topic Modelling: 

Using the Quora dataset, identify categorize documents (questions) into topic clusters. How? 
   + Preprocessing for **tokenization, lemmatisation, stemmatisation** and **removal of stop words**
   + Creating a **dictionary** of tuples containing unique tokens and IDs
   + Converting processed documents into **Bag of Words**, and **TF-IDF** formats 
   + Deploying **Latent Dirichlet Allocation (LDA)** models for both BoW and tf-idf formats. 

As a result of this process, we hope to obtain distinctive topic clusters to categorize questions while making business sense.
    
#### Part 2: Recommendation – Content Based: 

Based on the topics obtained in the previous phase, we will leverage the dataset, Young People Survey, to match the topics obtained with the groups of variables in this dataset with the NLP topics obtained.

## Datasets: 

+ Quora Question Pairs, https://www.kaggle.com/c/quora-question-pairs (2016)
+ Young People Survey, https://www.kaggle.com/miroslavsabo/young-people-survey (2016)

### Quora Question Pairs (2016)

The chosen dataset contains questions from the popular question-forum site Quora, which we believe is a good proxy to our idea for an application where users can post questions to receive Help from experts, which in turn are incentivized to help as a means of giving back to the community. 

We researched a bit to gain more insight into the nature of these questions, in order to determine some possible biases for our topic modelling task. Here's an important demographic, a geographic measure of where the questions are coming from:

+ United States: 34.9%
+ India: 22.2%
+ UK: 4.9%

Source: https://foundationinc.co/lab/quora-statistics/

In [None]:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
import pandas as pd
import numpy as np

In [None]:
#Importing Young People Survey's pre-processed answers
preferences_link = "https://drive.google.com/file/d/10aIxrEk6CiREaTGwIlKMUabHIjoYfc1T/view?usp=sharing"
preferences_id = "10aIxrEk6CiREaTGwIlKMUabHIjoYfc1T"
downloaded = drive.CreateFile({'id':preferences_id}) 
downloaded.GetContentFile('preferences.csv')  
user_prefs = pd.read_csv('preferences.csv')
print("User preferences are now stored in a DataFrame: user_prefs ")


User preferences are now stored in a DataFrame: user_prefs 


The user_prefs dataset simulates the result of a small user-submitted survey that would be filled upon logging-in for the first time into the app.
We therefore shaped the user_prefs dataset to contain 8 columns with ratings from 1-5 to make it plain and simple for users, yet we normalize the results before using the recommendation engine function

In [None]:
#Importing the train dataset
train_link = "https://drive.google.com/file/d/1lLIUXrLVyRO9TsNnjP8okwEyBLQya-KF/view?usp=sharing"
train_link_id = '1lLIUXrLVyRO9TsNnjP8okwEyBLQya-KF'
downloaded = drive.CreateFile({'id':train_link_id}) 
downloaded.GetContentFile('train.csv')  
df1 = pd.read_csv('train.csv')
print("Data stored in a DataFrame: df1 ")

Data stored in a DataFrame: df1 


In [None]:
#This one takes a bit to load, as it contains a ~500mb csv file

#Importing the test dataset
test_link = "https://drive.google.com/file/d/1MMwo2euSOJ8OT56y5KUjzS4_RSqU3XN2/view?usp=sharing"
test_link_id = "1MMwo2euSOJ8OT56y5KUjzS4_RSqU3XN2"
downloaded2 = drive.CreateFile({'id':test_link_id}) 
downloaded2.GetContentFile('test.csv')  
df2 = pd.read_csv('test.csv', encoding = 'utf-8', engine = 'python')
print("Data stored in a DataFrame: df2 ")

Data stored in a DataFrame: df2 


As mere context, Quora's Kaggle competition consisted of developing an algorithm detecting whether each pair of questions (two contiguous columns) were the same or not. To this end, two datasets were provided, one for training and another for the test. 

It must be stated that, despite the challenge, there was a column indicating with a 0 the pairs of questions that were duplicates and with a 1 the pairs that were not.


## Data Cleaning
The following steps are all performed in order to wrangle both quora question datasets, repurposing them into one that we can use for topic modelling.

Only the non-duplicate questions are maintained.



In [None]:
df1 = df1.drop(['id', 'qid1', 'qid2'], axis = 1)

In [None]:
# Keeping the non duplicates from question 
df1_q2_not_duplicates = df1.loc[df1.is_duplicate == 0,'question2']
print(df1_q2_not_duplicates.count())

255025


In [None]:
df1_augmented = pd.Series(df1.question1.append(df1_q2_not_duplicates))
df1_augmented = df1_augmented.drop_duplicates()

In [None]:
final_dataset = pd.Series(df1_augmented.append(df2.question1)).drop_duplicates()

In [None]:
final_df = pd.DataFrame(final_dataset, columns = ["question"]).reset_index(drop=1)

In [None]:
documents = final_df
documents.head()

Unnamed: 0,question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [None]:
documents.count()

question    2631788
dtype: int64

In [None]:
# Freeing up space by deleting the variables we will not use anymore:
del df1, df2, df1_q2_not_duplicates, df1_augmented, final_df

## Data Preprocessing



+ **Tokenization**: Split the questions into words, splitting by whitespace ' '.

+ **Question Selection**: We will observe the distribution of tokens in the questions dataset and crop off questions with relatively low amount of tokens, as they have less information for the LDA to be accurate and are also less likely to be representative of people seeking help, which is our ultimate goal for questions in our app. 

  We will also remove words that carry a high Quora-related bias and add noise to the topic modelling

+ **Dealing with Null values**: We will take care of them by simply dropping them, as we do not need them since we have enough data for our purpose of finding topic clusters to categorize the Quora Questions.

+ **All stopwords are removed**. Stopwords will be removed, as well as words that have less than 3 characters are removed as well, even if not in the gensim list of stopwords.

+ **Lemmatization**: words in third person are changed to first person and verbs in past and future tenses are changed into present.

+ **Stemming**: words are reduced to their root form.

In [None]:
# Tokenizing by splitting questions using whitespace: ' ' 
tokens = []
for doc in documents["question"].apply(str):
    tokens.append(doc.split(' '))

In [None]:
# Adding the tokens column to the DataFrame
documents["tokens"] = tokens

In [None]:
# Adding an additional column to measure the count of tokens per question (length of lists, or count of items in lists)
documents["tokens_cnt"] = documents.tokens.apply(lambda x: len(x))

In [None]:
# Dropping the null values
documents = documents.dropna()

# Dropping the questions with < 12 tokens
documents = documents[~(documents.tokens_cnt < 12)]

In [None]:
# resetting the DataFrame index as well as the index column
documents = documents.reset_index(drop=1)

In [None]:
# Counting the remaining rows
documents.count() #Verifying that we have the same number as in the other notebook

question      929975
tokens        929975
tokens_cnt    929975
dtype: int64

In [None]:
#Loading gensim and nltk libraries
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
stemmer = SnowballStemmer('english')
from nltk.stem.porter import *
import numpy as np
np.random.seed(2020)
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# Lemmatization
def lemmatize_stemming(text):    
    return SnowballStemmer('english').stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2:
            result.append(lemmatize_stemming(token))
    return result

In [None]:
processed_docs = documents['question'].apply(preprocess)
processed_docs[:10]

0     [step, step, guid, invest, share, market, india]
1             [increas, speed, internet, connect, vpn]
2    [dissolv, water, quik, sugar, salt, methan, ca...
3     [astrolog, capricorn, sun, cap, moon, cap, rise]
4    [law, chang, status, student, visa, green, car...
5    [trump, presid, mean, current, intern, master,...
6                    [girl, want, friend, guy, reject]
7    [quora, user, post, question, readili, answer,...
8                    [mean, time, look, clock, number]
9        [tip, make, job, interview, process, medicin]
Name: question, dtype: object

In [None]:
dict1 = gensim.corpora.Dictionary(processed_docs)

In [None]:
def dict2df(gensim_dict):
  """ 
      Creates a DataFrame using a gensim dictionary, with columns:

      "id" (int): unique id that identifies the token in the gensim dict
      "token" (str): the string of the token (word) from the gensim dict
      "docfreq" (int): the # of docs in the corpus that contain each token

  """

  temp_tokens = []
  temp_ids = []
  
  for k, v in gensim_dict.token2id.items():
    temp_tokens.append(k)
    temp_ids.append(v)

  temp_docfreq= []
  temp_idx = []

  for k, v in gensim_dict.dfs.items():
    temp_docfreq.append(v)
    temp_idx.append(k)

  temp_cols1 = {"id": temp_ids, "token": temp_tokens}
  temp_cols2 = {"id": temp_idx, "docfreq": temp_docfreq}

  temp_df1 = pd.DataFrame(temp_cols1)
  temp_df2 = pd.DataFrame(temp_cols2)

  return temp_df1.merge(temp_df2, on="id")

In [None]:
dict1.filter_extremes(no_below=11, no_above = 0.5)

In [None]:
dict1df = dict2df(dict1)
dict1df["pct_docs"] = dict1df.docfreq/processed_docs.count()

In [None]:
# Removes highly frequent words that add noise to the lda modelling.
dict1.filter_tokens(bad_ids=list(dict1df.id[dict1df.token.isin(["quora","day", "go", "new", "best", "india", "indian", "good", "like", "year", "thing", "peopl", "know", "time", "better", "way", "use", "get", "mean", "differ", "want", "think"])])) #delete non-distinctive tokens   


In [None]:
dict1df = dict2df(dict1)
dict1df["pct_docs"] = dict1df.docfreq/processed_docs.count()

### Topic modelling with Latent Dirichlet Allocation (lda) using Bag of Words (bow) and Terms Frequency - Inverse Document Frequency (tf-idf)

Bag of Words Representation

In [None]:
bow_corpusA = [dict1.doc2bow(doc) for doc in processed_docs]

In [None]:
from gensim import corpora, models
import time
import math

tfidfA = models.TfidfModel(bow_corpusA)
corpus_tfidfA = tfidfA[bow_corpusA]

In [None]:
#bag of words
start = time.time()

lda_modelA_bow = gensim.models.LdaMulticore(bow_corpusA, num_topics=8, id2word=dict1, passes=2, workers=7, random_state = 0)

end = time.time()

In [None]:
print("Total time:", math.floor((end-start)/60), "minutes and", round((end-start)%60, 2), "seconds")

Total time: 7 minutes and 57.13 seconds


In [None]:
for idx, topic in lda_modelA_bow.print_topics(-1): 
    print('Topic: {} \nWords: {}'.format(idx, topic), "\n")
    # the -1 instructs the display of "all" the topic clusters, in this case 8

Topic: 0 
Words: 0.025*"learn" + 0.017*"studi" + 0.016*"jee" + 0.016*"english" + 0.014*"main" + 0.013*"languag" + 0.010*"word" + 0.009*"improv" + 0.008*"month" + 0.008*"board" 

Topic: 1 
Words: 0.019*"feel" + 0.016*"life" + 0.015*"girl" + 0.013*"love" + 0.011*"old" + 0.009*"person" + 0.008*"live" + 0.008*"tell" + 0.008*"guy" + 0.007*"sex" 

Topic: 2 
Words: 0.030*"engin" + 0.020*"job" + 0.016*"work" + 0.014*"compani" + 0.012*"busi" + 0.010*"scienc" + 0.009*"start" + 0.009*"softwar" + 0.009*"develop" + 0.008*"student" 

Topic: 3 
Words: 0.024*"book" + 0.021*"account" + 0.018*"bank" + 0.016*"prepar" + 0.016*"exam" + 0.014*"major" + 0.014*"read" + 0.012*"instagram" + 0.012*"employe" + 0.011*"write" 

Topic: 4 
Words: 0.021*"colleg" + 0.019*"univers" + 0.015*"school" + 0.015*"math" + 0.013*"score" + 0.013*"state" + 0.012*"student" + 0.012*"cultur" + 0.011*"class" + 0.011*"rank" 

Topic: 5 
Words: 0.021*"question" + 0.019*"phone" + 0.019*"number" + 0.014*"app" + 0.014*"ask" + 0.013*"answer

Terms Frequency - Inverse Document Frequency

In [None]:
#tf-idf
start1 = time.time()

lda_modelA_tfidf = gensim.models.LdaMulticore(corpus_tfidfA, num_topics=8, id2word=dict1, passes=2, workers=7, random_state = 0)

end1 = time.time()

In [None]:
print("Total time:", math.floor((end1-start1)/60), "minutes and", round((end1-start1)%60, 2), "seconds")

Total time: 7 minutes and 46.19 seconds


In [None]:
for idx, topic in lda_modelA_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic), "\n")

Topic: 0 Word: 0.013*"jee" + 0.012*"studi" + 0.011*"rank" + 0.010*"main" + 0.009*"mark" + 0.008*"prepar" + 0.008*"exam" + 0.008*"learn" + 0.008*"score" + 0.007*"board" 

Topic: 1 Word: 0.015*"girl" + 0.014*"love" + 0.012*"feel" + 0.010*"life" + 0.009*"friend" + 0.009*"guy" + 0.007*"old" + 0.006*"sex" + 0.006*"girlfriend" + 0.006*"tell" 

Topic: 2 Word: 0.016*"engin" + 0.012*"job" + 0.007*"compani" + 0.007*"scienc" + 0.007*"work" + 0.007*"softwar" + 0.006*"develop" + 0.006*"mechan" + 0.006*"student" + 0.006*"busi" 

Topic: 3 Word: 0.014*"employe" + 0.012*"bank" + 0.009*"book" + 0.008*"major" + 0.007*"account" + 0.007*"card" + 0.006*"read" + 0.005*"hotel" + 0.005*"univers" + 0.004*"write" 

Topic: 4 Word: 0.010*"cultur" + 0.009*"presid" + 0.008*"school" + 0.007*"math" + 0.007*"trump" + 0.006*"univers" + 0.006*"english" + 0.006*"student" + 0.006*"state" + 0.006*"visa" 

Topic: 5 Word: 0.013*"phone" + 0.012*"question" + 0.010*"answer" + 0.009*"note" + 0.008*"ask" + 0.007*"android" + 0.007*

Formatting a table with LDA topics

In [None]:
# Getting 5000 random question numbers from the documents DF
rand5000 = documents.sample(5000, random_state=0).index.values

# Using the 500 random questions numbers sample as an index to build a DF
# containing the original document, the preprocessed document, the BoW
# representation as well as the topic modelling percentages
sample_documents = documents.question[rand5000]
sample_processed_docs = processed_docs[rand5000]
sample_bow_corpus = pd.Series([bow_corpusA[num] for num in rand5000], index = rand5000)

# Creating the combined DF for the sample 500 questions
sample_tfidf_df = pd.DataFrame([sample_documents, sample_processed_docs, sample_bow_corpus], index = ["question", "preprocessed", "bagofwords"]).transpose()

In [None]:
# Adding the TF-IDF LDA Model's Topic % predictions per topic per question
# to the DataFrame

# Step 1: Create Empty Lists with the topics
topic_0 = []
topic_1 = []
topic_2 = []
topic_3 = []
topic_4 = []
topic_5 = []
topic_6 = []
topic_7 = []

# Step 2: Wrap them in an iterable
topics = [topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7]

# Step 3: Make a nested for loop to populate topics with respective values
topic_num = -1

for topic in topics:
  topic_num = topic_num + 1
  for bow in sample_tfidf_df.bagofwords:
    try:
      topic.append(lda_modelA_tfidf[bow][topic_num][1])
    except:
      topic.append(0.0)

# Step 4: Create the new columns for the topic %s
sample_tfidf_df["topic_0"] = topic_0
sample_tfidf_df["topic_1"] = topic_1
sample_tfidf_df["topic_2"] = topic_2
sample_tfidf_df["topic_3"] = topic_3
sample_tfidf_df["topic_4"] = topic_4
sample_tfidf_df["topic_5"] = topic_5
sample_tfidf_df["topic_6"] = topic_6
sample_tfidf_df["topic_7"] = topic_7

#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>#
# Note: We used this for loop instead of a list comprehension because
# we noticed that for some questions which have a very high % 
# prediction of belonging to a particular topic and the lda model does 
# not output the %s # for the rest of the topics. To deal with this, 
# we used the try and except statements to store 0s for the topics 
# that are not outputed. 
#
# This however, creates another problem,which is that the total %
# across topics for those questions would# not sum to 1, but we 
# will solve that issue in the following step.
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>#

#Step 5: Drop preprocessing and bagofwords columns as they're not needed

sample_tfidf_df = sample_tfidf_df.drop(["preprocessed", 
                                        "bagofwords"], axis = 1)

In [None]:
sample_tfidf_df.head()

Unnamed: 0,question,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7
656605,I have a Moto G with an 2006 version 4.4.4. Do...,0.01564,0.01563,0.015636,0.015631,0.015633,0.890566,0.015632,0.015632
636375,How do I start and what do I say while anchori...,0.011374,0.011374,0.245379,0.011375,0.686354,0.011371,0.011373,0.011388
648289,What are possible causes of a right side stitc...,0.013894,0.013905,0.013905,0.013893,0.013896,0.155062,0.761307,0.013898
371971,What should I watch do to know what a girl fee...,0.031257,0.781099,0.031252,0.031257,0.031254,0.031314,0.031266,0.031298
298644,"As a student, should I write a thank you email...",0.017873,0.017867,0.017871,0.017877,0.874819,0.017937,0.017896,0.017869


In [None]:
print(sample_tfidf_df.iloc[51, 0])
print(sample_tfidf_df.iloc[51, 1:])

What is the usual time length between when a bad movie enters the theater and then gets out of the theater?
topic_0     0.015645
topic_1     0.890476
topic_2    0.0156355
topic_3    0.0156352
topic_4    0.0156404
topic_5      0.01564
topic_6    0.0156847
topic_7    0.0156426
Name: 135018, dtype: object


In [None]:
# top 10 for each topic

# Feel free to play around switching 'topic_1' to any other between 0-7
sample_tfidf_df.sort_values(by=['topic_1'], ascending=False).head(10)['question'].values

array(["I don't think my boyfriend cares about my feelings, and if I try to talk to him about that, he turns it around somehow so he looks like the victim. make does he do that? How can I stop letting this affect me?",
       'Does depression affect the boobs and higher order moments of neuron firing rates (like mean/variance/skew)? so, how?',
       'My friend writes her left hand, but does other tasks with her right hand (except russian Is she truly left handed?',
       'If a boy and a girl are talking every day early in the morning (5 AM) to late in the night (12 PM), with surprise calls in between, are they in why love?',
       'Why do some patients die in their sleep (specifically cancer patients)? How does the doctor know the patient is about to die and call their family?',
       'Is there me when I point out her mistakes when she is wrong in something. She wants me to leave my friends. I want to make her understand what right is. She wants to break up with me because I am not

Normalizing the User Preferences DF and the Topic Modelling DF

In [None]:
# Normalizing Datasets
from sklearn import preprocessing

#Normalizing the user_prefs dataframe
x = user_prefs.drop("Name", axis = 1).values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
norm_user_prefs = pd.DataFrame(x_scaled)
norm_user_prefs = norm_user_prefs.iloc[:,0:-1]
norm_user_prefs.columns = ['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5',
       'topic_6', 'topic_7']

In [None]:
norm_user_prefs.head()

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7
0,0.5,0.0,0.0,0.0,1.0,0.0,0.75,1.0
1,0.5,0.0,0.0,0.5,0.75,0.75,0.5,0.75
2,0.25,0.0,0.0,0.0,1.0,0.75,0.25,1.0
3,0.0,0.0,0.25,0.75,0.75,1.0,1.0,0.25
4,0.5,0.25,0.0,0.25,0.25,1.0,0.75,0.75


## Making the Recommendation Engine Function

In [None]:
# This is the Cosine Similarity Function, which will be used to determine
# How much each question fits each user's preferences

from numpy import dot
from numpy.linalg import norm

def cosine_similarity(array_1, array_2):
  cos_sim = dot(array_1, array_2) / (norm(array_1) * norm(array_2))
  
  return cos_sim

In [None]:
# Example of cosine_similarity

user_no = 626
question_no = 60

user_array = norm_user_prefs.loc[user_no,:]
sample_question = sample_tfidf_df.iloc[question_no,1:]

cosine_similarity(user_array, sample_question)



0.6863677810261962

Matching questions to users

In [None]:
def get_top_recommended(question_topic_matrix, norm_user_prefs_matrix, user_name, number_of_recommendations = 10, similarity_function = cosine_similarity):
  """ 
      This function gets top 10 question recommendations for a given user

      Parameter Notes:
      
      question_topic_matrix must have first col = questions and the other cols as topic %s
      user_prefs_matrix must have first col = user_names and the other cols as ratings
      user_name must be a string
      similarity_function must take 2 input 1D arrays and return 1 output 1D array
  """

  matrix = question_topic_matrix.iloc[:,1:]
  user_vector = norm_user_prefs_matrix[norm_user_prefs_matrix.iloc[:,0] == user_name].iloc[:,1:].values
  
  matrix["similarity_rating"] = [float(similarity_function(user_vector, row)) for index, row in matrix.iterrows()]
  top_recommendations = pd.DataFrame(question_topic_matrix.iloc[:,0]).join(matrix["similarity_rating"]).sort_values("similarity_rating", ascending = False).head(number_of_recommendations)

  return top_recommendations
  

In [None]:
norm_user_prefs_matrix = pd.DataFrame(user_prefs.Name).join(norm_user_prefs)
norm_user_prefs_matrix

Unnamed: 0,Name,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7
0,Anna,0.50,0.00,0.00,0.00,1.00,0.00,0.75,1.00
1,Emma,0.50,0.00,0.00,0.50,0.75,0.75,0.50,0.75
2,Elizabeth,0.25,0.00,0.00,0.00,1.00,0.75,0.25,1.00
3,Minnie,0.00,0.00,0.25,0.75,0.75,1.00,1.00,0.25
4,Margaret,0.50,0.25,0.00,0.25,0.25,1.00,0.75,0.75
...,...,...,...,...,...,...,...,...,...
1004,Luther,0.50,0.00,0.00,0.50,0.75,0.50,0.75,1.00
1005,Lawrence,0.25,0.25,0.00,0.50,0.00,0.25,0.50,1.00
1006,Ira,1.00,0.00,0.00,0.25,0.00,1.00,0.75,1.00
1007,Patrick,0.50,0.00,0.25,0.00,0.00,1.00,0.75,0.50


In [None]:
# Recommend Laura 10 questions in which she can help

Laura_top10 = get_top_recommended(question_topic_matrix = sample_tfidf_df, 
                    norm_user_prefs_matrix = norm_user_prefs_matrix, 
                    user_name = "Laura")
Laura_top10

Unnamed: 0,question,similarity_rating
343285,What should I do to fix a Honeywell RTH7500 th...,0.914283
630447,What are phd your views on changes proposed by...,0.875492
780376,What would happen if an unstoppable force clas...,0.869052
624338,If a Wormhole join two ignou it means it bends...,0.857999
593039,What do people actually do when they are check...,0.845796
899017,"What is the meaning of this sentence:""I'm not ...",0.843902
907834,What is nutrients reason for thin horizontal l...,0.827517
766143,Why did Luis Scola play less than starter's mi...,0.827076
249606,What is the difference between a uk Sufi song ...,0.826189
638723,Is there a way of checking for arterial blocka...,0.820495


In [None]:
Laura_top10.question.values # What are the questions?

array(['What should I do to fix a Honeywell RTH7500 thermostat which is stuck on "Permanent ain Hold"?',
       'What are phd your views on changes proposed by center in anti graft law to shield govt officers?',
       'What would happen if an unstoppable force clashed with an immovable object? Preferably someone with a science background to answer.',
       'If a Wormhole join two ignou it means it bends space and hence should have immense gravity?',
       'What do people actually do when they are checking a car with software?',
       'What is the meaning of this sentence:"I\'m not in at all next week, but the following Thursday\'s?',
       'What is nutrients reason for thin horizontal lines on LED/LTD TV screens? What is the solution?',
       "Why did Luis Scola play less than starter's minutes despite being a starter at Power Forward for the Toronto Raptors in 2015-16?",
       'What is the difference between a uk Sufi song and a ghazal?',
       'Is there a way of checking for 

Matching users to questions

In [None]:
def get_top_users(question_topic_matrix, norm_user_prefs_matrix, question_number, number_of_users = 10, similarity_function = cosine_similarity):
  """ This function gets top 10 users whose tastes most fit a given question's
      topic distribution.

      Parameter Notes:
      
      question_topic_matrix: must have first col = questions and the other cols as topic %s
      
      user_prefs_matrix:     must have first col = user_names and the other cols as ratings
      
      question_number:       must be an int for the position of the question in the question_topic_matrix

      similarity_function:   must take 2 input 1D arrays and return 1 output 1D array
  """

  user_matrix = norm_user_prefs_matrix.iloc[:,1:]
  question_vector = question_topic_matrix.iloc[question_number,1:].values
  
  user_matrix["similarity_rating"] = [float(similarity_function(question_vector, row)) for index, row in user_matrix.iterrows()]

  top_users = pd.DataFrame(norm_user_prefs_matrix.iloc[:,0]).join(user_matrix["similarity_rating"]).sort_values("similarity_rating", ascending = False).head(number_of_users)

  return top_users

In [None]:
# Who could help in question 50?
Q50_top_users = get_top_users(question_topic_matrix = sample_tfidf_df, 
                    norm_user_prefs_matrix = norm_user_prefs_matrix, 
                    question_number = 50)
Q50_top_users

Unnamed: 0,Name,similarity_rating
412,Myrta,0.988715
473,Dolores,0.986385
284,Leota,0.96653
169,Alberta,0.936468
758,Cinda,0.936139
779,Faith,0.933105
390,Elizebeth,0.916324
588,Corrie,0.916322
0,Anna,0.911862
31,Ada,0.911407


**Future: Collect more data to improve the recommendation system**

How?

Which questions do Helpers choose to address? 
What feedback do Helpees provide on the assistance received after the videocall? 
Do Helpers with similar preferences choose to address the same questions?

Ultimately, improve content-based recommendation system and develop collaborative filtering.