### Data description

CareerVillage.org has provided several years of anonymized data and each file comes from a table in their database.

answers.csv: Answers are what this is all about! Answers get posted in response to questions. Answers can only be posted by users who are registered as Professionals. However, if someone has changed their registration type after joining, they may show up as the author of an Answer even if they are no longer a Professional.

comments.csv: Comments can be made on Answers or Questions. We refer to whichever the comment is posted to as the "parent" of that comment. Comments can be posted by any type of user. Our favorite comments tend to have "Thank you" in them :)

emails.csv: Each email corresponds to one specific email to one specific recipient. The frequency_level refers to the type of email template which includes immediate emails sent right after a question is asked, daily digests, and weekly digests.

group_memberships.csv: Any type of user can join any group. There are only a handful of groups so far.

groups.csv: Each group has a "type". For privacy reasons we have to leave the group names off.

matches.csv: Each row tells you which questions were included in emails. If an email contains only one question, that email's ID will show up here only once. If an email contains 10 questions, that email's ID would show up here 10 times.

professionals.csv: We call our volunteers "Professionals", but we might as well call them Superheroes. They're the grown ups who volunteer their time to answer questions on the site.

questions.csv: Questions get posted by students. Sometimes they're very advanced. Sometimes they're just getting started. It's all fair game, as long as it's relevant to the student's future professional success.

school_memberships.csv: Just like group_memberships, but for schools instead.

students.csv: Students are the most important people on CareerVillage.org. They tend to range in age from about 14 to 24. They're all over the world, and they're the reason we exist!

tag_questions.csv: Every question can be hashtagged. We track the hashtag-to-question pairings, and put them into this file.

tag_users.csv: Users of any type can follow a hashtag. This shows you which hashtags each user follows.

tags.csv: Each tag gets a name.

question_scores.csv: "Hearts" scores for each question.

answer_scores.csv: "Hearts" scores for each answer.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

In [2]:
### function for reading file

def read_file(filename):
    dataframe = pd.read_csv(filename)
    print("The shape of dataframe:", dataframe.shape)
    return dataframe

### Loading all the data files

In [3]:
answers_df = read_file('answers.csv')

The shape of dataframe: (51123, 5)


In [4]:
comments_df = read_file('comments.csv')

The shape of dataframe: (14966, 5)


In [5]:
questions_df = read_file('questions.csv')

The shape of dataframe: (23931, 5)


In [6]:
professionals_df = read_file('professionals.csv')

The shape of dataframe: (28152, 5)


In [7]:
tags_df = read_file('tags.csv')

The shape of dataframe: (16269, 2)


In [8]:
tag_questions_df = read_file('tag_questions.csv')

The shape of dataframe: (76553, 2)


In [9]:
tag_users_df = read_file('tag_users.csv')

The shape of dataframe: (136663, 2)


In [10]:
question_scores_df = read_file('question_scores.csv')

The shape of dataframe: (23928, 2)


In [11]:
answer_scores_df = read_file('answer_scores.csv')

The shape of dataframe: (51138, 2)


## Examine Question data frame

### Create new features from answer and questions dataframe


    1) Number of answers for each question
    2) Time difference between question asked and first answered psoted
    3) No of tags for each questions
    4) List of tags for each question
    5) Length of questions
    6) Heart score for each question

### Number of answers for each question

In [12]:
answers_count = answers_df.groupby('answers_question_id')['answers_id'].count()

In [13]:
df = pd.DataFrame(answers_count.rename('count')).reset_index()
questions_df['questions_answer_count'] = questions_df.merge(df, how = 'left' , left_on = 'questions_id', right_on = 'answers_question_id')['count'].fillna(0).astype(int)

### Time difference between question posted and first answer posted

In [14]:
## function to convert timestamp

def datetime_covert(df,column):
    df[column] = pd.to_datetime(df[column], infer_datetime_format= True)
    return df

In [18]:
### Convert time stamp for dataframes with date information

questions_df = datetime_covert(questions_df, 'questions_date_added')
answers_df = datetime_covert(answers_df, 'answers_date_added')
professionals_df = datetime_covert(professionals_df,'professionals_date_joined')
comments_df = datetime_covert(comments_df,'comments_date_added')

In [19]:
## group by questions id and find the min of answer date to know the first answer date for each question

df = answers_df.groupby('answers_question_id')['answers_date_added'].min()
df = pd.DataFrame(df.rename('first_answer')).reset_index()

In [20]:
### Merge the created dataframe to question datafame and calculate the time difference between time question asked and first answer

questions_df['first_answer_time'] = questions_df.merge(df, how = 'left' , left_on = 'questions_id', right_on = 'answers_question_id')['first_answer']

questions_df['time_to_get_first_answer'] = questions_df['first_answer_time'] - questions_df['questions_date_added']

### Create list of tags for each question and count number of tags

In [21]:
## merge tag names to question ids

questions_id_tag_names = tag_questions_df.merge(tags_df, how = 'left' , left_on = 'tag_questions_tag_id', right_on = 'tags_tag_id')

In [22]:
### Concatenate tag names for each question id

foo = lambda a: ", ".join(a)

questions_id_tag_names = questions_id_tag_names.groupby(by='tag_questions_question_id').agg({'tags_tag_name': foo}).reset_index()

### Count of tags for each question id

In [23]:
questions_id_tag_names['question_tags_count'] = questions_id_tag_names.tags_tag_name.apply(lambda x: len(x.split(',')))

questions_df = questions_df.merge(questions_id_tag_names, how ='left', left_on = 'questions_id', right_on = 'tag_questions_question_id')

### Heart scores for questions

In [24]:
### merge questions dataframe to question score df

questions_df = questions_df.merge(question_scores_df, how = 'left', left_on = 'questions_id', right_on = 'id')

In [25]:
## drop repetitive columns

questions_df.drop(['id','tag_questions_question_id'],axis = 1, inplace= True)

### Merge question body and title

In [26]:
questions_df['full_text'] = questions_df['questions_title'] + ' ' + questions_df['questions_body']

## Examine the professional data frame

### Create new features from the professional and answer dataframe 

    1) Find professionals first activity date
    2) Find professionals last activity date
    3) Time taken to answer first question by the professional
    4) Last question answered by the professional
    5) First comment by professional
    6) Last comment by professional
    7) Time taken to answer the first question by the professional
    8) Number of questions answered by the professional
    9) Number of comments by each professional
    10) Professional heart score
    11) List of professional tags

### Find professionals first and last answer date

In [27]:
### first answer date

temp = answers_df.groupby('answers_author_id')['answers_date_added'].min()
df = pd.DataFrame(temp.rename('min')).reset_index()

professionals_df['professional_first_answer_date'] = professionals_df.merge(df, how = 'left' , left_on = 'professionals_id', right_on = 'answers_author_id')['min']


### last answer date

temp = answers_df.groupby('answers_author_id')['answers_date_added'].max()
df = pd.DataFrame(temp.rename('max')).reset_index()

professionals_df['professional_last_answer_date'] = professionals_df.merge(df, how = 'left' , left_on = 'professionals_id', right_on = 'answers_author_id')['max']

### Number of questions answered by professionals

In [28]:
### Number of questions answered by the professional

temp = answers_df.groupby('answers_author_id')['answers_question_id'].count()
df = pd.DataFrame(temp.rename('count')).reset_index()

professionals_df['number_questions_answered'] = professionals_df.merge(df, how = 'left' , left_on = 'professionals_id', right_on = 'answers_author_id')['count'].fillna(0).astype('int')

### Number of comments, first and last comment date for each professional

In [29]:
### Number of comments answered by the professional

temp = comments_df.groupby('comments_author_id')['comments_id'].count()
df = pd.DataFrame(temp.rename('count')).reset_index()

professionals_df['number_comments'] = professionals_df.merge(df, how = 'left' , left_on = 'professionals_id', right_on = 'comments_author_id')['count'].fillna(0).astype('int')


### first comment date for the professional

temp = comments_df.groupby('comments_author_id')['comments_date_added'].min()
professionals_df['date_first_comment'] = pd.merge(professionals_df, pd.DataFrame(temp.rename('first_comment')), left_on='professionals_id', right_index=True, how='left')['first_comment']

### last comment date for the professional

temp = comments_df.groupby('comments_author_id')['comments_date_added'].max()
professionals_df['date_last_comment'] = pd.merge(professionals_df, pd.DataFrame(temp.rename('last_comment')), left_on='professionals_id', right_index=True, how='left')['last_comment']


### Find professionals first and last overall activity date

In [30]:
### Last activity of the professional

professionals_df['date_last_activity'] = professionals_df[['professional_last_answer_date', 'date_last_comment']].max(axis=1)


### First activity of the professional 

professionals_df['date_first_activity'] = professionals_df[['professional_first_answer_date', 'date_first_comment']].min(axis=1)

### Time taken by professional to give first response after joining


In [31]:
professionals_df['time_to_answer_first_question'] = professionals_df['date_first_activity'] - professionals_df['professionals_date_joined']

### Days to answer first question by professional

In [32]:
professionals_df['days_to_answer_first_question'] = professionals_df['time_to_answer_first_question'].dt.days.fillna(-1).astype(int)

### Time since last activity for each professional

In [33]:
import datetime

### competition start date assumed at 1st Feb 2019

competition_start_date = datetime.datetime(2019,2,1)

In [34]:
### time since last activity

professionals_df['time_since_last_activity'] = competition_start_date - professionals_df['date_last_activity']

### Days required to answer the question

In [35]:
temp = pd.merge(questions_df, answers_df, left_on='questions_id', right_on='answers_question_id')
answers_df['time_delta_answer'] = (temp['answers_date_added'] - temp['questions_date_added'])
answers_df['time_delta_answer_day'] = answers_df['time_delta_answer'].dt.days

### Heart score for answers: Almost 90% of answers have less than 2 heart score

In [36]:
temp = pd.merge(answers_df, answer_scores_df, left_on='answers_id', right_on='id', how='left')
answers_df['answers_hearts_score'] = temp['score'].fillna(0).astype(int)

### Total Hearts score for each professional

In [37]:
temp = answers_df.groupby('answers_author_id')['answers_hearts_score'].sum()
df = pd.DataFrame(temp.rename('count')).reset_index()

professionals_df['professional_answers_hearts_score'] = professionals_df.merge(df, left_on='professionals_id', how='left', right_on = 'answers_author_id')['count'].fillna(0).astype(int)

### Tag list of the professionals

In [38]:
## merge tag names to user ids

user_id_tag_names = tag_users_df.merge(tags_df, how = 'left' , left_on = 'tag_users_tag_id', right_on = 'tags_tag_id')

In [39]:
### Concatenate tag names for each user id

foo = lambda a: ", ".join(a)

user_id_tag_names = user_id_tag_names.groupby(by='tag_users_user_id').agg({'tags_tag_name': foo}).reset_index()

In [40]:
### merge user id tag names df to professional df

professionals_df = professionals_df.merge(user_id_tag_names, how ='left', left_on = 'professionals_id', right_on = 'tag_users_user_id')

### Feature with professionals hash tags concatenated with industry name

In [41]:
professionals_df['tags_and_industry'] = professionals_df['tags_tag_name'] + ' ' + professionals_df['professionals_industry']

### Merge questions and answers data frame

In [42]:
questions_and_answers = questions_df.merge(answers_df, how = 'left' , left_on = 'questions_id', right_on = 'answers_question_id')

questions_and_answers.shape

(51944, 20)

### Dataframe of questions with no answers

In [43]:
questions_no_answers = questions_and_answers[questions_and_answers.answers_id.isna()]

questions_no_answers.shape

(821, 20)

### Dataframe of questions with answers

In [44]:
questions_with_answers = questions_and_answers[questions_and_answers.answers_id.notna()]

questions_with_answers.shape

(51123, 20)

### Merge professional dataframe to question with answer dataframe

In [45]:
### merge professionals to questions and answers dataframe

questions_with_answers = questions_with_answers.merge(professionals_df, how = 'left', left_on = 'answers_author_id', right_on= 'professionals_id')

### The number of unique professionals ids are less than unique answer author id since there is a possibility answer author was previously a professional but no longer is. Hence we decide to drop rows with null professional ids

In [46]:
## remove null professional id rows in the data

questions_with_answers = questions_with_answers.loc[questions_with_answers.professionals_id.notna()]

### Time difference between professional join date and answer date

In [47]:
questions_with_answers['time_diff_prof_join_answer_date'] = questions_with_answers['answers_date_added'] - questions_with_answers['professionals_date_joined']

### Find effective time to answer each question based on question date, professional date and answer date

In [48]:
questions_with_answers['effective_time_to_answer'] = questions_with_answers.apply(lambda x: x['time_diff_prof_join_answer_date'] if (x['professionals_date_joined'] > x['questions_date_added']) else x['time_delta_answer'], axis = 1)

In [49]:
### convert effective days to answer and time since last activity to days

questions_with_answers['effective_days_to_answer'] = questions_with_answers.effective_time_to_answer.dt.days

questions_with_answers['time_since_last_activity'] = questions_with_answers['time_since_last_activity'].dt.days

### Find score based on effective days to answer and time since last activity

In [50]:
questions_with_answers['effective_days_to_answer_score'] = 1/np.log10(10 + questions_with_answers['effective_days_to_answer'])

questions_with_answers['time_since_last_activity_score'] = 1/np.log10(10 + questions_with_answers['time_since_last_activity'])

### final content dataframe shreya

In [51]:
final_content_df_shreya = questions_with_answers[['professionals_id','questions_id','full_text','effective_days_to_answer_score','time_since_last_activity_score','professional_answers_hearts_score','tags_and_industry','questions_answer_count']]

## Content based filtering approach

### Create a corpus of all questions with title and body

In [52]:
documents = pd.DataFrame(final_content_df_shreya.full_text.unique(),columns = ['full_text'])

In [53]:
print(len(documents))
documents.head()

22744


Unnamed: 0,full_text
0,Teacher career question What is a maths...
1,I want to become an army officer. What can I d...
2,Will going abroad for your first job increase ...
3,To become a specialist in business management...
4,Are there any scholarships out there for stude...


In [57]:
### import necessary libraries

# For pre-processing

import nltk
from textblob import TextBlob
import re
from textblob import Word

import nltk
nltk.download('stopwords') 
from nltk.corpus import stopwords 

# For LDA
import gensim
from pprint import pprint

# LDA Visualization
import pyLDAvis
import pyLDAvis.gensim 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Manik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [58]:
### preprocessing text data

### stop words removal

stop_words = stopwords.words('english')
stop_words.extend(['know', 'want', 'getting', 'what', 'would', 'going', 
                   'like', 'getting', 'ever', 'every', 'hello', 'come', 
                   'kinda', 'felt', 'whatever', 'that', 'come', 'always', 
                   'also', 'shall', 'thing', 'good', 'maybe', "what's", 
                   'nagar', 'once', 'something', 'even', 'question', 'thank'])

def preprocess_text(doc):
    
    """
    
    pre-processing using textblob: 
    tokenizing, converting to lower-case, and lemmatization based on POS tagging, 
    removing stop-words, and retaining tokens greater than length 3
    
    """
    
    blob = TextBlob(doc)
    result = []
    tag_dict = {"J": 'a', # Adjective
                "N": 'n', # Noun
                "V": 'v', # Verb
                "R": 'r'} #  Adverb
   
    for sent in blob.sentences:
        ### using parts of speech tags
        words_and_tags = [(w, tag_dict.get(pos[0])) for w, pos in sent.tags]   
        
        ### lemmatization
        lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
        
        for token in lemmatized_list:
            if token.lower() not in stop_words and len(token.lower()) > 3:
                result.append(token.lower())
    
    return result

In [59]:
doc_sample = documents[documents.index == 1].values[0][0]

print('original document: ')
print(doc_sample)
print('\n\n tokenized and stemmed document: ')
print(preprocess_text(doc_sample))

original document: 
I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


 tokenized and stemmed document: 
['become', 'army', 'officer', 'become', 'army', 'officer', 'priyanka', 'bangalore', '10th', 'college', 'confuse', 'take', 'become', 'army', 'officer', 'military', 'army']


In [60]:
### preprocessing documents corpus

import swifter # Makes applying to datframe as fast as vectorizing

processed_docs = documents['full_text'].swifter.apply(preprocess_text) 

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=22744, style=ProgressStyle(description_wid…




In [61]:
processed_docs.head()

0    [teacher, career, maths, teacher, maths, teach...
1    [become, army, officer, become, army, officer,...
2    [abroad, first, increase, chance, back, home, ...
3    [become, specialist, business, management, net...
4    [scholarship, student, first, generation, live...
Name: full_text, dtype: object

In [62]:
## create dictionary

text_dictionary = gensim.corpora.Dictionary(processed_docs)

print(len(text_dictionary))

19794


In [63]:
# Include words in dictionary that appear greater than 15 times 
# but less than 0.5 proportion of the frequency of all the words in all of the questions

text_dictionary.filter_extremes(no_below = 5, no_above=0.5) 
print(len(text_dictionary))
# for k, v in dictionary.iteritems():
#     print(k, v)


4587


In [64]:
main_corpus = [text_dictionary.doc2bow(doc) for doc in processed_docs]


from gensim import models
# # Term Document Frequency
# tfidf = models.TfidfModel(main_corpus)
# tfidf_main_corpus = tfidf[main_corpus]


In [65]:
### LDA model fitting

lda_model = gensim.models.ldamodel.LdaModel(main_corpus, 
                                       num_topics= 20, 
                                       id2word=text_dictionary, 
                                       random_state= 802,
                                       chunksize=1000,
                                       passes = 10,
                                       alpha='auto',
                                       per_word_topics=True)

### Printing topics from the LDA model

In [66]:

for ids, topic in lda_model.print_topics(-1):
    print('Topic Number: {} \n Corresponding Words: {}'.format(ids, topic))


Topic Number: 0 
 Corresponding Words: 0.175*"degree" + 0.091*"medical" + 0.088*"doctor" + 0.076*"medicine" + 0.056*"biology" + 0.033*"master" + 0.033*"bachelor" + 0.031*"healthcare" + 0.030*"become" + 0.022*"pediatrician"
Topic Number: 1 
 Corresponding Words: 0.127*"teacher" + 0.090*"teach" + 0.081*"education" + 0.077*"game" + 0.074*"design" + 0.032*"video" + 0.031*"designer" + 0.028*"artist" + 0.025*"graphic" + 0.019*"educator"
Topic Number: 2 
 Corresponding Words: 0.057*"animal" + 0.053*"veterinarian" + 0.047*"surgeon" + 0.042*"architecture" + 0.038*"woman" + 0.037*"young" + 0.036*"challenge" + 0.035*"architect" + 0.034*"veterinary" + 0.028*"surgery"
Topic Number: 3 
 Corresponding Words: 0.141*"career" + 0.104*"major" + 0.054*"field" + 0.049*"interested" + 0.029*"pursue" + 0.026*"psychology" + 0.019*"different" + 0.019*"choose" + 0.018*"type" + 0.017*"best"
Topic Number: 4 
 Corresponding Words: 0.195*"nurse" + 0.139*"nursing" + 0.084*"interview" + 0.078*"become" + 0.042*"police"

In [68]:
# Perplexity    
print('\nThe Perplexity Value is: ', lda_model.log_perplexity(main_corpus))  # a metric that gauges the quality of an LDA model, the lower the better



The Perplexity Value is:  -7.779270779256092


In [71]:
from gensim.models.coherencemodel import CoherenceModel

# Coherence Score - also tells us how good our LDA model is

coherence_score = CoherenceModel(model=lda_model, texts = processed_docs, dictionary= text_dictionary, coherence='c_v')
coherence_score = coherence_score.get_coherence()
print('\nThe Coherence Score is: ', coherence_score)


The Coherence Score is:  0.4027596123921284


In [73]:
# Visualize the LDA results

pyLDAvis.enable_notebook()
lda_vis = pyLDAvis.gensim.prepare(lda_model, main_corpus, text_dictionary, sort_topics=True)
lda_vis


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [74]:
def map_topics_to_questions(lda_model, main_corpus):
    
    """
    This function maps the topics we get from our LDA to the question text (each question)
    It takes in the corpus that we input, and maps LDA topics to that corpus.
    """
    
    topics_and_questions = lda_model.get_document_topics(main_corpus)
    
    df = pd.DataFrame.from_records([{v:k for v, k in row} for row in topics_and_questions]) 
    
    """
    UNDERSTAND THE FUNCTION DataFrame.from_records() TO CHANGE NAMES
    """
    
    return df

In [75]:

"""
Mapping all topics to all questions
"""
questions_to_topics_df = map_topics_to_questions(lda_model, main_corpus)


### Recommendation based on topic similarity

In [78]:
from sklearn.metrics.pairwise import cosine_similarity

def question_topic_similarity(new_question, threshold = 0.0):
    """ Calculates the topic similarity to the existing questions and returns the most similar ones.
    """ 

    questions_to_topics_df = map_topics_to_questions(lda_model, main_corpus)
    

    processed_new_question = [preprocess_text(new_question)]

    new_question_corpus = [text_dictionary.doc2bow(text) for text in processed_new_question]
    new_question_lda = lda_model[new_question_corpus]
    
    new_dict = {}
    for row in new_question_lda:
        for v, k in row:
           new_dict[v] = k 

    for i in range(20):
        if i in list(new_dict.keys()):
            continue
        new_dict[i] = 0    

    new_question_to_topics_df = pd.DataFrame.from_dict(new_dict, orient = 'index')

    df = new_question_to_topics_df.T
    new_question_to_topics_df = df.reindex(sorted(df.columns), axis=1)    
    
    
    cos_sim = cosine_similarity(questions_to_topics_df.fillna(0), new_question_to_topics_df.fillna(0))
    result = pd.DataFrame({'full_topic_similar':np.tile(documents['full_text'], cos_sim.shape[1]),
                           'topic_similarity':cos_sim.reshape(-1,)},
                         index=np.tile(documents['full_text'].index, cos_sim.shape[1]))
    
    result = result[result['topic_similarity'] >= threshold].sort_values('topic_similarity', ascending=False)
    return result



def recommend_topic_similarity(new_question, weights=[1, 1, 1, 1, 1]):
    

    sim_questions = question_topic_similarity(new_question, threshold=0.0)
    result = final_content_df_shreya.merge(sim_questions, left_on='full_text', right_on='full_topic_similar', how='left')
    

    result['total_hearts_score'] = np.log10(10+result['professional_answers_hearts_score'])
    result['final_score'] = (weights[0]*result['topic_similarity']+
                             weights[1]*result['effective_days_to_answer_score']+
                             weights[2]*result['total_hearts_score'] +
                             weights[3]*result['time_since_last_activity_score'])
#                              weights[4]*result['score_professional_hearts'])
    
#     recommendation = result.groupby('professionals_id')['final_score'].sum().sort_values(ascending=False)

    final_sorted = result.sort_values(by = 'final_score', ascending=False)
    return final_sorted
          

In [79]:
query_text = ['I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army']

# print(query_text)

recommend_topic_similarity(query_text[0], weights=[0.3, 0.1, 0.1, 0.5, 1]).head()

Unnamed: 0,professionals_id,questions_id,full_text,effective_days_to_answer_score,time_since_last_activity_score,professional_answers_hearts_score,tags_and_industry,questions_answer_count,full_topic_similar,topic_similarity,total_hearts_score,final_score
35345,be5d23056fcb4f1287c823beec5291e1,5e12cb630a7c4da3aa79d54a9dd792b2,Can i be a police officer #police #police-off...,1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",1,Can i be a police officer #police #police-off...,0.971817,2.466868,1.053473
19569,be5d23056fcb4f1287c823beec5291e1,776e22d9eb1045eb8a9771eb015e8ddf,I want to be a police officer or a police disp...,1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",2,I want to be a police officer or a police disp...,0.924846,2.466868,1.039382
10889,be5d23056fcb4f1287c823beec5291e1,814c1e4562ab408a8769f1a64a8351f1,Do become a police officer what should be my q...,1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",3,Do become a police officer what should be my q...,0.90269,2.466868,1.032735
39766,be5d23056fcb4f1287c823beec5291e1,4efec667c2cd4f8e8bb47945dd3bd1a9,What are the qualifications needed to become a...,0.897712,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",3,What are the qualifications needed to become a...,0.91398,2.466868,1.025893
45001,be5d23056fcb4f1287c823beec5291e1,09eb3002c20a4684b51705c630379d61,"i am an 8th grader , if i go to a military hig...",1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",4,"i am an 8th grader , if i go to a military hig...",0.879636,2.466868,1.025819


### Results shows that we get professionals that answered similar questions

### Recommendation using text similarity

In [80]:
def preprocess_new_question(doc):
    
    """
    
    pre-processing using textblob: 
    tokenizing, converting to lower-case, and lemmatization based on POS tagging, 
    removing stop-words, and retaining tokens greater than length 3
    
    """
    
    blob = TextBlob(doc)
    result = []
    tag_dict = {"J": 'a', # Adjective
                "N": 'n', # Noun
                "V": 'v', # Verb
                "R": 'r'} #  Adverb
   
    for sent in blob.sentences:
        
        words_and_tags = [(w, tag_dict.get(pos[0])) for w, pos in sent.tags]    
        lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
        
        for token in lemmatized_list:
            if token not in stop_words and len(token) > 3:
                result.append(token.lower())
    main_result = ' '.join(result)
    return main_result


# doc_sample = documents[documents.index == 1].values[0][0]

# print('original document: ')
# print(doc_sample)
# print('\n\n tokenized and stemmed document: ')
# print(preprocess_new_question(doc_sample))

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

documents = pd.DataFrame(final_content_df_shreya.full_text.unique(),columns = ['full_text'])
processed_all_questions = documents['full_text'].swifter.apply(preprocess_new_question)


## function to compute text similarity

def question_text_similarity(new_question, threshold = 0.0):


#     documents = pd.DataFrame(final_content_df_shreya.full_text.unique(),columns = ['full_text'])
#     processed_all_questions = documents['full_text'].swifter.apply(preprocess_new_question)
    full_corpus = processed_all_questions.tolist()
    processed_new_question = preprocess_new_question(new_question)
    vectorizer = TfidfVectorizer()
    vectorizer.fit(full_corpus)

    tfidf_corpus = vectorizer.transform(full_corpus)
    tfidf_new_question = vectorizer.transform([processed_new_question])

    cos_sim = cosine_similarity(tfidf_corpus, tfidf_new_question)

    result = pd.DataFrame({'full_text_similar':np.tile(documents['full_text'], cos_sim.shape[1]),
                               'text_similarity':cos_sim.reshape(-1,)},
                          index=np.tile(documents.index, cos_sim.shape[1]))
    result = result[result['text_similarity'] >= threshold].sort_values('text_similarity', ascending=False)
    return result


# query_text = ['I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army']

# question_text_similarity(query_text[0])

## function to recommend based on text similarity

def recommend_text_similarity(new_question, weights=[1, 1, 1, 1]):
    

    sim_questions = question_text_similarity(new_question, threshold=0.0)
    result = final_content_df_shreya.merge(sim_questions, left_on='full_text', right_on='full_text_similar', how='left')
    

    result['total_hearts_score'] = np.log10(10+result['professional_answers_hearts_score'])
    result['final_score'] = (weights[0]*result['text_similarity']+
                             weights[1]*result['effective_days_to_answer_score']+
                             weights[2]*result['total_hearts_score'] +
                             weights[3]*result['time_since_last_activity_score'])
    
#     recommendation = result.groupby('professionals_id')['final_score'].sum().sort_values(ascending=False)

    final_sorted = result.sort_values(by = 'final_score', ascending=False)
    return final_sorted


query_text = ['I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army']

print(query_text)

recommend_text_similarity(query_text[0], weights=[0.3, 0.1, 0.1, 0.5])


HBox(children=(IntProgress(value=0, description='Pandas Apply', max=22744, style=ProgressStyle(description_wid…


['I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army']


Unnamed: 0,professionals_id,questions_id,full_text,effective_days_to_answer_score,time_since_last_activity_score,professional_answers_hearts_score,tags_and_industry,questions_answer_count,full_text_similar,text_similarity,total_hearts_score,final_score
46863,f1cc078488fa49b2827a9671ab1cc582,552df6150cf842578f7bc7ab45ed3d05,What do you do in the army? i want to go to We...,0.637673,0.960253,22.0,"computer-hardware, computer-skills, software-m...",3,What do you do in the army? i want to go to We...,0.684522,1.505150,0.899765
22249,f1cc078488fa49b2827a9671ab1cc582,cb7c51b7f145491085b10817ae23e92b,I want to build my career in Army. What should...,0.533316,0.960253,22.0,"computer-hardware, computer-skills, software-m...",3,I want to build my career in Army. What should...,0.671284,1.505150,0.885358
22251,58fa5e95fe9e480a9349bbb1d7faaddb,cb7c51b7f145491085b10817ae23e92b,I want to build my career in Army. What should...,0.768622,0.691010,282.0,"mechanical-engineering, engineering, automotiv...",3,I want to build my career in Army. What should...,0.671284,2.465383,0.870290
11566,9fc88a7c3323466dbb35798264c7d497,c76ebe4b76e44750b133498cd7f9f075,How much does a major make in the army? I'm a ...,0.960253,1.000000,1.0,"safety, environmental-services, military, vete...",2,How much does a major make in the army? I'm a ...,0.510244,1.041393,0.853238
45001,be5d23056fcb4f1287c823beec5291e1,09eb3002c20a4684b51705c630379d61,"i am an 8th grader , if i go to a military hig...",1.000000,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",4,"i am an 8th grader , if i go to a military hig...",0.268783,2.466868,0.842563
27074,be5d23056fcb4f1287c823beec5291e1,a21b70bf1fd54348a1071872785b03bc,I would like to know more about being a law e...,1.000000,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",1,I would like to know more about being a law e...,0.263009,2.466868,0.840830
30352,be5d23056fcb4f1287c823beec5291e1,d726289d84c9477db0a4aaad7498b561,I'm in 8th grade and curious about joining the...,0.897712,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",3,I'm in 8th grade and curious about joining the...,0.288481,2.466868,0.838243
31335,be5d23056fcb4f1287c823beec5291e1,fda1bb6fa568460cb307a0e3e73cfb67,What is the Best College or University to go t...,1.000000,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",1,What is the Best College or University to go t...,0.240233,2.466868,0.833998
29618,be5d23056fcb4f1287c823beec5291e1,3155da10b7d44f9bb4cb1f8219580cfd,"What do you have to do to become a cop, and wh...",1.000000,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",2,"What do you have to do to become a cop, and wh...",0.229729,2.466868,0.830846
21761,be5d23056fcb4f1287c823beec5291e1,b2a1449dda484317ad10a7237c03bdc8,I'm very interested in learning about what a d...,0.960253,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",1,I'm very interested in learning about what a d...,0.232381,2.466868,0.827667


### Recommendation using topic and text similarity

In [81]:
def recommend_text_and_topic_similarity(new_question, weights=[1, 1, 1, 1, 1]):
    

    sim_questions_text = question_text_similarity(new_question, threshold=0.0)
    result = final_content_df_shreya.merge(sim_questions_text, left_on='full_text', right_on='full_text_similar', how='left')
    
    sim_questions_topic = question_topic_similarity(new_question, threshold=0.0)
    result = result.merge(sim_questions_topic, left_on='full_text', right_on='full_topic_similar', how='left')
    

    result['total_hearts_score'] = np.log10(10+result['professional_answers_hearts_score'])
    result['final_score'] = (weights[0]*result['text_similarity']+
                             weights[1]*result['topic_similarity']+
                             weights[2]*result['effective_days_to_answer_score']+
                             weights[3]*result['total_hearts_score'] +
                             weights[4]*result['time_since_last_activity_score'])
#                              weights[4]*result['score_professional_hearts'])
    
#     recommendation = result.groupby('professionals_id')['final_score'].sum().sort_values(ascending=False)

    final_sorted = result.sort_values(by = 'final_score', ascending=False)
    return final_sorted


query_text = ['I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army']

# print(query_text)

recommend_text_and_topic_similarity(query_text[0], weights=[0.2, 0.2, 0.1, 0.1, 0.4]).head()

Unnamed: 0,professionals_id,questions_id,full_text,effective_days_to_answer_score,time_since_last_activity_score,professional_answers_hearts_score,tags_and_industry,questions_answer_count,full_text_similar,text_similarity,full_topic_similar,topic_similarity,total_hearts_score,final_score
35345,be5d23056fcb4f1287c823beec5291e1,5e12cb630a7c4da3aa79d54a9dd792b2,Can i be a police officer #police #police-off...,1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",1,Can i be a police officer #police #police-off...,0.216301,Can i be a police officer #police #police-off...,0.97182,2.466868,0.916504
22251,58fa5e95fe9e480a9349bbb1d7faaddb,cb7c51b7f145491085b10817ae23e92b,I want to build my career in Army. What should...,0.768622,0.69101,282.0,"mechanical-engineering, engineering, automotiv...",3,I want to build my career in Army. What should...,0.671284,I want to build my career in Army. What should...,0.903582,2.465383,0.914778
45001,be5d23056fcb4f1287c823beec5291e1,09eb3002c20a4684b51705c630379d61,"i am an 8th grader , if i go to a military hig...",1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",4,"i am an 8th grader , if i go to a military hig...",0.268783,"i am an 8th grader , if i go to a military hig...",0.879652,2.466868,0.908567
46863,f1cc078488fa49b2827a9671ab1cc582,552df6150cf842578f7bc7ab45ed3d05,What do you do in the army? i want to go to We...,0.637673,0.960253,22.0,"computer-hardware, computer-skills, software-m...",3,What do you do in the army? i want to go to We...,0.684522,What do you do in the army? i want to go to We...,0.847291,1.50515,0.904746
10889,be5d23056fcb4f1287c823beec5291e1,814c1e4562ab408a8769f1a64a8351f1,Do become a police officer what should be my q...,1.0,0.830482,283.0,"resume, police, justice, law, resume-writing, ...",3,Do become a police officer what should be my q...,0.218447,Do become a police officer what should be my q...,0.902712,2.466868,0.903112


### Results shows that we get professionals that answered similar questions

## Matrix Factorization approach

### Create a unique list of documents and professionals for the matrix

In [82]:
documents_list = questions_with_answers.full_text.unique()
professionals_list = questions_with_answers.answers_author_id.unique()

In [770]:
# train_documents = documents_list[:20000]

In [771]:
# train_documents

array(['Teacher   career   question What  is  a  maths  teacher?   what  is  a  maths  teacher  useful? #college #professor #lecture',
       'I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army',
       "Will going abroad for your first job increase your chances for jobs back home? I'm planning on going abroad for my first job. It will be a teaching job and I don't have any serious career ideas. I don't know what job I would be working if I stay home instead so I'm assuming staying or leaving won't makeba huge difference in what I care about, unless I find something before my first job. I can think of ways that going abroad can be seen as good and bad. I do not know which side respectable employers willl side with. #working-abroad #employment- #overseas",
       ...,
      

In [83]:
### Create corpus for full text for topic modeling

documents = pd.DataFrame(questions_with_answers.full_text.unique(), columns = ['full_text'])

In [84]:
documents.tail()

Unnamed: 0,full_text
22739,What is a computer engineer & a computer progr...
22740,What major do I need to study to be a writer I...
22741,Which careers are good if I enjoy working with...
22742,How can going to college help me advance my ca...
22743,Is age a factor for hiring entry level compute...


### create a pivot table with index as question full text and column as professional id and value as max time delta score

In [85]:
pivoted = pd.pivot_table(questions_with_answers, index = 'full_text', columns= 'answers_author_id', values= 'effective_days_to_answer_score', aggfunc= np.max)

## fill missing values with 0
pivoted.fillna(0,inplace= True)

### reindex the table and restore the original question text order
pivoted = pivoted.reindex(index= documents_list)

### transpose the dataframe

pivoted = pivoted.T

pivoted = pivoted.reindex(index= professionals_list)

pivoted.head()

full_text,Teacher career question What is a maths teacher? what is a maths teacher useful? #college #professor #lecture,I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question #military #army,"Will going abroad for your first job increase your chances for jobs back home? I'm planning on going abroad for my first job. It will be a teaching job and I don't have any serious career ideas. I don't know what job I would be working if I stay home instead so I'm assuming staying or leaving won't makeba huge difference in what I care about, unless I find something before my first job. I can think of ways that going abroad can be seen as good and bad. I do not know which side respectable employers willl side with. #working-abroad #employment- #overseas","To become a specialist in business management, will I have to network myself? i hear business management is a hard way to get a job if you're not known in the right areas. #business #networking",Are there any scholarships out there for students that are first generation and live in GA? I'm trying to find scholarships for first year students but they all seem to be for other states besides GA. Any help?? #college #scholarships #highschoolsenior #firstgeneration,How many years of coege do you need to be an engineer To be an engineer #united-states,"I want to become a doctor because of my great interest in science and helping people, but can I major in something that isn't science oriented? I am a musician and want to pursue that in college as well, but I don't want my love of science to supersede that. #medicine #music",what kind of college could i go to for a soccer player I like soccer because i been playing sense i was 6 years old. soccer is my best sport i played for 1/7 years.every year. #college #soccer #building,What are the college classes like for and graphics design major? I'm asking because I was thinking about choosing that career as an major for when I go to college. I though why not ask someone who has went to college and graduated with and degree in that subject. #graphic-design #graphics,what does it take to be an anesthesiologist? I am a sophomore who is interested in learning more about anesthesiologist. What steps do they have to take in order to become a anesthesiologist? Is it fun or hard? #doctor #healthcare #experience #anesthesiologist,...,"Did college help you become a better writer? Hi, I am an aspiring writer and I am taking college-course writing classes and I am wondering if, as a writing, your college literature and writing courses benefited you any. And if so, which benefited you more: the study and analysis of literature, or the enhancing of your writing skills in your writing course? #author #literature #creative-writing","Is graduate school a lot harder than regular college? I have heard from a numerous amount of people that graduate school is not harder than your regular college and/or university. Some people have told me that all it is is a lot more work and that it dives into specific content, which to my expectations does make sense because it is based strictly on that area of work which you are studying. Other individuals have said that graduate school is totally different and way more hard. Could this be due to it's title, Graduate School? I don't know who to believe and am wondering what the real answer to this question is! Just saying, even if it truly is more difficult this will not stop me from achieving my M.A. #college #university #graduate-school",What should you tell professionals when networking? I need to get an internship as a graduation requirement and know that networking is very important but I'm not exactly sure what to say #networking,"What colleges/universities/degrees should I be looking at to become an epidemiologist? My name is Megan, I am in 11th grade, and I would to like to become an epidemiologist. I would like to go to a college where they have an epidemiology program, or where epidemiologists have come from. Name as many as you know of because I would like to see a wide range of options. Keep in mind, I am a straight A honors student so I would prefer a more selective college/university. Also, what degrees should I be looking for in a college to become an epidemiologist? And, I am looking to become more of an investigative epidemiologist if that makes a difference. #professor #public-health #epidemiology #disease-prevention #epidemiologist #field-investigations",What is success and happiness Tell me something I dont know. #management #life #life-coach #happiness #analytics #succession-planning #success-driven #rich,What is a computer engineer & a computer programmer <p>I want to know which one is better to do and has a great salary</p>,"What major do I need to study to be a writer I am a high school junior and for awhile now I've been wondering about what career path I should take. I believe that a job is something you do for money and a career is something you do because your passionate about it. I love to write! I find myself writing everywhere in my spare time and even in class(where I'm supposed to be paying attention, lol) I want to be a writer but I don't know where to start or what resources I should be looking for. #educator #journalism #writer #famous #novel-writer","Which careers are good if I enjoy working with kids? I have been volunteering, and I have found that I really like working with kids. What careers would allow me to be around kids and work with kids? I have already considered the teaching options, e.g., in a day-care, preschool, or elementary school, but I am looking for other ideas for careers where I can work with kids. #education #children",How can going to college help me advance my career in law enforcement? I am thinking about a career in law enforcement and since many positions only require a high school diploma I was wondering if going to college or a university can help me get a higher rank in law enforcement. #police #law-enforcement,"Is age a factor for hiring entry level computer engineers? I am in my mid thirties with a Bachelor's degree in computer engineering but no professional experience. After graduation I worked in other fields just to gain a living. I am looking for a job as programmer, but my age and lack of hands on experience seem to block me. I had couple of internships, I have worked on some projects in my free time just to keep up with market demands, but it doesn't seem to satisfy employers. How can I break through? #computer-science #programming #computer-engineering #java"
answers_author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
36ff3b3666df400f956f8335cf53e09e,0.897712,0.0,0.0,0.0,0.0,0.0,0.0,0.522517,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2aa47af241bf42a4b874c453f0381bd4,0.0,0.591647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cbd8f30613a849bf918aed5c010340be,0.0,0.442933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7e72a630c303442ba92ff00e8ea451df,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17802d94699140b0a0d2995f30c034c6,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [86]:
pivoted_matrix = pivoted.values

In [87]:
pivoted_matrix.shape

(10067, 22744)

### Text similarity for questions

In [121]:
import nltk
from textblob import TextBlob
import pandas as pd
import re
from textblob import Word

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.corpus import stopwords 

import swifter # Makes row operations on datframe as fast as vectorizing

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Manik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Manik\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Manik\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Manik\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!




In [122]:
# ## stop words removal using nltk inbuilt stopwords list and extending the list based on problem context

# stop_words = stopwords.words('english')
# stop_words.extend(['know', 'want', 'getting', 'what', 'would', 'going', 
#                    'like', 'getting', 'ever', 'every', 'hello', 'come', 
#                    'kinda', 'felt', 'whatever', 'that', 'come', 'always', 
#                    'also', 'shall', 'thing', 'good'])

# def preprocess_new_question(doc):
    
#     """
    
#     pre-processing using textblob: 
#     tokenizing, converting to lower-case, and lemmatization based on POS tagging, 
#     removing stop-words, and retaining tokens greater than length 3
    
#     """
    
#     blob = TextBlob(doc)
#     result = []
#     tag_dict = {"J": 'a', # Adjective
#                 "N": 'n', # Noun
#                 "V": 'v', # Verb
#                 "R": 'r'} #  Adverb
   
#     for sent in blob.sentences:
        
#         words_and_tags = [(w, tag_dict.get(pos[0])) for w, pos in sent.tags]    
#         lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
        
#         for token in lemmatized_list:
#             if token not in stop_words and len(token) > 3:
#                 result.append(token.lower())
#     main_result = ' '.join(result)
#     return main_result

In [77]:
# doc_sample = documents[documents.index == 1].values[0][0]

# print('original document: ')
# print(doc_sample)
# print('\n\n tokenized and stemmed document: ')
# print(preprocess_new_question(doc_sample))

original document: 
I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army


 tokenized and stemmed document: 
become army officer what become army officer priyanka bangalore 10th when college confuse take become army officer question military army


In [123]:
# ### preprocess full text column in the corpus

# processed_all_questions = documents['full_text'].swifter.apply(preprocess_new_question)

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=22744, style=ProgressStyle(description_wid…




In [88]:
def question_text_similarity(new_question, threshold = 0.0):

    full_corpus = processed_all_questions.tolist()
    processed_new_question = preprocess_new_question(new_question)
    vectorizer = TfidfVectorizer()
    vectorizer.fit(full_corpus)

    tfidf_corpus = vectorizer.transform(full_corpus)
    tfidf_new_question = vectorizer.transform([processed_new_question])

    cos_sim = cosine_similarity(tfidf_corpus, tfidf_new_question)

    result = pd.DataFrame({'full_text':np.tile(documents['full_text'], cos_sim.shape[1]),
                               'similarity':cos_sim.reshape(-1,)},
                          index=np.tile(documents.index, cos_sim.shape[1]))
#     result = result[result['similarity'] >= threshold].sort_values('similarity', ascending=False)
    return result

### Matrix Factorization

In [781]:
# train_matrix = pivoted_matrix[:, :20000]
# test_matrix = pivoted_matrix[:, 20000: ]

# train_matrix.shape

(10067, 20000)

In [782]:
# test_matrix.shape

(10067, 2744)

In [89]:
from sklearn.decomposition import NMF
model = NMF(n_components = 20, init='random', random_state=0)
W = model.fit_transform(pivoted_matrix)
H = model.components_

In [90]:
W.shape

(10067, 20)

In [91]:
H.shape

(20, 22744)

### Doing predictions

In [92]:
pred = np.dot(W,H)

In [93]:
## new query

documents_list[1]

'I want to become an army officer. What can I do to become an army officer? I am Priyanka from Bangalore . Now am in 10th std . When I go to college I should not get confused on what I want to take to become army officer. So I am asking this question  #military #army'

In [95]:
### finding text similarity score for new query to existing corpus

result = question_text_similarity(documents_list[1])
sim_scores = np.array(result.similarity.values)

In [96]:
### weighing the prediction based on similarity scores

predictions_values = np.dot(pred,sim_scores)

In [97]:
recommend_df = pd.DataFrame(predictions_values.reshape(-1, len(predictions_values)), columns= professionals_list)
recommend_df = recommend_df.T

recommend_df = recommend_df.rename({0:'Prediction_score'}, axis= 1)
recommend_df = recommend_df.reset_index()
recommend_df.head()

Unnamed: 0,index,Prediction_score
0,36ff3b3666df400f956f8335cf53e09e,21.62191
1,2aa47af241bf42a4b874c453f0381bd4,0.000173
2,cbd8f30613a849bf918aed5c010340be,0.163612
3,7e72a630c303442ba92ff00e8ea451df,0.07061
4,17802d94699140b0a0d2995f30c034c6,0.02034


In [98]:
final_content_df_shreya.shape

(50106, 8)

In [99]:
merged_df = final_content_df_shreya.merge(recommend_df, how = 'left', left_on= 'professionals_id', right_on = 'index')

In [100]:
merged_df.sort_values(by = 'Prediction_score', ascending= False, inplace= True)

In [102]:
merged_df.head(50)

Unnamed: 0,professionals_id,questions_id,full_text,effective_days_to_answer_score,time_since_last_activity_score,professional_answers_hearts_score,tags_and_industry,questions_answer_count,index,Prediction_score
0,36ff3b3666df400f956f8335cf53e09e,332a511f1569444485cf7a7a556a5e54,Teacher career question What is a maths...,0.897712,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
36413,36ff3b3666df400f956f8335cf53e09e,859dec11d46847c0a2969ef75fcd7d7f,What are some tips for finding a summer job? W...,1.0,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
23177,36ff3b3666df400f956f8335cf53e09e,7340345843414f7e96c5ed94d70de664,How many years does it take to become a teache...,0.960253,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
28788,36ff3b3666df400f956f8335cf53e09e,690342b4467f4adda2a4142b73bd80ff,were is the best college for being a nurse i w...,1.0,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
1367,36ff3b3666df400f956f8335cf53e09e,f14a85dcf0614e39953d0294791b9a73,How does Going to College Serve a Person today...,0.445333,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",6,36ff3b3666df400f956f8335cf53e09e,21.62191
44868,36ff3b3666df400f956f8335cf53e09e,9f100c73d0e04c319283c3509f2f760b,"Is ""drafter"" a still-used position at engineer...",0.897712,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",4,36ff3b3666df400f956f8335cf53e09e,21.62191
11053,36ff3b3666df400f956f8335cf53e09e,f4efccbe81614c69a90a0d2ea94c5f97,What do I need to do in order to become a dent...,0.926628,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
11034,36ff3b3666df400f956f8335cf53e09e,e5c5952787a5460abb69e160093b6b72,For the major business and entrepreneurship wh...,1.0,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",3,36ff3b3666df400f956f8335cf53e09e,21.62191
11014,36ff3b3666df400f956f8335cf53e09e,e59d6a29c2814aaca5ff979c7c7a0955,Passion or money? Just from the question itsel...,1.0,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191
11013,36ff3b3666df400f956f8335cf53e09e,63f8e1b3153448caaa74943fd00ec989,"When studying psychology, how does one narrow ...",1.0,0.478491,431.0,"career, jobs, engineering, career-choice, coll...",1,36ff3b3666df400f956f8335cf53e09e,21.62191


In [105]:
unique_profs = merged_df[merged_df.Prediction_score > 0.5]['professionals_id'].unique()

In [106]:
unique_profs

array(['36ff3b3666df400f956f8335cf53e09e',
       '58fa5e95fe9e480a9349bbb1d7faaddb',
       'be5d23056fcb4f1287c823beec5291e1',
       'a1006e6a58a0447592e2435caa230f78',
       '369f1c8646b649f6997eae7809696bd5',
       '05ab77d4c6a141b999044ebbf5415b0d',
       'a6d33c38902546849c36ea7e9e9f0870',
       'e1d39b665987455fbcfbec3fc6df6056',
       'c3b4e11154f74a858779be7ba9b6f00c',
       'fe4543418c0846e5a65fa22b4ad9a304',
       '13b55ed4834e4814bb33a4c87001063d',
       '4dc61581ec7b409bbd037e483f53ba0a',
       'dc28056163a8447686e5691f4c1475b0',
       'fafeba89ca764bd891862fb8440a2962',
       'e2b4c84bf1ca4aea9b108869692d8017',
       '96bbbdd06a334805a0501034d9df1aa4',
       'd67ce930870945109a7ad86d29ba2035',
       'bc46e3699d92477ba8c7aa723e54a151',
       '887c5a142b42466fb74740d72989fc74',
       '209fcd55fefa4fe29ccedcdc26bd5d89',
       '8d63cf34213f45189a5a8eabd9d71529',
       'a72bde6ac9e349d195a6d356444c9578',
       '7bffa1792d474359b922e5de700d7a82',
       'c25

In [109]:
merged_df[merged_df.professionals_id == '58fa5e95fe9e480a9349bbb1d7faaddb']['full_text'].head()
    

2345     What's a better plan after graduating: find a ...
48950    Do you really need 11 years of school to becom...
16536    Do men get paid more in engineering than woman...
23198    After 10 th standard which group i want to tak...
40816    Is it worth pursuing a doctorates degree in ma...
Name: full_text, dtype: object

In [110]:
merged_df[merged_df.professionals_id == 'be5d23056fcb4f1287c823beec5291e1']['full_text'].head()

6349     How are young adults supposed to gain the expe...
18564    Is there anyone who likes being a social worke...
46670    Is it common for employers to offer education ...
33993    Once I finish college, how can i get the exper...
6934     What are your long range and short range goals...
Name: full_text, dtype: object

In [111]:
merged_df[merged_df.professionals_id == '7919ea19db274c1fb862f3456cd25ac5']['full_text'].head()

27988    Is proficiency in Java enough to get you place...
35563    Is taking a semester off for an internship a g...
34414    Do I have what it takes to get into Cornell? h...
2628     As an entry level intern for the government wh...
20203    Should I choose a major based on what my paren...
Name: full_text, dtype: object

### Results of matrix factorization not very good

### We can see that the from the above, the professionals recommended by the matrix factorization approach have not answered any questions related to army word which is in the given query