# Doc2Vec Cosine Similarity

#### :brief: This program finds similar courses based off course title, overview description, and/or target audience.  We convert the paragraphs to doc2vec vectors, then compute the cosine similarity among all the courses.

#### :conclusion: Semantic comparison of course descriptions does not look promising for our purposes of finding replacement courses.  Also, doc2vec on course title only seems to simply return matching titles.

In [129]:
# !pip install -U gensim

In [130]:
import numpy as np
import pandas as pd
import pickle
import urllib
import collections
import re
pd.options.mode.chained_assignment = None  # default='warn'

from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score,confusion_matrix,accuracy_score,roc_curve
from sklearn.metrics.pairwise import cosine_similarity


def extractCourseDescription(catalog):
    """
    :brief: This function scrapes the website for the course overview description and target audience from the url.
    """
    courseDes = []
    target_audience = []
    for i,r in catalog.iterrows():
        try:
            link = r['course url'] #gather url from dataframe
            if ".htm" in link:
                param, value = link.split(".htm",1)
                link = param + '.htm'
                f = urllib.urlopen(link)
                myfile = f.read()
                courseDes.append(re.findall("<!- OVERVIEW_\[ ->(.*?)<!- ]_OVERVIEW ->", myfile))
                target_audience.append(re.findall("<!- TARGET_AUDIENCE_\[ ->(.*?)<!- ]_TARGET_AUDIENCE ->", myfile))
            else:
                courseDes.append("no description")
                target_audience.append("no target audience")

            # print progress messages
            if i%100==0:
                print ("finished " + str(i) + " out of " +str(len(catalog)) + " records...")

        except Exception as e:
            print(e)
            courseDes.append(e)
            target_audience.append(e)
            pass
            
    catalog["Course Description"] = courseDes
    catalog["Target Audience"] = target_audience
    print("Done.")
    return catalog

In [131]:
"""
:brief: uncomment and run this cell to scrape webpages and update pickle file with course descriptions and target audience.
    - smaller subset, takes less than 10 mins
"""

# catalogBusDec = pd.read_csv('immuta/December 05 2016 Catalog Business Courses/December 05 2016 Catalog Business Courses.csv')
# catalogBusDecAddedDesc = extractCourseDescription(catalogBusDec)

'\n:brief: uncomment and run this cell to scrape webpages and update pickle file with course descriptions and target audience.\n    - smaller subset, takes less than 10 mins\n'

In [132]:
"""
save catalogBusDecAddedDesc to pickle file
"""
# with open('catalog_with_desc.txt','w') as f:
#     pickle.dump(catalogBusDecAddedDesc,f)

'\nsave catalogBusDecAddedDesc to pickle file\n'

In [133]:
df = pickle.load(open('/home/jupyter/work/catalog_with_desc.txt', 'rb'))
df.tail(2)

Unnamed: 0,index,language,solution area,curriculum,series,course title,course#,course url,asset type,estimated duration hours,skillport,cd,replaces,Course Description,Target Audience
1021,1021,English,SALES and CUSTOMER FACING SKILLS,TestPreps,Test Preps,TestPrep ITIL Foundation,ib_itlv_a01_tp_enus,http://library.skillport.com/coursedesc/ib_itlv_a01_tp_enus/summary.htm,SkillSoft Testprep Exams,1.0,Released,Released,,"[To test your knowledge on the skills and competencies being measured by the vendor certification exam*. TestPrep can be taken in either Study or Certification mode. Study mode is designed to maximize learning by not only testing your knowledge of the material, but also by providing additional information on the topics presented. Certification mode is designed to test your knowledge of the material within a structured testing environment, providing valuable feedback at the end of the test.<br/><br/>* This TestPrep is aligned to the ITIL 2011 Edition publications.]","[Individuals seeking practice in a structured testing environment, covering the skills and competencies being measured by the vendor certification exam.]"
1022,1022,English,SALES and CUSTOMER FACING SKILLS,Mentoring Assets,Mentoring Assets,Mentoring ITIL Foundation,mntitv3f,http://library.skillport.com/coursedesc/mntitv3f/summary.htm,SkillSoft Mentoring Assets,,Released,,,[Skillsoft Mentors are available to help students with their studies for the ITIL Foundation exam. You can reach them by entering a Mentored Chat Room or by using the Email My Mentor service.<br/><br/>* This asset is aligned to the ITIL 2011 Edition publications.],"[Individuals who are studying the associated Skillsoft content in preparation for, or to become familiar with, the skills and competencies being measured by the actual certification exam.]"


In [134]:
"""
:brief: to identify similar courses, use doc2vec on title, description, and target audience
"""

# df.columns
df['Course Description'] = map(str, df['Course Description'])
df['Target Audience'] = map(str, df['Target Audience'])

# combine desc + aud into a str
df['desc+aud'] = df['Course Description'] + df['Target Audience']
pd.set_option('display.max_colwidth', -1)
df.tail(1)['desc+aud']

1022    ['Skillsoft Mentors are available to help students with their studies for the ITIL Foundation exam. You can reach them by entering a Mentored Chat Room or by using the Email My Mentor service.<br/><br/>* This asset is aligned to the ITIL 2011 Edition publications.']['Individuals who are studying the associated Skillsoft content in preparation for, or to become familiar with, the skills and competencies being measured by the actual certification exam.']
Name: desc+aud, dtype: object

### Modeling Doc2Vec on the whole data
We can infer vectors for any keywords from this model. We then compare this vector with all the document vectors to find the highest cosine similarity.

In [135]:
def Labeled(s,l):
    sentences = []
    for i,talk in enumerate(s):
        sentences.append(LabeledSentence(utils.to_unicode(talk).split(),[l[i]]))
    return sentences

sentences_all = Labeled(df['desc+aud'], range(1405))
model = Doc2Vec(min_count=1, window=10, size=50, sample=1e-4, negative=5, workers=7)  # size=128
model.build_vocab(sentences_all)
model.train(sentences_all)
# model.train(sentences)
X = []
for doc_id in range(len(sentences_all)):
    inferred_vector = model.infer_vector(sentences_all[doc_id].words)
    X.append(inferred_vector)

## cosine similarity matrix

In [136]:
cosine_similarity(X)[0][0]
cosine_similarity(X).shape

(1023, 1023)

In [137]:
df.columns
df.tail(1)

Unnamed: 0,index,language,solution area,curriculum,series,course title,course#,course url,asset type,estimated duration hours,skillport,cd,replaces,Course Description,Target Audience,desc+aud
1022,1022,English,SALES and CUSTOMER FACING SKILLS,Mentoring Assets,Mentoring Assets,Mentoring ITIL Foundation,mntitv3f,http://library.skillport.com/coursedesc/mntitv3f/summary.htm,SkillSoft Mentoring Assets,,Released,,,['Skillsoft Mentors are available to help students with their studies for the ITIL Foundation exam. You can reach them by entering a Mentored Chat Room or by using the Email My Mentor service.<br/><br/>* This asset is aligned to the ITIL 2011 Edition publications.'],"['Individuals who are studying the associated Skillsoft content in preparation for, or to become familiar with, the skills and competencies being measured by the actual certification exam.']","['Skillsoft Mentors are available to help students with their studies for the ITIL Foundation exam. You can reach them by entering a Mentored Chat Room or by using the Email My Mentor service.<br/><br/>* This asset is aligned to the ITIL 2011 Edition publications.']['Individuals who are studying the associated Skillsoft content in preparation for, or to become familiar with, the skills and competencies being measured by the actual certification exam.']"


In [142]:
similarity_matrix = cosine_similarity(X)

# replace diagonal values with 0
np.fill_diagonal(similarity_matrix, 0)
scores = []
similar_to = []
for i, x in enumerate(similarity_matrix):
    scores.append(max(x))
    similar_to.append(np.argmax(x))
#     print(i, argmax(x), round(max(x), 2))
df['scores'] = scores
df['similar_to'] = similar_to

# create description_of_similar_to
descriptions_of_similar_to = []
titles_of_similar_to = []
for i, x in enumerate(df['similar_to']):
    descriptions_of_similar_to.append(df['Course Description'][x])
    titles_of_similar_to.append(df['course title'][x])
    
df['description_of_similar_to'] = descriptions_of_similar_to
df['titles_of_similar_to'] = titles_of_similar_to
pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

similar_courses = df[['course title', 'desc+aud', 'scores', 'similar_to', 'titles_of_similar_to', 'description_of_similar_to']].sort(['scores'], ascending=False)
similar_courses[similar_courses['scores'] > .9].head(10)

height has been deprecated.





Unnamed: 0,course title,desc+aud,scores,similar_to,titles_of_similar_to,description_of_similar_to
155,Effective Critical Analysis of Business Reports,"['Effective decision making requires sound analytics. This impact explores the pitfalls of basing decisions on faulty logic.']['Students preparing to enter the workforce, entry level employees who have just entered the workforce and mid-level employees looking to refresh their skills.']",0.99997,120,Effective Critical Analysis of Business Reports,['Effective decision making requires sound analytics. This impact explores the pitfalls of basing decisions on faulty logic.']
120,Effective Critical Analysis of Business Reports,"['Effective decision making requires sound analytics. This impact explores the pitfalls of basing decisions on faulty logic.']['Students preparing to enter the workforce, entry level employees who have just entered the workforce and mid-level employees looking to refresh their skills.']",0.99997,155,Effective Critical Analysis of Business Reports,['Effective decision making requires sound analytics. This impact explores the pitfalls of basing decisions on faulty logic.']
121,Leading Outside the Organization,"[""A leader's public image is just as important as his or her management ability. This Business Impact examines the expanding role of today's business leaders outside of their organizations.""]['Individuals responsible for leading teams either occasionally, for example as project managers, or more permanently as team leaders or line managers.']",0.999934,470,Leading Outside the Organization,"[""A leader's public image is just as important as his or her management ability. This Business Impact examines the expanding role of today's business leaders outside of their organizations.""]"
470,Leading Outside the Organization,"[""A leader's public image is just as important as his or her management ability. This Business Impact examines the expanding role of today's business leaders outside of their organizations.""]['Individuals responsible for leading teams either occasionally, for example as project managers, or more permanently as team leaders or line managers.']",0.999934,121,Leading Outside the Organization,"[""A leader's public image is just as important as his or her management ability. This Business Impact examines the expanding role of today's business leaders outside of their organizations.""]"
917,Succeeding in Account Management,"[""Successful account managers pursue high-profit customers and work hard to retain them. They strive to understand and satisfy their customers' needs, knowing that their efforts will serve the strategic initiatives of not only their own company, but of their client as well. This Challenge Series product explores effective account management techniques. You'll assume the role of an account manager for a human resources consulting firm.""]['Experienced sales professionals who wish to improve their ability to manage accounts']",0.999934,914,Succeeding in Account Management,"[""Successful account managers pursue high-profit customers and work hard to retain them. They strive to understand and satisfy their customers' needs, knowing that their efforts will serve the strategic initiatives of not only their own company, but of their client as well. This Challenge Series product explores effective account management techniques. You'll assume the role of an account manager for a human resources consulting firm.""]"


## The semantic approach on the descriptions doesn't work too well for our purposes.  Let's try doc2vec on only the course title.

In [143]:
sentences_all = Labeled(df['course title'], range(1405))
model = Doc2Vec(min_count=1, window=10, size=50, sample=1e-4, negative=5, workers=7)  # size=128
model.build_vocab(sentences_all)
model.train(sentences_all)
# model.train(sentences)
X = []
for doc_id in range(len(sentences_all)):
    inferred_vector = model.infer_vector(sentences_all[doc_id].words)
    X.append(inferred_vector)
    
    
similarity_matrix = cosine_similarity(X)

# replace diagonal values with 0
np.fill_diagonal(similarity_matrix, 0)
scores = []
similar_to = []
for i, x in enumerate(similarity_matrix):
    scores.append(max(x))
    similar_to.append(np.argmax(x))
#     print(i, argmax(x), round(max(x), 2))
df['scores'] = scores
df['similar_to'] = similar_to

# create description_of_similar_to
descriptions_of_similar_to = []
titles_of_similar_to = []
for i, x in enumerate(df['similar_to']):
    descriptions_of_similar_to.append(df['Course Description'][x])
    titles_of_similar_to.append(df['course title'][x])
    
df['description_of_similar_to'] = descriptions_of_similar_to
df['titles_of_similar_to'] = titles_of_similar_to
pd.set_option('display.height', 500)
pd.set_option('display.max_rows', 500)

similar_courses = df[['course title', 'desc+aud', 'scores', 'similar_to', 'titles_of_similar_to', 'description_of_similar_to']].sort(['scores'], ascending=False)
similar_courses[similar_courses['scores'] > .9].head(5)

height has been deprecated.





Unnamed: 0,course title,desc+aud,scores,similar_to,titles_of_similar_to,description_of_similar_to
809,Core PMI? Values and Ethical Standards,"['As a project manager, you will inevitably be called upon to address ethical dilemmas. The type and complexity of these dilemmas can vary significantly from balancing the competing interests of stakeholders to adhering to conflicting legal, multi-cultural, and multi-national rules, regulations, and requirements. Addressing these issues is much more complex than simply deciding what is right and what is wrong. In an increasingly global network, project managers must proactively seek to understand cultural diversity, and how to work successfully with multi-national teams. Sensitivity to other groups, their social customs, and their means of doing business is key to success. Often, project managers will need to weigh all competing interests fairly and objectively in order to make the ethical decision that will have the most far-reaching benefits. In this course, learners will explore the values underlying ethical decisions and behaviors as outlined in the PMI\xc2\xae Code of Ethics and Professional Conduct. For each value, learners will be introduced to the integrity aspired to, as well as the mandatory conduct demanded of project managers to effectively manage projects and further promote project management as a profession. Topics covered include the behaviors that align with the core values of responsibility, respect, honesty, and fairness; how to integrate ethics into your project environments; and how to resolve ethical dilemmas. The course provides a foundational knowledge base reflecting the most up-to-date project management information so learners can effectively put principles to work at their own organizations. This course will assist in preparing the learner for the PMI\xc2\xae certification exam. This course is aligned with A Guide to the Project Management Body of Knowledge (PMBOK\xc2\xae Guide) \xe2\x80\x93 Fifth Edition, published by PMI\xc2\xae, Inc., 2013. Copyright and all rights reserved. Material from this publication has been reproduced with the permission of PMI\xc2\xae.']['Existing project managers wishing to get certified in recognition of their skills and experience, or others who wish to train to become accredited project managers.']",0.999917,803,Core PMI? Values and Ethical Standards,"['As a project manager, you will inevitably be called upon to address ethical dilemmas. The type and complexity of these dilemmas can vary significantly from balancing the competing interests of stakeholders to adhering to conflicting legal, multi-cultural, and multi-national rules, regulations, and requirements. Addressing these issues is much more complex than simply deciding what is right and what is wrong. In an increasingly global network, project managers must proactively seek to understand cultural diversity, and how to work successfully with multi-national teams. Sensitivity to other groups, their social customs, and their means of doing business is key to success. Often, project managers will need to weigh all competing interests fairly and objectively in order to make the ethical decision that will have the most far-reaching benefits. In this course, learners will explore the values underlying ethical decisions and behaviors as outlined in the PMI\xc2\xae Code of Ethics and Professional Conduct. For each value, learners will be introduced to the integrity aspired to, as well as the mandatory conduct demanded of project managers to effectively manage projects and further promote project management as a profession. Topics covered include the behaviors that align with the core values of responsibility, respect, honesty, and fairness; how to integrate ethics into your project environments; and how to resolve ethical dilemmas. The course provides a foundational knowledge base reflecting the most up-to-date project management information so learners can effectively put principles to work at their own organizations. This course will assist in preparing the learner for the PMI\xc2\xae certification exam. This course is aligned with A Guide to the Project Management Body of Knowledge (PMBOK\xc2\xae Guide) \xe2\x80\x93 Fifth Edition, published by PMI\xc2\xae, Inc., 2013. Copyright and all rights reserved. Material from this publication has been reproduced with the permission of PMI\xc2\xae.']"
803,Core PMI? Values and Ethical Standards,"['As a project manager, you will inevitably be called upon to address ethical dilemmas. The type and complexity of these dilemmas can vary significantly from balancing the competing interests of stakeholders to adhering to conflicting legal, multi-cultural, and multi-national rules, regulations, and requirements. Addressing these issues is much more complex than simply deciding what is right and what is wrong. In an increasingly global network, project managers must proactively seek to understand cultural diversity, and how to work successfully with multi-national teams. Sensitivity to other groups, their social customs, and their means of doing business is key to success. Often, project managers will need to weigh all competing interests fairly and objectively in order to make the ethical decision that will have the most far-reaching benefits. In this course, learners will explore the values underlying ethical decisions and behaviors as outlined in the PMI\xc2\xae Code of Ethics and Professional Conduct. For each value, learners will be introduced to the integrity aspired to, as well as the mandatory conduct demanded of project managers to effectively manage projects and further promote project management as a profession. Topics covered include the behaviors that align with the core values of responsibility, respect, honesty, and fairness; how to integrate ethics into your project environments; and how to resolve ethical dilemmas. The course provides a foundational knowledge base reflecting the most up-to-date project management information so learners can effectively put principles to work at their own organizations. This course will assist in preparing the learner for the PMI\xc2\xae certification exam. This course is aligned with A Guide to the Project Management Body of Knowledge (PMBOK\xc2\xae Guide) \xe2\x80\x93 Fifth Edition, published by PMI\xc2\xae, Inc., 2013. Copyright and all rights reserved. Material from this publication has been reproduced with the permission of PMI\xc2\xae.']['Existing project managers wishing to get certified in recognition of their skills and experience, or others who wish to train to become accredited project managers.']",0.999917,809,Core PMI? Values and Ethical Standards,"['As a project manager, you will inevitably be called upon to address ethical dilemmas. The type and complexity of these dilemmas can vary significantly from balancing the competing interests of stakeholders to adhering to conflicting legal, multi-cultural, and multi-national rules, regulations, and requirements. Addressing these issues is much more complex than simply deciding what is right and what is wrong. In an increasingly global network, project managers must proactively seek to understand cultural diversity, and how to work successfully with multi-national teams. Sensitivity to other groups, their social customs, and their means of doing business is key to success. Often, project managers will need to weigh all competing interests fairly and objectively in order to make the ethical decision that will have the most far-reaching benefits. In this course, learners will explore the values underlying ethical decisions and behaviors as outlined in the PMI\xc2\xae Code of Ethics and Professional Conduct. For each value, learners will be introduced to the integrity aspired to, as well as the mandatory conduct demanded of project managers to effectively manage projects and further promote project management as a profession. Topics covered include the behaviors that align with the core values of responsibility, respect, honesty, and fairness; how to integrate ethics into your project environments; and how to resolve ethical dilemmas. The course provides a foundational knowledge base reflecting the most up-to-date project management information so learners can effectively put principles to work at their own organizations. This course will assist in preparing the learner for the PMI\xc2\xae certification exam. This course is aligned with A Guide to the Project Management Body of Knowledge (PMBOK\xc2\xae Guide) \xe2\x80\x93 Fifth Edition, published by PMI\xc2\xae, Inc., 2013. Copyright and all rights reserved. Material from this publication has been reproduced with the permission of PMI\xc2\xae.']"
680,Rebuilding Trust,"[""Trust is one of the most important elements of a productive working environment but can easily be broken. Broken trust won't just disappear, but needs to be rebuilt. This Business Impact explores what trust is and the ways to rebuild trust once it has been broken.""]['Professionals in non-managerial roles']",0.999907,679,Rebuilding Trust,"[""Trust is an important component in any workplace. When colleagues know they can count on each other, morale and productivity levels tend to increase. But what if trust is betrayed? How will this impact your ability to perform your job effectively? This course will provide key insight into the cost of lost trust including its negative impacts on performance, morale, and ultimately the bottom line. How to rebuild trust once it's lost and maintain trust over time will also be addressed.""]"
679,Rebuilding Trust,"[""Trust is an important component in any workplace. When colleagues know they can count on each other, morale and productivity levels tend to increase. But what if trust is betrayed? How will this impact your ability to perform your job effectively? This course will provide key insight into the cost of lost trust including its negative impacts on performance, morale, and ultimately the bottom line. How to rebuild trust once it's lost and maintain trust over time will also be addressed.""]['Anyone who wants to develop or refine their skills for developing and sustaining trusting relationships']",0.999907,680,Rebuilding Trust,"[""Trust is one of the most important elements of a productive working environment but can easily be broken. Broken trust won't just disappear, but needs to be rebuilt. This Business Impact explores what trust is and the ways to rebuild trust once it has been broken.""]"
917,Succeeding in Account Management,"[""Successful account managers pursue high-profit customers and work hard to retain them. They strive to understand and satisfy their customers' needs, knowing that their efforts will serve the strategic initiatives of not only their own company, but of their client as well. This Challenge Series product explores effective account management techniques. You'll assume the role of an account manager for a human resources consulting firm.""]['Experienced sales professionals who wish to improve their ability to manage accounts']",0.999902,914,Succeeding in Account Management,"[""Successful account managers pursue high-profit customers and work hard to retain them. They strive to understand and satisfy their customers' needs, knowing that their efforts will serve the strategic initiatives of not only their own company, but of their client as well. This Challenge Series product explores effective account management techniques. You'll assume the role of an account manager for a human resources consulting firm.""]"
