Building a recommendation system using word analogies
--------

Word embeddings can be used to build data products. 

When I worked at an employment website, I built a recommendation engine for job seekers. The job seeker would provide a list of previous job titles and we would suggest jobs for them. My goal was given a current job title, what could I suggest as a better job? By suggesting career advancement advice, we could value to the job seeker.

I framed the data product as a word analogy problem.

> Man is to king as woman is to queen

or

> Prince is to king as princess is to ________.

In [105]:
import gensim
import gensim.downloader

In [106]:
model = gensim.downloader.load('glove-wiki-gigaword-300')

In [107]:
def complete_analogy(worda, wordb, wordc):
    "Return the single best match that completes: {worda} is to {wordb} as {wordc} is ____"
    try:
        result = model.most_similar(negative=[worda], 
                                    positive=[wordb, wordc])
        # Remove simple purals
        top_result = result[0][0]
        if top_result != wordc+'s':
            return top_result
        else:
            second_best_result = result[1][0]
            return second_best_result
    except KeyError as error:
        return error

assert complete_analogy("man", "king", "woman") == 'queen'

In [108]:
lower_position  = "prince"
higher_position = "king"
original_job_titles = ['princess', 'valet', 'gardener'] 
promotions = [complete_analogy(lower_position, higher_position, job_title) for job_title in original_job_titles]

In [109]:
for job_title, promotion in zip(original_job_titles, promotions):
    print(f"A {job_title} can be promoted to {promotion}.")

A princess can be promoted to queen.
A valet can be promoted to concierge.
A gardener can be promoted to horticulturist.


Building A Data Product Notes
-----

The most important element of building a data product is a large quantity of high quality data. For the actual model, I built a custom embedding space using millions of job postings.

Another element is error handling. Above I had to handle simple plurals. The actual system had complex logical to handle edge cases.

<center><h2>Sources of Inspiration</h2></center>

- https://radimrehurek.com/gensim/models/keyedvectors.html
- https://towardsdatascience.com/how-to-solve-analogies-with-word2vec-6ebaf2354009