## Recommender System - Marketplace Matching

In this notebook, we will: 
- Clean textual data from user-input verbatim posts
- Use a Word2Vec model to calculate document similarities 
- Sort the most similar user input to our training data in order to recommend similar products 
- Save this model in a format that allows us to refresh the testing data

The goal of this project is to create a recommender system to help Pangeans find the right "project" for them, given their profile information. Here, we are using legacy data that is from Pangea V2, when Pangeans were allowed to post services and requests, as well as purchase items on the platform. We are using the User-Inputted Titles to suggest similar services, or in V3, similar "projects". 

### (1) Data Collection, Preprocessing, and Exploratory Analysis

In [14]:
#Importing Libraries
import numpy as np
import pandas as pd

import sys
from pandas import DataFrame

import json
from pandas.io.json import json_normalize
import csv

import matplotlib.pyplot as plt
%matplotlib inline 
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

from gensim import corpora
from collections import defaultdict
from pprint import pprint

from gensim.models.word2vec import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import common_texts

import re
from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
nltk.download('stopwords') ###

import gensim 
from gensim.models import KeyedVectors
import gensim.downloader as api

from operator import itemgetter, attrgetter

from gensim.models.doc2vec import Doc2Vec

import os, sys

from operator import add

from sklearn.preprocessing import normalize

import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from adjustText import adjust_text

from sklearn.manifold import TSNE

import pickle

import json
from pprint import pprint
from pandas.io.json import json_normalize


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/angelateng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
def vectorize_and_store_existing_titles():
    
    raw = pd.read_csv("allPostData.csv", header=0);
    
    #we can replace this with a filepath in the future
    titles = raw['title'];
    post_titles = [title for title in titles];
    tokens = [[word for word in title.lower().split()] for title in post_titles];
    
    clean_words = [[word for word in title if word.isalpha()] for title in tokens];
    stoplist = set(stopwords.words('english'));
    
    titles_nostopwords = [[word for word in title if word not in stoplist] for title in clean_words];   
    
    model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary = True); 
    filtered_word_list = [[word for word in title if word in model.vocab] for title in titles_nostopwords];
    
    title_vectors = {}
    for title in filtered_word_list: 
        word_vecs = [model[word] for word in title]
        if len(word_vecs) == 0:
            title_vec = [np.zeros(300)]
        else: 
            title_vec = normalize(sum(word_vecs).reshape(1, -1))
        title_vectors[" ".join(title)] = title_vec[0]
    #do we want to print out the original titles? how to do this?
    vectorized_titles = pd.DataFrame.from_dict(title_vectors)
    
    #can also replace filepath in the future
    vectorized_titles.to_pickle("/Users/angelateng/Dropbox/SharpestMinds/vectorized_titles.pkl")
    return(vectorized_titles)

#vectorize_and_store_existing_titles()


In [8]:
#pd.read_pickle("/Users/angelateng/Dropbox/SharpestMinds/vectorized_titles.pkl")
#sanity check

In [10]:
import json
from pprint import pprint

with open('allPostData.json') as fresh_data:
    user_post = json.load(fresh_data)

pprint(user_post)

[{'address': {'components': {'city': 'Providence',
                             'country': 'US',
                             'county': 'Providence County',
                             'number': '158',
                             'state': 'RI',
                             'street': 'University Ave',
                             'zip_code': '02906'},
              'formatted': '158 University Ave, Providence, RI 02906, USA'},
  'archived_status': 'live',
  'category': 'Cooking',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1505340466,
                 'timezone': '+00:00'},
  'description': "I'm trying to broaden the repertoire of culinary dishes I "
                 "can make. Please help! I'm surviving on chicken and pasta "
                 "rn... I'll buy the ingredients.",
  'id': '-Kty6gXaY-_wgIheV3LS',
  'last_modified': '2018-09-07T23:08:39.964Z',
  'location': {'$reql_type$': 'GEOMETRY',
               'coordinates': [-7

  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1536361705.571,
                    'timezone': '+00:00'},
  'location': {'$reql_type$': 'GEOMETRY',
               'coordinates': [-82.9065344, 40.0424984],
               'type': 'Point'},
  'owner_id': '8nhkCkLxlmSNAuNgVW6y767p8uH3',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price': 10,
  'price_type': 'flexible',
  'status': 'legacy',
  'title': 'Social Media Management',
  'visibility': 'public'},
 {'address': {'components': {'city': 'Providence',
                             'country': 'US',
                             'county': 'Providence County',
                             'number': '39',
                             'state': 'RI',
                             'street': 'E George St',
                             'zip_code': '02906'},
              'formatted': '39 E George St, Providence, RI 02906, USA'},
  'archived_status': 'live',
  'category': 'Electronics',
  'c

  'price_type': 'flexible',
  'status': 'legacy',
  'title': 'Accounting Help',
  'visibility': 'public'},
 {'address': {'components': {'city': 'Providence',
                             'country': 'US',
                             'county': 'Providence County',
                             'premise': 'Sayles Hall',
                             'state': 'RI',
                             'zip_code': '02912'},
              'formatted': 'Sayles Hall, Providence, RI 02912, USA'},
  'archived_status': 'live',
  'category': 'Instructor',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1519322191,
                 'timezone': '+00:00'},
  'description': 'Fun stuff',
  'id': '-L5zirsVOhPnaLiEIOd-',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1536361711.219,
                    'timezone': '+00:00'},
  'location': {'$reql_type$': 'GEOMETRY',
               'coordinates': [-71.4025754, 41.8262112],
          

 {'address': {'components': {'city': 'Providence',
                             'country': 'US',
                             'county': 'Providence County',
                             'number': '33',
                             'state': 'RI',
                             'street': 'Vinton St',
                             'zip_code': '02909'},
              'formatted': '33 Vinton St, Providence, RI 02909, USA'},
  'archived_status': 'live',
  'category': 'Cooking',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1513625723,
                 'timezone': '+00:00'},
  'description': '5 “meals” and 2 “snacks” prepared by a chef. Can be '
                 'vegetarian or vegan. \n'
                 '\n'
                 'This can last 1 week in place of lunch and afternoon snacks. '
                 'Typical portions are not supremely huge but large enough for '
                 'leftovers! ',
  'id': '-L0gBZef-6Owj6T1Rpf5',
  'last_mod

 {'address': {'components': {'city': 'Providence',
                             'country': 'US',
                             'county': 'Providence County',
                             'premise': 'Andrews Hall',
                             'state': 'RI',
                             'zip_code': '02912'},
              'formatted': 'Andrews Hall, Providence, RI 02912, USA'},
  'archived_status': 'live',
  'category': 'Cooking',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1524017321,
                 'timezone': '+00:00'},
  'description': "I'll be your best friend please. I'm a cool friend, ill be "
                 'your friend; a friend.',
  'id': '-LAMMd5_Exl4wQ41hVgq',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1536361715.875,
                    'timezone': '+00:00'},
  'location': {'$reql_type$': 'GEOMETRY',
               'coordinates': [-71.4025101, 41.8305829],
               'type': 'Po

  'owner_id': 'R4J66WJOF5XG8iHfnYENBy15Mv32',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price_type': 'free',
  'status': 'legacy',
  'title': 'Pangea Post Consulting ',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542657469.764,
                 'timezone': '+00:00'},
  'description': 'I need some help in smash 4',
  'id': '0591563d-55c1-4868-ade4-56983b4da475',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542657469.764,
                    'timezone': '+00:00'},
  'owner_id': 'ad2dcf61-fbb2-49f3-8cd0-8a566669c617',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'request',
  'price': 0,
  'price_type': 'flexible',
  'status': 'inactive',
  'title': 'Smash 4 Tutor',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'other_

  'photos': [],
  'post_type': 'request',
  'price': 1,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Teach me meme culture',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'events',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542663491.452,
                 'timezone': '+00:00'},
  'description': 'If any of you guys wanna start a new large server hmu',
  'id': '0ce1aae3-84d8-472c-b05e-f3bb39b05d5f',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542663491.452,
                    'timezone': '+00:00'},
  'owner_id': 'fffd2dad-6768-47c5-8189-843c8afd77f2',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'request',
  'price': 0,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Looking For Factorio Players',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'other_service',
  

  'post_type': 'request',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Software Design Internship',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542666017.928,
                 'timezone': '+00:00'},
  'description': 'I can give some advise and I’ll proof your paper!',
  'id': '19cb9f4f-04b8-4203-935d-68c7a5764da4',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542666017.928,
                    'timezone': '+00:00'},
  'owner_id': '8ee22344-49db-4c48-8393-65725be1e658',
  'payment_type': 'hourly',
  'payment_type_new': 'flat_fee',
  'photos': [],
  'post_type': 'offering',
  'price': 15,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Tips for Writing a Paper',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'servic

                    'timezone': '+00:00'},
  'owner_id': '3c4507dc-63c0-41c9-8519-facd730605aa',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': ['01542663604517'],
  'post_type': 'request',
  'price': 0,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Learn Special Meme Techniques ',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'design',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541732378.094,
                 'timezone': '+00:00'},
  'description': 'Basic design skills',
  'id': '124fd9a8-f21c-42d2-b89e-58acdb14a97d',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541732378.094,
                    'timezone': '+00:00'},
  'owner_id': '811ad179-9948-4a00-8249-ed43bd5c2efa',
  'payment_type': 'hourly',
  'photos': [],
  'post_type': 'offering',
  'price': None,
  'price_type': 'free',
  'status': 'active',
  'title': 'Design',
  'visi

  'id': '20959cfe-a527-4caa-84e8-68ddaec79bc4',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1540508363.407,
                    'timezone': '+00:00'},
  'owner_id': '252d36b6-4ce0-4bc7-9a39-1cd7881db4e6',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price': 15,
  'price_type': 'set_price',
  'status': 'active',
  'title': 'Spanish tutoring ',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'photography',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1543839787.965,
                 'timezone': '+00:00'},
  'description': 'Looking for a photographer to capture and edit pictures for '
                 'an event on Thursday  Dec. 6th. Would be about 3 hours of '
                 'your time. Editing the photos is a must!',
  'id': '2e1da4e3-375f-4be1-9e69-129d21e30a74',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 154383978

                 'epoch_time': 1542074201.833,
                 'timezone': '+00:00'},
  'description': 'I want the Lehrer family to be looked favorably upon for '
                 'decades to come-you can help make that happen',
  'id': '210313b6-ea3d-457e-a452-b803e8650263',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542074201.833,
                    'timezone': '+00:00'},
  'owner_id': '8c9cc994-4701-41e3-adfc-4fc6ffabfc63',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'request',
  'price': 25,
  'price_type': 'set_price',
  'status': 'active',
  'title': 'Family portrait',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542659018.32,
                 'timezone': '+00:00'},
  'description': 'Message me for more info',
  'id': '3f0fefbe-01f4-4798-a5e1-02190477eb4a',
  'last_modified': {'$reql_type$':

                    'timezone': '+00:00'},
  'owner_id': '1ed96d70-f68d-4004-9a99-752d64c069db',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'offering',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Sneaker Shopping ',
  'visibility': 'public'},
 {'archivedStatus': 'live',
  'archived_status': 'live',
  'category': 'other_service',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1540953542.382,
                 'timezone': '+00:00'},
  'description': 'I’m looking for someone with experience in building word2vec '
                 'models. If interested, please message me and I’m more than '
                 'happy to explain the details of the build.',
  'id': '72f2844b-8b35-45fc-8d8f-546881ffe2d6',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1540953542.382,
                    'timezone': '+00:00'},
  'owner_id': 'c39f7594-d

                 'timezone': '+00:00'},
  'description': 'my channel is called “commas69” and thats all i need, the '
                 'rest is up to you, just make sure its the right dimensions '
                 '(2560x1440)',
  'id': '438ad929-58d2-4bcf-9dc9-710b65408287',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542664774.93,
                    'timezone': '+00:00'},
  'owner_id': '4e30f0c5-1ebf-4059-9fed-996202183f41',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'request',
  'price': 5,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'create youtube channel art for me',
  'visibility': 'public'},
 {'archived_status': 'live',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1543959175.139,
                 'timezone': '+00:00'},
  'description': 'Description to test post',
  'id': '6897fc6b-78f5-456d-91b1-73431aaf9911',
  'last_modifie

  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1542663576.708,
                    'timezone': '+00:00'},
  'owner_id': 'a3dd18e6-5e2a-452a-a77d-167b234068a9',
  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'request',
  'price': 5,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Melee Practice',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'cuisine',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541909269.773,
                 'timezone': '+00:00'},
  'description': 'Yesss so happy',
  'id': '7258fb6d-17d3-4721-9ceb-3a14fd6121e7',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541909269.773,
                    'timezone': '+00:00'},
  'owner_id': 'd44269f8-35ed-4b36-a1c7-3de252cde5be',
  'payment_type': 'hourly',
  'photos': [],
  'post_type': 'offering',
  'price': None,
  'price_type': '

  'payment_type': 'hourly',
  'payment_type_new': 'hourly',
  'photos': [],
  'post_type': 'offering',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Singing',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'other_service',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541467691.76,
                 'timezone': '+00:00'},
  'description': 'Having worked on several successful start ups myself, I want '
                 'to help entrepreneurs take the next step in their journey. '
                 'Whether you want to run ideas by me, be connected to my well '
                 'rounded network, or want to hear about my experiences, I am '
                 'here to help you. ',
  'id': '9e25e9bb-974b-4f37-8a14-ae22d92b1a71',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541467691.76,
                    'timezone': '+00:00'},
  'owner_id': 'f050

  'category': 'design',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541736278.107,
                 'timezone': '+00:00'},
  'description': 'Magic positive power',
  'id': '9aea05ca-39b8-4454-906b-eca762e722ba',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541736278.107,
                    'timezone': '+00:00'},
  'owner_id': 'c0002e95-f53f-4fce-a76b-b5f86dc0a264',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Magic power',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'other_service',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542501004.102,
                 'timezone': '+00:00'},
  'description': 'im leaving pvd from Tuesday- Monday and I need someone to '
                 'watch my guinea pig, either by comi

                 'epoch_time': 1541730750.663,
                 'timezone': '+00:00'},
  'description': 'Me: a brown student ',
  'id': 'b1ce00b9-2fa8-47dc-a2e3-e28864966d42',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541730750.663,
                    'timezone': '+00:00'},
  'owner_id': '8cf39ace-3379-456a-84f2-a9d3b8c27f42',
  'payment_type': 'hourly',
  'photos': ['01541730750528'],
  'post_type': 'offering',
  'price': None,
  'price_type': 'free',
  'status': 'inactive',
  'title': 'Will chance you for brown ',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'design',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541570614.161,
                 'timezone': '+00:00'},
  'description': 'I am in need of a person who can edit a Call Of Duty: Black '
                 'Ops 4 montage. Ive got a couple good clips recorded from my '
                 'phone pointed at the tv. I 

  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'photography',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541640019.847,
                 'timezone': '+00:00'},
  'description': 'Basic photos ',
  'id': 'c016eebd-825c-4f8f-a569-e288b925ce0a',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1541640019.847,
                    'timezone': '+00:00'},
  'owner_id': '108fdb5e-c19e-4021-82bc-1f17e62a1666',
  'payment_type': 'hourly',
  'photos': [],
  'post_type': 'offering',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Simple Photography',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1542013783.936,
                 'timezone': '+00:00'},
  'description': "Hey guys i need a proficient math tutor, I'm returnin

  'price': 30,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Help with Anthropogy',
  'visibility': 'all_colleges'},
 {'archived_status': 'live',
  'category': 'other_good',
  'content_type': 'good',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1538088263.188,
                 'timezone': '+00:00'},
  'description': 'Come chat with me for love and life advice 🙈\n',
  'id': 'ed5fdc67-d772-4f09-a1b1-62c144fac56c',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1538088263.188,
                    'timezone': '+00:00'},
  'owner_id': 'CwmZzskl3HTubvLVERbIA9B2emi2',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price': 5,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Love and Life Advice 💕',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'instruction',
  'content_type': 'service',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1541

 {'archived_status': 'live',
  'category': 'apparel',
  'content_type': 'good',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1539821027.733,
                 'timezone': '+00:00'},
  'description': 'Will do and fold ur laundry.',
  'id': 'd97cde89-3cff-4827-8239-43ef98c2f40f',
  'last_modified': {'$reql_type$': 'TIME',
                    'epoch_time': 1539821027.733,
                    'timezone': '+00:00'},
  'owner_id': 'c489ebd1-f274-4c5a-bf75-3669b3ee41bb',
  'payment_type': 'fixed',
  'photos': [],
  'post_type': 'offering',
  'price': 20,
  'price_type': 'flexible',
  'status': 'active',
  'title': 'Laundry service ',
  'visibility': 'public'},
 {'archived_status': 'live',
  'category': 'electronics',
  'content_type': 'good',
  'created_at': {'$reql_type$': 'TIME',
                 'epoch_time': 1538251112.637,
                 'timezone': '+00:00'},
  'description': 'In a good shape.\n'
                 'No scratches on the screen.\n'
               

In [13]:

with open('allPostData.json') as fresh_data:
    user_post = json.load(fresh_data)

#pprint(user_post)
df = pd.DataFrame.from_dict(json_normalize(user_post), orient='columns')
df

Unnamed: 0,address.components.city,address.components.country,address.components.county,address.components.number,address.components.premise,address.components.state,address.components.street,address.components.zip_code,address.formatted,archivedStatus,...,owner_id,payment_type,payment_type_new,photos,post_type,price,price_type,status,title,visibility
0,Providence,US,Providence County,158,,RI,University Ave,02906,"158 University Ave, Providence, RI 02906, USA",,...,RHCWoC5WnSTXi0JWJadv18ejPwy1,fixed,,[],request,25,flexible,legacy,Teach Me How To Cook!,public
1,Providence,US,Providence County,157,,RI,University Ave,02906,"157 University Ave, Providence, RI 02906, USA",,...,zrbigYwCTzggBURdm89fNLstvBy2,fixed,,[],offering,15,flexible,legacy,Long Boarding Lessons,public
2,Providence,US,Providence County,233,,RI,Hope St,02906,"233 Hope St, Providence, RI 02906, USA",,...,zrbigYwCTzggBURdm89fNLstvBy2,fixed,,[],request,15,flexible,legacy,Personal Trainer,public
3,Providence,US,Providence County,152,,RI,University Ave,02906,"152 University Ave, Providence, RI 02906, USA",,...,zrbigYwCTzggBURdm89fNLstvBy2,fixed,,[],offering,15,flexible,legacy,Final Cut Pro Lesson,public
4,Providence,US,Providence County,184,Prince Engineering Laboratory,RI,Hope St,02912,"Prince Engineering Laboratory, 184 Hope St, Pr...",,...,2cAi0NSxlFVDlvRuHWxstNAMX5B2,fixed,,[],offering,15,set_price,legacy,Resume Building,public
5,Providence,US,Providence County,157,,RI,University Ave,02906,"157 University Ave, Providence, RI 02906, USA",,...,zrbigYwCTzggBURdm89fNLstvBy2,fixed,,[],offering,,free,legacy,Motivational Pep Talk,public
6,Columbus,US,Franklin County,7,,OH,Easton Oval,43219,"7 Easton Oval, Columbus, OH 43219, USA",,...,8nhkCkLxlmSNAuNgVW6y767p8uH3,fixed,,[],offering,2,flexible,legacy,Resume + Cover Letter,public
7,Providence,US,Providence County,184,Prince Engineering Laboratory,RI,Hope St,02912,"Prince Engineering Laboratory, 184 Hope St, Pr...",,...,RHCWoC5WnSTXi0JWJadv18ejPwy1,fixed,,[1532481234728],offering,4,set_price,legacy,Paracord Bracelet,public
8,Providence,US,Providence County,1,,RI,Prospect St,02912,"1 Prospect St, Providence, RI 02912, USA",,...,uGQ1ywtBJlbWpnMs7nibLtbV6v23,fixed,,[],offering,20,flexible,legacy,Video Production,public
9,Columbus,US,Franklin County,7,,OH,Easton Oval,43219,"7 Easton Oval, Columbus, OH 43219, USA",,...,8nhkCkLxlmSNAuNgVW6y767p8uH3,fixed,,[],offering,,free,legacy,Consulting Cases,public


In [2]:
#Preprocessing and reading the data
allposts = pd.read_csv("allPostData.csv", header=0)
allposts.head()

Unnamed: 0,archivedStatus,address__components__city,address__components__country,address__components__county,address__components__number,address__components__premise,address__components__state,address__components__street,address__components__zip_code,address__formatted,...,photos__001,photos__002,photos__003,photos__004,post_type,price,price_type,status,title,visibility
0,,Providence,US,Providence County,158,,RI,University Ave,2906.0,"158 University Ave, Providence, RI 02906, USA",...,,,,,request,25.0,flexible,legacy,Teach Me How To Cook!,public
1,,Providence,US,Providence County,157,,RI,University Ave,2906.0,"157 University Ave, Providence, RI 02906, USA",...,,,,,offering,15.0,flexible,legacy,Long Boarding Lessons,public
2,,Providence,US,Providence County,233,,RI,Hope St,2906.0,"233 Hope St, Providence, RI 02906, USA",...,,,,,request,15.0,flexible,legacy,Personal Trainer,public
3,,Providence,US,Providence County,152,,RI,University Ave,2906.0,"152 University Ave, Providence, RI 02906, USA",...,,,,,offering,15.0,flexible,legacy,Final Cut Pro Lesson,public
4,,Providence,US,Providence County,184,Prince Engineering Laboratory,RI,Hope St,2912.0,"Prince Engineering Laboratory, 184 Hope St, Pr...",...,,,,,offering,15.0,set_price,legacy,Resume Building,public


#### (6) Similarity Queries 

Now that we've seen that this is not a 2-D problem, but indeed a multidimensional problem, we'll perform similarity queries. This means that, given the title vectors we have, one from our Pangea dictionary and the other for new data coming in (supposedly real-time in future state), we want to see how similar these titles are. The more similar the titles, the more likely they would be a good recommendation for our Pangean.

To study title similarities, we'll be using the dot product of the vectors (or the cosine similarity). The higher the dot product, the more similar the titles will be. Titles that are exactly the same will have a dot product of 1. 

In [36]:
titles_w_sim_rank = []

In [37]:
title_vectors['teach'][:30]

array([ 0.04235635, -0.01368302,  0.07774045,  0.02823756,  0.04200774,
        0.04462232,  0.08750159, -0.04775983, -0.06170431, -0.06867655,
        0.05961264,  0.04880567, -0.02143963,  0.07285989, -0.04427371,
        0.07878629, -0.0613557 ,  0.09342799, -0.01655907, -0.06763071,
        0.12550029,  0.06797932,  0.10388635,  0.07355712, -0.05403485,
       -0.08122657, -0.00155786, -0.03067785, -0.03433827, -0.03276952],
      dtype=float32)

In [38]:
title_vectors.keys()

dict_keys(['teach', 'long boarding lessons', 'personal trainer', 'final cut pro lesson', 'resume building', 'motivational pep talk', 'resume cover letter', 'bracelet', 'video production', 'consulting cases', 'sas sql tutor', 'guitar lessons', 'photography portraits', 'help tinder profile', 'custom love song', 'gym buddy', 'personal cheerleader', 'fan', 'lil ducky', 'snowboarding lessons', 'bed bath beyond drawers', 'math tutor', 'head lamp', 'web development', 'rice cooking pot', 'get startup job', 'tennis racket', 'alarm clock clock', 'orgo kit', 'cs tutor', 'ethernet cable', 'social media management', 'mouse', 'photography', 'bed raises', 'tv screen', 'homecooked dinner', 'ig social media marketing', 'developer', 'iron', 'soccer lessons', 'idea help', 'korean tutor', 'adobe photoshop help', 'desk lamps', 'ride boston', 'increase gym gains', 'adobe illustrator help', 'headshots', 'laundry', 'programming advise', 'original prints', 'massage', 'free', 'workout plan', 'design coolest deb

*Title_vectors* is actually a dictionary--which makes it a really special and useful datatype for our title vectors. Some useful links I found when using python dictionaries are the following: 
- https://stackoverflow.com/questions/6634708/typeerror-dict-object-is-not-callable 
- https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value 
- https://www.w3schools.com/python/python_dictionaries.asp

We can then print out all the "keys" of the *title_vectors* dictionary to see the list of all of our titles. 

For a deeper look into dot product, we want to test out the semantics below: 

A couple resources that I found particularly helpful when learning about dot product are the following:
- https://en.wikipedia.org/wiki/Dot_product 
- https://www.mathsisfun.com/algebra/vectors-dot-product.html 
- https://betterexplained.com/articles/vector-calculus-understanding-the-dot-product/
- http://tutorial.math.lamar.edu/Classes/CalcII/DotProduct.aspx

In [39]:
#testing semantics
np.dot(title_vectors['teach'], title_vectors['chemistry tutor'])

0.4491459

We're comparing two titles that show u a lot in our data, "teach" and "chemistry tutor". Logically, we know that these two titles probably have some type of relation, so we'd expect a higher dot product, as shown by the 0.449. 

In [40]:
#title_vectors['teach']
dot_product = np.dot(title_vectors['teach'], title_vectors['teach'])
dot_product
#santiy check that dot product = 1 since they are the same thing

1.0

This makes sense as a sanity check, so now we can also check the dot product of 'teach' and 'teach', which should be 1. 

In [41]:
#loop over all keys in dict 

ranked_titles = {}


#can also use title_vectors.keys() 
for key in title_vectors:
    ranked_titles[key] = np.dot(title_vectors['food'], title_vectors[key])

# note: might not work if the word is not in our dictionary (the new pangea one not gensim)

This is also true, and now we're ready to move on to creating a new dictionary called "ranked_titles", that contains the literal title vector (this will, in future state, be each new title that a user inputs), as well as its similarity to other titles that are already in our ddatabase.

In [42]:
import operator
sorted_title_vecs = sorted(ranked_titles.items(), key=operator.itemgetter(1), reverse=True)

Next, we want to sort the title vectors by relevance--the higher the dot product, the more we care about that vector.

In [43]:
sorted_title_vecs[:30]

[('food', 1.0000001),
 ('food photography', 0.7063276),
 ('food delivery thanksgiving', 0.6528243),
 ('deliver dining hall food', 0.6489846),
 ('farm fresh meal', 0.54369897),
 ('rice cooking pot', 0.5030476),
 ('cooking', 0.47695595),
 ('drink wine', 0.47675544),
 ('chef prepared meal prep', 0.47417408),
 ('meal credit', 0.46516),
 ('homecooked dinner', 0.46060634),
 ('tomato soup recipe', 0.44682276),
 ('grocery runs', 0.42736563),
 ('buy groceries', 0.4250207),
 ('grocery shopping', 0.4146102),
 ('pet cleaning services', 0.4097092),
 ('meal swipes', 0.40529132),
 ('private cook', 0.39881307),
 ('keto diet daily menu', 0.39766154),
 ('even delicious cookies', 0.38736588),
 ('delicious cookies', 0.3863588),
 ('professional pet bathing services', 0.38212496),
 ('help setting farmers market sun', 0.37188083),
 ('feed cuddle cat', 0.35799968),
 ('mediterranean cooking class', 0.3564884),
 ('pet photography', 0.35361582),
 ('pet care advice', 0.35287824),
 ('cooking lessons', 0.35078347),

Finished: 
- built a recommender system that takes in a user inputted title and compares it with every title that we have in our database 

Next steps: 
- need to ask John what type of output / JSON payload he wants for this (titles? titles with how well they match? top 10? top 50? ranked list? very good match???) 