### Doc2Vec Model

Gensim's Doc2Vec model vectorizes a group of words together rather than single words.
[Source](https://www.machinelearningplus.com/nlp/gensim-tutorial/#15howtoupdateanexistingword2vecmodelwithnewdata)

Code and idea adapted from the work of Clay Carson, Jollene Muncy and Cynthia Chiang.

In [33]:
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [34]:
# read in raw data
data = pd.read_csv('./data/combined.csv')

In [35]:
# set display options to full column widths
pd.set_option('display.max_colwidth', None)

In [36]:
# define function to tokenize a column

def tokenizer_function(column):
    """
    Takes in a text column
        tokenizes the text in each row
        using pattern [[a-zA-Z]\w+]
        which matches every lowercase and upperase character between a-z that are word characters
    Returns list of strings
    """
    
    # instantiate empty list of tokenized text
    texts = []
    
    # define tokenizer pattern
    pattern = '[a-zA-Z]\w+'
    # instantiate tokenizer
    tokenizer = RegexpTokenizer(pattern=pattern)
    
    # create for loop to tokenize each row and add the list of tokens to texts
    for text in column:
        tokens = tokenizer.tokenize(text)
        
        # transform tokens into lower case strings
        tokens = [token.lower() for token in tokens]
        texts.append(tokens)
    return texts

In [37]:
# tokenize titles
tokenized_titles = tokenizer_function(data['title'])

In [38]:
type(tokenized_titles)
tokenized_titles[:2]

[['request',
  'for',
  'information',
  'new',
  'design',
  'booklet',
  'die',
  'cutting',
  'insert',
  'assembly'],
 ['audiovisual', 'suite', 'for', 'large', 'auditorium', 'maxwell', 'afb']]

**Create List of Tagged Documents**

[Source](https://www.machinelearningplus.com/nlp/gensim-tutorial/#15howtoupdateanexistingword2vecmodelwithnewdata)

In [39]:
# define function to create tagged documents
def tag_documents(list_of_documents):
    """generator function
       accepts list of tokenized documents
       in form of list of list of words
       attaches tags to each document
       yields list of tagged documents"""
    for i, document in enumerate(list_of_documents):
        yield TaggedDocument(document, [i])

In [40]:
# tag documents
tagged_titles = list(tag_documents(tokenized_titles))

In [41]:
tagged_titles[:1]

[TaggedDocument(words=['request', 'for', 'information', 'new', 'design', 'booklet', 'die', 'cutting', 'insert', 'assembly'], tags=[0])]

**Create Doc2Vec Model**

**--PV-DBOW, vector size = 50, epochs = 20--**

In [42]:
# Instantiate the Doc2Vec model
# skip-gram model
model = Doc2Vec(vector_size=50, dm=0, dbow_words=1, min_count=2, seed=1977, epochs=20)

In [43]:
# Build the Volabulary
model.build_vocab(tagged_titles)

In [44]:
%%time
# Train the Doc2Vec model
model.train(tagged_titles, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 1min 36s, sys: 10.8 s, total: 1min 47s
Wall time: 1min 19s


In [45]:
# Save model for future use
model.save('doc2vec_titles')

In [17]:
# Load saved doc2vec_model
# model = Doc2Vec.load('doc2vec_titles')

**Get Similar Documents**

In [55]:
# get the 10 most similar documents to the first document
model.docvecs.most_similar(2530, topn=10)

[(36141, 0.8817406892776489),
 (30384, 0.878228485584259),
 (34781, 0.8750245571136475),
 (25496, 0.8694334030151367),
 (38220, 0.8674364686012268),
 (42675, 0.8672429323196411),
 (13667, 0.8621771335601807),
 (35385, 0.8619729280471802),
 (15634, 0.8584710359573364),
 (15929, 0.856934130191803)]

**Let's investigate how similar these documents are**

In [2]:
# read in dataframe of contract notifications
data = pd.read_csv('./data/combined.csv')
data.sample(25)

Unnamed: 0,noticeId,title,solicitationNumber,department,subTier,office,postedDate,type,baseType,archiveType,...,award,pointOfContact,description,organizationType,officeAddress,placeOfPerformance,additionalInfoLink,uiLink,links,resourceLinks
40819,dc5becef2cc64c46bd8306b130e084e6,56--VALVE BODY INSUL,N0010420QEF63,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVSUP WEAPON SYSTEMS SUPPORT MECH,2020-05-28,Solicitation,Solicitation,auto15,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'LY...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '17050-0788', 'city': 'MECHANICSBU...",{},,https://beta.sam.gov/opp/dc5becef2cc64c46bd830...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
21955,496f02eda0f94bd9a8b3e84b8a77815d,"47--ADAPTER,STRAIGHT,TUBE",SPE7M320U0899,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA LAND AND MARITIME,2020-05-07,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '43218-3990', 'city': 'COLUMBUS', ...",{},,https://beta.sam.gov/opp/496f02eda0f94bd9a8b3e...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
2530,61c2bdb4aee540a3bdf42a3886aca240,Kanopolis Lake Vegetation Management,W912DQ20R1058,DEPT OF DEFENSE,DEPT OF THE ARMY,W071 ENDIST KANSAS CITY,2020-04-22,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'laur...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '64106-2896', 'city': 'KANSAS CITY...",,,https://beta.sam.gov/opp/61c2bdb4aee540a3bdf42...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
12098,e39501dbb64a451fb5b10c8f933ccd6e,FD2020-20-00640,,DEPT OF DEFENSE,DEPT OF THE AIR FORCE,FA8221 AFNWC PZBB,2020-03-11,Special Notice,Special Notice,autocustom,...,,,https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '84056-5837', 'city': 'HILL AFB', ...",,,https://beta.sam.gov/opp/e39501dbb64a451fb5b10...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
38514,6ed1f768283b497bb19066368ecf53cd,41--POSI-VAC STARTER KI,SPE8E820T3561,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA TROOP SUPPORT,2020-05-18,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '19111-5096', 'city': 'PHILADELPHI...",{},,https://beta.sam.gov/opp/6ed1f768283b497bb1906...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
40638,f8a0f16a5896412581cd81f84c552d3e,"43--PUMP UNIT,CENTRIFUG",SPE7M120T438A,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA LAND AND MARITIME,2020-05-28,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '43218-3990', 'city': 'COLUMBUS', ...",{},,https://beta.sam.gov/opp/f8a0f16a5896412581cd8...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
41257,99d96fbf1dee437b8acc864286598ba9,Delivery Order for Class I Engineering Change ...,N00019-20-RFPREQ-PMA-201-0259,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVAL AIR SYSTEMS COMMAND,2020-05-28,Presolicitation,Presolicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'sama...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '20670-5000', 'city': 'PATUXENT RI...","{'city': {'code': '65000', 'name': 'Saint Loui...",,https://beta.sam.gov/opp/99d96fbf1dee437b8acc8...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
13097,ba00b2b5611d4153b5727a08cae93211,Serena Software - Sole Source (Including Brand...,FA300220Q0006,DEPT OF DEFENSE,DEPT OF THE AIR FORCE,FA3002 338 SCONS CC,2020-04-29,Justification,Justification,auto30,...,"{'date': '2020-04-23', 'number': 'FA300220C0009'}","[{'fax': None, 'type': 'primary', 'email': 'gr...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '78150-4300', 'city': 'RANDOLPH AF...","{'city': {'code': 'TX-18', 'name': 'JBSA Rando...",,https://beta.sam.gov/opp/ba00b2b5611d4153b5727...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
5186,12a2351eab5b4a2f96a309632d18c0e2,Cisco ISE License,N0025320Q0097,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVAL UNDERSEA WARFARE CENTER,2020-04-16,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'murr...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '98345-7610', 'city': 'KEYPORT', '...",,,https://beta.sam.gov/opp/12a2351eab5b4a2f96a30...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
1367,055711e208e747198e7dac16ad315f14,USPSC- Knowledge Management Specialist,7200AA20R00045,AGENCY FOR INTERNATIONAL DEVELOPMENT,AGENCY FOR INTERNATIONAL DEVELOPMENT,USAID M/OAA,2020-04-24,Solicitation,Solicitation,auto15,...,,"[{'fax': '', 'type': 'primary', 'email': 'jbui...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '20523', 'city': 'WASHINGTON', 'co...","{'city': {'code': '50000', 'name': 'Washington...",,https://beta.sam.gov/opp/055711e208e747198e7da...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...


In [47]:
# find most similar documents in dataframe
def most_similar(path_to_model, base_document_idx, n):
    """function
       finds most similar n documents
       to base_document
       based on cosine similarity
       returns similar documents 
       as pandas dataframe"""
    # load saved doc2vec_model
    model = Doc2Vec.load(path_to_model)
    # get similar topics
    similars = model.docvecs.most_similar(base_document_idx, topn=n)
    
    # load original dataframe
    data = pd.read_csv('./data/combined.csv')
    # base document
    row1 = data.loc[base_document_idx, ['title', 'department', 'uiLink']]
    
    # list of rows in the original dataframe
    # initialized with the base_document
    list_of_dfs = [row1]
    
    # iterate through all similar notifications
    for tag, similar_document in similars:
        # find the row in the notifications dataframe corresponding to the tag
        df = data.loc[tag, ['title', 'department', 'uiLink']]
        # add row to the list of rows 
        list_of_dfs.append(df)
    # return all rows as a dataframe   
    return pd.DataFrame(list_of_dfs)

In [48]:
# show most similar 10 titles to the notification indexed 2530
path = 'doc2vec_titles'
most_similar(path, 2530, 10)

Unnamed: 0,title,department,uiLink
2530,Kanopolis Lake Vegetation Management,DEPT OF DEFENSE,https://beta.sam.gov/opp/61c2bdb4aee540a3bdf42a3886aca240/view
36141,Canada Geese Hazard Management,DEPT OF DEFENSE,https://beta.sam.gov/opp/7e34e5089a904354a45f0f1ca1b48f01/view
30384,Forestry Vegetation Management Treatment,"AGRICULTURE, DEPARTMENT OF",https://beta.sam.gov/opp/afb469ce250d4ef2bcacf858b89832c0/view
34781,Benbrook Lake Mowing Services,DEPT OF DEFENSE,https://beta.sam.gov/opp/61ba07930a6f4a1986325a9bf3f3948e/view
25496,Vegetation Services,DEPT OF DEFENSE,https://beta.sam.gov/opp/da7ea2a5c4f344ef898e254bff65be8e/view
38220,Tree Spraying,"AGRICULTURE, DEPARTMENT OF",https://beta.sam.gov/opp/b8ba6114df5248c9a9b949edc9754ed4/view
42675,Law Enforcement Services Coralville Lake,DEPT OF DEFENSE,https://beta.sam.gov/opp/ca9e5cba5c494956b5a4b4b21023b71d/view
13667,"Security Gates, Lake Mead NRA","INTERIOR, DEPARTMENT OF THE",https://beta.sam.gov/opp/38799521cff148d487c6ba68b59a55b7/view
35385,"Fee Collector Services, Lake Ouachita",DEPT OF DEFENSE,https://beta.sam.gov/opp/111ce7e61c4847878b32f4aaee12ab65/view
15634,"Herbicide Services, RS Kerr Lake, OK",DEPT OF DEFENSE,https://beta.sam.gov/opp/ae7f38a4152d4b6783694779e90acbce/view


**Create Doc2Vec Model**

**--PV-DM, vector size = 100, epochs = 20--**

In [49]:
# Instantiate the Doc2Vec model
# cbow model
model_cbow = Doc2Vec(vector_size=100, dm=1, dbow_words=0, seed=1977, epochs=20)

In [50]:
# Build the Volabulary
model_cbow.build_vocab(tagged_titles)

In [51]:
%%time
# Train the Doc2Vec model
model_cbow.train(tagged_titles, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 1min 17s, sys: 10.7 s, total: 1min 28s
Wall time: 1min 18s


In [52]:
# Save model for future use
model_cbow.save('doc2vec_cbow_titles')

In [26]:
# Load saved doc2vec_model
# model = Doc2Vec.load('doc2vec_cbow_titles')

**Get Similar Documents**

In [56]:
# get the 10 most similar documents to the first document
model_cbow.docvecs.most_similar(2530, topn=10)

[(6245, 0.7212549448013306),
 (38114, 0.6974656581878662),
 (10881, 0.6973400115966797),
 (12427, 0.6951607465744019),
 (41082, 0.6848417520523071),
 (45302, 0.6625716090202332),
 (10627, 0.6494672298431396),
 (13389, 0.6357793807983398),
 (12426, 0.634731650352478),
 (22228, 0.6210447549819946)]

In [54]:
path_cbow = 'doc2vec_cbow_titles'
most_similar(path_cbow, 2530, 10)

Unnamed: 0,title,department,uiLink
2530,Kanopolis Lake Vegetation Management,DEPT OF DEFENSE,https://beta.sam.gov/opp/61c2bdb4aee540a3bdf42a3886aca240/view
6245,Battle Creek Environmental Compliance and Commitments,"INTERIOR, DEPARTMENT OF THE",https://beta.sam.gov/opp/d636876ff5d44cf29ec05482ef2c04b2/view
38114,R--Glen Canyon Dam Adaptive Management Program - Meeting Facilitation,"INTERIOR, DEPARTMENT OF THE",https://beta.sam.gov/opp/d2cc833ca9e648718abcdb51711bbec7/view
10881,"Fire Management Complex Janitorial Services, Lewis","INTERIOR, DEPARTMENT OF THE",https://beta.sam.gov/opp/1d79193e417442f8a701f1e3c21c7781/view
12427,Technical Support for Coast Guard Ballast Water Management Program,"TRANSPORTATION, DEPARTMENT OF",https://beta.sam.gov/opp/61631a2abffe413a9ff5cc3e39e31d73/view
41082,ARNG-CSO-F Dock Management Services,DEPT OF DEFENSE,https://beta.sam.gov/opp/b41bcbc3d63d406cb49073d646747934/view
45302,"58--NETWORK MANAGEMENT - AND OTHER REPLACEMENT PARTS, IN REPAIR/MODIFICATION OF",DEPT OF DEFENSE,https://beta.sam.gov/opp/8bcac9882c4c4d5b93937caf735a8897/view
10627,R--Enterprise Progran Management Office (EPMO) Technical & Business Management Support - Base Period (VA-20-00031624),"VETERANS AFFAIRS, DEPARTMENT OF",https://beta.sam.gov/opp/e5543506b0c54432aa040ad8563b6ca3/view
13389,Agency Strategic Business Management Support Services,DEPT OF DEFENSE,https://beta.sam.gov/opp/78fe435d19e046dcab8cab4fe52104f3/view
12426,"Combined Services (Gate Attendant, Mowing and Trimming and Solid Waste Removal Services) for Tar Camp Park & Dam Site 5 Park within the responsibility of the Pine Bluff Site Office",DEPT OF DEFENSE,https://beta.sam.gov/opp/617f443b45384ae6a9d93f88cb2b2dca/view


**Get Similar Words**

In [100]:
# inspect the vocabulary

vocabulary = model.wv.vocab
# vocabulary

In [106]:
# get the first 10 items in the vocabulary
list(vocabulary.keys())[:10]

['request',
 'for',
 'information',
 'new',
 'design',
 'booklet',
 'die',
 'cutting',
 'insert',
 'assembly']

In [107]:
# get the frequency of a word in the corpus 
model.wv.vocab['request'].count 

328

In [111]:
# get words similar to the entry term with their cosine similarity scores
model.wv.most_similar_cosmul('request')

[('information', 0.9256718158721924),
 ('proposals', 0.8846799731254578),
 ('agv', 0.8732286691665649),
 ('ppb2', 0.8524883389472961),
 ('r420', 0.849981427192688),
 ('adhesives', 0.8489277958869934),
 ('cornerstone', 0.8470085859298706),
 ('acclimation', 0.8461048007011414),
 ('papers', 0.8448440432548523),
 ('infill', 0.8435037732124329)]

The list of most similar words to request in the document is promising, but it could be better. More text cleaning is needed.

---
**Note:** I tested the two Doc2Vec models on the same document. The two models found different documents to be most similar to the same base document. However, performing a common sense check on both sets of similar documents suggests that the documents the two models recommended were equally relevant. The PV-DBOW model, however, produced higher similarity scores. Therefore, I used that model in my recommender.