### Doc2Vec Model

Gensim's Doc2Vec model vectorizes a group of words together rather than single words.
[Source](https://www.machinelearningplus.com/nlp/gensim-tutorial/#15howtoupdateanexistingword2vecmodelwithnewdata)

In [8]:
import pandas as pd
import numpy as np
from gensim import corpora
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [9]:
# code and idea adapted from Clay Carson, Jollene Muncy and Cynthia Chiang

# read in tokenized text file
tokenized_titles = pd.read_csv('./data/tokenized_titles.csv')

In [10]:
# set display options to full column widths
pd.set_option('display.max_colwidth', None)

In [11]:
tokenized_titles.head()

Unnamed: 0,0
0,"['request', 'for', 'information', 'new', 'design', 'booklet', 'die', 'cutting', 'insert', 'assembly', 'united', 'states', 'government', 'publishing', 'office', 'united', 'states', 'government', 'publishing', 'office']"
1,"['audiovisual', 'suite', 'for', 'large', 'auditorium', 'maxwell', 'afb', 'dept', 'of', 'defense', 'dept', 'of', 'the', 'air', 'force']"
2,"['metrology', 'equipment', 'move', 'dept', 'of', 'defense', 'defense', 'logistics', 'agency', 'dla']"
3,"['inner', 'inflatable', 'assy', 'lpu', 'dept', 'of', 'defense', 'defense', 'logistics', 'agency', 'dla']"
4,"['cradle', 'dept', 'of', 'defense', 'defense', 'logistics', 'agency', 'dla']"


In [12]:
# convert pd.DataFrame into list
tokenized_titles = tokenized_titles.values.tolist()

# source: https://note.nkmk.me/en/python-pandas-list/#:~:text=Convert%20labels%20(row%20%2F%20column%20names)%20to%20list,has%20a%20tolist()%20method.

In [13]:
type(tokenized_titles)
tokenized_titles[:2]

[["['request', 'for', 'information', 'new', 'design', 'booklet', 'die', 'cutting', 'insert', 'assembly', 'united', 'states', 'government', 'publishing', 'office', 'united', 'states', 'government', 'publishing', 'office']"],
 ["['audiovisual', 'suite', 'for', 'large', 'auditorium', 'maxwell', 'afb', 'dept', 'of', 'defense', 'dept', 'of', 'the', 'air', 'force']"]]

**Create List of Tagged Documents**

[Source](https://www.machinelearningplus.com/nlp/gensim-tutorial/#15howtoupdateanexistingword2vecmodelwithnewdata)

In [14]:
# define function to create tagged documents
def tag_documents(list_of_documents):
    """generator function
       accepts list of tokenized documents
       in form of list of list of words
       attaches tags to each document
       yields list of tagged documents"""
    for i, document in enumerate(list_of_documents):
        yield TaggedDocument(document, [i])

In [15]:
# tag documents
tagged_titles = list(tag_documents(tokenized_titles))

In [16]:
tagged_titles[:1]

[TaggedDocument(words=["['request', 'for', 'information', 'new', 'design', 'booklet', 'die', 'cutting', 'insert', 'assembly', 'united', 'states', 'government', 'publishing', 'office', 'united', 'states', 'government', 'publishing', 'office']"], tags=[0])]

**Create Doc2Vec Model**

In [6]:
# Instantiate the Doc2Vec model
# skip-gram model
model = Doc2Vec(vector_size=50, dbow_words=1, min_count=2, seed=1977, epochs=40)

In [7]:
# Build the Volabulary
model.build_vocab(tagged_titles)

NameError: name 'tagged_titles' is not defined

In [46]:
%%time
# Train the Doc2Vec model
model.train(tagged_titles, total_examples=model.corpus_count, epochs=model.epochs)

CPU times: user 1min 40s, sys: 12.7 s, total: 1min 53s
Wall time: 1min 49s


In [48]:
# Save model for future use
model.save('doc2vec_model')

In [17]:
# Load saved doc2vec_model
model = Doc2Vec.load('doc2vec_model')

**Get Similar Documents**

In [70]:
# get the 10 most similar documents to the first document
model.docvecs.most_similar(0, topn=10)

[(20468, 0.5547981262207031),
 (5556, 0.5542991161346436),
 (2786, 0.5503724217414856),
 (2742, 0.5410701632499695),
 (2252, 0.5400078296661377),
 (41787, 0.5182334184646606),
 (42921, 0.49703189730644226),
 (24790, 0.48553964495658875),
 (12941, 0.4851374626159668),
 (27733, 0.4806517958641052)]

**Let's investigate how similar these documents are**

In [2]:
# read in dataframe of contract notifications
data = pd.read_csv('./data/combined.csv')
data.sample(25)

Unnamed: 0,noticeId,title,solicitationNumber,department,subTier,office,postedDate,type,baseType,archiveType,...,award,pointOfContact,description,organizationType,officeAddress,placeOfPerformance,additionalInfoLink,uiLink,links,resourceLinks
40819,dc5becef2cc64c46bd8306b130e084e6,56--VALVE BODY INSUL,N0010420QEF63,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVSUP WEAPON SYSTEMS SUPPORT MECH,2020-05-28,Solicitation,Solicitation,auto15,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'LY...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '17050-0788', 'city': 'MECHANICSBU...",{},,https://beta.sam.gov/opp/dc5becef2cc64c46bd830...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
21955,496f02eda0f94bd9a8b3e84b8a77815d,"47--ADAPTER,STRAIGHT,TUBE",SPE7M320U0899,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA LAND AND MARITIME,2020-05-07,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '43218-3990', 'city': 'COLUMBUS', ...",{},,https://beta.sam.gov/opp/496f02eda0f94bd9a8b3e...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
2530,61c2bdb4aee540a3bdf42a3886aca240,Kanopolis Lake Vegetation Management,W912DQ20R1058,DEPT OF DEFENSE,DEPT OF THE ARMY,W071 ENDIST KANSAS CITY,2020-04-22,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'laur...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '64106-2896', 'city': 'KANSAS CITY...",,,https://beta.sam.gov/opp/61c2bdb4aee540a3bdf42...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
12098,e39501dbb64a451fb5b10c8f933ccd6e,FD2020-20-00640,,DEPT OF DEFENSE,DEPT OF THE AIR FORCE,FA8221 AFNWC PZBB,2020-03-11,Special Notice,Special Notice,autocustom,...,,,https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '84056-5837', 'city': 'HILL AFB', ...",,,https://beta.sam.gov/opp/e39501dbb64a451fb5b10...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
38514,6ed1f768283b497bb19066368ecf53cd,41--POSI-VAC STARTER KI,SPE8E820T3561,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA TROOP SUPPORT,2020-05-18,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '19111-5096', 'city': 'PHILADELPHI...",{},,https://beta.sam.gov/opp/6ed1f768283b497bb1906...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
40638,f8a0f16a5896412581cd81f84c552d3e,"43--PUMP UNIT,CENTRIFUG",SPE7M120T438A,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),DLA LAND AND MARITIME,2020-05-28,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'Di...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '43218-3990', 'city': 'COLUMBUS', ...",{},,https://beta.sam.gov/opp/f8a0f16a5896412581cd8...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
41257,99d96fbf1dee437b8acc864286598ba9,Delivery Order for Class I Engineering Change ...,N00019-20-RFPREQ-PMA-201-0259,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVAL AIR SYSTEMS COMMAND,2020-05-28,Presolicitation,Presolicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'sama...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '20670-5000', 'city': 'PATUXENT RI...","{'city': {'code': '65000', 'name': 'Saint Loui...",,https://beta.sam.gov/opp/99d96fbf1dee437b8acc8...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
13097,ba00b2b5611d4153b5727a08cae93211,Serena Software - Sole Source (Including Brand...,FA300220Q0006,DEPT OF DEFENSE,DEPT OF THE AIR FORCE,FA3002 338 SCONS CC,2020-04-29,Justification,Justification,auto30,...,"{'date': '2020-04-23', 'number': 'FA300220C0009'}","[{'fax': None, 'type': 'primary', 'email': 'gr...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '78150-4300', 'city': 'RANDOLPH AF...","{'city': {'code': 'TX-18', 'name': 'JBSA Rando...",,https://beta.sam.gov/opp/ba00b2b5611d4153b5727...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
5186,12a2351eab5b4a2f96a309632d18c0e2,Cisco ISE License,N0025320Q0097,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVAL UNDERSEA WARFARE CENTER,2020-04-16,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'murr...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '98345-7610', 'city': 'KEYPORT', '...",,,https://beta.sam.gov/opp/12a2351eab5b4a2f96a30...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...
1367,055711e208e747198e7dac16ad315f14,USPSC- Knowledge Management Specialist,7200AA20R00045,AGENCY FOR INTERNATIONAL DEVELOPMENT,AGENCY FOR INTERNATIONAL DEVELOPMENT,USAID M/OAA,2020-04-24,Solicitation,Solicitation,auto15,...,,"[{'fax': '', 'type': 'primary', 'email': 'jbui...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '20523', 'city': 'WASHINGTON', 'co...","{'city': {'code': '50000', 'name': 'Washington...",,https://beta.sam.gov/opp/055711e208e747198e7da...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",['https://beta.sam.gov/api/prod/opps/v3/opport...


In [18]:
# find most similar documents in dataframe
def most_similar(base_document, n):
    """function
       finds most similar n documents
       to base_document
       based on cosine similarity
       returns similar documents 
       as pandas dataframe"""
    # load saved doc2vec_model
    model = Doc2Vec.load('doc2vec_model')
    # get similar topics
    similars = model.docvecs.most_similar(base_document, topn=n)
    # empty list of rows in the original dataframe
    list_of_dfs = []
    # iterate through all similar notifications
    for tag, similar_document in similars:
        # find the row in the notifications dataframe corresponding to the tag
        df = data.loc[tag, ['title', 'department']]
        # add row to the list of rows 
        list_of_dfs.append(df)
    # return all rows as a dataframe   
    return pd.DataFrame(list_of_dfs)

In [19]:
most_similar(2530, 10)

Unnamed: 0,title,department
33806,BRAND NAME - FLIR CAMERA MIDWAVE AND LONGWAVE LENSES and CALIBRATION,DEPT OF DEFENSE
36574,34--Haas 4th Axis rotary Table,DEPT OF DEFENSE
21034,"15--WINDSHIELD PANEL,AI, IN REPAIR/MODIFICATION OF",DEPT OF DEFENSE
27702,61--3SFG Battery Requirement,DEPT OF DEFENSE
7108,6530--ICU Beds - COVID-19,"VETERANS AFFAIRS, DEPARTMENT OF"
13380,"U008--Veterans Legacy Program, creating educational materials Multiple award to Universities/museums","VETERANS AFFAIRS, DEPARTMENT OF"
33510,300 Ton Press Control Upgrade,DEPT OF DEFENSE
28582,CENTCOM ARMORED LIVE-ROUND SUBSTITUTE SYSTEM (ALRSS),DEPT OF DEFENSE
37078,FSC 6760 Photographic Cases,DEPT OF DEFENSE
34133,"28--RING SET,PISTON",DEPT OF DEFENSE


**Get Similar Words**

In [105]:
tagged_titles[500]

TaggedDocument(words=["['covid19', 'bpa', 'visn', 'lab', 'abbott', 'sars', 'co', 'antibody', 'test', 'reagents', 'award', 'notice', 'c24120a0039', 'veterans', 'affairs', 'department', 'of', 'veterans', 'affairs', 'department', 'of']"], tags=[500])

In [111]:
# get words similar to the entry term with their cosine similarity scores
model.wv.most_similar_cosmul('office')

KeyError: "word 'office' not in vocabulary"