### Capstone Idea

[GitHub](https://github.com/biborsz/Capstone)

**Problem Statement:** 

Fedbizopps.gov used to be a website where small businesses could search for federal contract opportunities. While the collection of soliciations and award notifications was searchable based on key terms, finding opportunities of interest based on similarity, rather than key term search, as far as I know, was not available. As a result, looking for contract opportunities took a considerable time, which could potentially put a strain on lightly staffed small businesses. The successor of Fedbizopps.gov, Beta.SAM.gov, although not yet fully operational, provides a wide array of filtering options besides the search by key terms and the browsing function. Those new to the system would benefit, however, from an application that recommends opportunities of interest based on topic and/or wording similarity. 

The purpose of this project is two-fold:
1. Stretching the limits of natural language processing, build an application that, based on user up- or downvote would recommend contract notifications of interest.
2. Observe how well a content based recommender system is capable of finding not only similar but also relevant notifications. 


-> it would give a list of potentially useful search terms, with an option to click on a select few

-> based on the search terms selected - it will further tune the recommender system

-> businesses have an option of rating a recommendation up or down

- based on earlier search terms - one would have access to a longer list of opportunites - that would make browsing possible (right now - it looks to me - that is out of the question)

- companies could search for other relevant information - for example: which companies are applying for similar grants in their business and geographic area
   
   
[API documentation](https://open.gsa.gov/api/get-opportunities-public-api/#user-account-api-key-creation)

[Beta.SAM.gov](https://beta.sam.gov/)

**Methodology:**
   - content based recommender system
   - text vectorizer: *bag-of-words*, *one-hot-encoder* 
     - bag of words: extracts words from the corpus as features
     - one-hot-encoder: gives a value of one if it finds a feature within a row/ text (0 to all other features)
   - classifies/ recommends based on *cosine similarities*

**Sources:**

https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d

https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

http://recommender-systems.org/content-based-filtering/

https://heartbeat.fritz.ai/recommender-systems-with-python-part-i-content-based-filtering-5df4940bd831

**Ideas for the future:**
 - find database of registered businesses (business activity/ line of business)
 - crossreference awardees in prior grants to provide a list of competitors

In [7]:
# imports
import pandas as pd
import numpy as np
import requests
import time
import datetime

In [8]:
# set display options 
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', None)

In [None]:
# rewrite code - to read in combined file - 
# separete out dates 
# split string '-'
# turn them into datetime object
# check for oldest date
# turn back into string
# make that postedTo date
# have for loop pull another page
# transform json object
# concat it to existing dataframe
# save it to csv
# I am at 03/02/2020

In [9]:
# define function to return minimum postedDate as string
def get_min_post_date(filename):
    # read in file of downloaded contract opportunities
    df = pd.read_csv(filename)
    # get min posted date
    min_date = pd.to_datetime(df['postedDate']).min()
    return min_date.strftime('%m/%d/%Y')

In [10]:
# define function to return maximum postedDate as string
def get_max_post_date(filename):
    # read in file of downloaded contract opportunities
    df = pd.read_csv(filename)
    # get min posted date
    max_date = pd.to_datetime(df['postedDate']).max()
    return max_date.strftime('%m/%d/%Y')

In [11]:
# define function to return today's date as string
def today():
    now = datetime.datetime.now()
    return now.strftime('%m/%d/%Y')

In [12]:
today()

'06/03/2020'

In [13]:
get_max_post_date('./data/combined.csv')

'06/02/2020'

In [14]:
# pull contract information from api.sam.gov

# set base url
url = 'https://api.sam.gov/prod/opportunities/v1/search'

# create empty list to store results
result = []
# initialize counter
count = 0
# downloaded contract opportunities - file name
# file = './data/combined.csv'

# set postedTo date to today's date by calling today() function
postedTo = today()

# set posted from date
postedFrom = get_max_post_date('./data/combined.csv')

# for loop to pull contracts
for i in range(5):
    count += 1
    
    # do a get request
    req = requests.get(url,
                      params={
                          'api_key': '',
                          'postedFrom': postedFrom,
                          'postedTo': postedTo,
                          'limit': 1000,
                          'offset': (count-1) * 1000
                      })
    
    # add response to result list
    result.append(req)
    
    now = datetime.datetime.now()
    print('Time:', now.strftime("%Y-%m-%d %H:%M:%S"))
    time.sleep(5)
    
    
    
# source for datetime - https://www.w3resource.com/python-exercises/python-basic-exercise-3.php

Time: 2020-06-03 22:32:09
Time: 2020-06-03 22:32:36
Time: 2020-06-03 22:33:04
Time: 2020-06-03 22:33:14
Time: 2020-06-03 22:33:21


In [15]:
# unpack list of json objects from response data
ops = []
for item in result:
    print(item.headers)
    ops.append(item.json())

{'Age': '20', 'Content-Type': 'application/hal+json', 'Date': 'Thu, 04 Jun 2020 02:32:09 GMT', 'Server': 'openresty', 'Vary': 'Origin, Access-Control-Request-Method, Access-Control-Request-Headers', 'Via': 'http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'X-Forwarded-For': '74.96.156.35, 10.177.16.72, 10.177.54.200, 10.177.54.200', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Set-Cookie': 'citrix_ns_id=RMIen98RxPPmQqMq57dl6w+msVY0002; Domain=.sam.gov; Path=/; Secure; HttpOnly', 'Cache-Control': 'private', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked'}
{'Age': '21', 'Content-Type': 'application/hal+json', 'Date': 'Thu, 04 Jun 2020 02:32:35 GMT', 'Server': 'openresty', 'Vary': 'Origin, Access-Control-Request-Method, Access-Control-Request-Headers', 'Via': 'http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'X-Forwarded-For': '74.96.156.35, 10.177.16.72, 10.177.53.59,

In [16]:
# parse json objects
ls_data = []
for i in range(len(ops)):
    print(ops[i].keys())
    df = pd.DataFrame(ops[i]['opportunitiesData'])
    ls_data.append(df)
    data = pd.concat(ls_data)

dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])


In [17]:
data.shape

(3146, 27)

In [18]:
data.tail(10)

Unnamed: 0,noticeId,title,solicitationNumber,department,subTier,office,postedDate,type,baseType,archiveType,archiveDate,typeOfSetAsideDescription,typeOfSetAside,responseDeadLine,naicsCode,classificationCode,active,award,pointOfContact,description,organizationType,officeAddress,placeOfPerformance,additionalInfoLink,uiLink,links,resourceLinks
136,0149e9351bc84e479cb87222f80591ed,VCS Tailcone Solicitation,N0016720R0010,DEPT OF DEFENSE,DEPT OF THE NAVY,NSWC CARDEROCK,2020-06-02,Solicitation,Solicitation,auto15,2020-06-27,,,2020-06-12T15:00:00-04:00,336611.0,2010,Yes,,"[{'fax': None, 'type': 'primary', 'email': 'douglas.riedel@navy.mil', 'phone': '3012272959', 'title': 'Contract Specialist', 'fullName': 'Douglas A Riedel'}, {'fax': '', 'type': 'secondary', 'email': 'jonathan.mauro@navy.mil', 'phone': '3012274053', 'title': 'COR', 'fullName': 'Jonathan Mauro'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0149e9351bc84e479cb87222f80591ed,OFFICE,"{'zipcode': '20817-5700', 'city': 'BETHESDA', 'countryCode': 'USA', 'state': 'MD'}","{'country': {'code': 'USA', 'name': 'UNITED STATES'}}",,https://beta.sam.gov/opp/0149e9351bc84e479cb87222f80591ed/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0149e9351bc84e479cb87222f80591ed&limit=1'}]","[https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/9fabb18b3fca4287afea158bfa8cf1b1/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/105f5b618876414681dcd87298c87b6e/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/a3cb27f001e1482492d4366a3175c0eb/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/2fc57cd4c9394833b0cb85319bb2c0fd/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/9d49bc8e8eb24c59aa3a09e0d8adfc03/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/e84f29aa176e4260b45b9f16cdf9cc79/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/523f10c600d64ddeb6158cb54ae578d7/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/00770aea12b5470eb9f5459f66ee1a0a/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/1f9ccdb798124e7db4e3245cdc6717f7/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/d27a2ae7bf3848118824c526d5b89605/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/4e26ddc923d4433b8cc1521aad1b54f6/download?api_key=null&token=, https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/9c2b9f3ca856445690ba74087cf23e52/download?api_key=null&token=]"
137,0124531ff6894e799daa94584a51bf46,Q402--Community Nursing Home (CNH) Services (Multiple Award) Houston,36C25620Q0470,"VETERANS AFFAIRS, DEPARTMENT OF","VETERANS AFFAIRS, DEPARTMENT OF",256-NETWORK CONTRACT OFFICE 16 (36C256),2020-06-02,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,2020-08-02,,,2020-06-08T16:00:00-05:00,623110.0,Q402,Yes,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'steven.berkeley@va.gov', 'phone': None, 'title': None, 'fullName': 'Steven A Berkeley Contracting Officer 713-791-1414'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0124531ff6894e799daa94584a51bf46,OFFICE,"{'zipcode': '39157', 'city': 'RIDGELAND', 'countryCode': 'USA', 'state': 'MS'}",{},,https://beta.sam.gov/opp/0124531ff6894e799daa94584a51bf46/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0124531ff6894e799daa94584a51bf46&limit=1'}]",[https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/d5b0762e6c144e6c9f0e2beb9f78bc5d/download?api_key=null&token=]
138,0120814104e543899b47e86c441faf13,"70--ETHRNT LAN SW ASSY, IN REPAIR/MODIFICATION OF",N0010420QND18,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVSUP WEAPON SYSTEMS SUPPORT MECH,2020-06-02,Presolicitation,Presolicitation,autocustom,2020-06-23,Total Small Business Set-Aside (FAR 19.5),SBA,2020-06-08,,7050,Yes,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'STEVE.SMITHEY@NAVY.MIL', 'phone': None, 'title': None, 'fullName': 'STEVE P. SMITHEY, N744.8, PHONE (717)605-7751, FAX (717)605-4236, EMAIL STEVE.SMITHEY@NAVY.MIL'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0120814104e543899b47e86c441faf13,OFFICE,"{'zipcode': '17050-0788', 'city': 'MECHANICSBURG', 'countryCode': 'USA', 'state': 'PA'}",{},,https://beta.sam.gov/opp/0120814104e543899b47e86c441faf13/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0120814104e543899b47e86c441faf13&limit=1'}]",
139,0117619e58e34fd9a255f22237b38178,Birkman Method Behavioral and Occupational Assessment,FY20A1PCKB,DEPT OF DEFENSE,DEPT OF THE AIR FORCE,FA4890 ACC AMIC,2020-06-02,Sources Sought,Sources Sought,auto15,2020-06-24,Total Small Business Set-Aside (FAR 19.5),SBA,2020-06-09T18:00:00-04:00,611430.0,U002,Yes,,"[{'fax': None, 'type': 'primary', 'email': 'camilla.funk@us.af.mil', 'phone': None, 'title': None, 'fullName': 'Camilla Funk'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0117619e58e34fd9a255f22237b38178,OFFICE,"{'zipcode': '23665-2701', 'city': 'LANGLEY AFB', 'countryCode': 'USA', 'state': 'VA'}","{'city': {'code': 'VA-06', 'name': 'Langley AFB'}, 'state': {'code': 'VA', 'name': 'Virginia'}, 'zip': '23665', 'country': {'code': 'USA', 'name': 'UNITED STATES'}}",,https://beta.sam.gov/opp/0117619e58e34fd9a255f22237b38178/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0117619e58e34fd9a255f22237b38178&limit=1'}]",
140,010c5fcb1e1c41bdadafad079fdc7183,Seaspray 7500E Radar Repair Services,70Z03820RH0000002,"HOMELAND SECURITY, DEPARTMENT OF",US COAST GUARD,AVIATION LOGISTICS CENTER (ALC)(000,2020-06-02,Presolicitation,Presolicitation,auto15,2020-07-01,,,2020-06-16T11:30:00-04:00,488190.0,J059,Yes,,"[{'fax': '', 'type': 'primary', 'email': 'renee.l.wood@uscg.mil', 'phone': '', 'title': None, 'fullName': 'Renee L. Wood'}, {'fax': '', 'type': 'secondary', 'email': 'Ashley.M.Radtke@uscg.mil', 'phone': '', 'title': None, 'fullName': 'Ashley M. Radtke'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=010c5fcb1e1c41bdadafad079fdc7183,OFFICE,"{'zipcode': '27909', 'city': 'ELIZABETH CTY', 'countryCode': 'USA', 'state': 'NC'}","{'city': {'code': '37000', 'name': 'Huntsville'}, 'state': {'code': 'AL', 'name': 'Alabama'}, 'zip': '35805', 'country': {'code': 'USA', 'name': 'UNITED STATES'}}",,https://beta.sam.gov/opp/010c5fcb1e1c41bdadafad079fdc7183/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=010c5fcb1e1c41bdadafad079fdc7183&limit=1'}]",
141,00e627c798a0456caec2c930bb3c2c44,J--Mechanicsburg intends to solicit and award a base five-year requirements contract for the repair of the MK38 weapon system(see attached NIIN and P/N List).,N0010419RD004,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVSUP WEAPON SYSTEMS SUPPORT MECH,2020-06-02,Presolicitation,Presolicitation,autocustom,2020-06-03,,,2019-02-04T00:00:00-05:00,332994.0,J,Yes,,"[{'fax': None, 'type': 'primary', 'email': None, 'phone': None, 'title': None, 'fullName': 'Trevor Monn 7176054446'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=00e627c798a0456caec2c930bb3c2c44,OFFICE,"{'zipcode': '17050-0788', 'city': 'MECHANICSBURG', 'countryCode': 'USA', 'state': 'PA'}",,,https://beta.sam.gov/opp/00e627c798a0456caec2c930bb3c2c44/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=00e627c798a0456caec2c930bb3c2c44&limit=1'}]",
142,0067c4b152784bf2aba29c585b8866a6,Booster Motors,N6893620R0126,DEPT OF DEFENSE,DEPT OF THE NAVY,NAVAL AIR WARFARE CENTER,2020-06-02,Sources Sought,Sources Sought,autocustom,2021-06-16,,,2020-06-16T09:00:00-07:00,332710.0,9390,Yes,,"[{'fax': '7609393095', 'type': 'primary', 'email': 'erika.m.martin@navy.mil', 'phone': '7609390283', 'title': None, 'fullName': 'Erika Martin'}, {'fax': '7609398186', 'type': 'secondary', 'email': 'erin.strand@navy.mil', 'phone': '7609397309', 'title': None, 'fullName': 'Erin K Strand'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0067c4b152784bf2aba29c585b8866a6,OFFICE,"{'zipcode': '93555-6018', 'city': 'CHINA LAKE', 'countryCode': 'USA', 'state': 'CA'}",,,https://beta.sam.gov/opp/0067c4b152784bf2aba29c585b8866a6/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0067c4b152784bf2aba29c585b8866a6&limit=1'}]",
143,003b1ba4f6454a74856267014305d7d2,Virtual Reality Paint Simulation,W912GY20RFI6220,DEPT OF DEFENSE,DEPT OF THE ARMY,W6QK SIAD CONTR OFF,2020-06-02,Sources Sought,Sources Sought,auto15,2020-07-01,,,2020-06-16T10:00:00-07:00,611420.0,6910,Yes,,"[{'fax': None, 'type': 'primary', 'email': 'heidi.m.young.civ@mail.mil', 'phone': '5308274565', 'title': None, 'fullName': 'Heidi Young'}, {'fax': '5308274722', 'type': 'secondary', 'email': 'melissa.m.kaarbo.civ@mail.mil', 'phone': '5308274776', 'title': None, 'fullName': 'Melissa Kaarbo'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=003b1ba4f6454a74856267014305d7d2,OFFICE,"{'zipcode': '96113-5000', 'city': 'HERLONG', 'countryCode': 'USA', 'state': 'CA'}","{'city': {'code': '33336', 'name': 'Herlong'}, 'state': {'code': 'CA', 'name': 'California'}, 'zip': '96113', 'country': {'code': 'USA', 'name': 'UNITED STATES'}}",,https://beta.sam.gov/opp/003b1ba4f6454a74856267014305d7d2/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=003b1ba4f6454a74856267014305d7d2&limit=1'}]",[https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/4946f7050a1645c1a747f56e64f723af/download?api_key=null&token=]
144,001f916c79c449978942c8d62a22b6d6,CIRCUIT CARD ASSEMB,SPRPA120QR351,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),"DLA AVIATION AT PHILADELPHIA, PA",2020-06-02,Solicitation,Solicitation,auto15,2020-06-27,,,2020-06-12,333999.0,5998,Yes,{'awardee': {'location': {}}},"[{'fax': None, 'type': 'primary', 'email': 'JASON.SOL@DLA.MIL', 'phone': None, 'title': None, 'fullName': 'Telephone: 2157373652'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=001f916c79c449978942c8d62a22b6d6,OFFICE,"{'zipcode': '19111-5098', 'city': 'PHILADELPHIA', 'countryCode': 'USA', 'state': 'PA'}",{},,https://beta.sam.gov/opp/001f916c79c449978942c8d62a22b6d6/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=001f916c79c449978942c8d62a22b6d6&limit=1'}]",
145,0011c709bf9d456e893b098504fb41d9,6515--NEW - SWANSON FLEX HINGE TOE,36C25720Q0773,"VETERANS AFFAIRS, DEPARTMENT OF","VETERANS AFFAIRS, DEPARTMENT OF",257-NETWORK CONTRACT OFFICE 17 (36C257),2020-06-02,Sources Sought,Sources Sought,autocustom,2020-06-09,,,2020-06-04T16:00:00-05:00,339113.0,6515,Yes,{'awardee': {}},"[{'fax': '210-694-6300', 'type': 'primary', 'email': 'vinicky.ervin@va.gov', 'phone': '210-694-6306', 'title': 'Contract Specialist', 'fullName': 'Dr. Vinicky Ann Ervin Ph.D.'}]",https://api.sam.gov/prod/opportunities/v1/noticedesc?noticeid=0011c709bf9d456e893b098504fb41d9,OFFICE,"{'zipcode': '76006', 'city': 'ARLINGTON', 'countryCode': 'USA', 'state': 'TX'}","{'streetAddress': 'VA Health Care Center in Harlington', 'streetAddress2': ' 2601 Veterans Drive', 'city': {'name': 'Harlingen, Texas'}, 'state': {'name': ''}, 'zip': '78550-8595', 'country': {'code': 'USA', 'name': 'UNITED STATES'}}",,https://beta.sam.gov/opp/0011c709bf9d456e893b098504fb41d9/view,"[{'rel': 'self', 'href': 'https://api.sam.gov/prod/opportunities/v1/search?noticeid=0011c709bf9d456e893b098504fb41d9&limit=1'}]",[https://beta.sam.gov/api/prod/opps/v3/opportunities/resources/files/64744604c3c14194b7367a3dd4025d9b/download?api_key=null&token=]


In [19]:
data.duplicated('noticeId').sum()

0

In [15]:
# data.drop_duplicates('noticeId', inplace=True)

In [20]:
data.to_csv('./data/6_03_pull.csv', index=False)

In [21]:
df1 = pd.read_csv('./data/combined.csv')
df2 = pd.read_csv('./data/6_03_pull.csv')

In [22]:
df2.shape

(3146, 27)

In [23]:
df = pd.concat([df1, df2])

In [24]:
df.shape

(50887, 27)

In [25]:
df.to_csv('./data/combined.csv', index=False)