### Capstone Idea

[GitHub](https://github.com/biborsz/Capstone)

**Problem Statement:** 

Beta.SAM.gov is the successor, among other federal governmental websites, of fedbizops.gov. The part that interests me is that it has a searchable collection of federal awards, as well as an information system of past grants. The current system is not yet fully operational. Nonetheless, it allows to search the collection based on key terms, grant id, (ide meg egyebek). One of it's problems in the past was that while it was searchable, it did not have a recommender system. Businesses, especially small businesses had to spend a considerable amount of time to find relevant grant opportunities. This new system offers that. Creating an account will have the advantage of receiving updates. However, the downside of email updates is that it clogs up the email inbox, and it does not always generate the expected value:
- it is difficult to know whether there are more opportunities out there or just the ones about which a business gets notified
- reader fatigue may cause businesses not to look further for contract opportunities than what had already been sent to them - and thus leave potential fundings on the table
- someone still needs to sift through a large amount of potentially irrelevant federal grant description

All in all: a user controlled recommender system would enhance the effectiveness of grant searching for businesses that do not have lots of resources to allocate toward that activity in the first place. 

How it would work:

- based on archival data - it would search for similarities in 
   - business activity of applicants
   - name of funding agency
   
   - earlier search terms
   - successful earlier grant applications

-> it would give a list of potentially useful search term, with an option to click on a select few

-> based on the search terms selected - it will further tune the recommender system

-> businesses have an option of rating a recommendation up or down

- based on earlier search terms - one would have access to a longer list of opportunites - that would make browsing possible (right now - it looks to me - that is out of the question)

- companies could search for other relevant information - for example: which companies are applying for similar grants in their business and geographic area
   
   
[API documentation](https://open.gsa.gov/api/get-opportunities-public-api/#user-account-api-key-creation)

[Beta.SAM.gov](https://beta.sam.gov/)

**Methodology:**
   - content based recommender system
   - text vectorizer: *bag-of-words*, *one-hot-encoder* 
     - bag of words: extracts words from the corpus as features
     - one-hot-encoder: gives a value of one if it finds a feature within a row/ text (0 to all other features)
   - classifies/ recommends based on *cosine similarities*

**Sources:**

https://towardsdatascience.com/how-to-build-a-simple-recommender-system-in-python-375093c3fb7d

https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html

**Ideas for the future:**
 - find database of registered businesses (business activity/ line of business)
 - crossreference awardees in prior grants to provide a list of competitors

In [1]:
# imports
import pandas as pd
import numpy as np
import requests
import time
import datetime

In [128]:
# set display options 
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_colwidth', None)

In [None]:
# rewrite code - to read in combined file - 
# separete out dates 
# split string '-'
# turn them into datetime object
# check for oldest date
# turn back into string
# make that postedTo date
# have for loop pull another page
# transform json object
# concat it to existing dataframe
# save it to csv
# I am at 03/02/2020

In [2]:
# define function to return minimum postedDate as string
def get_min_post_date(filename):
    # read in file of downloaded contract opportunities
    df = pd.read_csv(filename)
    # get min posted date
    min_date = pd.to_datetime(df['postedDate']).min()
    return min_date.strftime('%m/%d/%Y')

In [11]:
# get_min_post_date('./data/5_24_pull.csv')

'05/03/2020'

In [3]:
# pull contract information from api.sam.gov

# set base url
url = 'https://api.sam.gov/prod/opportunities/v1/search'

# create empty list to store results
result = []
# initialize counter
count = 0
# downloaded contract opportunities - file name
# file = './data/combined.csv'

# get minimum posted date from downloaded contract opportunities file
# set postedTo date to minimum posted date
# postedTo = get_min_post_date(file)
postedTo = get_min_post_date('./data/combined.csv')

# set posted from date
postedFrom = '01/01/2020'

# for loop to pull contracts
for i in range(10):
    count += 1
    
    # do a get request
    req = requests.get(url,
                      params={
                          'api_key': '',
                          'postedFrom': postedFrom,
                          'postedTo': postedTo,
                          'limit': 1000,
                          'offset': (count-1) * 1000
                      })
    
    # add response to result list
    result.append(req)
    
    now = datetime.datetime.now()
    print('Time:', now.strftime("%Y-%m-%d %H:%M:%S"))
    time.sleep(5)
    
    
    
# source for datetime - https://www.w3resource.com/python-exercises/python-basic-exercise-3.php

Time: 2020-05-26 21:26:56
Time: 2020-05-26 21:27:22
Time: 2020-05-26 21:27:47
Time: 2020-05-26 21:27:54
Time: 2020-05-26 21:28:01
Time: 2020-05-26 21:28:08
Time: 2020-05-26 21:28:15
Time: 2020-05-26 21:28:22
Time: 2020-05-26 21:28:30
Time: 2020-05-26 21:28:37


In [4]:
# unpack list of json objects from response data
ops = []
for item in result:
    print(item.headers)
    ops.append(item.json())

{'Age': '21', 'Content-Type': 'application/hal+json', 'Date': 'Wed, 27 May 2020 01:26:56 GMT', 'Server': 'openresty', 'Vary': 'Origin, Access-Control-Request-Method, Access-Control-Request-Headers', 'Via': 'http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'X-Forwarded-For': '74.96.156.35, 10.177.16.72, 10.177.55.40, 10.177.55.40', 'Connection': 'keep-alive', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Set-Cookie': 'citrix_ns_id=0UAwFQF08xlSqhWm8yioRiIaitQ0002; Domain=.sam.gov; Path=/; Secure; HttpOnly', 'Cache-Control': 'private', 'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked'}
{'Age': '21', 'Content-Type': 'application/hal+json', 'Date': 'Wed, 27 May 2020 01:27:22 GMT', 'Server': 'openresty', 'Vary': 'Origin, Access-Control-Request-Method, Access-Control-Request-Headers', 'Via': 'http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])', 'X-Cache': 'MISS', 'X-Forwarded-For': '74.96.156.35, 10.177.16.72, 10.177.55.40, 1

In [5]:
# parse json objects
ls_data = []
for i in range(len(ops)):
    print(ops[i].keys())
    df = pd.DataFrame(ops[i]['opportunitiesData'])
    ls_data.append(df)
    data = pd.concat(ls_data)

dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])
dict_keys(['totalRecords', 'limit', 'offset', 'opportunitiesData', 'links'])


In [6]:
data.shape

(2983, 27)

In [7]:
data.tail(10)

Unnamed: 0,noticeId,title,solicitationNumber,department,subTier,office,postedDate,type,baseType,archiveType,...,award,pointOfContact,description,organizationType,officeAddress,placeOfPerformance,additionalInfoLink,uiLink,links,resourceLinks
973,997b369db1ee4da3ad16fa6b6f365e12,Award of a 1 GB ETHERNET FROM (BLDG) 261; (RM)...,HC101320QA037,DEPT OF DEFENSE,DEFENSE INFORMATION SYSTEMS AGENCY (DISA),DITCO-SCOTT,2020-01-02,Award Notice,Combined Synopsis/Solicitation,autocustom,...,"{'date': '2020-01-02', 'number': 'HC101320PA24...","[{'fax': '', 'type': 'primary', 'email': 'tami...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '622255406', 'city': 'SCOTT AFB', ...",,,https://beta.sam.gov/opp/997b369db1ee4da3ad16f...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
974,7aa4b24d40f24d6490eea05a9f4e72e1,Dry Goods/Meats 2nd Qtr FY20,15B20320Q00000006,"JUSTICE, DEPARTMENT OF",FEDERAL PRISON SYSTEM / BUREAU OF PRISONS,FCI DANBURY,2020-01-02,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'dort...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '06811', 'city': 'DANBURY', 'count...","{'city': {'code': '18430', 'name': 'Danbury'},...",,https://beta.sam.gov/opp/7aa4b24d40f24d6490eea...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...
975,6d70e2e7e9a34b5fba969386b8196e8e,"SUBMIT A QUOTE TO PROVIDE, INSTALL, AND MAINTA...",HC101319QB008,DEPT OF DEFENSE,DEFENSE INFORMATION SYSTEMS AGENCY (DISA),DITCO-SCOTT,2020-01-02,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'dale...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '622255406', 'city': 'SCOTT AFB', ...",,,https://beta.sam.gov/opp/6d70e2e7e9a34b5fba969...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...
976,5ea0e7bb80434c3f809b5a96e48b1b07,Purchase of Gearbox Assembly,SPRTA1-20-Q-0112,DEPT OF DEFENSE,DEFENSE LOGISTICS AGENCY (DLA),"DLA AVIATION AT OKLAHOMA CITY, OK",2020-01-02,Presolicitation,Presolicitation,autocustom,...,,"[{'fax': None, 'type': 'primary', 'email': 'da...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '73145-3070', 'city': 'TINKER AFB'...","{'country': {'code': 'USA', 'name': 'UNITED ST...",,https://beta.sam.gov/opp/5ea0e7bb80434c3f809b5...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
977,4e7da4f82e094701a2f48b336ec0c8ae,"SUBMIT A QUOTE TO PROVIDE, INSTALL, AND MAINTA...",HC101320QA158,DEPT OF DEFENSE,DEFENSE INFORMATION SYSTEMS AGENCY (DISA),DITCO-SCOTT,2020-01-02,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'tami...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '622255406', 'city': 'SCOTT AFB', ...",,,https://beta.sam.gov/opp/4e7da4f82e094701a2f48...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...
978,34c8c74c20054b2ab9a456947f110f6d,Remediation Services at the PR-58 Site in Nort...,W912WJ20X0014,DEPT OF DEFENSE,DEPT OF THE ARMY,W2SD ENDIST NEW ENGLAND,2020-01-02,Sources Sought,Sources Sought,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'heat...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '01742-2751', 'city': 'CONCORD', '...","{'city': {'code': '51580', 'name': 'North King...",,https://beta.sam.gov/opp/34c8c74c20054b2ab9a45...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
979,256c9825d7744a2988be4cd52238c0d1,"SUBMIT A QUOTE TO PROVIDE, INSTALL, AND MAINTA...",HC101319QB215,DEPT OF DEFENSE,DEFENSE INFORMATION SYSTEMS AGENCY (DISA),DITCO-SCOTT,2020-01-02,Combined Synopsis/Solicitation,Combined Synopsis/Solicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'tami...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '622255406', 'city': 'SCOTT AFB', ...",,,https://beta.sam.gov/opp/256c9825d7744a2988be4...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...
980,14c585fbe7c1461b868ded6ca82f7b8d,"PROVIDE, INSTALL, AND MAINTAIN A 2.488GB (OC48...",HC101320QA222,DEPT OF DEFENSE,DEFENSE INFORMATION SYSTEMS AGENCY (DISA),DITCO-SCOTT,2020-01-02,Solicitation,Solicitation,autocustom,...,,"[{'fax': None, 'type': 'primary', 'email': 'aa...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '622255406', 'city': 'SCOTT AFB', ...",,,https://beta.sam.gov/opp/14c585fbe7c1461b868de...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...
981,13f7e81d93e64c5a8e96d8e143c9507d,J--REPLACE VICTORIAN PARK LIGHT FIXTURES AT: S...,140P8620Q0004,"INTERIOR, DEPARTMENT OF THE",NATIONAL PARK SERVICE,PWR GOGA(86000),2020-01-02,Award Notice,Combined Synopsis/Solicitation,autocustom,...,"{'date': '2019-12-30', 'number': '140P8620P001...","[{'fax': None, 'type': 'primary', 'email': 'Ga...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '94123', 'city': 'SAN FRANCISCO', ...",{},,https://beta.sam.gov/opp/13f7e81d93e64c5a8e96d...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",
982,0ab14328f5054d5889c44377b38d2f6a,COHO,HR001120S0001,DEPT OF DEFENSE,DEFENSE ADVANCED RESEARCH PROJECTS AGENCY (DA...,DEF ADVANCED RESEARCH PROJECTS AGCY,2020-01-02,Presolicitation,Presolicitation,autocustom,...,,"[{'fax': '', 'type': 'primary', 'email': 'HR00...",https://api.sam.gov/prod/opportunities/v1/noti...,OFFICE,"{'zipcode': '222032114', 'city': 'ARLINGTON', ...",,,https://beta.sam.gov/opp/0ab14328f5054d5889c44...,"[{'rel': 'self', 'href': 'https://api.sam.gov/...",[https://beta.sam.gov/api/prod/opps/v3/opportu...


In [8]:
data.duplicated('noticeId').sum()

0

In [15]:
# data.drop_duplicates('noticeId', inplace=True)

In [9]:
data.to_csv('./data/5_26_pull.csv', index=False)

In [10]:
df1 = pd.read_csv('./data/combined.csv')
df2 = pd.read_csv('./data/5_26_pull.csv')

In [12]:
df2.shape

(1958, 27)

In [13]:
df = pd.concat([df1, df2, df3])

In [14]:
df.shape

(31913, 27)

In [15]:
df.to_csv('./data/combined.csv', index=False)