# Refresh KSATT Open Source Graph Database

### Usage

Neo4j Version 4.4.5 was used at time of updating this.

While all of these sources are relatively static, they do have version updates (on irregular cadences).  
It is not urgent to incorporate these updates, so running these updates should be done monthly or quarterly and require a more significant rebuild of the KSATT graph database as well.  

Prior to running this code for the first time, you should:
- Ensure you have all python packages installed on your local machine (and optionally in the virtual environment you use to run this).
- Create an empty new local Neo4j graph database. 
  - Start it upon creation, then stop it. 
  - Install the APOC plugin. 
  - Edit the settings file so that the minimum and maximum heap sizes are at least 1GB and 3GB respectively.
  - You'll need to restart the database after making those changes to the configuration.
- Substitute in your file path to the `neo4j_import_folder` variable in the *Prep* codeblock under *Download Data*. 
  - This file path can be found through Neo4j Desktop (... > Open folder > DBMS > import). I will open this directory in Finder then select "New Terminal at Folder" and `pwd` to copy and paste the full path. 
- Manually copy the `opm_onet_crosswalk.csv` file from this repository into the Neo4j DBMS import folder. 

Make sure your Neo4j database is started before running this code.  
When this code is ran, it:
- Downloads data from ONET, ESCO and NICE in the form of CSVs into the Neo4j import folder. **Remember to change `download_new_data` flag to True.**
- Cleans all data in the Neo4j import folder in preparation for upload. 
- Clears a local **existing and running** Neo4j graph database, then uploads the newly downloaded and cleaned data. 

### To Do
- Incorporate [ONET Crosswalks](https://www.onetcenter.org/crosswalks.html).
- Possibly incorporate USA Jobs. 

### Cahnges in code to run in MemGraph
- apoc.periodic.iterate -> periodic.iterate
- WITH HEADERS -> WITH HEADER and move after csv file location
- batchSize -> batch_size and only accepts one parameter, number in batch
- YIELD operation did not work, changed to YIELD * RETURN * for now
- added some functions specifically for mage and constraints

# Upload to MEM

## Clear Database and Assert Schema

In [None]:
# MATCH (n) DETACH DELETE n 
#from neo4j import GraphDatabase
 
#URI = "bolt://localhost:7687"
#AUTH = ("  "")
 
#with GraphDatabase.driver(URI, auth=AUTH) as client:

#    query = "CREATE CONSTRAINT ON (n:ONET_Major_Group) ASSERT n.key IS UNIQUE;"
#   client.execute_query(query)
 
#with GraphDatabase.driver(URI, auth=AUTH).session() as session:
#    session.run("CREATE CONSTRAINT ON (n:ONET_Major_Group) ASSERT n.key IS UNIQUE;")

## Constraints and Indexes

In [None]:
print("Removing all nodes and clearing the database schema for a fresh start")
# NOTE: had issue with broken connection or wire error, something like that - testing to see if upping heap size in neo4j.conf for the new database helps
 MATCH (n) DETACH DELETE n 
# CALL apoc.schema.assert({},{}, true)  # clear schema

print("Assert unique index properties for nodes related to employee data - this makes searching faster and organization neater")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Occupation) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Occupation(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Major_Group) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Major_Group(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Minor_Group) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Minor_Group(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Broad_Occupation_Group) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Broad_Occupation_Group(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Detailed_Occupation_Group) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Detailed_Occupation_Group(key);")

run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Scale) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Scale(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Segment) ASSERT n.key, n.title IS UNIQUE;") 
run_constraint_query("CREATE INDEX ON :ONET_Segment(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Family) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Family(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Class) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Class(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Commodity) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Commodity(key);")

run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Technology_Skills) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Technology_Skills(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Tools) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Tools(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Technology_Skills_Example) ASSERT n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NET_Technology_Skills_Example(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Tools_Example) ASSERT n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Tools_Example(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:xc) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Tools_Example(title);")

run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Abilities) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Abilities(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Interests_And_Work_Values) ASSERT n.key, n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Interests_And_Work_Values(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Work_Styles) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Work_Styles(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Basic_Skills) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Basic_Skills(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Cross_Functional_Skills) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Cross_Functional_Skills(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Knowledge) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Knowledge(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Education) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Education(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Education_Category) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Education_Category(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Experience_And_Training) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Experience_And_Training(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Experience_And_Training_Category) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Experience_And_Training_Category(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Generalized_Work_Activities) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Generalized_Work_Activities(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Intermediate_Work_Activities) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Intermediate_Work_Activities(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Detailed_Work_Activities) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Detailed_Work_Activities(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ONET_Work_Context) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ONET_Work_Context(key);")
  
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Skills) ASSERT n.uri, n.key, n.alt_labels IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Skills(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Knowledge) ASSERT n.uri, n.key, n.description, n.alt_labels IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Knowledge(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Language) ASSERT n.uri, n.key, n.title, n.description, n.alt_labels IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Language(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Attitudes_Values) ASSERT n.uri, n.key, n.description, n.alt_labels IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Attitudes_Values(uri);")
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Occupation_Group) ASSERT n.uri, n.key, n.alt_labels, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Occupation_Group(uri);")
run_constraint_query("CREATE CONSTRAINT ON (n:ESCO_Occupation) ASSERT n.uri, n.title, n.alt_labels, n.description, n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :ESCO_Occupation(uri);")

run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Category) ASSERT n.title, n.acronym, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Category(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Area) ASSERT n.title, n.acronym, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Area(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Workrole) ASSERT n.key, n.title, n.description IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Workrole(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:OPM_Cybersecurity_Category) ASSERT n.key IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :OPM_Cybersecurity_Category(key);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Knowledge) ASSERT n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Knowledge(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Skills) ASSERT n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Skills(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Abilities) ASSERT n.title IS UNIQUE;")
run_constraint_query("CREATE INDEX ON :NICE_Abilities(title);")
run_constraint_query("CREATE CONSTRAINT ON (n:NICE_Tasks) ASSERT n.title IS UNIQUE;") 
run_constraint_query("CREATE INDEX ON :NICE_Tasks(title);")


## Upload ONET Data

## Upload ESCO Data
Notes:
concept_Schemes didn't have anything helpful.  
Everything from ictSkills was in transversalSkillsCollection.

In [None]:
#TODO: try relating skills and occupations using a bit of web scraping

# base_url = 'https://ec.europa.eu/esco/portal/occupation?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Foccupation%2F00030d09-2b3a-4efd-87cc-c4ea39d27c34&conceptLanguage=en&full=true#&uri='
# uri_list = run_query("""MATCH (a:ESCO_Occupation) RETURN a.uri AS uri ['uri']
# for uri in uri_list:
#     try:
#         r = requests.get(base_url+uri)
#     except requests.exceptions.RequestException as e:
#         print('ERROR: could not access uri.')
#         raise SystemExit(e)
#     soup = BeautifulSoup(r.text, 'html.parser')

#     # https://stackoverflow.com/questions/5690686/using-nextsibling-from-beautifulsoup-outputs-nothing explains why you have to double up
#     header_list = ["Essential Knowledge"]
#     # header_list = ["Essential Knowledge  "Essential skills and competences  "Optional skills and competences  "Optional Knowledge"]
#     for h in header_list:
#         # clear uri list and decide on relationship and label type depending on header
#         ksatt_uris = []
#         if 'essential' in h.lower():
#             rel = 'ESSENTIAL_FOR'
#         else:
#             rel = 'OPTIONAL_FOR'
#         if 'knowledge' in h.lower():
#             label = 'ESCO_Knowledge'
#         else:
#             label = 'ESCO_Skills'
#         # look for that header
#         h2 = soup.find("h2  text=h)
#         # if the header actually exists
#         if h2 is not None:
#             ul = h2.nextSibling.nextSibling # find the tag after the header - will be the unordered list keeping all the other ksatt links underneath
#             for li in ul.find_all('li'): # find all the nested links
# a = li.find('a') # grab the individual pieces
# ksatt_uris.append(a.attrs['href']) # grab the link and add to our list of uris
#             # once all the uris are collected for a header, create relationships of the right type between the occupation and the KSATT with that URI! 
#             for ksatt_uri in ksatt_uris:
# run_query("""
#     MATCH (a:ESCO_Occupation {{uri: '{}'}})
#     MATCH (b:{} {{uri: '{}'}})
#     MERGE (a)<-[:{}]-(b)
#          "".format(uri, label, ksatt_uri, rel))


# # took 40m to get through 827/2942 ESCO_Occupations - essential AND optional knowledge/skills

# # took ___ to get through__________ ESCO_Occupations - essential knowledge only

# #28 per min avg
# # with 2942 = 100 min

## Upload NICE Data

NICE Framework Background  
- NIST (National Institute of Standards and Technology), NICCS (National Initiative for Cybersecurity Careers and Studies), OPM (Office of Personnel Management) all have a copy of the NICE (National Initiative for Cybersecurity Education) framework, which was developed by NICE. 
- Straight from OPM: OPM and DHS during the early stages of its collaborative endeavors co-led efforts to identify the cybersecurity workforce. With the direct engagement of over 20 Federal departments and agencies, and numerous public and private organizations, the National Initiative for Cybersecurity Education (NICE) developed the National Cybersecurity Workforce Framework (the Framework) to define cybersecurity work and lay a foundation for cybersecurity workforce efforts. The NICE Framework provides a common language and taxonomy, defines specialty areas and KSAs/competencies, and codifies talent.
- NICCS has NICE embedded in their website https://niccs.cisa.gov/workforce-development/cyber-security-workforce-framework 
- NIST has an excel spreadsheet https://www.nist.gov/itl/applied-cybersecurity/nice/nice-framework-resource-center/workforce-framework-cybersecurity-nice 
- OPM has it in a PDF https://www.opm.gov/policy-data-oversight/classification-qualifications/reference-materials/interpretive-guidance-for-cybersecurity-positions.pdf 
Other Resources  
  - https://csrc.nist.gov/projects/olir/focal-document-templates
  - https://nvlpubs.nist.gov/nistpubs/CSWP/NIST.CSWP.04162018.pdf
  - https://www.nist.gov/cyberframework/framework
  - https://niccs.cisa.gov/workforce-development/cyber-security-workforce-framework
  - Withdrawn: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-181.pdf
  - Superseded by: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-181r1.pdf
  - https://csrc.nist.gov/publications/detail/sp/800-181/rev-1/final
  - https://niccs.cisa.gov/workforce-development/cyber-career-pathways
  - https://niccs.cisa.gov/workforce-development
OPM Cybersecurity Codes are different from regular OPM Series Codes  
  - Grab Reference Spreadsheet https://www.nist.gov/itl/applied-cybersecurity/nice/nice-framework-resource-center/workforce-framework-cybersecurity-nice
  - Table of Contents has OPM codes, but these are not the same as the ones from OPM - they're 3 digits instead of the GS-XXXX structure. Researching into it, OPM actually has separate codes for these cybersecurity roles https://dw.opm.gov/datastandards/referenceData/2273/current?category=&q=cybersecurity. 
  - Tried to figure out more about these cybersecurity-specific codes
      - https://www.opm.gov/policy-data-oversight/classification-qualifications/classifying-general-schedule-positions/#url=Standards
      - made sense to find cybersecurity under 2200 IT Group so went digging into that PDF https://www.opm.gov/policy-data-oversight/classification-qualifications/classifying-general-schedule-positions/standards/2200/gs2200a.pdf
      - Searched for 2210 and found the IT Cybersecurity Specialist role, this additional document was linked https://www.opm.gov/policy-data-oversight/classification-qualifications/reference-materials/interpretive-guidance-for-cybersecurity-positions.pdf 
      - In that Interpretive Guidance for Cybersecurity Positions, there was the NICE framework and more explanations about these cybersecurity codes
      - Hard to tell if they are supposed to be under the 2210 focus, or are just completely separate. 
          - Don't think it's just 2210 because a section of the PDF includes that they also have overlap with the 0855, 0854, and 0391 series. 
PDW has Cybersecurity codes that match with the ones in the NICE Framework, but...  
  - These codes are actually results of something David did a couple years ago.
  - He compared position descriptions of employees with the descriptions for the cybersecurity codes and recommended the top 3 codes based on the doc2vec results. 
  - Main issues with this now:
      - Is it being maintained/updated live?
      - The position descriptions used at that time were more detailed than the ones we have now - would be comparing different results. 
  - Possible solution might be that we infer the cybersecurity code based on KSATT overlap, rather than code crosswalking. 

## Crosswalk OPM and ONET