## NSF Data Cleaning

We have grants from the National Science Foundation, downloaded on 07/04/21. We took years 2000-2020 to bring it in line with the rest of our data. Grants downloaded from: https://www.nsf.gov/awardsearch/download.jsp

**Wikipedia**: The National Science Foundation (NSF) is an independent agency of the United States government that supports fundamental research and education in all the non-medical fields of science and engineering. Its medical counterpart is the National Institutes of Health. With an annual budget of about US$8.3 billion (fiscal year 2020), the **NSF funds approximately 25\% of all federally supported basic research conducted by the United States' colleges and universities**.[3] In some fields, such as **mathematics, computer science, economics, and the social sciences**, the NSF is the major source of federal backing.



Example grant

```
<?xml version="1.0" encoding="UTF-8"?>
<rootTag>
    <Award>
        <AwardTitle>Design of Cutting Tools for High Speed Milling</AwardTitle>
        <AwardEffectiveDate>06/15/2000</AwardEffectiveDate>
        <AwardExpirationDate>05/31/2004</AwardExpirationDate>
        <AwardTotalIntnAmount>280000.00</AwardTotalIntnAmount>
        <AwardAmount>280000</AwardAmount>
        <AwardInstrument>
        <Value>Continuing Grant</Value>
        </AwardInstrument>
        <Organization>
        <Code>07030000</Code>
        <Directorate>
        <Abbreviation>ENG</Abbreviation>
        <LongName>Directorate For Engineering</LongName>
        </Directorate>
        <Division>
        <Abbreviation>CMMI</Abbreviation>
        <LongName>Div Of Civil, Mechanical, &amp; Manufact Inn</LongName>
        </Division>
        </Organization>
        <ProgramOfficer>
        <SignBlockName>george hazelrigg</SignBlockName>
        </ProgramOfficer>
        <AbstractNarration>This project will focus on development of new cutting tool designs to allow increases in high speed machining (HSM) productivity.  The researchers will investigate a new method of increasing the damping in the tool body through creation of internal features, which will dissipate energy through friction during tool vibration.  These features will be designed to utilize the high centripetal accelerations experienced during the high spindle speeds used in HSM to dramatically increase the resulting energy dissipation.  This "centrifugal damping" will provide significant increases in the dynamic stiffness of the tool, and result in direct productivity improvements in HSM.  The researchers will also investigate the form and placement of the cutting edges on the tool body.  Previous work in this area has shown that unequal spacing of the cutting edges can result in enhanced stability and improved productivity.   This investigation will expand on this research to include analysis of undulating cutting edges, and experimentally verify the results.  &lt;br/&gt;&lt;br/&gt;The final result of this research will be an increased understanding of the role of the cutting tool in high-speed milling.  We will formulate optimal design rules and experimentally demonstrate the productivity increases achievable by intelligent design of milling cutters for HSM.  The research team expects these results to gain rapid commercial acceptance by HSM users. &lt;br/&gt;&lt;br/&gt;&lt;br/&gt;</AbstractNarration>
        <MinAmdLetterDate>06/23/2000</MinAmdLetterDate>
        <MaxAmdLetterDate>04/16/2002</MaxAmdLetterDate>
        <ARRAAmount/>
        <AwardID>0000009</AwardID>
        <Investigator>
            <FirstName>John</FirstName>
            <LastName>Ziegert</LastName>
            <EmailAddress>ziegert@clemson.edu</EmailAddress>
            <StartDate>06/23/2000</StartDate>
            <EndDate/>
            <RoleCode>Principal Investigator</RoleCode>
        </Investigator>
        <Investigator>
            <FirstName>Jiri</FirstName>
            <LastName>Tlusty</LastName>
            <EmailAddress>jtlusty@ufl.edu</EmailAddress>
            <StartDate>06/23/2000</StartDate>
            <EndDate/>
            <RoleCode>Co-Principal Investigator</RoleCode>
        </Investigator>
        <Investigator>
            <FirstName>Tony</FirstName>
            <LastName>Schmitz</LastName>
            <EmailAddress>tony.schmitz@utk.edu</EmailAddress>
            <StartDate>06/23/2000</StartDate>
            <EndDate/>
            <RoleCode>Co-Principal Investigator</RoleCode>
        </Investigator>
        <Institution>
            <Name>University of Florida</Name>
            <CityName>GAINESVILLE</CityName>
            <ZipCode>326112002</ZipCode>
            <PhoneNumber>3523923516</PhoneNumber>
            <StreetAddress>1 UNIVERSITY OF FLORIDA</StreetAddress>
            <CountryName>United States</CountryName>
            <StateName>Florida</StateName>
            <StateCode>FL</StateCode>
        </Institution>
        <FoaInformation>
            <Code>0308000</Code>
            <Name>Industrial Technology</Name>
        </FoaInformation>
        <ProgramElement>
            <Code>1468</Code>
            <Text>Manufacturing Machines &amp; Equip</Text>
        </ProgramElement>
        <ProgramReference>
            <Code>9146</Code>
            <Text>MANUFACTURING BASE RESEARCH</Text>
        </ProgramReference>
        <ProgramReference>
            <Code>MANU</Code>
            <Text>MANUFACTURING</Text>
        </ProgramReference>
    </Award>
</rootTag>
```

This award can be found at https://www.nsf.gov/awardsearch/showAward?AWD_ID=0000009

Important information

- AwardTitle
- AwardEffectiveDate
- AwardAmount
- AbstractNarration


I would like to save information in the normal format:
award amount|cleaned text

In terms of filtering, it seems to me that the organisation codes might be the key.
```
<Organization>
<Code>05010000</Code>
<Directorate>
<Abbreviation>CSE</Abbreviation>
<LongName>Direct For Computer &amp; Info Scie &amp; Enginr</LongName>
</Directorate>```

In [1]:
import xml.etree.ElementTree as ET
import cleaning3
import json
from collections import defaultdict
import numpy as np
import boto3
import re
import unicodedata
import io
import os

from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
import time
import matplotlib.pyplot as plt

from xml.dom import minidom
import sys
sys.path.append("../../tools")
import my_stopwords

In [2]:
# Download the lemmatisesr
wnl = WordNetLemmatizer()

# Create a tokeniser
count = CountVectorizer(strip_accents='ascii', min_df=1)
tokeniser = count.build_analyzer()

stopwords = my_stopwords.get_stopwords()

In [93]:
def clean_string(xml_string):
    # Escape any wild "&"s floating around in the document
    # This is a negative lookahead--don't match the string if it's part of &amp;!
    xml_string = re.sub('&(?!amp;)', '&amp;', xml_string)
    
    if '</Award>' not in xml_string:
        xml_string = xml_string + '</Award>\n</rootTag>\n'
    elif '</rootTag>' not in xml_string:
        xml_string = xml_string + '</rootTag>\n'
    return xml_string

### For each grant in 2020, harvest the organisation codes and longnames of organizations

Also check how many are missing this information

In [84]:
with open('../../Data/nsf/2020/2000005.xml', encoding='utf-8') as f:
    xml = clean_string(f.read())
root = ET.fromstring(xml)

In [13]:
organizations = root.find('Award').findall('Organization')
for org in organizations:
    code = str(org.find('Code').text)
    longname = org.find('Directorate').find('LongName').text

In [39]:
document_count = 0
multiple_orgs = 0
no_orgs = 0

organization_dict = defaultdict(set)

for file in os.listdir('../../Data/nsf/2020/'):
    document_count+=1
    
    with open('../../Data/nsf/2020/'+file, 'r', encoding='utf-8') as f:
        xml = clean_string(f.read())
    root = ET.fromstring(xml)
    
    
    organizations = root.find('Award').findall('Organization')
    if len(organizations) == 0:
        no_orgs += 1
    elif len(organizations) > 1:
        multiple_orgs += 1
    
    for org in organizations:
        code = str(org.find('Code').text)
        longname = org.find('Directorate').find('LongName').text    
        
    organization_dict[code].add(longname)
    
print('Found', document_count, 'documents')
print('Of these', no_orgs, 'had no organizations recorded')
print('And', multiple_orgs, 'had multiple organizations recorded')

Found 11301 documents
Of these 0 had no organizations recorded
And 0 had multiple organizations recorded


In [41]:
for key in sorted(organization_dict.keys()):
    print(key, organization_dict[key])

00020000 {None}
00020200 {None}
01010000 {'Office Of The Director'}
01060000 {'Office Of The Director'}
01060100 {'Office Of The Director'}
01060200 {'Office Of The Director'}
01060400 {'Office Of The Director'}
01060500 {'Office Of The Director'}
01070001 {'Office Of The Director'}
01090000 {'Office Of The Director'}
02000000 {'Office Of Information & Resource Mgmt'}
02040003 {'Office Of Information & Resource Mgmt'}
02060000 {'Office Of Information & Resource Mgmt'}
03010000 {'Direct For Mathematical & Physical Scien'}
03020000 {'Direct For Mathematical & Physical Scien'}
03040000 {'Direct For Mathematical & Physical Scien'}
03060000 {'Direct For Mathematical & Physical Scien'}
03070000 {'Direct For Mathematical & Physical Scien'}
03090000 {'Direct For Mathematical & Physical Scien'}
04010000 {'Direct For Social, Behav & Economic Scie'}
04030000 {'Direct For Social, Behav & Economic Scie'}
04040000 {'Direct For Social, Behav & Economic Scie'}
04050000 {'Direct For Social, Behav & Eco

### So what does this tell us?

- Every document has exactly one organisation
- The organization names are consistant (at least in 2020!)
- We can probably organize our documents by organization name

These are the organizations
- Office Of Information & Resource Mgmt
- Direct For Mathematical & Physical Scien
- Direct For Social, Behav & Economic Scie
- Direct For Computer & Info Scie & Enginr
- Directorate For Geosciences
- Directorate For Engineering
- Direct For Biological Sciences
- Direct For Education and Human Resources
- Office of Budget, Finance, & Award Management
- Office Of The Director

In [43]:
document_count = 0
multiple_orgs = 0
no_orgs = 0

organization_dict = defaultdict(set)

for file in os.listdir('../../Data/nsf/2001/'):
    document_count+=1
    
    with open('../../Data/nsf/2001/'+file, 'r', encoding='utf-8') as f:
        xml = clean_string(f.read())
    root = ET.fromstring(xml)
    
    
    organizations = root.find('Award').findall('Organization')
    if len(organizations) == 0:
        no_orgs += 1
    elif len(organizations) > 1:
        multiple_orgs += 1
    
    for org in organizations:
        code = str(org.find('Code').text)
        longname = org.find('Directorate').find('LongName').text    
        
    organization_dict[code].add(longname)
    
print('Found', document_count, 'documents')
print('Of these', no_orgs, 'had no organizations recorded')
print('And', multiple_orgs, 'had multiple organizations recorded')

Found 9847 documents
Of these 0 had no organizations recorded
And 0 had multiple organizations recorded


In [44]:
for key in sorted(organization_dict.keys()):
    print(key, organization_dict[key])

01000000 {'Office Of The Director'}
01060000 {'Office Of The Director'}
01090000 {'Office Of The Director'}
01120000 {'Office Of The Director'}
02040003 {'Office Of Information & Resource Mgmt'}
02040300 {'Office Of Information & Resource Mgmt'}
02090003 {'Office Of Information & Resource Mgmt'}
03000000 {'Direct For Mathematical & Physical Scien'}
03010000 {'Direct For Mathematical & Physical Scien'}
03010100 {'Direct For Mathematical & Physical Scien'}
03010200 {'Direct For Mathematical & Physical Scien'}
03010500 {'Direct For Mathematical & Physical Scien'}
03010600 {'Direct For Mathematical & Physical Scien'}
03010700 {'Direct For Mathematical & Physical Scien'}
03010800 {'Direct For Mathematical & Physical Scien'}
03020000 {'Direct For Mathematical & Physical Scien'}
03020400 {'Direct For Mathematical & Physical Scien'}
03020415 {'Direct For Mathematical & Physical Scien'}
03020417 {'Direct For Mathematical & Physical Scien'}
03020419 {'Direct For Mathematical & Physical Scien'}
0

In [69]:
raw_names = ['Office Of Information & Resource Mgmt',
'Direct For Mathematical & Physical Scien',
'Direct For Social, Behav & Economic Scie',
'Direct For Computer & Info Scie & Enginr',
'Directorate For Geosciences',
'Directorate For Engineering',
'Direct For Biological Sciences',
'Direct For Education and Human Resources',
'Office of Budget, Finance, & Award Management',
'Office Of The Director']

directorate_names = []
for org in raw_names:
    org = org.replace(',', '')
    org = org.replace(' ', '_')
    directorate_names.append(org)
print(directorate_names)

['Office_Of_Information_&_Resource_Mgmt', 'Direct_For_Mathematical_&_Physical_Scien', 'Direct_For_Social_Behav_&_Economic_Scie', 'Direct_For_Computer_&_Info_Scie_&_Enginr', 'Directorate_For_Geosciences', 'Directorate_For_Engineering', 'Direct_For_Biological_Sciences', 'Direct_For_Education_and_Human_Resources', 'Office_of_Budget_Finance_&_Award_Management', 'Office_Of_The_Director']


In [97]:
def make_document(root):
    directorate_name = root.find('Award').find('Organization').find('Directorate').find('LongName').text
    
    if directorate_name is None:
        directorate_name = ''
        
    directorate_name = directorate_name.replace(',', '')
    directorate_name = directorate_name.replace(' ', '_')

    amount = root.find('Award').find('AwardAmount').text

    if amount is None:
        amount = '0'

    title = root.find('Award').find('AwardTitle').text

    if title is None:
        title = ''

    abstract = root.find('Award').find('AbstractNarration').text

    if abstract is None:
        abstract = ''

    date = root.find('Award').find('AwardEffectiveDate').text.split('/')[2]
    
    text = title + ' ' + abstract
    cleaned_text = cleaning3.clean(text, wnl, tokeniser)
    
    return date, amount, directorate_name, cleaned_text

In [105]:
for year in range(2000,2021):
    documents = {}
    for directorate in directorate_names:
        documents[directorate] = defaultdict(list)
        
    for file in os.listdir('../../Data/nsf/'+str(year)+'/'): 

        # Read each grant
        with open('../../Data/nsf/'+str(year)+'/'+file, 'r', encoding='utf-8') as f:
            xml = clean_string(f.read())
        root = ET.fromstring(xml)

        date, amount, directorate_name, cleaned_text = make_document(root)

        if len(cleaned_text) > 50 and directorate_name in directorate_names:
            documents[directorate_name][date].append(amount+'|'+cleaned_text)

    for directorate in documents.keys():
        for date in documents[directorate].keys():
            with open('../../Data/nsf_cleaned/'+directorate+'/'+date+'.txt', "a") as f:
                for document in documents[directorate][date]:
                    f.write(document+'\n')
    
    print(year)

1999
