# Exploratory Data Analysis

THe NSF is organized by directorate as follows:
1. Directorate for Biological Sciences
2. Directorate for Computer & Information Science & Engineering
3. Directorate for Education & Human Resources
4. Directorate for Engineering
5. Directorate for Geosciences
6. Directorate for Mathematical & Physical Sciences
7. Directorate for Social, Behavioral & Economic Sciences

It seems natural to think those would be topics we can identify based on award's abstracts.
Each directorate has multiple division.
There is also a number of offices which we will group together 
and treat them at the same level as directorate

First, let's take a look at our short element data.

In [1]:
import pandas as pd
import re
import os
import json
from itertools import combinations
from sklearn.feature_extraction import stop_words
from pandas.io.json import json_normalize

In [2]:
# path to short element json file
short_elements_dir = os.path.join(os.pardir,'data', 'interim', 'short_elements.json')   

In [3]:
# decode json file
with open(short_elements_dir, encoding='utf-8') as f:
    d = json.load(f)

In [4]:
# create dataframe based on list of dictionaries
df_short = json_normalize(d, meta=['award_id', 'award_instr'])
# remove nested data (some keys have value that contains list of dict)
df_short = df_short.drop(['Institution', 'Investigator', 'ProgramElement'], axis = 1)

In [40]:
df_short.head()

Unnamed: 0,amount,award_id,award_instr,eff_date,exp_date,nsf_officer,org_code,org_direct,org_div,title
0,125000.0,6000001,Standard Grant,1960-04-15,1960-03-31,,5010000,"[direct, for, computer, info, scie, enginr]",Division of Computing and Communication Founda...,Chemical Education Material Study (G12226)
1,28000.0,6100002,Standard Grant,1961-12-15,1962-12-31,,5020000,"[direct, for, computer, info, scie, enginr]",Div Of Information & Intelligent Systems,Translation and Publication of the 1961 Issues...
2,40160.0,6100003,Standard Grant,1961-12-15,1965-01-31,,5090000,"[direct, for, computer, info, scie, enginr]",Office of Advanced Cyberinfrastructure (OAC),Advanced Science Seminar in Soil Clay Mineralo...
3,,6100004,Standard Grant,1962-02-15,1966-05-31,,5010200,"[direct, for, computer, info, scie, enginr]",Division of Computing and Communication Founda...,Development of Science Teaching Materials For ...
4,1334824.0,6100005,Standard Grant,1962-02-15,1968-09-30,,5010200,"[direct, for, computer, info, scie, enginr]",Division of Computing and Communication Founda...,A Project For the Development of the Education...


When an officer name is not available, label it as Nan. The string 'name not available' has different spacing.

In [None]:
# separate each word 
df_short.nsf_officer = df_short.nsf_officer.str.split()
# recombine in a controlled spacing convention
df_short.nsf_officer = df_short.nsf_officer.str.join(' ')
# replace missing officer's name by Nan
df_short.nsf_officer.replace('name not available',pd.np.nan, inplace = True)

Convert date to datetime object.

In [24]:
# convert date string to datetime object
df_short.eff_date = pd.to_datetime(df_short.eff_date, format='%m/%d/%Y')
df_short.exp_date = pd.to_datetime(df_short.exp_date, format='%m/%d/%Y')

Mark missing amount as Nan

In [28]:
df_short.amount.replace('',pd.np.nan, inplace = True)

Mark missing directorate as Nan

In [32]:
df_short.org_direct.replace('',pd.np.nan, inplace = True)

In [33]:
df_short.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438352 entries, 0 to 438351
Data columns (total 10 columns):
amount         438245 non-null float64
award_id       438352 non-null int64
award_instr    438352 non-null object
eff_date       438352 non-null datetime64[ns]
exp_date       438352 non-null datetime64[ns]
nsf_officer    368565 non-null object
org_code       438352 non-null object
org_direct     438222 non-null object
org_div        438352 non-null object
title          438352 non-null object
dtypes: datetime64[ns](2), float64(1), int64(1), object(6)
memory usage: 33.4+ MB


Directorate have different abbreviations. Office are also listed at the same level as directorate.

In [34]:
df_short.org_direct.value_counts()

Direct For Mathematical & Physical Scien                        90586
Directorate For Engineering                                     70718
Direct For Biological Sciences                                  62424
Directorate For Geosciences                                     61959
Direct For Computer & Info Scie & Enginr                        53079
Direct For Education and Human Resources                        39804
Direct For Social, Behav & Economic Scie                        35298
Office Of The Director                                          23277
Office of Budget, Finance, & Award Management                     318
Office Of Information & Resource Mgmt                             198
Directorate for Engineering                                       100
Directorate for Education & Human Resources                        97
Directorate for Computer & Information Science & Engineering       97
Directorate for Social, Behavioral & Economic Sciences             70
Directorate for Biol

Let's consolidate directorate names.

In [35]:
# make sure all names are lower case for comparison
df_short.org_direct = df_short.org_direct.str.lower()

In [49]:
# keep only words
df_short.org_direct = df_short.org_direct.str.findall('\w+')

In [45]:
# remove stopwords
df_short.loc[df_short.org_direct.notnull(), 'org_direct'] = \
    df_short.loc[df_short.org_direct.notnull(), 'org_direct'].apply( \
    lambda x: [word for word in x if word not in stop_words.ENGLISH_STOP_WORDS])

In [52]:
# recombine text
df_short.org_direct = df_short.org_direct.str.join(' ')

In [54]:
# get all possible directorate name
s_direct_names = df_short.org_direct.value_counts()

In [55]:
# derive word count in each directorate name
direct_names_len = {direct: len(direct.split()) for direct in s_direct_names.index}

In [57]:
def are_letters_common(abbr, full_word):
    """
    returns true if all letters in abbreviation abbr are present in full_word
    """
    # check if all letters are in full_word
    for l in list(abbr):
        
        if l not in full_word:
            
            return False
    
    # true if loop completed (all letters in full)
    return True

In [58]:
def find_abbreviations(valcount_dict):
    """
    make pairs of abbreviated, non-abbreviated names
    """
    # make two lists: abbrevation list and replacement list
    abbreviation = []
    replacement = []
    
    # group dict by value
    for w_cnt in range(min(valcount_dict.values()), max(valcount_dict.values())+1):
        
        # make a list of keys which have the same count
        list4pairs = [kp.split() for kp,vp in valcount_dict.items() if vp == w_cnt]
        
        # make a list of pair combinations
        pairs = list(combinations(list4pairs , 2))
        
        # compare pairs
        for p in pairs:
            abbre_list =[]
            repl_list = []
            
            # compare word by word
            for w in range(w_cnt):
                
                # get abbreviated word and longer word (full)
                if len(p[0][w]) >=  len(p[1][w]):
                    abbre_word = p[1][w]
                    full_word = p[0][w]
                else:
                    abbre_word = p[0][w]
                    full_word = p[1][w]
                
                # do they have the same root?
                # test if all letters in abbre_word are in full_word
                if are_letters_common(abbre_word,full_word):
                    abbre_list.append(abbre_word)
                    repl_list.append(full_word)
                else:
                    # root is different, move on
                    # decrement w to indicate loop ended with a break statement
                    w -= 1
                    break
            # if for loop complete, concatenate word list
            if w == w_cnt-1:
                abbreviation.append(' '.join(abbre_list))
                replacement.append(' '.join(repl_list))
                
    # return two list
    return abbreviation,replacement

In [59]:
# figure out directorate name that matches non abrreviated name
abbreviation, replacement = find_abbreviations(direct_names_len)

In [67]:
for a, r in zip(abbreviation, replacement):
    print('{}  --->  {}'.format(a, r))

direct biological sciences  --->  directorate biological sciences
direct mathematical physical scien  --->  directorate mathematical physical sciences
office information resource mgmt  --->  office information resource management
direct education human resources  --->  directorate education human resources
direct social behav economic scie  --->  directorate social behavioral economic sciences
direct computer info scie enginr  --->  directorate computer information science engineering


In [68]:
# replace each abbreviation by full name
df_short.org_direct.replace(to_replace=abbreviation, value=replacement, inplace=True, method='pad')

In [69]:
# remove row where directorate is missing
df_short.dropna(subset=['org_direct'], inplace=True )

In [71]:
# removes all offices instances
df_short = df_short[ ~df_short.org_direct.str.contains('office', case=False) ]

In [73]:
df_short.org_direct.value_counts()

directorate mathematical physical sciences              90613
directorate engineering                                 70818
directorate biological sciences                         62490
directorate geosciences                                 62011
directorate computer information science engineering    53176
directorate education human resources                   39901
directorate social behavioral economic sciences         35368
Name: org_direct, dtype: int64

In [None]:
# dataframe to merge with abstract
df_short.loc[:, ['AwardID', 'Directorate_Name']]

In [None]:
# retrieve institution information
df_institution = json_normalize(d, record_path='Institution', meta=['award_id'])

In [None]:
df_institution.head()

In [None]:
# df_short = pd.read_json(short_elements_dir, orient='values')

In [None]:
# with open(short_elements_dir, encoding='utf-8') as f:
#     d = json.load(f)

In [None]:
# json_normalize(d, record_path='org_direct')