## Midway Project Check - Chasz Griego

This is a summary of my recent progress on my project. As a reminder, I am analyzing data about NSF GRFP recipients from the past ten years. The raw data can be obtained from the following source:

https://www.research.gov/grfp/AwardeeList.do?method=loadAwardeeList

In my project proposal, I suggested four tasks that I will complete (estimated completion time included):

- Build a Pandas DataFrame that includes GRFP recipient data from the past 10 years (**1 hour**)
- Cleanup the text data so that it is organized and easily readable for a machine learning algorithm (**3 hours**)
- Construct feature vectors from this data and train and validate a simple machine learning algorithm (**2 hours**)
- Make interactive visualizations that compare the numbers of recipients at different institutions and fields of study (**2 hours**) 

In this notebook, I will address the status of each task, including the code written to complete them.

#### Building DataFrame

I extracted CSV files, separated by year, that contain information on recipients that won the award or received an honorable mention. To access these files to run the first few cells, unzip the file `nsf_grfp_recipients_2011_2020_separated.zip` and store them in the same directory as this notebook.

In [41]:
import pandas as pd
import numpy as np

Below is a function that reads the CSV files. I made a consistent naming scheme so that every file can be read in a loop. Each file is read into a pandas DataFrame, a second DataFrame is defined with the pertinent columns, and columns are added to describe the year and the recipients status (award or honorable mention). The names of the recipients aren't included so that the dataset is anonymous. Each DataFrame is stored in a list which is concatanated into one.

In [42]:
def read_data(years, status):
    
    dflist = []
    for year in years:
        df1 = pd.read_csv(f'{year}{status}.csv',
                         index_col=None,
                         header=0,
                         encoding='latin-1')
        
        df2 = pd.DataFrame({'Baccalaureate Institution':df1['Baccalaureate Institution'],
                            'Current Institution':df1['Current Institution'],
                            'Field of Study':df1['Field of Study'],
                            'Year':[year]*len(df1),
                            'Status':[status]*len(df1)})

        dflist.append(df2)

    df_all = pd.concat(dflist, axis=0, ignore_index=True)
    
    return df_all

Next, a list of years is made to select the files.

In [43]:
years = [str(int(y)) for y in np.linspace(2011,2020,10)]

Below the `read_data` function is called to make a dataframe of award winners.

In [44]:
awards = read_data(years, 'award')

In [45]:
awards.head()

Unnamed: 0,Baccalaureate Institution,Current Institution,Field of Study,Year,Status
0,University of Minnesota-Twin Cities,University of Minnesota-Twin Cities,Chemistry - Analytical,2011,award
1,University of Southern Mississippi,University of Southern Mississippi,Chemistry - Polymer,2011,award
2,University of New Mexico,University of Washington,Life Sciences - Molecular Biology,2011,award
3,University of Texas at Austin,UNIVERSITY OF CALIFORNIA BERKELEY,Life Sciences - Developmental Biology,2011,award
4,Vanderbilt University,Vanderbilt University,Life Sciences - Biophysics,2011,award


Here, we can see that this dataframe contains five columns: the baccalaureate institution of the recipient, their current field of study, their current institution where they are pursuing a graduate degree, the year of the award, the award status.

This is also done for honorable mentions:

In [46]:
mentions = read_data(years, 'mention')

In [47]:
mentions.head()

Unnamed: 0,Baccalaureate Institution,Current Institution,Field of Study,Year,Status
0,Oberlin College,Florida International University,Life Sciences - Ecology,2011,mention
1,New College of Florida,UNIVERSITY OF CALIFORNIA BERKELEY,Psychology - Computational Psychology,2011,mention
2,University of Massachusetts Amherst,University of Massachusetts Amherst,Life Sciences - Developmental Biology,2011,mention
3,Indiana University-Purdue University at Indian...,Indiana University,Psychology - Cognitive Neuroscience,2011,mention
4,Brandeis University,,Life Sciences - Ecology,2011,mention


Next I merge the two dataframes:

In [48]:
nsf = pd.concat([awards, mentions], axis=0, ignore_index=True)

Next, I checking if any columns contains NaNs. It looks like just the `Current Institution` column does. 

In [49]:
nsf.isna().any()

Baccalaureate Institution    False
Current Institution           True
Field of Study               False
Year                         False
Status                       False
dtype: bool

To make it more interprettable, all the NaNs are filled with the word "Undecided". Here, we're assuming that the applicant's did not fill this field because they were undergraduates that didn't know where they would be going for grad school.

In [50]:
nsf = nsf.fillna({'Current Institution':'Undecided'})
nsf.isna().any()

Baccalaureate Institution    False
Current Institution          False
Field of Study               False
Year                         False
Status                       False
dtype: bool

Next, I decided to write all of the data to a master CSV, so that it's easier to access in future notebooks. 

In [51]:
nsf.to_csv(f'nsf_grfp_recipients_{years[0]}_{years[-1]}.csv',index=False)

This portion of the project was fairly straight-forward, and didn't take too long to complete.

#### Cleaning text data

I'm currently working on this task. I think I'm off to a good start, but there is still more things to figure out. As a reminder, I'm trying to cleanup the data about institutions, which was user-inputted, so there are inconsistencies with the ways that institution names are recorded. Below I'm working through ways to process the data so that it's more normalized and that recipients that belong to the same institution can be grouped as such.

First, I check if the data is case sensitive. I check with the method `.str.lower()`. Below I compare how many unique baccalaureate institutions there are before and after making all words lowercase.

In [52]:
print('Unique entries with no change:',len(nsf['Baccalaureate Institution'].unique()))
print('Unique entries with all lower case:',len(nsf['Baccalaureate Institution'].str.lower().unique()))

Unique entries with no change: 1530
Unique entries with all lower case: 1484


We can see that by making each entry lowercased, we significantly reduce the number of unique institution names in the `Baccalaureate Instituion` column. We'll make that change to that column and the `Current Institution` column.

In [55]:
nsf['Baccalaureate Institution'] = nsf['Baccalaureate Institution'].str.lower()
nsf['Current Institution'] = nsf['Current Institution'].str.lower()

Next, I use `.apply()` and `.translate()` to remove any punctuation marks that may be in an entry. I use a library of punctuation marks from the `string` module.

In [56]:
import string
nsf['Baccalaureate Institution'] = (nsf['Baccalaureate Institution']
                                    .apply(lambda x: x.translate({ord(c):
                                    None for c in string.punctuation})))

print('Unique entries with punctuation removed:',
      len(nsf['Baccalaureate Institution'].unique()))

Unique entries with punctuation removed: 1457


We can see that this operation also helped reduce the number of unique entries.

As of today, this is the most manipulation I've done to clean up this data. My next goal is to try to figure out a way to compare these entries to a list of all universities in the united states so that I can fish out the rest of the names that were inputted inconsistently. 

I found a JSON file of all universities from this page:

https://github.com/Hipo/university-domains-list

I navigate the repository to get to a site with the raw text data and use `urllib.request` to store the data back into a new JSON file in my directory.

In [57]:
url = 'https://raw.githubusercontent.com/Hipo/university-domains-list/master/world_universities_and_domains.json'

import urllib.request
urllib.request.urlretrieve(url, 'world_universities_and_domains.json')

('world_universities_and_domains.json',
 <http.client.HTTPMessage at 0x1963743b080>)

In [58]:
unis = pd.read_json('world_universities_and_domains.json')

In [59]:
unis.head()

Unnamed: 0,web_pages,name,alpha_two_code,state-province,domains,country
0,[http://www.marywood.edu],Marywood University,PA,,[marywood.edu],United States
1,"[https://www.cstj.qc.ca, https://ccmt.cstj.qc....",Cégep de Saint-Jérôme,CA,,[cstj.qc.ca],Canada
2,[http://www.lindenwood.edu/],Lindenwood University,US,,[lindenwood.edu],United States
3,[http://www.davietjal.org/],DAV Institute of Engineering & Technology,IN,Punjab,[davietjal.org],India
4,[http://www.lpu.in/],Lovely Professional University,IN,Punjab,[lpu.in],India


This DataFrame contains some information about every university in the world. I can use this to extract every university in the United States.

In [60]:
usa = unis[unis['country'] == 'United States']
usa.head()

Unnamed: 0,web_pages,name,alpha_two_code,state-province,domains,country
0,[http://www.marywood.edu],Marywood University,PA,,[marywood.edu],United States
2,[http://www.lindenwood.edu/],Lindenwood University,US,,[lindenwood.edu],United States
5,[https://sullivan.edu/],Sullivan University,US,,[sullivan.edu],United States
6,[https://www.fscj.edu/],Florida State College at Jacksonville,US,,[fscj.edu],United States
7,[https://www.xavier.edu/],Xavier University,US,,[xavier.edu],United States


So now I have something to benchmark all of the university names in my DataFrame. Next I need to figure out more ways to clean my data or different ways to format it so that I can spot different variants of the same university name. I'm wondering if I should eliminate spaces and try to match the compositions of characters between an entry in my DataFrame and an entry in the university DataFrame. It might be significantly easier to just eliminate entries with a university name that only occurs once. I'm a little hesitant about eliminating data points, but it could be possible that those points don't add any significance to my analysis. I would really appreciate feedback and any suggestions about how I could tackle this problem. I will probably limit myself to just a couple more hours on this task.

#### Feature engineering and model building

I will be eliminating this task from my project plan. I've made this decision because (1) this will probably be too much work and will take a lot more time to do and (2) it's not as interesting to build a model that predicts if someone gets an award or an honorable mention. It is more interesting for the model to predict if someone wins the award or doesn't, and unfortunately, NSF does not provide data of applicants that weren't recognized. If I acquire more free time outside of this class, I may still try to build models for more practice, but I'm no longering making my project depend on this portion.

#### Making interactive visualizations

This task will be the culmination of my work in the previous tasks. I would like to build interactive figures that show how many recipients an institution or a particular field in science has had. I will start with a simple type of figure, in case I start to run out of time, like an interactive bar or pie chart that allows user to click on a section of data and learn how many recipients an institution or field has. If I have more time, I would still like to make a more sophisticated figure, like a heat map of the United States that shows states with a higher number of recipients, and if the user clicks on a state, a new figure is displayed that shows a bar graph of number of recipients versus institution. 