<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NCSES-class---FedRePORTER-and-IPEDS-data" data-toc-modified-id="NCSES-class---FedRePORTER-and-IPEDS-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NCSES class - FedRePORTER and IPEDS data</a></span><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Python-Setup" data-toc-modified-id="Python-Setup-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Python Setup</a></span></li></ul></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load the data</a></span><ul class="toc-item"><li><span><a href="#Federal-RePORTER---Projects-(https://federalreporter.nih.gov/FileDownload)" data-toc-modified-id="Federal-RePORTER---Projects-(https://federalreporter.nih.gov/FileDownload)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Federal RePORTER - Projects (<a href="https://federalreporter.nih.gov/FileDownload" target="_blank">https://federalreporter.nih.gov/FileDownload</a>)</a></span></li><li><span><a href="#Federal-RePORTER---Abstracts-(https://federalreporter.nih.gov/FileDownload)" data-toc-modified-id="Federal-RePORTER---Abstracts-(https://federalreporter.nih.gov/FileDownload)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Federal RePORTER - Abstracts (<a href="https://federalreporter.nih.gov/FileDownload" target="_blank">https://federalreporter.nih.gov/FileDownload</a>)</a></span></li><li><span><a href="#IPEDS-data" data-toc-modified-id="IPEDS-data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>IPEDS data</a></span></li></ul></li><li><span><a href="#Filter-the-data" data-toc-modified-id="Filter-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Filter the data</a></span></li><li><span><a href="#Match-with-IPEDS-university-names" data-toc-modified-id="Match-with-IPEDS-university-names-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Match with IPEDS university names</a></span></li><li><span><a href="#Text-analysis-(Topic-modeling)" data-toc-modified-id="Text-analysis-(Topic-modeling)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Text analysis (Topic modeling)</a></span><ul class="toc-item"><li><span><a href="#LDA-method---Latent-Dirichlet-Allocation" data-toc-modified-id="LDA-method---Latent-Dirichlet-Allocation-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>LDA method - Latent Dirichlet Allocation</a></span></li><li><span><a href="#NMF-method---Non-negative-matrix-factorization" data-toc-modified-id="NMF-method---Non-negative-matrix-factorization-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>NMF method - Non-negative matrix factorization</a></span></li><li><span><a href="#Choosing-results" data-toc-modified-id="Choosing-results-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Choosing results</a></span></li></ul></li><li><span><a href="#Creating-a-finalized-dataset" data-toc-modified-id="Creating-a-finalized-dataset-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Creating a finalized dataset</a></span><ul class="toc-item"><li><span><a href="#Number-of-grants-per-university" data-toc-modified-id="Number-of-grants-per-university-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Number of grants per university</a></span></li><li><span><a href="#Top-5-topics-by-%grants" data-toc-modified-id="Top-5-topics-by-%grants-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Top 5 topics by %grants</a></span></li><li><span><a href="#Top-5-topics-by-%dollars" data-toc-modified-id="Top-5-topics-by-%dollars-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Top 5 topics by %dollars</a></span></li><li><span><a href="#Finalized-dataset" data-toc-modified-id="Finalized-dataset-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>Finalized dataset</a></span></li></ul></li></ul></div>

## NCSES class - FedRePORTER and IPEDS data

### Introduction

This notebook builds links between two datasets:
1. **Federal RePORTER** (https://federalreporter.nih.gov) - a collaborative effort led by STAR METRICS® to create a searchable database of scientific awards from agencies (across agencies or fiscal years, by the award's project leader, or by a text search of a project's title, terms, or abstracts).


2. **IPEDS** (https://nces.ed.gov/ipeds/) - IPEDS is the Integrated Postsecondary Education Data System, a system of interrelated surveys conducted annually by the U.S. Department of Education’s National Center for Education Statistics (NCES). IPEDS gathers information from every college, university, and technical and vocational institution that participates in the federal student financial aid programs on enrollments, program completions, graduation rates, faculty and staff, finances, institutional prices, and student financial aid.

**Output file**

**Data dictionary**

Unit of observation: **university by year (from year 2000+)**

Variables:
- **#grants** - total number of grants per university from 2000+
- **top 5 topics by grants** - top five topics by a grant share
- **top 5 topics by dollars** - top five topics by a dollar share

### Python Setup

In [165]:
# Data manipulation
import pandas as pd

# Read in multiple files
import glob

# Display settings 
pd.set_option('float_format', '{:f}'.format) # show full float numbers
pd.set_option('display.max_colwidth', -1) # show full cells

# Text analysis (topic modeling)
import numpy as np
import sklearn
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string

## Load the data

### Federal RePORTER - Projects (https://federalreporter.nih.gov/FileDownload)

In [170]:
# Get all .csv files for FedRePORTER Projects

fedreporter_files = glob.glob('FedRePORTER_PRJ_C_FY20*.csv')
print(fedreporter_files)

['FedRePORTER_PRJ_C_FY2018.csv', 'FedRePORTER_PRJ_C_FY2008.csv', 'FedRePORTER_PRJ_C_FY2009.csv', 'FedRePORTER_PRJ_C_FY2013.csv', 'FedRePORTER_PRJ_C_FY2007.csv', 'FedRePORTER_PRJ_C_FY2006.csv', 'FedRePORTER_PRJ_C_FY2012.csv', 'FedRePORTER_PRJ_C_FY2004.csv', 'FedRePORTER_PRJ_C_FY2010.csv', 'FedRePORTER_PRJ_C_FY2011.csv', 'FedRePORTER_PRJ_C_FY2005.csv', 'FedRePORTER_PRJ_C_FY2001.csv', 'FedRePORTER_PRJ_C_FY2015.csv', 'FedRePORTER_PRJ_C_FY2014.csv', 'FedRePORTER_PRJ_C_FY2000.csv', 'FedRePORTER_PRJ_C_FY2016.csv', 'FedRePORTER_PRJ_C_FY2002.csv', 'FedRePORTER_PRJ_C_FY2003.csv', 'FedRePORTER_PRJ_C_FY2017.csv']


In [171]:
# Read them in, concatenate and convert to a dataframe

list_data = []
for filename in fedreporter_files:
    data = pd.read_csv(filename)
    list_data.append(data)
    
projects = pd.concat(list_data)

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [178]:
projects.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1075525 entries, 0 to 87255
Data columns (total 24 columns):
PROJECT_ID                     1075525 non-null int64
 PROJECT_TERMS                 1052425 non-null object
 PROJECT_TITLE                 1075525 non-null object
 DEPARTMENT                    1075525 non-null object
 AGENCY                        1075525 non-null object
 IC_CENTER                     877594 non-null object
 PROJECT_NUMBER                1075525 non-null object
 PROJECT_START_DATE            935590 non-null object
 PROJECT_END_DATE              943996 non-null object
 CONTACT_PI_PROJECT_LEADER     1075430 non-null object
 OTHER_PIS                     131441 non-null object
 CONGRESSIONAL_DISTRICT        1001814 non-null float64
 DUNS_NUMBER                   1065111 non-null object
 ORGANIZATION_NAME             1074914 non-null object
 ORGANIZATION_CITY             1069689 non-null object
 ORGANIZATION_STATE            1059189 non-null object
 ORGANIZATION

### Federal RePORTER - Abstracts (https://federalreporter.nih.gov/FileDownload)

In [173]:
# Get all .csv files for FedRePORTER Abstracts

abstracts_files = glob.glob('FedRePORTER_PRJABS_C_FY20*.csv')
print(abstracts_files)

['FedRePORTER_PRJABS_C_FY2009.csv', 'FedRePORTER_PRJABS_C_FY2008.csv', 'FedRePORTER_PRJABS_C_FY2018.csv', 'FedRePORTER_PRJABS_C_FY2017.csv', 'FedRePORTER_PRJABS_C_FY2003.csv', 'FedRePORTER_PRJABS_C_FY2002.csv', 'FedRePORTER_PRJABS_C_FY2016.csv', 'FedRePORTER_PRJABS_C_FY2000.csv', 'FedRePORTER_PRJABS_C_FY2014.csv', 'FedRePORTER_PRJABS_C_FY2015.csv', 'FedRePORTER_PRJABS_C_FY2001.csv', 'FedRePORTER_PRJABS_C_FY2005.csv', 'FedRePORTER_PRJABS_C_FY2011.csv', 'FedRePORTER_PRJABS_C_FY2010.csv', 'FedRePORTER_PRJABS_C_FY2004.csv', 'FedRePORTER_PRJABS_C_FY2012.csv', 'FedRePORTER_PRJABS_C_FY2006.csv', 'FedRePORTER_PRJABS_C_FY2007.csv', 'FedRePORTER_PRJABS_C_FY2013.csv']


In [174]:
# Read them in, concatenate and convert to a dataframe

list_data = []
for filename in abstracts_files:
    data = pd.read_csv(filename)
    list_data.append(data)
    
asbtracts = pd.concat(list_data)

### IPEDS data

In [19]:
ipeds = pd.read_csv('IPEDS.csv',encoding='ISO-8859-1')

## Filter the data

In [20]:
# Keeping only first project (not to double count the total cost)
projects = projects.groupby(' PROJECT_NUMBER').first().reset_index()

# Filtering by projects which have more than 100,000 FY_TOTAL_COST
projects = projects[projects[' FY_TOTAL_COST'] > 100000]

## Match with IPEDS university names

In [24]:
ipeds = ipeds.rename(columns={'INSTNM':' ORGANIZATION_NAME'})

projects = projects.dropna(subset=[' ORGANIZATION_NAME'])

def normalize_names(name):
    name = name.lower()
    name = name.strip()
    for i in string.punctuation:
        name = name.replace(i,'')
    name = name.replace(' ','')
    return name

ipeds[' ORGANIZATION_NAME'] = ipeds[' ORGANIZATION_NAME'].apply(normalize_names)
projects[' ORGANIZATION_NAME'] = projects[' ORGANIZATION_NAME'].apply(normalize_names)

ipeds[' ORGANIZATION_NAME'].head()

In [112]:
projects[' ORGANIZATION_NAME'].head()

10434    associationofuniversitiesforresearchinastronom...
10435                                   columbiauniversity
10436                                 washingtonuniversity
10437                            universityofhawaiisystems
10438                       universityofcaliforniasandiego
Name:  ORGANIZATION_NAME, dtype: object

In [116]:
projects_subset = projects[['PROJECT_ID',' ORGANIZATION_NAME']]

projects_ipeds_1 = projects_subset.merge(ipeds,on=' ORGANIZATION_NAME')

projects_ipeds_1[' ORGANIZATION_NAME'].nunique()

In [119]:
# Add additional validated matches

names_dictionary = dict()

names_dictionary['universityofvirginiamaincampus'] = ['universityofvirginiacharlottesville','universityofvirginia']
names_dictionary['ohiostateuniversitymaincampus'] = ['ohiostateuniversityresearchfoundation','ohiostateuniversity','ohiostateuniversityvetmed']
names_dictionary['universityofcincinnatimaincampus'] = ['universityofcincinnati']
names_dictionary['auburnuniversity'] = ['auburnuniversityatauburn', 'auburnuniversiymaincampus']
names_dictionary['theuniversityoftennesseeknoxville'] = ['universityoftennesseeknoxville', 'universityoftennessee', 'universityoftennessee’scenterforcleanproductsandcleantechnologiesandthehealthybuildingnetwork']
names_dictionary['universityofsouthcarolinacolumbia'] = ['universityofsouthcarolinaatcolumbia']
names_dictionary['texasamuniversitycollegestation'] = ['texasamuniversity']
names_dictionary['universityofsouthfloridamaincampus'] = ['universityofsouthflorida']
names_dictionary['midwesternuniversityglendale'] = ['midwesternuniversity']
names_dictionary['stonybrookuniversity'] = ['stateuniversitynewyorkstonybrook','mathdeptstonybrookuniversity','sunystonybrook','stonybrookbiotechnology','stonybrooktechnologyappliedresearch','thestateuniversityofnewyorkatstonybrook']
names_dictionary['universityofthepacific'] = ['universityofthepacificstockton']
names_dictionary['universityatbuffalo'] = ['stateuniversityofnewyorkatbuffalo', 'universityatbuffalosuny', 'thestateuniversityofnewyorkatbuffalo', 'universityofbuffalo']
names_dictionary['midwesternuniversitydownersgrove'] = ['midwesternuniversity']
names_dictionary['northcarolinastateuniversityatraleigh'] = ['northcarolinastateuniversityraleigh','northcarolinastateuniversity']
names_dictionary['universityofcoloradoboulder'] = ['universityofcoloradoatboulder','universityofcoloradoboulderdeptofgeologicalsciences','regentsoftheuniversityofcoloradothe','universityofcolorado']
names_dictionary['universityofwashingtonseattlecampus'] = ['universityofwashington','universityofwashingtonmechanicalengineering','universityofwashingtonseattle']
names_dictionary['indianauniversitypurdueuniversityindianapolis'] = ['indianaunivpurdueunivatindianapolis']
names_dictionary['universityofpittsburghpittsburghcampus'] = ['theuniversityofpittsburgh','universityofpittsburgh']
names_dictionary['universityofcoloradodenveranschutzmedicalcampus'] = ['universityofcoloradodenver','universityofcoloradodenverhscdenver']
names_dictionary['louisianastateuniversityandagriculturalmechanicalcollege'] = ['louisianastateunivaampmcolbatonrouge','louisianastateunivagriculturalcenter']

names_dictionary_updated = {k: v for k, v in names_dictionary.items() if v is not None}

names_dictionary_dataframe = pd.DataFrame.from_dict(names_dictionary_updated,orient="index").reset_index()

names_dataframe = names_dictionary_dataframe.set_index('index').stack().reset_index()

names_dataframe = names_dataframe.rename(columns={'index':' ORGANIZATION_NAME'})

names_dataframe_finalized = ipeds.merge(names_dataframe,on=' ORGANIZATION_NAME')

names_dataframe_finalized = names_dataframe_finalized.rename(columns={' ORGANIZATION_NAME':'original_name',0:' ORGANIZATION_NAME'})

projects_ipeds_2 = projects_subset.merge(names_dataframe_finalized,on=' ORGANIZATION_NAME')

In [128]:
projects_ipeds_combined = pd.concat([projects_ipeds_1,projects_ipeds_2])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [130]:
projects_ipeds_combined[' ORGANIZATION_NAME'].nunique()

1196

## Text analysis (Topic modeling)

In [132]:
# Drop missing abstracts
abstracts = abstracts.dropna()

# Merge abstracts with Project ID of IPEDS universities
merged_abstracts = abstracts.merge(projects_ipeds_combined,on='PROJECT_ID')

# Get only the text of abstracts
merged_abstracts_list = merged_abstracts[' ABSTRACT'].values.tolist()

### LDA method - Latent Dirichlet Allocation

Latent Dirichlet Allocation is a probabilistic topic model that is used for discovering abstract topics from a collection of documents.

Latent (or hidden) stands for topics in the documents that are existing but not yet developed or manifest and which can be discovered based on observed data, such as words in the documents. Dirichlet stands for two distributions that are taken into account when creating topics: a distribution of words in the topic (which words are more or less probable to belong to a given topic) and a distribution of topics in documents (which topic is more or less probable for a given document). Allocation stands for allocating topics to documents, once those two distributions are in place.

LDA model takes as input a corpus (a collection of text documents). Every text document is tokenized to become a sequence of words (tokens). All unique words across a given corpus are saved as a vocabulary. Text documents are then converted to a matrix of token counts (how often a given unique word from a vocabulary appears in a given text document), e.g.:


|doc# / unique words| 'science' | 'research' | 'cell' | 'DNA' | 'gene'|
|----------|-----------|------------|--------|-------|-------|
|document 1|     0     |     0      |     1  |    5  |    7  |
|document 2|     1     |     5      |     0  |    0  |    0  |
|document 3|     0     |     2      |     5  |    0  |    0  |  


The LDA model then finds for each topic a distribution over the words, i.e. the probability of a word appearing in a given topic, and then maps a probability of a topic being assigned to a given document.
 
More here: https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

In [137]:
# Convert a collection of text documents to a matrix of token counts
# Every token is a unique word, and the matrix consists of a count of a given word (token) in a given document

tf_vectorizer = CountVectorizer(stop_words='english') # remove English stopwords (semantically-vacuous words, such as 'the', 'and', etc.)
tf = tf_vectorizer.fit_transform(merged_abstracts_list)
tf_feature_names = tf_vectorizer.get_feature_names() # unique words (vocabulary)

lda = LatentDirichletAllocation(n_topics=10, random_state=0).fit(tf)

lda_W = lda.transform(tf) # get topics to documents matrix
lda_H = lda.components_ # get word to topics matrix

In [140]:
# View the list of topics (10 top words per topic)

for topic_idx, topic in enumerate(lda_H):
    print("Topic %d:" % (topic_idx))
    print('----------------------------')
    print(" ".join([tf_feature_names[i]
                for i in topic.argsort()[:-10 - 1:-1]]))
    print('----------------------------')

Topic 0:
----------------------------
research program training students university clinical center science support faculty
----------------------------
Topic 1:
----------------------------
cells cancer cell tumor immune specific aim studies breast expression
----------------------------
Topic 2:
----------------------------
protein proteins gene genes cell dna genetic molecular human specific
----------------------------
Topic 3:
----------------------------
patients clinical treatment care health study patient outcomes data trial
----------------------------
Topic 4:
----------------------------
cell cells signaling mechanisms role function aim specific mice determine
----------------------------
Topic 5:
----------------------------
disease risk diabetes study ad metabolic studies insulin obesity metabolism
----------------------------
Topic 6:
----------------------------
hiv infection drug new human studies specific develop tissue cell
----------------------------
Topic 7:
------

In [141]:
# View a top document related to a given topic

for topic_idx, topic in enumerate(lda_H):
    print('--------------------')
    print("Topic %d:" % (topic_idx))
    print('--------------------')
    print(" ".join([tf_feature_names[i]
                    for i in topic.argsort()[:-10 - 1:-1]]))
    top_doc_indices = np.argsort(lda_W[:,topic_idx] )[::-1][0:1]
    for doc_index in top_doc_indices:
        print('--------------------')
        print(merged_abstracts_list[doc_index])

--------------------
Topic 0:
--------------------
research program training students university clinical center science support faculty
--------------------
A. INTRODUCTION AND TRANSFORMING INTENTFor the past five years, the overarching mission of the Stanford University School of Medicine (SoM) hasbeen the translation of discoveries into medical practice. A plan developed in 2001-02 has encouragedtransforming efforts in clinical and translational (CT) education and research by providing the resources tosupport such endeavors. The vision and goals of this plan, Translating Discoveries, are closely alignedwith those of the NIH Clinical and Translational Science Award (CTSA) effort. The CTSA program hasgiven the SoM an important opportunity to reassess, refine and refocus these efforts and move to anotherlevel in our effort to transform the practice of CTR across the University. The result will be anadministrative home, the Stanford Center for Clinical and Translational Education and Re

patients clinical treatment care health study patient outcomes data trial
--------------------
DESCRIPTION (provided by applicant):     PROJECT SUMMARY Objective: The broad, long-term objective of this project is to advance understanding of ways to prevent unintentional prescription opioid poisoning. Specifically, the objective (FOA CE-10-002 objective #6) is to examine the ability of a publicly-sponsored physician education initiative to contribute to change in physician prescribing patterns, and to reduction in prescription opioid-related morbidity and mortality. Importance: Washington State (WA) is in the upper tier (10.8/100,000 in 2005) of unintentional poisoning mortality in the U.S.. In response to this public health epidemic, the Washington State Interagency Guideline on Opioid Dosing for Chronic, Non-Cancer Pain was developed by a consortium of Washington State agencies and clinical experts in pain management, and implemented in April, 2007. The Guideline focused primarily on 

--------------------
DESCRIPTION (provided by applicant): Coronary heart disease (CHD) is the single largest killer for both men and women in the US. The Japanese in Japan continue to have very low CHD rates despite increasing levels of risk factors since the end of World War II (WWII). CHD mortality has been increasing in most Asian countries except for Japan. The Electron-Beam Tomography and Risk Assessment in the Japanese and US Men in the Post World War II Birth Cohort (ERA-JUMP) (HL068200 and HL071561) was designed to test the hypothesis that levels of subclinical atherosclerosis in men in the post WWII birth cohort, i.e., men aged 40-49, in Japan exposed for long periods of time to Westernized lifestyle are lower than in age-matched US Black, White, and Japanese American men. ERA-JUMP has documented that (1) the Japanese in Japan had significantly lower levels of subclinical atherosclerosis in the coronary and carotid arteries than US populations, (2) Japanese Americans had simil

project research data new systems high methods develop using based
--------------------
NONTECHNICAL SUMMARYThis award supports OpenKIM, a cyberinfrastructure component of the research community that uses computer simulations of atoms based on Newton's Laws and models for the interaction between atoms, to attack problems in materials science, engineering, and physics, and to enable the discovery of new materials, design new devices, to advance the understanding of materials-related phenomena, and much more. Recent years have seen significant advancement in the areas of materials knowledge, discovery, and manufacturing methodologies. This includes, for example, the development of graphene (a single atomic layer of carbon atoms, which has exceptional mechanical, thermal, and electrical properties) and the related class of two-dimensional materials with unprecedented material properties now being extensively studied by scientists and engineers. Another example is the advent of three-dimen

### NMF method - Non-negative matrix factorization

NMF is another model used for topic extraction - while the LDA model uses raw counts of unique words per document, NMF model uses a normalized representation of those raw counts (TF-IDF representation)

TF stands for term-frequency and TF-IDF is term-frequency times inverse document-frequency. In other words, we are not only looking for how often a word appears in a given document, but also whether this particular word is distinct across all the collections of documents (corpus). For example, intuitively we understand that words like "often" or "use" are more frequently encountered, but they are less informative (more semantically-vacuous) if we want to discern a particular topic of a document, as they might be frequently encounter across all text documents in a corpus. On the other hand, words which we will see less frequently across a collection of document might indicate that those words are specific to a particular document, and, therefore, constitute a basis for a topic. 

More here: 

- https://scikit-learn.org/stable/modules/decomposition.html#non-negative-matrix-factorization-nmf-or-nnmf
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

In [142]:
# Convert a collection of raw documents to a matrix of TF-IDF features
vectorizer = TfidfVectorizer(stop_words='english')
tfidf = vectorizer.fit_transform(merged_abstracts_list)

# Get feature names
vectorizer_feature_names = vectorizer.get_feature_names()

# Run the model with 10 topics
nmf = NMF(n_components=10).fit(tfidf)

nmf_W = nmf.transform(tfidf) # get topics to documents matrix
nmf_H = nmf.components_ # get word to topics matrix

nmf_W[0]

In [148]:
# View the list of topics (10 top words per topic)

for topic_idx, topic in enumerate(nmf_H):
    print("Topic %d:" % (topic_idx))
    print('----------------------------')
    print(" ".join([vectorizer_feature_names[i]
                for i in topic.argsort()[:-10 - 1:-1]]))
    print('----------------------------')

Topic 0:
----------------------------
data project research systems new methods materials computational applications design
----------------------------
Topic 1:
----------------------------
training program research trainees faculty clinical biology postdoctoral medicine scientists
----------------------------
Topic 2:
----------------------------
cells cell immune stem tumor responses infection il differentiation lung
----------------------------
Topic 3:
----------------------------
health care intervention risk children study outcomes treatment patients clinical
----------------------------
Topic 4:
----------------------------
hiv infection infected aids viral virus drug transmission prevention antiretroviral
----------------------------
Topic 5:
----------------------------
cancer breast tumor prostate tumors cancers clinical patients imaging metastasis
----------------------------
Topic 6:
----------------------------
students stem science program research student engineering ed

In [149]:
# View a top document related to a given topic

for topic_idx, topic in enumerate(nmf_H):
    print('--------------------')
    print("Topic %d:" % (topic_idx))
    print('--------------------')
    print(" ".join([vectorizer_feature_names[i]
                    for i in topic.argsort()[:-10 - 1:-1]]))
    top_doc_indices = np.argsort(nmf_W[:,topic_idx] )[::-1][0:1]
    for doc_index in top_doc_indices:
        print('--------------------')
        print(merged_abstracts_list[doc_index])

--------------------
Topic 0:
--------------------
data project research systems new methods materials computational applications design
--------------------
Recent revolutions in data availability have radically altered activities across many fields within science, industry, and government. For instance, contemporary simulations medication properties can require the computational power of entire data centers, and recent efforts in astronomy will soon generate the largest image datasets in history. In such extreme environments, the only viable path forward for scientific discovery hinges on the development and exploitation of next-generation computational cyberinfrastructure of supercomputers and software. The development of this new computational infrastructure demands significant engineering resources, so it is paramount to maximize the infrastructure's potential for high impact and wide adoption across as many technical domains as possible. Unfortunately, despite this necessity, exi

hiv infection infected aids viral virus drug transmission prevention antiretroviral
--------------------
DESCRIPTION (provided by applicant): The overarching goals of our renewal proposal are to develop an integrated program to further our ability to provide evidenced-based, potent antiretroviral therapy (ART) to patients with HIV-2 infection. Compared to HIV-1, HIV-2 infection is characterized by a longer asymptomatic stage, lower plasma viral loads, slower decline in CD4 count, decreased mortality rate due to AIDS, lower rates of mother to child transmission, and lower rates genital shedding and sexual transmission. In West Africa, where both HIV-1 and HIV-2 co- circulate, between 1-2 million individuals are infected with HIV-2 and a significant proportion are co-infected with both HIV-1 and HIV-2. Despite the relatively attenuated disease course of HIV-2, a significant minority of untreated individuals will progress to clinical AIDS or death without ART and as will the majority of t

dna gene genes genetic genome expression rna human genomic chromatin
--------------------
DESCRIPTION (provided by applicant): Most psychiatric disorders are not due to mutations in a single gene but rather involve cellular pathways under control of many genes and molecular signals. Recent studies point to the fact that complex epigenetic mechanisms regulating gene activity 'above' the genetic nucleotide sequence may be involved as well. The best understood mechanism of epigenetic modification is DNA methylation. In this genome-wide study we will examine the role of DNA methylation in schizophrenia susceptibility. Our study consists of a genome-wide discovery and replication phase to identify CpG loci in the human genome that are under epigenetic control and involved in disease susceptibility, followed by locus-specific validation in large schizophrenia cohorts. Our systematic approach for identifying candidate CpG loci involved in disease also includes study of general features of DNA

### Choosing results

By comparing the outputs of LDA and NMF models, the results of NMF model appear to be more coherent within one topic and also in relation to the context of documents associated with it. Therefore, we will take the top 10 topics from the NMF model output, and we will map results back to the abstracts to identify top 5 topics per university

In [150]:
# View 10 topics returned by the NMF model again

for topic_idx, topic in enumerate(nmf_H):
    print("Topic %d:" % (topic_idx))
    print(" ".join([vectorizer_feature_names[i]
                for i in topic.argsort()[:-10 - 1:-1]]))
    print('----------------------------')

Topic 0:
data project research systems new methods materials computational applications design
----------------------------
Topic 1:
training program research trainees faculty clinical biology postdoctoral medicine scientists
----------------------------
Topic 2:
cells cell immune stem tumor responses infection il differentiation lung
----------------------------
Topic 3:
health care intervention risk children study outcomes treatment patients clinical
----------------------------
Topic 4:
hiv infection infected aids viral virus drug transmission prevention antiretroviral
----------------------------
Topic 5:
cancer breast tumor prostate tumors cancers clinical patients imaging metastasis
----------------------------
Topic 6:
students stem science program research student engineering education graduate learning
----------------------------
Topic 7:
brain neural neurons cognitive memory visual synaptic cortex learning cortical
----------------------------
Topic 8:
dna gene genes genetic

In [152]:
# Create topic names from the word lists above

topics_names = pd.DataFrame()
topics_names['topic'] = [0,1,2,3,4,5,6,7,8,9]
topics_names['topic_name'] = ['computational methods/research systems','training programs','cell research','health care','HIV','cancer','STEM education','brain/neural','DNA/genes','protein/molecular']

In [153]:
topics_names

Unnamed: 0,topic,topic_name
0,0,computational methods/research systems
1,1,training programs
2,2,cell research
3,3,health care
4,4,HIV
5,5,cancer
6,6,STEM education
7,7,brain/neural
8,8,DNA/genes
9,9,protein/molecular


## Creating a finalized dataset

### Number of grants per university

In [154]:
grants_per_university_full = projects.merge(projects_ipeds_combined,on=['PROJECT_ID',' ORGANIZATION_NAME'])

grants_per_university = grants_per_university_full.groupby(['UNITID',' ORGANIZATION_NAME'])[' PROJECT_NUMBER'].count().reset_index()

grants_per_university = grants_per_university.rename(columns={' PROJECT_NUMBER':'#grants'})

grants_per_university.sort_values('#grants',ascending=False).head(10)

### Top 5 topics by %grants

In [242]:
# Get the index of documents and topic weights per document
topics_probabilities = []
for index,i in enumerate(nmf_W): # for every document
    topics_probabilities.append([i, index]) # get all topic weights

topics_list_dataframe = pd.DataFrame(topics_probabilities)
topics_list_dataframe = topics_list_dataframe.rename(columns={0:'topic',1:'index'})
merged_abstracts = merged_abstracts.reset_index()
merged_topics_abstracts = merged_abstracts.merge(topics_list_dataframe,on='index')

merged_topics_abstracts[[0,1,2,3,4,5,6,7,8,9]] = merged_topics_abstracts['topic'].apply(pd.Series)

topics_per_grant = grants_per_university_full.merge(merged_topics_abstracts,on=['PROJECT_ID','UNITID',' ORGANIZATION_NAME'])

topics_per_grant_university = topics_per_grant.groupby(['UNITID',' ORGANIZATION_NAME'])[[0,1,2,3,4,5,6,7,8,9]].sum().reset_index()
topics_weights = pd.melt(frame=topics_per_grant_university, id_vars=['UNITID',' ORGANIZATION_NAME'],value_vars=[0,1,2,3,4,5,6,7,8,9],var_name='topic',value_name='weight')
topics_weights_sorted = topics_weights.sort_values([' ORGANIZATION_NAME','weight'],ascending=False)
topics_weights_top_five = topics_weights_sorted.groupby(['UNITID',' ORGANIZATION_NAME']).head(5)
topics_weights_top_five.head()

top_five_topics_by_grants = topics_weights_top_five.merge(topics_names,on='topic')

top_five_topics_by_grants = top_five_topics_by_grants.groupby(['UNITID',' ORGANIZATION_NAME'])['topic_name'].apply(list).reset_index()
top_five_topics_by_grants = top_five_topics_by_grants.rename(columns={'topic_name':'top_5_topics_grants'})

top_five_topics_by_grants.head()

### Top 5 topics by %dollars

In [235]:
topics_per_grant_dollars = topics_per_grant[['UNITID',' ORGANIZATION_NAME',' PROJECT_NUMBER',' FY_TOTAL_COST',0,1,2,3,4,5,6,7,8,9]]

def multiply(row):
    total_cost = row[' FY_TOTAL_COST']
    topic_0 = row[0]
    topic_1 = row[1]
    topic_2 = row[2]
    topic_3 = row[3]
    topic_4 = row[4]
    topic_5 = row[5]
    topic_6 = row[6]
    topic_7 = row[7]
    topic_8 = row[8]
    topic_9 = row[9]
    return total_cost*topic_0,total_cost*topic_1,total_cost*topic_2,total_cost*topic_3,total_cost*topic_4,total_cost*topic_5,total_cost*topic_6,total_cost*topic_7,total_cost*topic_8,total_cost*topic_9

topics_per_grant_dollars[0,1,2,3,4,5,6,7,8,9] = topics_per_grant_dollars.apply(multiply,axis=1)

topics_per_grant_dollars[[0,1,2,3,4,5,6,7,8,9]] = topics_per_grant_dollars[(0,1,2,3,4,5,6,7,8,9)].apply(pd.Series)

topics_dollars_university = topics_per_grant_dollars.groupby(['UNITID',' ORGANIZATION_NAME'])[[0,1,2,3,4,5,6,7,8,9]].sum().reset_index()
topics_weights_dollars = pd.melt(frame=topics_dollars_university, id_vars=['UNITID',' ORGANIZATION_NAME'],value_vars=[0,1,2,3,4,5,6,7,8,9],var_name='topic',value_name='weight')
topics_weights_dollars_sorted = topics_weights_dollars.sort_values([' ORGANIZATION_NAME','weight'],ascending=False)
topics_weights_dollars_top_five = topics_weights_dollars_sorted.groupby(['UNITID',' ORGANIZATION_NAME']).head(5)

top_five_topics_by_dollars = topics_weights_dollars_top_five.merge(topics_names,on='topic')

top_five_topics_by_dollars = top_five_topics_by_dollars.groupby(['UNITID',' ORGANIZATION_NAME'])['topic_name'].apply(list).reset_index()
top_five_topics_by_dollars = top_five_topics_by_dollars.rename(columns={'topic_name':'top_5_topics_dollars'})

top_five_topics_by_dollars.head()

### Finalized dataset

In [271]:
finalized_dataset = grants_per_university.merge(top_five_topics_by_grants,on=['UNITID',' ORGANIZATION_NAME']).merge(top_five_topics_by_dollars,on=['UNITID',' ORGANIZATION_NAME'])

finalized_dataset

In [None]:
#Question about individual universities:
#auburnuniversityatmontgomery
#auburnuniversityatauburn
#auburnuniversiymaincampus