# Analyzing "7363" Data - Text Clustering
Author: Nicholas Goldsmith <br>
Start Date: 20 December 2022 <br>
Last Updated: 23 December 2022 (cleaned up an earlier draft)

## Purpose
In this Jupyter notebook, I attempt text clustering on some public US government data. I utilized code generated using ChatGPT, in a continued attempt to find a method to see changes in topical areas across years. This analysis was utilized to try out the technique and serve as a reference for future analyses. The point of this script is NOT to make a meaningful analysis of the utilized data. I would recommend https://beta.nsf.gov/about/about-nsf-by-the-numbers or try conducting your own analysis if you are interested in more informative analyses.
<br><br> 
Disclaimer: The contents of this document do not necessarily represent the views of my employer nor the views of any organization I may be associated with.

## Libraries

In [1]:
# Library needed to work with dataframes
import pandas as pd
# Library needed for some math
import numpy as np
# Libraries needed for clustering and natural language processing
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

## Variables to Define

In [2]:
num_clusters = 20 # This should be the number of clusters you want

## Data
I am using publically available data. Using the public NSF Award Search (https://nsf.gov/awardsearch/advancedSearch.jsp), I've searched Element Code "7363" and included all active and expired awards. There are 2573 awards in the dataset.

In [3]:
Awards = pd.read_excel("Awards-2022-12-20.xls") # Reads in the data file

In [4]:
Awards.info() # Gives the structure of the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2573 entries, 0 to 2572
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   AwardNumber              2573 non-null   int64 
 1   Title                    2573 non-null   object
 2   NSFOrganization          2573 non-null   object
 3   Program(s)               2573 non-null   object
 4   StartDate                2573 non-null   object
 5   LastAmendmentDate        2573 non-null   object
 6   PrincipalInvestigator    2568 non-null   object
 7   State                    2573 non-null   object
 8   Organization             2573 non-null   object
 9   AwardInstrument          2573 non-null   object
 10  ProgramManager           2573 non-null   object
 11  EndDate                  2573 non-null   object
 12  AwardedAmountToDate      2573 non-null   object
 13  Co-PIName(s)             773 non-null    object
 14  PIEmailAddress           2568 non-null  

### Adding a Year Column

In [5]:
Awards['Year'] = 0 # Creating a year column with all 0's

# The below uses a for loop to go through every row. It takes the last four characters of
# Start Date and calls them the year
for i in range(0, len(Awards['AwardNumber'])):
    Awards['Year'].at[i] = int(Awards['StartDate'][i][len(Awards['StartDate'][i])-4:])

### Year Subsets

In [6]:
Awards['Year'].unique()

array([2021, 2020, 2016, 2018, 2017, 2022, 2019, 2013, 2014, 2009, 2015,
       2007, 2006, 2012, 2008, 2010, 2011, 2005, 2004, 2003, 2002, 2023,
       2001], dtype=int64)

In [7]:
Awards2015 = Awards[Awards['Year'] == 2015]
Awards2016 = Awards[Awards['Year'] == 2016]
Awards2017 = Awards[Awards['Year'] == 2017]
Awards2018 = Awards[Awards['Year'] == 2018]
Awards2019 = Awards[Awards['Year'] == 2019]
Awards2020 = Awards[Awards['Year'] == 2020]
Awards2021 = Awards[Awards['Year'] == 2021]
Awards2022 = Awards[Awards['Year'] == 2022]

## Clustering Text

### Clustering a Single Year (2020)

#### Performing the Clustering

In [8]:
# Extract the abstracts from the data frame
abstracts = Awards2020['Abstract'].tolist()

# Remove empty strings from the list of abstracts
abstracts = [abstract for abstract in abstracts if abstract != ""]

# Vectorize the abstracts using TF-IDF
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(abstracts)

# Use k-means to cluster the vectors into num_clusters clusters
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(vectors)

# Assign each vector to a cluster
clusters = kmeans.predict(vectors)

# Add the cluster labels to the data frame
Awards2020['Cluster'] = clusters

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Awards2020['Cluster'] = clusters


#### Words Associated with each Cluster

##### All words

In [9]:
# Transform the cluster centers back into a list of words
cluster_words = vectorizer.inverse_transform(kmeans.cluster_centers_)

# Sort the words in each cluster by their TF-IDF scores
sorted_words = [sorted(words, key=lambda x: x[1], reverse=True) for words in cluster_words]

# Print the sorted words for each cluster
for i in range(len(sorted_words)):
    print(f"Cluster {i}:")
    print(sorted_words[i])

Cluster 0:
['by', 'dynamics', 'symbiotically', 'system', 'existing', 'experience', 'experiments', 'expose', 'extended', 'award', 'evaluation', 'events', 'over', 'augmented', 'cumbersome', 'curriculum', 'guarantee', 'multiple', 'outcomes', 'outreach', 'quality', 'subjective', 'summer', 'support', 'at', 'https', 'statutory', 'students', 'utility', 'as', 'is', 'nsf', 'nsfcns2008646project', 'use', 'user', 'users', 'using', 'are', 'br', 'broad', 'broadening', 'broader', 'crafted', 'criteria', 'framework', 'from', 'organized', 'precise', 'project', 'projects', 'xr', 'equipment', 'application', 'applications', 'spaces', 'sports', 'com', 'combination', 'combine', 'communication', 'community', 'component', 'computing', 'concepts', 'confined', 'connected', 'connectivity', 'considerations', 'contrast', 'contribute', 'domains', 'focused', 'for', 'foundation', 'go', 'goal', 'google', 'holistic', 'home', 'however', 'mostly', 'novel', 'posted', 'potential', 'potentially', 'societal', 'solutions', 't

##### Top 10 words

In [10]:
# Transform the cluster centers back into a list of words
cluster_words = vectorizer.inverse_transform(kmeans.cluster_centers_)

# Sort the words in each cluster by their TF-IDF scores
sorted_words = [sorted(words, key=lambda x: x[1], reverse=True) for words in cluster_words]

# Print the top 10 words for each cluster
for i in range(len(sorted_words)):
    print(f"Cluster {i}:")
    print(sorted_words[i][:10])

Cluster 0:
['by', 'dynamics', 'symbiotically', 'system', 'existing', 'experience', 'experiments', 'expose', 'extended', 'award']
Cluster 1:
['by', 'systems', 'excess', 'expected', 'experience', 'award', 'two', 'evaluation', 'culminate', 'current']
Cluster 2:
['by', 'cyberphysical', 'cycle', 'dynamic', 'dynamically', 'dynamics', 'system', 'systematic', 'systematically', 'systems']
Cluster 3:
['by', 'dynamic', 'dynamics', 'eyes', 'hybrid', 'hyperdimensional', 'symbol', 'synergy', 'system', 'systems']
Cluster 4:
['by', 'system', 'systems', 'existing', 'expenditure', 'experienced', 'expertise', 'exploration', 'explore', 'exploring']
Cluster 5:
['by', 'dynamic', 'dynamics', 'hybrid', 'system', 'systems', 'except', 'exchange', 'execution', 'exhaustive']
Cluster 6:
['by', 'symposium', 'system', 'systems', 'types', 'existing', 'experience', 'explore', 'expose', 'exposed']
Cluster 7:
['by', 'dynamic', 'system', 'systems', 'example', 'exceed', 'exciting', 'existing', 'experiments', 'exploit']
Cl

##### Unique Words

In [11]:
# Transform the cluster centers back into a list of words
cluster_words = vectorizer.inverse_transform(kmeans.cluster_centers_)

# Sort the words in each cluster by their TF-IDF scores
sorted_words = [sorted(words, key=lambda x: x[1], reverse=True) for words in cluster_words]

# Get the unique words for each cluster
seen_words = set()
unique_words = []
for i in range(len(sorted_words)):
    cluster_words = sorted_words[i]
    unique_words.append([word for word in cluster_words if word not in seen_words])
    seen_words.update(unique_words[i])

# Print the unique words for each cluster
for i in range(len(unique_words)):
    print(f"Cluster {i}:")
    print(unique_words[i][:10])

Cluster 0:
['by', 'dynamics', 'symbiotically', 'system', 'existing', 'experience', 'experiments', 'expose', 'extended', 'award']
Cluster 1:
['systems', 'excess', 'expected', 'two', 'culminate', 'current', 'fully', 'fundamental', 'much', 'multitude']
Cluster 2:
['cyberphysical', 'cycle', 'dynamic', 'dynamically', 'systematic', 'systematically', 'typically', 'exacerbates', 'examples', 'execution']
Cluster 3:
['eyes', 'hybrid', 'hyperdimensional', 'symbol', 'synergy', 'types', 'examine', 'examines', 'example', 'exhibit']
Cluster 4:
['expenditure', 'experienced', 'expertise', 'exploration', 'exploring', 'everyday', 'overarching', 'oversimplify', 'further', 'outside']
Cluster 5:
['except', 'exchange', 'exhaustive', 'exist', 'exists', 'expensive', 'exponential', 'extent', 'extracted', 'extraordinary']
Cluster 6:
['symposium', 'exposed', 'exposure', 'extends', 'tx', 'awards', 'twenty', 'event', 'audience', 'austin']
Cluster 7:
['exceed', 'exciting', 'exploits', 'explores', 'automotive', 'culm

#### Names for each cluster based on their top unique words

In [12]:
# Transform the cluster centers back into a list of words
cluster_words = vectorizer.inverse_transform(kmeans.cluster_centers_)

# Sort the words in each cluster by their TF-IDF scores
sorted_words = [sorted(words, key=lambda x: x[1], reverse=True) for words in cluster_words]

# Get the unique words for each cluster
seen_words = set()
unique_words = []
for i in range(len(sorted_words)):
    cluster_words = sorted_words[i]
    unique_words.append([word for word in cluster_words if word not in seen_words])
    seen_words.update(unique_words[i])

# Generate a name for each cluster based on its top words
cluster_names = []
for i in range(len(unique_words)):
    cluster_name = " ".join(unique_words[i][:3])
    cluster_names.append(cluster_name)

# Print the cluster names
print(cluster_names)

['by dynamics symbiotically', 'systems excess expected', 'cyberphysical cycle dynamic', 'eyes hybrid hyperdimensional', 'expenditure experienced expertise', 'except exchange exhaustive', 'symposium exposed exposure', 'exceed exciting exploits', 'bypass bytecode cyber', 'experimental experimentally experimentation', 'exceeding exploited surfaces', 'synthesis execute experts', 'cycling synchronized expediting', 'synergistic qualified episodic', 'ay expenditures awakening', 'exceedingly expedite explicitly', 'everything friendly orbit', 'cyberinfrastructure every successfully', 'cybersecurity synchronize typical', 'extensions extensively awe']


#### Top Titles for each Cluster

In [13]:
# Extract the titles from the data frame
titles = Awards2020['Title'].tolist()

# Vectorize the titles using TF-IDF
vectors = vectorizer.transform(titles)

# Assign each title to a cluster
cluster_labels = kmeans.predict(vectors)

# Compute the distances between each title and its cluster's centroid
distances = np.linalg.norm(vectors - kmeans.cluster_centers_[cluster_labels], axis=1)

# Zip the titles, cluster labels, and distances together
zipped = zip(titles, cluster_labels, distances)

# Sort the titles by their distances
sorted_titles = sorted(zipped, key=lambda x: x[2])

# Print the top 5 titles for each cluster
for i in range(kmeans.n_clusters):
    cluster_titles = [title for title, label, distance in sorted_titles if label == i]
    print(f"Cluster {i}:")
    print(cluster_titles[:5])

Cluster 0:
[]
Cluster 1:
['EAGER: Information Theory: From Classical to Quantum', 'Collaborative Research: CNS Core: Medium: Design and Analysis of Quantum Networks for Entanglement Distribution', 'Collaborative Research: CNS Core: Medium: Design and Analysis of Quantum Networks for Entanglement Distribution', 'Collaborative Research: CNS Core: Medium: Design and Analysis of Quantum Networks for Entanglement Distribution']
Cluster 2:
['CNS Core: Medium: Communication and Networking with Diffused Laser Light', 'CNS Core: Small: Design and Evaluation of Methods for Supporting Resilient and High-Availability Elastic Network Slicing', 'CNS Core: Small: New Caching Paradigms for Distributed and Dynamic Networks', 'CNS Core: Small: Dynamic and Composite Resource Management in Large-scale Industrial IoT Systems', 'SWIFT: SMALL: xNGRAN Navigating Spectral Utilization, LTE/WiFi Coexistence, and Cost Tradeoffs in Next Gen Radio Access Networks through Cross-Layer Design']
Cluster 3:
['CAREER: Wi