# NSF Research Awards Abstracts - Exploratory Data Analisys

In this notebook I seek high-level data understanding and define the approach to tackle the clustering task to find specific topics that can group them.

### Task

This dataset comprises several paper abstracts, one per file, that were furnished by the NSF (National Science Foundation). A sample abstract is shown at the end.

Your task is developing an unsupervised model which classifies abstracts into a topic (discover them!). Indeed, your goal is to group abstracts based on their semantic similarity.

You can get a sample of abstracts here. Be creative and state clearly your approach. Although we don’t expect accurate results we want to identify your knowledge over traditional and newest method over NLP.

### Let's deep

In [8]:
#create a virtual env (in root directory) and next install dependencies 

#%pip install -r requirements.txt

In [1]:
# import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [2]:
# create a function to load .xml files to pandas data frame

def load_xml_files_to_dataframe(directory: str) -> pd.DataFrame:
    # List to hold DataFrames
    dataframes = []
    
    # Iterate over all files in the directory
    for filename in os.listdir(directory):
        if filename.endswith('.xml'):
            file_path = os.path.join(directory, filename)
            # Load the XML file into a DataFrame and append to the list
            df = pd.read_xml(file_path)
            dataframes.append(df)
    
    # Concatenate all DataFrames into a single DataFrame
    combined_df = pd.concat(dataframes, ignore_index=True)
    return combined_df

In [3]:
# load data path using environment variable for security 
# (create .env file on notebooks folder and load using dotenv and os)

from dotenv import load_dotenv
load_dotenv()

df = load_xml_files_to_dataframe(os.getenv('DATA_PATH'))

### We explore basic data values and statistics

In [None]:
# let's view basic data characteristics
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.columns

In [None]:
df.shape

In [None]:
# let's confirm the total columns
print("The dataframe has:", len(df.columns), "columns in total")

In [None]:
# dtypes columns
df.dtypes

In [None]:
# let's take a look statistics from numerical variables
df.describe()

In [None]:
# Null values
df.isnull().sum()

Please note! The Abstract Narration column has 141 null values. This is important because it is the column of interest for creating the cluster.

### Let's explore our column of interest: Abstract narration

In [None]:
df["AbstractNarration"].head(3)

In [None]:
df["AbstractNarration"].tail(3)

Note that rows 13297 and 13298 have the same information! This column has duplicate values, probably because the same work was awarded multiple times.

In [None]:
# lets print one full sample
df["AbstractNarration"][0]

In [None]:
# percentage unique abstracts
print(f'Unique abstracts: {(len(df["AbstractNarration"].unique())/len(df["AbstractNarration"])*100):.2f} %')
print('unique abstracts:', len(df["AbstractNarration"].unique()))

### let's visualize numerical columns trends

In [49]:
numerical_columns = df.select_dtypes(include='number')

In [None]:
numerical_columns.head(5)

In [62]:
# define fuction to plot histogram of award amount
def plot_award_amounts(column_name, values):
    img = plt.figure(figsize=(8, 8))
    plt.hist(values, bins=20, edgecolor='black')
    plt.title(f'Histogram of {column_name}')
    plt.xlabel(column_name)
    plt.ylabel('Frequency')
    plt.show()

In [None]:
plot_award_amounts("AwardAmount", numerical_columns['AwardAmount'])

In [None]:
numerical_columns["AwardAmount"].plot()

### Export column: Abstract narration as a .csv file to develop cluster model

In [8]:
# create data directory in root and add this path to .env file
# next export data in csv format to this folder
df["AbstractNarration"].to_csv(os.getenv("RAW_DATA_PATH"), index=False)

### Final approach

1. Only will be use AbstractNarration column because the task is about semantic similarity and not exist any preview label
2. For this task the approach is with K-Means algorithm to define the topics.
3. This strategy using embeddings model to create vector space. The model to use is all-MiniLM-L6-v2 for good performance and fast implementation.
4. The model is from huggingface Hub
5. Create training.py and inference.py files inside pipelines folder for production level

### Next Steps

1. Validate different vector embedding models
2. Use pyspark to improve training performance
3. Add other columns and compare metrics
