# Semantic Text Similarity Clustering

Before startint to code the solution, I check for some documents from de dataset to get familiar with the format of the documents and to analice what information I should take from there since not every field was usefull.

After that quick analysis I decided to use information about the university and it's location and also the title and abstract.

### Imports

I used two spacy models, the bigger one **en_core_web_trf** for a better lemmatization and **en_core_web_lg** to get the word vectors of the topics

In [321]:
import os
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

import spacy
nlp = spacy.load('en_core_web_trf', disable=['tagger', 'parser', 'ner'])
nlp.add_pipe('sentencizer')
en_stop_words = nlp.Defaults.stop_words

import warnings
warnings.filterwarnings('ignore')
import gc

### Reading files from XML format to pandas DataFrame

In [322]:
path = './2020/'
files = os.listdir(path)
print(len(files))

13154


In [323]:
df_abstracts = pd.DataFrame()
df = pd.DataFrame()
i=0
list_keywords=[]

for file in files:
    file_path=path+file
    #print('Processing....'+file_path)
    tree = ET.parse(file_path)
    root = tree.getroot()

    abstract = {}
    
    abstract['id'] = root.find('Award').find('AwardID').text
    abstract['title'] = root.find('Award').find('AwardTitle').text
    abstract['date'] = root.find('Award').find('AwardEffectiveDate').text
    abstract['area'] = root.find('Award').find('Organization').find('Directorate').find('LongName').text
    abstract['division'] = root.find('Award').find('Organization').find('Division').find('LongName').text
    abstract['university'] = root.find('Award').find('Institution').find('Name').text
    abstract['city'] = root.find('Award').find('Institution').find('CityName').text
    abstract['country'] = root.find('Award').find('Institution').find('CountryName').text
    abstract['abstract'] = root.find('Award').find('AbstractNarration').text
    
    df  = pd.DataFrame(abstract,index=[i])
    i=i+1
    
    df_abstracts = pd.concat([df_abstracts, df])

df_abstracts

Unnamed: 0,id,title,date,area,division,university,city,country,abstract
0,2000005,Collaborative Research: Excellence in Research...,07/01/2020,Direct For Biological Sciences,Division Of Integrative Organismal Systems,Howard University,Washington,United States,Head and heart development are closely intertw...
1,2000009,Workshop on Replication of a Community-Engaged...,01/01/2020,Direct For Education and Human Resources,Division Of Undergraduate Education,University of Notre Dame,NOTRE DAME,United States,The National Academy of Engineering identified...
2,2000012,Brazos Analysis Seminar,02/01/2020,Direct For Mathematical & Physical Scien,Division Of Mathematical Sciences,Baylor University,Waco,United States,This award provides three years of funding to ...
3,2000021,Collaborative Research: ECR EIE DCL: The Devel...,09/01/2020,Direct For Education and Human Resources,Division Of Human Resource Development,Michigan State University,East Lansing,United States,"This collaborative research project, involving..."
4,2000028,Research Initiation Award: Microwave Synthesis...,05/01/2020,Direct For Education and Human Resources,Division Of Human Resource Development,Bowie State University,BOWIE,United States,Research Initiation Awards provide support for...
...,...,...,...,...,...,...,...,...,...
13149,2055767,National Center for Next Generation Manufacturing,07/01/2021,Direct For Education and Human Resources,Division Of Undergraduate Education,Tunxis Community-Technical College,Farmington,United States,Recent studies have highlighted the nation's i...
13150,2055771,NSF-BSF: Dynamics and Operator Algebras beyond...,08/01/2021,Direct For Mathematical & Physical Scien,Division Of Mathematical Sciences,University of Oregon Eugene,Eugene,United States,"This project links two mathematical fields, dy..."
13151,2055772,Collaborative Research: SaTC: CORE: Medium: Na...,03/01/2021,Direct For Computer & Info Scie & Enginr,Division Of Computer and Network Systems,International Computer Science Institute,Berkeley,United States,Recent years have seen a dramatic rise in mobi...
13152,2055773,Collaborative Research: SaTC: CORE: Medium: Na...,03/01/2021,Direct For Computer & Info Scie & Enginr,Division Of Computer and Network Systems,St Mary's University San Antonio,San Antonio,United States,Recent years have seen a dramatic rise in mobi...


### Data Pre-processing

In the first iteration I tried the entire below pipeline but adding to the abstract the fields: *title*, *area*, *division*, *university*, *city* and *country* because I wanted to try if this metadata could add any value to the topics and to better cluster them, **but** it didn't work as expected so, for the final solution I decided to just use the *abstract* field and the results were better

In [324]:
for col in df_abstracts.columns:
    df_abstracts[col] = df_abstracts[col].str.lower()
df_abstracts

Unnamed: 0,id,title,date,area,division,university,city,country,abstract
0,2000005,collaborative research: excellence in research...,07/01/2020,direct for biological sciences,division of integrative organismal systems,howard university,washington,united states,head and heart development are closely intertw...
1,2000009,workshop on replication of a community-engaged...,01/01/2020,direct for education and human resources,division of undergraduate education,university of notre dame,notre dame,united states,the national academy of engineering identified...
2,2000012,brazos analysis seminar,02/01/2020,direct for mathematical & physical scien,division of mathematical sciences,baylor university,waco,united states,this award provides three years of funding to ...
3,2000021,collaborative research: ecr eie dcl: the devel...,09/01/2020,direct for education and human resources,division of human resource development,michigan state university,east lansing,united states,"this collaborative research project, involving..."
4,2000028,research initiation award: microwave synthesis...,05/01/2020,direct for education and human resources,division of human resource development,bowie state university,bowie,united states,research initiation awards provide support for...
...,...,...,...,...,...,...,...,...,...
13149,2055767,national center for next generation manufacturing,07/01/2021,direct for education and human resources,division of undergraduate education,tunxis community-technical college,farmington,united states,recent studies have highlighted the nation's i...
13150,2055771,nsf-bsf: dynamics and operator algebras beyond...,08/01/2021,direct for mathematical & physical scien,division of mathematical sciences,university of oregon eugene,eugene,united states,"this project links two mathematical fields, dy..."
13151,2055772,collaborative research: satc: core: medium: na...,03/01/2021,direct for computer & info scie & enginr,division of computer and network systems,international computer science institute,berkeley,united states,recent years have seen a dramatic rise in mobi...
13152,2055773,collaborative research: satc: core: medium: na...,03/01/2021,direct for computer & info scie & enginr,division of computer and network systems,st mary's university san antonio,san antonio,united states,recent years have seen a dramatic rise in mobi...


In [325]:
df_abstracts.to_csv('./df_abstracts.csv',index=False)

In [326]:
df_enriched = pd.DataFrame()
df_enriched['id'] = df_abstracts['id']
df_enriched['text'] = df_abstracts['abstract']
df_enriched.dropna(inplace=True)
df_enriched.head()

Unnamed: 0,id,text
0,2000005,head and heart development are closely intertw...
1,2000009,the national academy of engineering identified...
2,2000012,this award provides three years of funding to ...
3,2000021,"this collaborative research project, involving..."
4,2000028,research initiation awards provide support for...


In [327]:
def pre_process_txt(txt:str):
    """Cleans the text string, removing HTML tags, non alpha-numeric characters and stopwords

    Parameters
    ----------
    txt : str
        String to clean

    Returns
    -------
    txt : str
        Clean string
    """

    text = txt.replace('<.*?>',' ')
    text = text.replace('[^\w\s]','')
    text = text.translate(str.maketrans('','', string.punctuation))
    text = " ".join(word for word in text.split() if word not in en_stop_words)
    
    return text


def lemmatize_pipe(doc):
    """Applys lemmatization to the document

    Parameters
    ----------
    doc : str
        Document to lemmatize

    Returns
    -------
    str
        Lemmatized document
    """
    lemma_list = [str(tok.lemma_) for tok in doc] 
    return " ".join(lemma_list)

def preprocess_pipe(texts):
    """Helper to use spacy pipe fast processing to lemmatize text

    Parameters
    ----------
    texts : pd.Series
        String column to lemmatize

    Returns
    -------
    preproc_pipe : array
        Lemmatized column
    """
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize_pipe(doc))
    
    return preproc_pipe

I decided to lemmatize the abstracts in order to get the root of every word and make easy the topic extraction and also because is a common good practice in NLP

In [328]:
df_enriched['clean'] = df_enriched.apply(lambda row : pre_process_txt(row['text']), axis=1)
print('Clean it... Lemmatizing...')
df_enriched['lemmatize'] = preprocess_pipe(df_enriched['clean'])
df_enriched.drop(columns=['text','clean'],inplace=True)
df_enriched

Clean it... Lemmatizing...


Unnamed: 0,id,lemmatize
0,2000005,head heart development closely intertwined emb...
1,2000009,national academy engineering identified solvin...
2,2000012,award provides years funding help defray expen...
3,2000021,collaborative research project involving michi...
4,2000028,research initiation awards provide support jun...
...,...,...
13149,2055767,recent studies highlighted nations increasing ...
13150,2055771,project links mathematical fields dynamics ope...
13151,2055772,recent years seen dramatic rise mobile apps he...
13152,2055773,recent years seen dramatic rise mobile apps he...


In [329]:
df_enriched.to_csv('./df_clean.csv',index=False)

Cleaning a bit of memory, because text dataframes tend to be very memory heavy

In [330]:
del df_abstracts
gc.collect()

40

In [331]:
def find_topics(text):
    """
    Function that takes a text as an input, and finds the most important topic and 
    takes the 3 most relevant words of that topic, using LDA model
    
    Parameters
    ----------
    text : str
        Lemmatized string to get topics from

    Returns
    -------
    str
        String with the 3 most relevant words of the topic
    """
    try:
        
        count_vectorizer = CountVectorizer(stop_words='english')
        count_data = count_vectorizer.fit_transform([text])
        
        number_topics = 1
        number_words = 3
        
        # Create and fit the LDA model
        lda = LDA(n_components=number_topics, n_jobs=-1)
        lda.fit(count_data)

        words = count_vectorizer.get_feature_names()

        #Get topics from model. They are represented as a list e.g. ['military','army']
        topics = [[words[i] for i in topic.argsort()[:-number_words - 1:-1]] for (topic_idx, topic) in enumerate(lda.components_)]
        topics = np.array(topics).ravel()
  
    except Exception as e:
        print(e)
        return (text)

    return " ".join(set(topics))

After the before mentioned first iteration I found that in the topics were some words specific to the investigation field that didn't add much value to the topic analysis and were just adding noice to them, so I decided to eliminated them after doing a frequency analysis of the topics extracted in that first iteration

In [332]:
bs_stop_words = ["research","project","students","data","new","university","award","states","united","nsfs","study","undergraduate","'s","1","2","state"]

def remove_bs_sw(txt):
    """
    Remove business specific stop words
    
    Parameters
    ----------
    text : str
        Lemmatized string to delete common problem specific words that just adds noice to the topics

    Returns
    -------
    str
        Cleaned string
    """

    text = " ".join(word for word in txt.split() if word not in bs_stop_words)
    
    return text

In [333]:
df_enriched['lemmatize'] = df_enriched.apply(lambda row : remove_bs_sw(row['lemmatize']), axis=1)
df_enriched['topics'] = df_enriched['lemmatize'].apply(find_topics).values
df_enriched.head()

Unnamed: 0,id,lemmatize,topics
0,2000005,head heart development closely intertwined emb...,development cells neural
1,2000009,national academy engineering identified solvin...,education ecosystem community
2,2000012,provides years funding help defray expenses pa...,theory analysis texas
3,2000021,collaborative involving michigan north texas m...,stem faculty women
4,2000028,initiation awards provide support junior midca...,vcp binders proteins


In [334]:
df_enriched.to_csv('./df_topics.csv',index=False)

In [335]:
nlp_sts = spacy.load('en_core_web_lg')

### Modeling
Now after all the NLP data cleaning, I transformed the topics into word vectors to then use them in a clustering algorithm

In [336]:
corpus = df_enriched['topics'].str.cat(sep=' ')
words = corpus.split()
corpus = " ".join(sorted(set(words), key=words.index))

#Apply the model
tokens = nlp_sts(corpus)

#Convert tags into vectors for clustering model
word_vectors = []
for i in tokens:
    word_vectors.append(i.vector)
word_vectors = np.array(word_vectors)
word_vectors

array([[ 0.03734  ,  0.0010196,  0.1125   , ..., -0.24715  , -0.40202  ,
         0.49479  ],
       [-0.55419  , -0.02895  , -0.39624  , ...,  0.28711  , -0.11356  ,
         0.27036  ],
       [ 0.10273  ,  0.0059362, -0.019216 , ...,  0.52837  , -0.34998  ,
         0.45391  ],
       ...,
       [ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ],
       [ 0.       ,  0.       ,  0.       , ...,  0.       ,  0.       ,
         0.       ],
       [-0.043278 ,  0.028749 , -0.33234  , ..., -0.11468  , -0.13788  ,
         0.011818 ]], dtype=float32)

I decided to use DBSCAN since in the past got great results in other task and I found out that this algorithm has some good results in text clustering. The *eps* and *min_samples* were chosen after a short hyperparameter tunning were the goal was to reduce the amount of words in the *outlier* cluster (-1 label)

In [337]:
labels = DBSCAN(metric='cosine', eps=0.4, min_samples=7).fit_predict(word_vectors)
np.unique(labels,return_counts=True)

(array([-1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13],
       dtype=int64),
 array([4760, 2347,   95,   13,   11,    4,    5,    5,    3,    7,    8,
           8,    3,    7,    7], dtype=int64))

In [338]:
df_topics = pd.DataFrame()
df_topics['topics'] = corpus.split()
df_topics['label'] = labels
df_topics

Unnamed: 0,topics,label
0,development,0
1,cells,0
2,neural,0
3,education,0
4,ecosystem,0
...,...,...
7278,wuschel,-1
7279,imperial,-1
7280,electrocatalysts,-1
7281,calgebras,-1


In [339]:
def assign_topic_labels(txt):
    
    """
    Assigns the labels of the topics to each document
    
    Parameters
    ----------
    text : str
        String with the topics of the document

    Returns
    -------
    str
        String with the clustering labels corresponding to the document
    """
    
    labels = []
    
    topics = txt.split()
    
    for t in topics:
        l = df_topics[df_topics['topics']==t]['label'].values[0]
        if l != -1:
            labels.append(l)
            
    return " ".join(str(x) for x in set(labels))

In [340]:
df_enriched['labels'] = df_enriched.apply(lambda row : assign_topic_labels(row['topics']), axis=1)

In [359]:
display(len(df_enriched['labels'][df_enriched['labels'].str.len()==0]))
display(len(df_enriched['labels'][df_enriched['labels'].str.len()==1]))
display(len(df_enriched['labels'][df_enriched['labels'].str.len()==2]))
display(len(df_enriched['labels'][df_enriched['labels'].str.len()==3]))

1021

11932

119

8

In [351]:
def clean_labels(txt):
    """
    Cleans labels removing the '0' label if the document has more labels, 
    since it seems to be common among all the documents
    
    Parameters
    ----------
    text : str
        String with the labels of a document

    Returns
    -------
    str
        Filtered labels
    """
    
    clean_labels = []
    labels = txt.split()
    
    if len(labels) > 1:
        clean_labels = [l for l in labels if l != '0']
    else:
        clean_labels = labels
        
    return " ".join(clean_labels)

In [357]:
df_enriched['labels'] = df_enriched.apply(lambda row : clean_labels(row['labels']), axis=1)

Split the labels in different columns

In [361]:
split_labels = df_enriched['labels'].str.split(" ",expand=True)
df_enriched['label_1'] = split_labels[0]
df_enriched['label_2'] = split_labels[1]
df_enriched

Unnamed: 0,id,lemmatize,topics,labels,label_1,label_2
0,2000005,head heart development closely intertwined emb...,development cells neural,0,0,
1,2000009,national academy engineering identified solvin...,education ecosystem community,0,0,
2,2000012,provides years funding help defray expenses pa...,theory analysis texas,1,1,
3,2000021,collaborative involving michigan north texas m...,stem faculty women,0,0,
4,2000028,initiation awards provide support junior midca...,vcp binders proteins,0,0,
...,...,...,...,...,...,...
13149,2055767,recent studies highlighted nations increasing ...,generation center manufacturing,0,0,
13150,2055771,links mathematical fields dynamics operator al...,operator dynamical algebra,0,0,
13151,2055772,recent years seen dramatic rise mobile apps he...,privacy health apps,0,0,
13152,2055773,recent years seen dramatic rise mobile apps he...,privacy health apps,0,0,


### Analysis of results

Doing a random sample it's clear that the clustering has a good sense of topics with this 3 documents, because all of them talk about cutting edge technologies used to solve a specific problem in different fields of study.

But, there's a lot of room to improvement, with more time to analyze the documents, the pre process could implement a way to analyze the acronyms that hide a lot of value about the main topic of the documents, also the model to generate the word vectors could be train specifically for the documents that we're processing in order to have a domain specific model. 

In [365]:
df_enriched[df_enriched['label_1']=='1'].sample(3,random_state=5)

Unnamed: 0,id,lemmatize,topics,labels,label_1,label_2
930,2003109,support chemical measurement imaging program d...,nmr mri warren,1,1,
3367,2013562,quantum computing quantum computers type compu...,davis quantum computer,1,1,
12585,2053096,denser liquid sink lighter vinegar sinks oil s...,water amoc atlantic,1,1,


In [367]:
df_abstracts = pd.read_csv('./df_abstracts.csv')

In [371]:
df_abstracts[df_abstracts['id'].isin([2003109,2013562,2053096])]['abstract'].values

array(['with support from the chemical measurement and imaging program in the division of chemistry, and co-funding from the atomic, molecular, and optical experimental physics program in the division of physics, professor warren warren and his group at duke university are working to expand the utility and accessibility of “hyperpolarized” nuclear magnetic resonance (nmr) spectroscopy. nmr is a powerful tool for chemists; it is useful for determining molecular structure and for monitoring the progress of chemical reactions. nmr\'s clinical cousin, magnetic resonance imaging (mri) is an important tool for producing images of soft tissues in the body. however, both methods usually suffer from low sensitivity - meaning that they cannot detect small amounts of sample or low concentrations. "hyperpolarization" methods can increase nmr signals by a factor of 1000 or more, but are usually technically challenging and extremely expensive. the warren group is exploring the fundamental chemistry 