# Case study - Clustering scientific papers accoring to their abstracts

https://archive.ics.uci.edu/ml/datasets/NSF+Research+Award+Abstracts+1990-2003 z folderu Data Folder ściągamy plik Part1. Jest to plik zawierający podstawowe informacje i abstrakty pewnego zbioru artykułów naukowych. Zadanie: pogrupować artykuły na podstawie abstraktów.

In [1]:
from glob import glob
import numpy as np
import pandas as pd
import re

In [3]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer

In [34]:
from multiprocessing import Pool
import time
import string

In [95]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score,make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [164]:
from itertools import product
from sklearn.preprocessing import Normalizer

### Get paths to all documents in subfolders

In [4]:
file_paths = glob('./Data2/Part1/*/*/*.txt')

In [5]:
file_paths[:5]

['./Data2/Part1/awards_1990/awd_1990_23/a9023383.txt',
 './Data2/Part1/awards_1990/awd_1990_23/a9023319.txt',
 './Data2/Part1/awards_1990/awd_1990_23/a9023464.txt',
 './Data2/Part1/awards_1990/awd_1990_23/a9023681.txt',
 './Data2/Part1/awards_1990/awd_1990_23/a9023335.txt']

In [69]:
example_path = file_paths[4]

In [70]:
with open(example_path,'r') as f:
    example_file = f.read()

In [71]:
print(example_file)

Title       : Faculty Awards for Women:Mathematical Sciences: Stochastic Processes and
               Applications
Type        : Award
NSF Org     : HRD 
Latest
Amendment
Date        : December 13,  1995  
File        : a9023335

Award Number: 9023335
Award Instr.: Continuing grant                             
Prgm Manager: Margrete S. Klein                       
	      HRD  DIVISION OF HUMAN RESOURCE DEVELOPMENT  
	      EHR  DIRECT FOR EDUCATION AND HUMAN RESOURCES
Start Date  : November 1,  1991   
Expires     : October 31,  1997    (Estimated)
Expected
Total Amt.  : $250000             (Estimated)
Investigator: Ruth J. Williams williams@math.ucsd.edu  (Principal Investigator current)
Sponsor     : U of Cal San Diego
	      9500 Gilman Drive, Dept. 0934
	      La Jolla, CA  920930934    858/534-0246

NSF Program : 9292      FACULTY AWARDS FOR WOMEN
Fld Applictn: 0000099   Other Applications NEC                  
              21        Mathematics                             
Progr

### Extract title and abstract

In [9]:
example_file = re.sub(r'(\s+\n\s+)','',example_file)

In [10]:
example_file

'Title       : Reaction Dynamics of Small Cyclic Hydrocarbons\nType        : Award\nNSF Org     : CHE \nLatest\nAmendment\nDate        : December 16,  1992  \nFile        : a9023319\n\nAward Number: 9023319\nAward Instr.: Continuing grant                             \nPrgm Manager: Seymour LapporteCHE  DIVISION OF CHEMISTRYMPS  DIRECT FOR MATHEMATICAL & PHYSICAL SCIEN\nStart Date  : February 15,  1991  \nExpires     : July 31,  1994       (Estimated)\nExpected\nTotal Amt.  : $155442             (Estimated)\nInvestigator: David K. Lewis   (Principal Investigator current)\nSponsor     : Colgate University\n\t      13 Oak Drive\n\t      Hamilton, NY  13346    315/228-1000\n\nNSF Program : 1942      UNIMOLECULAR PROCESSES\nFld Applictn: 0000099   Other Applications NEC12        Chemistry                               \nProgram Ref : 9141,9178,9229,\nAbstract    :\n              This grant from the Organic Dynamics Program supports the researchof Professor David Lewis at Colgate University.

In [11]:
m = re.findall(r'(\s+\n)|(\n\s+)',example_file)

In [12]:
m

[(' \n', ''),
 ('  \n', ''),
 ('\n\n', ''),
 ('                             \n', ''),
 ('  \n', ''),
 ('', '\n\t      '),
 ('', '\n\t      '),
 ('\n\n', ''),
 ('                               \n', ''),
 ('', '\n              ')]

In [27]:
def get_title_abstract(path):
    '''
    Get title and abstract for a paper from given path
    '''
    try:
        with open(path,'r',encoding='latin1') as f:
            file = f.read()
            text = re.sub(r'(\s+\n\s+)',' ',file)
            text = re.sub(r'(\s+\n)|(\n\s+)',' ',text)
            title = re.findall(r'Title\s+:(.*)',text)[0].strip()
            abstract = re.findall(r'Abstract\s+:(.*)',text)[0].strip()
            
    except Exception as e:
        print('Failed reading file %s' %path)
        print('Error: %s' %e)
        return '',''
    
    return(title,abstract)

## Multiprocessing

In [23]:
len(file_paths)# number of abstracts

51760

In [39]:
'''
check performance of multiprocessing with different number of processes 
accoridng to a sample of abstract
'''
for process in range(1,5):
    start = time.time()
    p = Pool(processes=process)
    result = p.map(get_title_abstract,file_paths[:10000])
    p.close()
    p.join()
    print('With {} processes it takes {:.1f} seconds'.format(process,time.time() - start))

With 1 processes it takes 7.6 seconds
With 2 processes it takes 4.2 seconds
With 3 processes it takes 4.2 seconds
With 4 processes it takes 4.3 seconds


In [73]:
# dual core CPU with hyperthreading
p = Pool(processes=2)
result = p.map(get_title_abstract,file_paths[:20000])
p.close()
p.join()

data = [art for art in result]

### Abstracts preprocessing

In [74]:
def abstract_preprocessing(abstract):
    # remove multiple spaces
    abstract_no_spaces = re.sub(r'(\s+)',' ',abstract)
    #remove capital letters          
    abstract_lower = abstract_no_spaces.lower()
    #tokenization
    abstract_tokenized = nltk.word_tokenize(abstract_lower)
    # remove punctuation
    abstract_no_punc = [word for word in abstract_tokenized if not word in string.punctuation]  
    # remove stopwords
    abstract_no_stopwords = [word for word in abstract_no_punc if not word in nltk.corpus.stopwords.words('english')]
    #joining words
    return ' '.join(abstract_no_stopwords)

In [75]:
# get only abstracts
abstracts = [art[1] for art in data]

In [76]:
pool = Pool(processes=2)
result = pool.map(abstract_preprocessing,abstracts)
pool.close()
pool.join()

### TF-IDF + SVD + MiniBatchKmeans - best hyperparameters tuning

In [79]:
titles = [art[0] for art in data]
abstracts = [art[1] for art in data]

In [80]:
abstracts[0]

'project supports cultural anthropologist university south florida analysis network structure public private social service agencies tampa bay area using network algorithms analyze function shape interactions agencies project describe analyze hierarchy interactions 500 agencies research important communities developed world multiple social service agencies interacting various levels government social institutions schools general public understanding interactions structured hierarchical manner advance abilities decision makers allocate resources efficiently'

In [176]:
def cluster_abstracts(vectorizer_min_df,vectorizer_max_df,dim_red_n_components_arg,clustering_n_clusters_arg,X):
    
    #TF-IDF
    tf_idf = TfidfVectorizer(min_df=vectorizer_min_df, max_df=vectorizer_max_df) 
    dtm = tf_idf.fit_transform(X)
    
    #dimension reduction
    svd = TruncatedSVD(n_components=dim_red_n_components_arg)
    dtm_svd_ = svd.fit_transform(dtm)
    
    #clustering
    mini_batch = MiniBatchKMeans(n_clusters=clustering_n_clusters_arg,
                                n_init = 1,
                                init_size = 1000,
                                batch_size = 1000)
    labels = mini_batch.fit_predict(dtm_svd_)
    
    sil_score = silhouette_score(dtm_svd_,labels)

    return(vectorizer_min_df,vectorizer_max_df,dim_red_n_components_arg,clustering_n_clusters_arg,sil_score)

In [177]:
vectorizer_min_df_arg = [10,20]
vectorizer_max_df_arg = [0.2,0.5,0.8]
dim_red_n_components_arg = [200,300]
clustering_n_clusters_arg = [20,40,60]

In [178]:
pool = Pool(2)
results = pool.starmap(functools.partial(cluster_abstracts,X = abstracts),
                       product(vectorizer_min_df_arg,
                               vectorizer_max_df_arg,
                               dim_red_n_components_arg,
                               clustering_n_clusters_arg))
pool.close()
pool.join()

In [180]:
sorted(results, key=lambda tup: tup[4],reverse=True)[:5]

[(10, 0.2, 200, 60, 0.10027918043704515),
 (20, 0.8, 200, 60, 0.09598451606225265),
 (10, 0.8, 200, 60, 0.09561062588394344),
 (10, 0.5, 200, 60, 0.0951825009441485),
 (20, 0.5, 200, 40, 0.09417496613666532)]

In [224]:
#TF-IDF
tf_idf = TfidfVectorizer(min_df=10, max_df=0.2) 
dtm = tf_idf.fit_transform(abstracts)
    
#dimension reduction
svd = TruncatedSVD(n_components=200)
dtm_svd_ = svd.fit_transform(dtm)
    
#clustering
mini_batch = MiniBatchKMeans(n_clusters=40,
                             n_init = 1,
                             init_size = 1000,
                             batch_size = 1000)
labels = mini_batch.fit_predict(dtm_svd_)
    
sil_score = silhouette_score(dtm_svd_,labels)

In [225]:
df = pd.DataFrame(
{
    'title':titles,
    'abstract':abstracts,
    'label':labels
})

In [226]:
df.head()

Unnamed: 0,title,abstract,label
0,Network Approach to Levels of Integration,project supports cultural anthropologist unive...,10
1,Reaction Dynamics of Small Cyclic Hydrocarbons,grant organic dynamics program supports resear...,2
2,Faculty Awards for Women. Scientific Inquiry i...,award provides support dr. susan feigenbaum na...,10
3,FAW: Biophysical Suudies of Eucaryotic mRNA Tr...,dr. goss research goals understanding regulati...,20
4,Faculty Awards for Women:Mathematical Sciences...,faculty award women scientists engineers made ...,25


In [227]:
ex = 5
for label in range(max(labels)):
    nmb = len(df.loc[df.label == label,'title'])
    print('There are {} abstracts with label {}'.format(nmb,label))
        
    for it in range(min(ex,nmb)):
        print(df.loc[df.label == label,'title'].iloc[it])
    print()

There are 869 abstracts with label 0
Core and Non-Core Measurements of Sediment Composition and Burial Rate and in Response to Climate Change on Time Scalesof 10.4 - 10.5 Years
Faculty Awards for Women: Tracing Sources and Sinks of Particles in the Ocean
Reactive Radionuclides as Indicators of New Production and Particle Transformations in the Equatorial Pacific Water Column: JGOFS
The Effect of Magnetic Field Structure on Wave Flux in the Solar Atmosphere
Faculty Award for Women - Energetic Winds and Disk Accretionin Low Mass Young Stars

There are 947 abstracts with label 1
Financial and Compliance Audit Services
Travel of India-Scientists under the U.S.-India Exchange of Scientists Program
Financial and Compliance Audit Services
Travel of U.S.-Scientist under the U.S-India Exchange of Scientists Programs
Financial and Compliance Audit Services

There are 787 abstracts with label 2
Reaction Dynamics of Small Cyclic Hydrocarbons
Structural Recognition in Complexation of Organic Compou