### Case Study #2: Develop a document clustering model and methods to qualify clusters

<b> Case study </b>

You have been sought to explore the possibility to use clustering techniques in order to cluster the documents of a database. As presented in the context section, the objective is to develop some systematic methods to qualify the output clusters. 

The client has already performed a more classic topic modeling algorithm and is seeking to challenge it, especially with respect to topic qualification. As it is an exploratory study, you are asked to test different methods and assess the improvement of the current algorithm and clusters qualification.

An important thing to have in mind is that a project for another client on the same thematic is currently in the pipe. Therefore, capitalizing the methodology and code you develop here is crucial to solve several similar problems later (and save you some sleep…)

<b>Data </b>

The corpus that will be used is composed of abstracts of articles extracted from PubMed, a free full-text archive of biomedical and life sciences journal literature.

The aim is to focus on the modeling section, that's why the data pre-processing is implemented in this notebook.

<b> Some good practices </b>

It is important to keep a Notebook organised: some quick reminder:
- <b> Use functions</b>: it allows to modify them in a single place when they are used several times, and makes the Notebook easier to read,
- <b> Structure your Notebook</b>: first the packages, then the loading, then the preparation, after that the modeling, the assessing etc. Don't hesitate to work only in a main function and make calls to functions split in different parts,
- <b> Comment your code</b>: a stranger (with knowledge) should be able to understand what you did.

## 1 - Import libraries

In [1]:
import pandas as pd
import numpy as np
import gensim
import nltk
import warnings
import re

from Bio import Entrez
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from gensim.utils import simple_preprocess
from gensim import models
from gensim.parsing.preprocessing import STOPWORDS

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

warnings.filterwarnings("ignore")

[nltk_data] Downloading package punkt to
[nltk_data]     /home/quinten/Utilisateurs/chlioui/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/quinten/Utilisateurs/chlioui/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/quinten/Utilisateurs/chlioui/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 2 - Extracting abstracts

In order to extract the articles, we used Biopython, the best-known Python library to process biological data.

If you want to skip the extraction section, you can start running the code from the cell 'Get abstracts'. The file <i> CS2_Article_Clustering.xlsx </i> is provided. 

In [2]:
def search(query):
    """ search articles with the key word 'query'.
    The database is specified in the parameter 'db'.
    The number of retrived articles is specified in the parameter 'retmax'. 
    The reason for declaring YOUR_EMAIL address is to allow the NCBI to
    contact you before blocking your IP, in case you’re violating the guidelines.
    """
    Entrez.email = 'YOUR_EMAIL'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='1000',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    return results

In [3]:
def fetch_details(id_list):
    """ Fetch details of a list of articles IDs.
    The reason for declaring YOUR_EMAIL address is to allow the NCBI to
    contact you before blocking your IP, in case you’re violating the guidelines.
    """
    ids = ','.join(id_list)
    Entrez.email = 'YOUR_EMAIL'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [4]:
def collect_data(categories):
    """ Get abstracts for each category in 'categories'. 
    
    return: abstracts
    ----------
    article_ID : ID of the article
    text : abstract of the article if it exists
    category : category of the article
    structured : indicates whether the article is structured
    Keywords : keywords of the article if it exists
    Title : title of the article
    """
    abstracts = pd.DataFrame(columns=['article_ID', 'Title',
                                      'Keywords', 'text', 'category',
                                      'structured'])
    for cat in categories:
        results = search(cat)  # get the articles for the category 'cat'
        id_list = results['IdList']  # select the IDs
        if (len(id_list) > 0):
            papers = fetch_details(id_list)  # get details of articles
            pubmed_articles = papers['PubmedArticle'] 
            for pubmed_article in pubmed_articles:
                s = 1  # structured article
                MedlineCitation = pubmed_article['MedlineCitation']
                pmid = int(str(MedlineCitation['PMID']))
                article = MedlineCitation['Article']
                keywords = MedlineCitation["KeywordList"]
                title = MedlineCitation['Article']['ArticleTitle']
                if(len(keywords) > 0):
                    keywords = list(keywords[0])
                if('Abstract' in article):
                    abstract = article['Abstract']['AbstractText']
                    if(len(abstract) == 1):
                        abstract = abstract[0]
                        s = 0
                else:
                    abstract = ''
                abstracts = abstracts.append({'article_ID': pmid, 'text': abstract,
                                              'category': cat, 'structured': s,
                                              'Keywords': keywords, 'Title': title},
                                             ignore_index=True)  # store the abstract
    return abstracts

In the cell below, we call the function **collect_data** in order to get the abstracts.

In [5]:
%%time 
# here are defined categories for which we want articles 
categories = ['cancérologie', 'cardiologie', 'gastro',
              'diabétologie', 'nutrition', 'infectiologie',
              'gyneco-repro-urologie', 'pneumologie', 'dermatologie',
              'industrie de santé', 'ophtalmologie']

# call the function collect_data to get the abstracts
abstracts = collect_data(categories)

CPU times: user 1min 18s, sys: 1.9 s, total: 1min 20s
Wall time: 3min 4s


### Exploring abstracts

In [6]:
abstracts.shape

(8996, 6)

In [7]:
abstracts.head(10)

Unnamed: 0,article_ID,Title,Keywords,text,category,structured
0,30348261,"[Early phase trials at the ""Institut de cancér...","[Early phase trials, Essai thérapeutique de ph...",[Early phase therapeutic trials in oncology ar...,cancérologie,1
1,28414610,Clinical Calculator for Early Mortality in Met...,[],Purpose Factors contributing to early mortalit...,cancérologie,0
2,24461451,[Individual lung cancer screening in practice....,[],,cancérologie,1
3,25287828,"Prospective, randomized, multicenter, phase II...",[],"[To compare epirubicin, cisplatin, and capecit...",cancérologie,1
4,25241229,Gemcitabine plus cisplatin versus chemoradioth...,"[Biliary tract cancer, Chemoradiotherapy, Cisp...",[Chemoradiotherapy (CHRT) is often advocated f...,cancérologie,1
5,26970507,[Commitment of The Bulletin du Cancer and the ...,[],,cancérologie,1
6,24433843,Feasibility of preoperative and postoperative ...,"[Adjuvant treatment, Chemo radiotherapy, Gastr...","[For resectable gastric cancer, both postopera...",cancérologie,1
7,22926014,Impact of primary tumour resection on survival...,[],[To assess the impact of primary tumour resect...,cancérologie,1
8,19699449,[Clamping modalities during partial nephrectom...,[],Partial nephrectomy requires control of renal ...,cancérologie,0
9,21969501,Alternative end points to evaluate a therapeut...,[],[Progression-free survival (PFS) is not an opt...,cancérologie,1


In [8]:
abstracts.groupby(['category']).count()

Unnamed: 0_level_0,article_ID,Title,Keywords,text,structured
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cancérologie,1000,1000,1000,1000,1000
cardiologie,1000,1000,1000,1000,1000
dermatologie,1000,1000,1000,1000,1000
diabétologie,1000,1000,1000,1000,1000
gastro,998,998,998,998,998
infectiologie,1000,1000,1000,1000,1000
nutrition,1000,1000,1000,1000,1000
ophtalmologie,998,998,998,998,998
pneumologie,1000,1000,1000,1000,1000


In [9]:
abstracts.groupby(['category', 'structured']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,article_ID,Title,Keywords,text
category,structured,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
cancérologie,0,511,511,511,511
cancérologie,1,489,489,489,489
cardiologie,0,292,292,292,292
cardiologie,1,708,708,708,708
dermatologie,0,291,291,291,291
dermatologie,1,709,709,709,709
diabétologie,0,510,510,510,510
diabétologie,1,490,490,490,490
gastro,0,518,518,518,518
gastro,1,480,480,480,480
