pubmed_clustering

Solution to coding part of the interview process for Invitae. Position software engineer, Scientific IR.

The solution was developed with Python 3.6 and the Anaconda Python Distribution on Windows OS.

Installation steps:

OPTIONAL create and activate a new virtual environment:

With Anaconda:

 conda create -n env_name python=3.6
 activate env_name

Alternatively:

 python3 -m venv env_name -env
 source env_name/bin/activate

pip install -r requirements.txt

As some of the dependencies are not available in the conda repo, we use pip to install all libraries.

run the code:

a. python main.py -m train -k N

  Executves input feature space xploration, traines and serializes seperate HDBSCAN and KMeans for individual input components:     ['title'],
                          ['abstract'],
                          ['title', 'abstract'],
                          ['title', 'NE'],
                          ['title', 'abstract', 'NE']
                          
  For each invidiual input component, ngrams in range frm (1,1) to (1,4) are evaluated. Best performing n-gram is serlized for each input component.
  
  
  After training and serialization, results are available:
     I. as a csv file in the folder reports/results/results.csv, where for each input components Homogenity score, Adjustem Mutual Information and inter- and intra-cluster varianc information is stored. 
     II. Visualized with help of tSNE, available in img/
     III. Clustring results, obtained by applying the best perfroming combinations for both HDBSCAN and K-Means, available in report/cluster_reports/clusters_ground_truth_HDBSCAN.tsv and report/cluster_reports/clusters_ground_truth_KMEANS.tsv.
     IV. Serialized models, available in prepared/utils. Models are serialized with sklearn.external.joblib.
     
     Additionally, analysis reports for the two evaluated clustering algorithms are shown on the screen, including HS and AMI values, top N words per cluster and Cluster ID with PMIDs in individual cluster. N==10 by default.

b. python main.py -m test -k N

 Executes inference on the test set (i.e. unlabeled ground truth) with the best perfomring HDBSCAN and K-Means algorthms. ['title', 'abstract', 'NE'] is used as the default input component. 
 
 After inference, the results (same format and content as above) are available in report/cluster_reports/clusters_test_set_HDBSCAN.tsv and report/cluster_reports/clusters_test_set_KMEANS.tsv.
 
 Additionally, analysis reports for the two evaluated clustering algorithms are shown on the screen, including HS and AMI values, top N words per cluster and Cluster ID with PMIDs in individual cluster. N==10 by default.

c. python main.py -m external -k N -i input_file -n corpus_name

with:

 -k: number of most frequent tokens in obtained clusters

 -i: path to input file, in the same format as unlabeled ground truth corpus
 
 -n: (optional) name of the corpus. If no name is given external_{timestamp} is used. 

 Generated reports are available in report/cluster_reports/cöuster_{corpus_name}_{timestamp}_HDBSCAN.csv and report/cluster_reports/cöuster_{corpus_name}_{timestamp}_KMEANS.csv.

Read the report, to be found in report/pubmed_clustering.pdf

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
instructions		instructions
prepared		prepared
report		report
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pubmed_clustering

About

Releases

Packages

Languages

deakkon/pubmed_clustering

Folders and files

Latest commit

History

Repository files navigation

pubmed_clustering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages