Medical Embeddings and Clinical Trial Search Engine

1. Create a new environment

conda create -p med_venv python==3.10 -y
conda activate med_venv/

2. Install all the requirements

python -m pip install --upgrade pip
pip install -r src/requirements.txt
conda install jupyter (to run the jupyter notebook)

3. Code Execution

Run python src/engine.py to train the model
Run streamlit run src/app.py to run the streamlit app

What does the code do??

The Project aims to train SkipGram and FastText Models on COVID-19 Clinical Trials Dataset and builds a Search Engine where user can type any COVID-19 related keyword and it presents all the top n similar results from the dataset
Application Demo
Architecture Diagram

Word2Vec and FastText Word Embedding with Gensim

Business Context:

We all must have wondered that if we search for a particular word in google, it does not show just the results that contain the very same word but also shows results that are very closely related to it. For example, if we search for the term ‘medicine’ in google, you can see results that not just include the word ‘medicine’ but also terms such as "health", "pharmacy", "WHO", and so on. So, google somehow understands that these terms are closely related to each other. This is where word embeddings come into the picture. Word embeddings are nothing but numerical representations of words in a sentence depending on the context.

General word embeddings might not perform well enough on all the domains. Hence, we need to build domain-specific embeddings to get better outcomes. In this project, we will create medical word embeddings using Word2vec and FastText in python.

Data Description

We are considering a clinical trials dataset for our project based on Covid-19. Dataset-Link

There are 10666 rows and 21 columns present in the dataset. The following two columns are essential for us,

Title
Abstract

Aim

This project aims to use the trained models (Word2Vec and FastText) to build a search engine and Streamlit UI.

To develop a machine learning application that can understand the relationship and pattern between various words used together in the field of medical science, create a smart search engine for records containing those terms, and finally build a machine learning pipeline in azure to deploy and scale the application.

Tech stack

Language - Python
Libraries and Packages - Pandas, Numpy, Matplotlib, Plotly, Gensim, Streamlit, NLTK.

Approach

Check my Jupyter notebooks:
- Theory Notebook
- Main Notebook

Importing the required libraries
Reading the dataset
Pre-processing
- Remove URLs
- Convert text to lower case
- Remove numerical values
- Remove punctuation.
- Perform tokenization
- Remove stop words
- Perform lemmatization
- Remove ‘\n’ character from the columns
Exploratory Data Analysis (EDA)
- Data Visualization using word cloud
Training the ‘Skip-gram’ model
Training the ‘FastText’ model
Model embeddings – Similarity
PCA plots for Skip-gram and FastText models
Convert abstract and title to vectors using the Skip-gram and FastText model
Use the Cosine similarity function
Perform input query pre-processing
Define a function to return top ‘n’ similar results
Result evaluation
Run the Streamlit Application
- Run streamlit run medical.py in notebook

Project Takeaways

Understanding the business problem
Understanding the architecture to build the Streamlit application
Learning the Word2Vec and FastText model
Importing the dataset and required libraries
Data Pre-processing
Performing basic Exploratory Data Analysis (EDA)
Training the Skip-gram model with varying parameters
Training the FastText model with varying parameters
Understanding and performing the model embeddings
Plotting the PCA plots
Getting vectors for each attribute
Performing the Cosine similarity function
Pre-processing the input query
Evaluating the results
Creating a function to return top ‘n’ similar results for a given query
Understanding the code for executing the Streamlit application.
Run the Streamlit application.

Links to solve some errors
- Link-1
- Link-2
- Link-3
- Reset_Git_Commit

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
Notebooks		Notebooks
input		input
output		output
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

Notebooks

Notebooks

input

input

output

output

src

src

.gitattributes

.gitattributes

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Medical Embeddings and Clinical Trial Search Engine

1. Create a new environment

2. Install all the requirements

3. Code Execution

What does the code do??

Word2Vec and FastText Word Embedding with Gensim

Business Context:

Data Description

Aim

Tech stack

Approach

Project Takeaways

About

Releases

Packages

Languages

License

avr2002/Medical-Embeddings-and-Clinical-Trial-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Medical Embeddings and Clinical Trial Search Engine

1. Create a new environment

2. Install all the requirements

3. Code Execution

What does the code do??

Word2Vec and FastText Word Embedding with Gensim

Business Context:

Data Description

Aim

Tech stack

Approach

Project Takeaways

About

Topics

Resources

License

Stars

Watchers

Forks

Languages