In [6]:
'''
## Project Objective

Develop an NLP model to identify medical conditions, treatments, or therapeutic insight in the provided medical transcript dataset.

## Dataset

- Use the provided `DATA.csv` file.

## Project Guidelines

1. **Data Preprocessing**
   - Convert the text data into a structured format suitable for training machine learning models.
   - Document the preprocessing steps.

2. **Feature Extraction**
   - Create feature vectors from the text data using appropriate techniques.
   - Explain your chosen feature extraction methods.

3. **NLP Model**
   - Choose a suitable NLP models, technique, tools, or frameworks for the task (e.g., Word2Vec, GloVe, BERT, Transformers, Hugging Face, PyTorch, etc.).
   - Justify your choices and describe the advantages and limitations in this context.

4. **Implementation and Evaluation**
   - Implement the chosen NLP model using Python and relevant libraries.
   - Evaluate your model's performance using suitable metrics.

5. **Report**
   - Submit your repo with a jupyter notebook detailing your approach, including:
       - Data preprocessing steps
       - Feature extraction techniques
       - Model choice and rationale
       - Model evaluation and results
       - Relevant visualizations
   - Clarity of explanation and documentation will be evaluated.

6. **Extra Points**
   - Evaluate and analyze keywords "Rectocele", "Cystocele" in the dataset and provide visualizations or summary statistics to illustrate findings.
'''

'\n## Project Objective\n\nDevelop an NLP model to identify medical conditions, treatments, or therapeutic insight in the provided medical transcript dataset.\n\n## Dataset\n\n- Use the provided `DATA.csv` file.\n\n## Project Guidelines\n\n1. **Data Preprocessing**\n   - Convert the text data into a structured format suitable for training machine learning models.\n   - Document the preprocessing steps.\n\n2. **Feature Extraction**\n   - Create feature vectors from the text data using appropriate techniques.\n   - Explain your chosen feature extraction methods.\n\n3. **NLP Model**\n   - Choose a suitable NLP models, technique, tools, or frameworks for the task (e.g., Word2Vec, GloVe, BERT, Transformers, Hugging Face, PyTorch, etc.).\n   - Justify your choices and describe the advantages and limitations in this context.\n\n4. **Implementation and Evaluation**\n   - Implement the chosen NLP model using Python and relevant libraries.\n   - Evaluate your model\'s performance using suitable 

In [7]:
import pandas as pd
import nltk
import re
from textblob import TextBlob


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
!pip install pandas nltk textblob




In [17]:
# Function to preprocess the data
def preprocess_text(text):
    # Lowercase conversion
    text = text.lower()

    # Special Character Handling: Remove or replace special characters
    # Retain only alphabets, spaces, and select special characters that might be relevant to medical terms
    text = re.sub(r'[^a-zA-Z\s\-]', '', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Spell Correction
    corrected_tokens = [str(TextBlob(word).correct()) for word in tokens]

    # Rejoin tokens into a single string
    text = ' '.join(corrected_tokens)
    return text

# Load your dataset
df = pd.read_csv('/content/drive/MyDrive/Colab_Notebooks/FemTherapeutics_Technical_Interview-main/DATA.csv')

# Calculate number of records before dropping null 'transcription' and duplicates
num_records_before = len(df)

# Drop records with null 'transcription'
df = df.dropna(subset=['transcription'])

# Calculate number of dropped records due to null 'transcription'
num_dropped_null = num_records_before - len(df)

# Drop duplicates based on 'transcription' to keep unique values
df = df.drop_duplicates()

# Calculate number of dropped duplicates
num_dropped_duplicates = (num_records_before - num_dropped_null) - len(df)

# Print the number of dropped records due to null 'transcription' and duplicates
print("total records :",num_records_before)
print("Number of dropped records due to null 'transcription':", num_dropped_null)
print("Number of dropped duplicate records:", num_dropped_duplicates)

# Save the preprocessed DataFrame
df.to_csv('/content/drive/MyDrive/Colab_Notebooks/FemTherapeutics_Technical_Interview-main/preprocessed_data.csv', index=False)



total records : 4999
Number of dropped records due to null 'transcription': 33
Number of dropped duplicate records: 0


In [27]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
0,0,A 23-year-old white female presents with comp...,Allergy / Immunology,Allergic Rhinitis,"SUBJECTIVE:, This 23-year-old white female pr...","allergy / immunology, allergic rhinitis, aller..."
1,1,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 2,"PAST MEDICAL HISTORY:, He has difficulty climb...","bariatrics, laparoscopic gastric bypass, weigh..."
2,2,Consult for laparoscopic gastric bypass.,Bariatrics,Laparoscopic Gastric Bypass Consult - 1,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...","bariatrics, laparoscopic gastric bypass, heart..."
3,3,2-D M-Mode. Doppler.,Cardiovascular / Pulmonary,2-D Echocardiogram - 1,"2-D M-MODE: , ,1. Left atrial enlargement wit...","cardiovascular / pulmonary, 2-d m-mode, dopple..."
4,4,2-D Echocardiogram,Cardiovascular / Pulmonary,2-D Echocardiogram - 2,1. The left ventricular cavity size and wall ...,"cardiovascular / pulmonary, 2-d, doppler, echo..."
