<a href="https://colab.research.google.com/github/avireddi08/AICTE_INTERNSHIP_P1/blob/main/Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Problem Statement: Generate concise medical summaries from lengthy doctor-patient conversations/patient data for faster review.
# Why: Concise medical summaries streamline documentation, saving time and improving efficiency for healthcare providers. They allow quick review of key information, aiding faster decision-making and better patient care.

1) from transformers import pipeline:

* The transformers library is a powerful toolkit built on top of PyTorch and TensorFlow, primarily used for tasks involving pre-trained language models like BERT, GPT-2, etc.
* The pipeline function simplifies the process of using these models for common NLP tasks like text summarization, sentiment analysis, translation, and more. It provides a high-level interface to load a pre-trained model and use it directly without worrying about the underlying details.


2) from sklearn.feature_extraction.text import TfidfVectorizer:
* sklearn (scikit-learn) is a popular Python library for machine learning.
* TfidfVectorizer is a tool used for converting text into numerical representations that machine learning algorithms can understand. It does this using a technique called TF-IDF (Term Frequency-Inverse Document Frequency).



In [None]:
'''from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
import re

# Load spaCy model for NER (Named Entity Recognition)
nlp = spacy.load("en_core_web_sm")

# Load the summarization pipeline (using BART for abstractive summarization)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")'''

In [None]:
'''def remove_greetings(text):
    # List of common greeting words/phrases
    greetings = [
        r'\bhello\b', r'\bhi\b', r'\bhey\b', r'\bhow are you\b',
        r'\bgood morning\b', r'\bgood afternoon\b', r'\bgood evening\b',
        r'\bwhat\'s up\b', r'\bhowdy\b', r'\bgreetings\b', r'\bsalutations\b'
    ]

    # Pattern that matches any of the greetings
    greetings_pattern = '|'.join(greetings)

    # Regex to replace greetings with an empty string
    cleaned_text = re.sub(greetings_pattern, '', text, flags=re.IGNORECASE)

    # Remove extra spaces resulting from replacements
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()

    return cleaned_text'''

In [None]:
'''# Named Entity Recognition (NER)
def extract_medical_entities(text, nlp_model):
    doc = nlp_model(text)
    entities = []
    for ent in doc.ents:
        if ent.label_ in ["DISEASE", "SYMPTOM", "TREATMENT", "MEDICATION"]:
            entities.append(ent.text)

    return entities'''

1) token.dep_: This part accesses the dependency label assigned to the current token. Dependency labels describe the grammatical relationship between words in a sentence, like subject, object, modifier, etc. These labels are determined during dependency parsing.

2) ['nsubj', 'dobj', 'ROOT']: This is a list containing specific dependency labels that the code is interested in:

* nsubj: Represents the nominal subject of a verb. In simpler terms, it's typically the noun or pronoun that performs the action of the verb.
* dobj: Represents the direct object of a verb. It's the noun or pronoun that receives the action of the verb.
* ROOT: Represents the main verb or the central element of the sentence that everything else relates to.

Example:

Let's say you have the sentence "The cat sat on the mat." For the word "sat," the dependency label might be ROOT, the token.text would be "sat," and the token.head.text would also be "sat" (since the root is usually its own head).

So, the line of code would create the string "ROOT: sat -> sat" and add it to the relations list

In [None]:
'''# Dependency Parsing (to understand context)
def dependency_parse(text, nlp_model):
    doc = nlp_model(text)
    relations = []
    for token in doc:
        if token.dep_ in ['nsubj', 'dobj', 'ROOT']:
            relations.append(f'{token.dep_}: {token.text} -> {token.head.text}')
    return relations
    # Do not consider this cell #'''

1) fit_transform does two things:
* It "fits" the vectorizer to the text, which means it learns the vocabulary and IDF weights.
* It "transforms" the text into a matrix where each row represents a document (in this case, just one document - your input text) and each column represents a word. The values in the matrix are the TF-IDF scores.

2)    keywords = [word for word, score in zip(vectorizer.get_feature_names_out(), X.sum(axis=0).tolist()[0]) if score > 0]
* This line extracts the actual keywords.
* It iterates through each word and its corresponding TF-IDF score.
* vectorizer.get_feature_names_out(): Retrieves the list of words (features) used in the TF-IDF matrix.
* X.sum(axis=0).tolist()[0]: Calculates the sum of TF-IDF scores for each word across all documents (which is just one document here).
* zip: Combines the words and their scores.
* if score > 0: Only words with a TF-IDF score greater than 0 are considered as keywords. These words are appended to the keywords list.

In [None]:
'''# Keyword Extraction using TF-IDF
def extract_keywords(text):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform([text])
    keywords = [word for word, score in zip(vectorizer.get_feature_names_out(), X.sum(axis=0).tolist()[0]) if score > 0]
    return keywords'''

In [None]:
'''# Function to generate an insightful summary
def generate_summary_for_processed_data(text, summarizer):
    # Use the summarizer to generate the summary
    # model will use deterministic approach to find the summary rather than using random sampling.
    summary_data = summarizer(text, max_length=120, min_length=50, do_sample=False)
    return summary_data[0]['summary_text']'''

In [None]:
'''# Function to generate an insightful summary for processed conversation.
def generate_summary_for_processed_conversation(text, summarizer):
    # Use the summarizer to generate the summary
    # model will use deterministic approach to find the summary rather than using random sampling.
    summary_conversation = summarizer(text, max_length=350, min_length=150, do_sample=False)
    return summary_conversation[0]['summary_text']'''

In [None]:
'''# Organize extracted content into sections
def create_medical_summary_for_processed_data(text, nlp_model, summarizer):

    entities = extract_medical_entities(text, nlp_model)
    # relations = dependency_parse(text, nlp_model)
    keywords = extract_keywords(text)

    summary = generate_summary_for_processed_data(text, summarizer)

    # Structuring the summary
    summary_dict = {
        "Entities": entities,
        # "Relations": relations,
        "Keywords": keywords,
        "Summary": summary
    }
    return summary_dict'''

In [None]:
'''# Organize extracted content into sections
def create_medical_summary_for_processed_conversation(text, nlp_model, summarizer):

    entities = extract_medical_entities(text, nlp_model)
    # relations = dependency_parse(text, nlp_model)
    keywords = extract_keywords(text)

    summary = generate_summary_for_processed_conversation(text, summarizer)

    # Structuring the summary
    summary_dict = {
        "Entities": entities,
        # "Relations": relations,
        "Keywords": keywords,
        "Summary": summary
    }
    return summary_dict'''

In [None]:
'''# Summarization using Abstractive Summarization model
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Using a popular summarization model (e.g., BART)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Process 10 rows in the processed_conversation
for index, row in sampled_df.head(10).iterrows():  # Iterating through 10 rows
    conversation_text = row['processed_conversation']  # Extract the processed_conversation text
    conversation_text = remove_greetings(conversation_text)  # Remove greetings
    summary = create_medical_summary_for_processed_conversation(conversation_text, nlp, summarizer)  # Generate summary for each row

    # Output the structured summary for the current row
    print(f"Summary for row {index}:")
    print(f"Entities: {summary['Entities']}")
    # print(f"Relations: {summary['Relations']}")
    print(f"Keywords: {summary['Keywords']}")
    print(f"Summary: {summary['Summary']}")
    print("\n" + "-"*50 + "\n")'''

In [None]:
'''# Summarization using Abstractive Summarization model
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

# Using a popular summarization model (e.g., BART)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Process 10 rows in the processed_conversation
for index, row in sampled_df.head(10).iterrows():  # Iterating through 10 rows
    conversation_text = row['processed_conversation']  # Extract the processed_conversation text
    conversation_text = remove_greetings(conversation_text)  # Remove greetings
    summary = create_medical_summary_for_processed_conversation(conversation_text, nlp, summarizer)  # Generate summary for each row

    # Output the structured summary for the current row
    print(f"Summary for row {index}:")
    print(f"Entities: {summary['Entities']}")
    # print(f"Relations: {summary['Relations']}")
    print(f"Keywords: {summary['Keywords']}")
    print(f"Summary: {summary['Summary']}")
    print("\n" + "-"*50 + "\n")'''

In [None]:
from transformers import pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
import re
import streamlit as st

# Load spaCy model for NER
@st.cache_resource
def load_spacy_model():
    return spacy.load("en_core_web_sm")

nlp = load_spacy_model()

# Load summarizer pipeline
@st.cache_resource
def load_summarizer():
    return pipeline("summarization", model="facebook/bart-large-cnn", device=-1)

summarizer = load_summarizer()

def remove_greetings(text):
    greetings = [
        r'\bhello\b', r'\bhi\b', r'\bhey\b', r'\bhow are you\b',
        r'\bgood morning\b', r'\bgood afternoon\b', r'\bgood evening\b',
        r'\bwhat\'s up\b', r'\bhowdy\b', r'\bgreetings\b', r'\bsalutations\b'
    ]
    greetings_pattern = '|'.join(greetings)
    cleaned_text = re.sub(greetings_pattern, '', text, flags=re.IGNORECASE)
    return re.sub(r'\s+', ' ', cleaned_text).strip()

def extract_medical_entities(text, nlp_model):
    doc = nlp_model(text)
    return [ent.text for ent in doc.ents if ent.label_ in ["DISEASE", "SYMPTOM", "TREATMENT", "MEDICATION"]]

def extract_keywords(text):
    vectorizer = TfidfVectorizer(stop_words='english')
    X = vectorizer.fit_transform([text])
    return [word for word, score in zip(vectorizer.get_feature_names_out(), X.toarray()[0]) if score > 0]

def generate_summary(text, max_len, min_len):
    summary_data = summarizer(text, max_length=max_len, min_length=min_len, do_sample=False)
    return summary_data[0]['summary_text']

def create_summary(text, nlp_model, max_len, min_len):
    entities = extract_medical_entities(text, nlp_model)
    keywords = extract_keywords(text)
    summary = generate_summary(text, max_len, min_len)
    return {
        "Entities": entities,
        "Keywords": keywords,
        "Summary": summary
    }

def main():
    st.title("Medical Text Analysis Tool")
    st.write("This tool extracts medical entities, keywords, and summarizes the input text.")

    user_input = st.text_area("Enter the medical text here:")
    summary_type = st.radio("Select Summary Type", ("Data", "Conversation"))

    if st.button("Process"):
        if user_input.strip():
            processed_text = remove_greetings(user_input)
            max_len, min_len = (120, 50) if summary_type == "Data" else (350, 150)
            summary = create_summary(processed_text, nlp, max_len, min_len)

            st.subheader("Extracted Medical Entities")
            st.write(", ".join(summary['Entities']) if summary['Entities'] else "No entities found.")

            st.subheader("Extracted Keywords")
            st.write(", ".join(summary['Keywords']) if summary['Keywords'] else "No keywords found.")

            st.subheader("Text Summary")
            st.write(summary['Summary'])
        else:
            st.warning("Please enter some text to process.")

if __name__ == "__main__":
    main()


In [None]:
'''# Install necessary libraries in one command
!pip install -q streamlit spacy transformers scikit-learn localtunnel

# Download the spaCy model for English
!python -m spacy download en_core_web_sm'''

In [None]:
!wget -q -o - ipv4.icanhazip.com

In [None]:
! streamlit run MedicalTextAnalysis.py & npx localtunnel --port 8501

In [None]:
# Distribution of data lengths
sampled_df['data_length'] = sampled_df['processed_data'].apply(lambda x: len(x.split()))
plt.figure(figsize=(8, 5))
sns.histplot(sampled_df['data_length'], kde=True, bins=30)
plt.title('Distribution of Data Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Distribution of conversation lengths
sampled_df['conversation_length'] = sampled_df['processed_conversation'].apply(lambda x: len(x.split()))
plt.figure(figsize=(8, 5))
sns.histplot(sampled_df['conversation_length'], kde=True, bins=30)
plt.title('Distribution of Conversation Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()