<a href="https://colab.research.google.com/github/dayana-cabrera004/npl/blob/main/MediMine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Required imports
!pip install gradio
import gradio as gr
!pip install langchain-together
from langchain_together import ChatTogether
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
import requests
import pandas as pd
import os
from datetime import datetime
from kaggle.api.kaggle_api_extended import KaggleApi




In [3]:
# Set environment variables
os.environ["TOGETHER_API_KEY"] = "207fee5eecff4d87a306a8566da4cd025ae6b252d14302d980dab27a618033f9"
os.environ["KAGGLE_USERNAME"] = "dayanacabrera"
os.environ["KAGGLE_KEY"] = "c90cf5759564ce5ca847713f7a36f72f"

In [4]:
# Initialize the Together.ai model
llm = ChatTogether(
    temperature=0.0,
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
)

In [5]:
# Initialize Kaggle API
def init_kaggle():
    api = KaggleApi()
    api.authenticate()
    return api

In [6]:
# Function to fetch datasets from various sources
def fetch_datasets(query, source):
    # Define API endpoints for different data sources
    sources = {
        'healthdata': 'https://healthdata.gov/api/search',
        'data_gov': 'https://catalog.data.gov/api/search',
        'who': 'https://www.who.int/data/api/search',
        'nih': 'https://www.nih.gov/api/search',
        'kaggle': None
    }

    # Special handling for Kaggle datasets
    if source == 'kaggle':
        try:
            api = init_kaggle()
            datasets = api.dataset_list(search=query, tag='health')
            return [dataset.ref for dataset in datasets]
        except Exception as e:
            print(f"Kaggle API error: {e}")
            return None

    # Handle other API requests
    try:
        response = requests.get(sources[source], params={'q': query})
        return response.json()
    except Exception as e:
        print(f"API error for {source}: {e}")
        return None

In [7]:
# General search function across all sources
def general_search(query):
    # Fetch Kaggle datasets first
    kaggle_datasets = fetch_datasets(query, 'kaggle')

    # Create prompt template for the LLM
    prompt = PromptTemplate(
        input_variables=["query", "kaggle_data"],
        template="""
        Comprehensive medical dataset search for: {query}

        Available Kaggle Datasets: {kaggle_data}

        Provide a summary including:
        1. Key findings from medical databases
        2. Relevant clinical studies
        3. Available research data
        4. Statistical highlights
        """
    )

    chain = LLMChain(llm=llm, prompt=prompt)
    return chain.run({"query": query, "kaggle_data": str(kaggle_datasets)})

In [8]:
# Function to display relevant Kaggle datasets
def show_relevant_datasets(query="diabetes"):
    try:
        api = init_kaggle()
        datasets = api.dataset_list(search=query, tag='health')
        dataset_info = []

        for dataset in datasets[:8]:  # Top 8 relevant datasets
            dataset_info.append([
                dataset.title,
                f"[View Dataset](https://www.kaggle.com/datasets/{dataset.ref})",
                f"{dataset.usabilityRating:.1f}/10",
                f"{dataset.downloadCount:,}"
            ])

        if not dataset_info:
            dataset_info = [["No datasets found", "Try another search term", "N/A", "N/A"]]

    except Exception as e:
        dataset_info = [["API Error", "Could not fetch datasets", "N/A", "N/A"]]

    return pd.DataFrame(
        dataset_info,
        columns=["Dataset", "Link", "Usability Score", "Downloads"]
    )

In [9]:
# Main Gradio interface
def build_medimine_interface():
    with gr.Blocks(title="MediMine - Medical Dataset Explorer") as app:
        gr.Markdown("""
        # 🏥 MediMine: Medical Dataset Explorer
        ## Comprehensive Medical Dataset Search Platform
        Search across multiple medical databases and find relevant datasets instantly.
        """)

        with gr.Row():
            with gr.Column():
                query_input = gr.Textbox(
                    label="Enter Medical Search Query",
                    placeholder="e.g., diabetes type 2 research data",
                    lines=2
                )
                search_btn = gr.Button("🔍 General Search", variant="primary")

        with gr.Row():
            diagnosis_btn = gr.Button("🏥 Diagnosis")
            treatment_btn = gr.Button("💊 Treatment")
            genes_btn = gr.Button("🧬 Genetics")
            trials_btn = gr.Button("🔬 Trials")
            kaggle_btn = gr.Button("📊 Kaggle")
            imaging_btn = gr.Button("🔎 Imaging")

        output_text = gr.Textbox(
            label="Search Results",
            lines=10,
            placeholder="Results will appear here..."
        )

        dataset_display = gr.DataFrame(
            value=show_relevant_datasets().values.tolist(),
            headers=["Dataset", "Link", "Usability Score", "Downloads"],
            label="Relevant Kaggle Datasets"
        )

        # Update both search results and dataset recommendations
        def update_results(query, search_type='general'):
            if search_type == 'general':
                search_result = general_search(query)
            else:
                search_result = specific_search(query, search_type)
            datasets = show_relevant_datasets(query).values.tolist()
            return search_result, datasets

        # Connect buttons to functions
        search_btn.click(
            fn=lambda q: update_results(q, 'general'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        diagnosis_btn.click(
            fn=lambda q: update_results(q, 'diagnosis'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        treatment_btn.click(
            fn=lambda q: update_results(q, 'treatment'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        genes_btn.click(
            fn=lambda q: update_results(q, 'genes'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        trials_btn.click(
            fn=lambda q: update_results(q, 'trials'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        kaggle_btn.click(
            fn=lambda q: update_results(q, 'kaggle'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

        imaging_btn.click(
            fn=lambda q: update_results(q, 'imaging'),
            inputs=query_input,
            outputs=[output_text, dataset_display]
        )

    return app

# Launch the application
if __name__ == "__main__":
    app = build_medimine_interface()
    app.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://382c5dfb7270bc9eaf.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


**Purpose and case use of APP:**

The project you're working on, MediMine - Medical Dataset Explorer, aims to be the first point of referral for healthcare dataset searches, offering a comprehensive, one-stop solution for finding relevant medical data. Users can input a query (e.g., “diabetes type 2 research data”), and the app will search across multiple trusted sources, including Kaggle, government health portals, and leading research organizations like WHO and NIH. By leveraging the LangChain framework with Together.ai’s language model, the app processes queries and provides detailed insights, metadata, and recommendations, making the search process efficient and accurate. The platform is designed to cover a wide range of healthcare topics, offering category-specific searches for areas like diagnosis, treatment, genetics, clinical trials, and imaging.

What differentiates MediMine from other healthcare data search tools is its goal to provide a complete, holistic search experience. Unlike other apps that may only pull data from a limited set of sources, MediMine aggregates datasets from various trusted platforms, ensuring that the search is thorough and comprehensive. The use of a sophisticated AI language model adds another layer of value, enabling the app to provide contextual summaries and insights alongside dataset links, usability scores, and other metadata. This unique combination of diverse data sources, AI-powered insights, and a user-friendly interface positions MediMine as the ultimate first-stop app for anyone looking to explore healthcare datasets—knowing that the search will be exhaustive and complete.