<a href="https://colab.research.google.com/github/ajaysingh-codes/farmworker-health-rag/blob/main/farmworker_health_rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 🌾 Farmworker Health Research RAG System

**A smart search system for scientific literature on farmworker health, chemical exposures, and occupational stressors**

Built with BM25 + Semantic Search | Interactive comparison interface | Optimized for health research papers

---



## 📦 Installing Required Libraries
This cell installs all the necessary Python packages for our RAG system:
- `sentence-transformers`: For creating semantic embeddings of text
- `bm25s`: For keyword-based search (BM25 algorithm)
- `pypdf2`: For extracting text from PDF files
- `pandas`: For data manipulation
- `numpy`: For numerical operations
- `joblib`: For saving/loading embeddings
- `ipywidgets`: For creating the interactive interface

In [1]:
!pip install sentence-transformers bm25s pypdf2 pandas numpy joblib ipywidgets -q
print("Libraries installed successfully")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Importing Libraries and Creating Workspace
This cell:
1. Imports all the libraries we'll use throughout the project
2. Creates a dedicated folder `/content/papers` for your PDF files
3. Sets up the basic environment for our RAG system

In [3]:
# Import all necessary libraries
import os
import pandas as pd
import numpy as np
import bm25s
import joblib
from sentence_transformers import SentenceTransformer
from IPython.display import display, Markdown
import ipywidgets as widgets
from PyPDF2 import PdfReader
from datetime import datetime

# Create a folder for your PDFs
pdf_folder = "/content/papers"
if not os.path.exists(pdf_folder):
    os.makedirs(pdf_folder)
    print(f"✅ Created folder: {pdf_folder}")
else:
    print(f"📁 Folder already exists: {pdf_folder}")

✅ Created folder: /content/papers


## 📤 Upload Your Research Papers
Upload your 5 PDF papers about farmworker health:
- Click 'Choose Files' to select your PDFs
- Papers will be moved to the `papers` folder
- You'll see confirmation for each uploaded file

In [4]:
from google.colab import files

print("Click 'Choose Files' below to upload your 5 PDFs:\n")

uploaded = files.upload()

# Move uploaded files to the papers folder
for filename in uploaded.keys():
    destination = os.path.join(pdf_folder, filename)
    os.rename(filename, destination)
    print(f"✅ Uploaded: {filename}")

# Verify and list all PDFs in the folder
pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]
print(f"📊 Total PDFs in folder: {len(pdf_files)}")
for i, pdf in enumerate(pdf_files, 1):
    print(f"  {i}. {pdf}")

Click 'Choose Files' below to upload your 5 PDFs:



Saving A_more_than_four-fold_sex-specific_difference_of_autism_spectrum_disorders_and_the_possible_contribu.pdf to A_more_than_four-fold_sex-specific_difference_of_autism_spectrum_disorders_and_the_possible_contribu.pdf
Saving Adverse_childhood_experiences_and_its_association_with_emotional_and_behavioral_problems_in_US_child.pdf to Adverse_childhood_experiences_and_its_association_with_emotional_and_behavioral_problems_in_US_child.pdf
Saving Agricultural_exposures_and_risk_of_childhood_neuroblastoma_a_systematic_review_and_meta-analysis.pdf to Agricultural_exposures_and_risk_of_childhood_neuroblastoma_a_systematic_review_and_meta-analysis.pdf
Saving Autism_Spectrum_Disorder_and_Prenatal_or_Early_Life_Exposure_to_Pesticides_A_Short_Review.pdf to Autism_Spectrum_Disorder_and_Prenatal_or_Early_Life_Exposure_to_Pesticides_A_Short_Review.pdf
Saving Exposure_to_pesticides_and_childhood_leukemia_risk_A_systematic_review_and_meta-analysis.pdf to Exposure_to_pesticides_and_childhood_leukemia_r