# MediAgent Data Agent

A Flask-based REST API server for pharmaceutical compound data aggregation and analysis, providing unified access to multiple scientific databases including ChEMBL and PubChem.

## 🔬 Overview

MediAgent Data Agent is a comprehensive pharmaceutical data system that orchestrates compound data collection from multiple sources, processes bioactivity information, and provides a unified API for drug discovery research. The system combines peer-reviewed data sources with industry-standard technologies to create a robust platform for pharmaceutical researchers.

## 🏗️ Architecture

### Core Components

- **Flask API Server**: REST API with compound search, retrieval, and analytics endpoints
- **Data Agent**: Orchestrates data collection from multiple pharmaceutical databases
- **Database Manager**: Handles PostgreSQL operations and data integrity
- **ChEMBL Connector**: Interfaces with ChEMBL API for molecular data
- **PubChem Connector**: Enriches data with PubChem information

### Technology Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **Web Framework** | Flask | REST API server |
| **Database** | PostgreSQL | Data storage and management |
| **Data Sources** | ChEMBL API | Primary compound and bioactivity data |
| **Data Sources** | PubChem API | Compound enrichment and validation |
| **Containerization** | Docker | Deployment and environment isolation |

## 📊 Validation & Research Backing

This technology stack is extensively validated in pharmaceutical research:

- **ChEMBL**: Manually curated database used by drug discovery scientists worldwide, with peer-reviewed publications on its web services
- **PubChem**: Established public repository providing standardized chemical information
- **REST API Architecture**: Industry-standard approach for pharmaceutical data access
- **PostgreSQL**: Enterprise-grade database widely used in pharmaceutical data management

The combination is commonly found in academic papers, pharmaceutical companies, and research institutions for drug discovery workflows.

## 🚀 Quick Start

### Prerequisites

- Python 3.11+
- PostgreSQL database
- Docker (optional)

### Installation

1. **Clone the repository**
   ```bash
   git clone <repository-url>
   cd mediagent-data-agent
   ```

2. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

3. **Configure database**
   ```bash
   # Update database connection parameters in database_manager.py
   # Default: host=localhost, database=mediagent, user=admin, password=mediagent2025
   ```

4. **Run the application**
   ```bash
   python api_server.py
   ```

### Docker Deployment

```bash
# Build the image
docker build -t n8nproject-data-agent .

# Run the container
docker run -p 5001:5001 mediagent-data-agent
```

## 📡 API Endpoints

### Health Check
```http
GET /health
```
Returns service status and version information.

### Search Compounds
```http
POST /api/v1/compounds/search
Content-Type: application/json

{
  "query": "aspirin",
  "limit": 10
}
```

**Response:**
```json
{
  "success": true,
  "data": {
    "query": "aspirin",
    "compounds_found": 5,
    "compounds_processed": 5,
    "compounds": [
      {
        "id": 1,
        "chembl_id": "CHEMBL25",
        "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
        "molecular_formula": "C9H8O4",
        "molecular_weight": 180.16,
        "bioactivities_count": 42
      }
    ]
  }
}
```

### Get Compound Details
```http
GET /api/v1/compounds/{chembl_id}
```

### Database Statistics
```http
GET /api/v1/database/stats
```

**Response:**
```json
{
  "success": true,
  "data": {
    "total_compounds": 1250,
    "total_bioactivities": 8743,
    "compounds_added_24h": 23
  }
}
```

## 🔄 Data Processing Workflow

1. **Search Request**: Client queries for compounds by name or identifier
2. **ChEMBL Search**: System searches ChEMBL database for matching compounds
3. **Data Enrichment**: 
   - Retrieves detailed molecular information
   - Fetches bioactivity data
   - Attempts to find corresponding PubChem data
4. **Database Storage**: All data stored in PostgreSQL with proper relationships
5. **Response**: Returns processed compound data with bioactivity metrics

## 🗃️ Database Schema

### Compounds Table
- `id` (Primary Key)
- `chembl_id` (Unique)
- `pubchem_cid`
- `smiles`
- `molecular_formula`
- `molecular_weight`
- `created_at`
- `updated_at`

### Bioactivities Table
- `id` (Primary Key)
- `compound_id` (Foreign Key)
- `target_chembl_id`
- `standard_type`
- `standard_value`
- `standard_units`
- `pchembl_value`

## ⚙️ Configuration

### Database Connection
Update connection parameters in `database_manager.py`:
```python
DatabaseManager(
    host="localhost",
    port=5432,
    database="mediagent",
    user="admin",
    password="mediagent2025"
)
```

### API Rate Limiting
ChEMBL connector includes built-in rate limiting (100ms delay between requests) to respect API usage policies.

## 🔍 Usage Examples

### Python Client Example
```python
import requests

# Search for compounds
response = requests.post('http://localhost:5001/api/v1/compounds/search', 
                        json={'query': 'ibuprofen', 'limit': 5})
data = response.json()

# Get specific compound
response = requests.get('http://localhost:5001/api/v1/compounds/CHEMBL521')
compound = response.json()
```

### JavaScript Client Example
```javascript
// Search compounds
fetch('http://localhost:5001/api/v1/compounds/search', {
  method: 'POST',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({query: 'aspirin', limit: 10})
})
.then(response => response.json())
.then(data => console.log(data));
```

## 🛠️ Development

### Project Structure
```
mediagent-data-agent/
├── api_server.py           # Flask API server
├── data_agent.py          # Main orchestration logic
├── database_manager.py    # PostgreSQL operations
├── chembl_connector.py    # ChEMBL API interface
├── pubchem_connector.py   # PubChem API interface
├── requirements.txt       # Python dependencies
├── Dockerfile            # Container configuration
└── README.md            # This file
```

### Running Tests
```bash
# Test individual components
python database_manager.py
python chembl_connector.py
python data_agent.py
```

## 📝 Logging

The system includes comprehensive logging:
- API request/response logging
- Database operation logging
- External API call logging
- Error tracking and reporting

## 🔒 Security Considerations

- Input validation on all API endpoints
- SQL injection prevention using parameterized queries
- Rate limiting for external API calls
- Error handling without sensitive information exposure

## 📈 Performance Features

- **Deduplication**: Automatic detection of existing compounds
- **Batch Processing**: Efficient handling of multiple compounds
- **Connection Pooling**: Optimized database connections
- **Caching**: Local storage of processed compound data

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🔗 References

- [ChEMBL Web Services Documentation](https://chembl.gitbook.io/chembl-interface-documentation/web-services)
- [PubChem API Documentation](https://pubchem.ncbi.nlm.nih.gov/docs/programmatic-access)
- [Flask Documentation](https://flask.palletsprojects.com/)
- [PostgreSQL Documentation](https://www.postgresql.org/docs/)

## 📞 Support

For questions, issues, or contributions, please open an issue on the project repository.