# Preprocessing Notebook: `7_df_sponsors.ipynb`

### Description  
Processes the **`sponsors`** table from the AACT Clinical Trials database to extract and classify sponsor-related attributes, which play an important role in understanding trial oversight, funding source, and bias patterns in clinical success rates.


### Key Steps  
- Extracted `nct_id`, `agency_class`, and `lead_or_collaborator` from AACT schema.  
- Standardized `agency_class` values into **Industry**, **Government**, **Other**, or **Unknown**.  
- One-hot encoded sponsor type and role (lead vs collaborator).  
- Aggregated sponsor-level information per trial (`nct_id`).  
- Export final dataset to `../data/processed/sponsor_clean.csv`  

### 📁 Output
- ✅ `sponsor_clean.csv`

In [1]:
# Import required libraries and set up DB connection
from sqlalchemy import create_engine
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv()
DATABASE_URL = os.getenv('DATABASE_URL')
engine = create_engine(DATABASE_URL)

In [2]:
# Load `sponsors` table from AACT
query = '''
SELECT nct_id, agency_class, lead_or_collaborator
FROM ctgov.sponsors;
'''
df_sponsors = pd.read_sql(query, engine)

In [3]:
# Drop duplicate rows
df_sponsors = df_sponsors.drop_duplicates()

In [4]:
# standardize `agency_class` values into broader categories
def sponsor_type(val):
    if val == 'INDUSTRY':
        return 'Industry'
    elif val in ['NIH', 'FED', 'OTHER_GOV']:
        return 'Government'
    elif val in ['OTHER', 'NETWORK', 'INDIV']:
        return 'Other'
    else:
        return 'Unknown'

df_sponsors['agency_class'] = df_sponsors['agency_class'].apply(sponsor_type)

In [5]:
# One-hot encode sponsor type and role columns
df_sponsors = pd.get_dummies(
    df_sponsors,
    columns=['agency_class', 'lead_or_collaborator'],
    prefix=['sponsor', 'role'],
    dtype=int
)

In [6]:
# Aggregate encoded sponsor information by `nct_id`
df_sponsors = df_sponsors.groupby('nct_id').max().reset_index()

In [7]:
# Sanity check and export cleaned dataset
assert df_sponsors['nct_id'].is_unique, 'Duplicate trial entries found!'
df_sponsors.to_csv('../data/processed/sponsors_clean.csv', index=False)

---

## Summary  
This notebook extracted and standardized **sponsor-related features** from the AACT database.  

Key outcomes:
- Consolidated sponsor types into four categories: *Industry*, *Government*, *Other*, and *Unknown*.  
- One-hot encoded both sponsor type and role (lead/collaborator).  
- Aggregated multiple sponsor entries per trial into a single record.  
- Exported the cleaned dataset: `sponsors_clean.csv`.

---

🔍 **Additional Note:**  
During EDA, the **role feature (lead vs collaborator)** demonstrated low discriminative power and was excluded from final modeling.  
Only the **sponsor type** variable was retained as it correlated more strongly with trial success probability.

---

📂 **Next Notebook:** `8_df_merged.ipynb` → Combines all cleaned tables into a single analytical dataset for exploratory data analysis