# Data Cleanup and Quality Check

**Author:** Diwas Puri (diwas.puri@duke.edu), Duke University

This notebook performs data cleaning on the scraped YBIO dataset.

**Input:** `../data/organizations_final.csv`  
**Output:** `../data/organizations_clean.csv`

In [None]:
import pandas as pd
import numpy as np
import os

# Define paths
INPUT_FILE = '../data/organizations_final.csv'
OUTPUT_FILE = '../data/organizations_clean.csv'

# Load data
print(f"Loading data from {INPUT_FILE}...")
df = pd.read_csv(INPUT_FILE)

initial_count = len(df)
print(f"Initial row count: {initial_count:,}")
display(df.head())

## 1. Remove Exact Duplicates
We remove rows that are identical across all columns.

In [None]:
df_deduped = df.drop_duplicates()
deduped_count = len(df_deduped)
print(f"Rows after removing exact duplicates: {deduped_count:,}")
print(f"Removed: {initial_count - deduped_count:,} rows")

## 2. Remove Repeated Headers
Sometimes scraping artifacts include the header row in the data body.

In [None]:
if 'Name' in df_deduped.columns:
    header_rows = df_deduped[df_deduped['Name'] == 'Name']
    if not header_rows.empty:
        print(f"Found {len(header_rows)} repeated header rows. Removing...")
        df_deduped = df_deduped[df_deduped['Name'] != 'Name']
    else:
        print("No repeated header rows found.")

## 3. Check for Duplicate UIDs
Ensure each organization has a unique identifier.

In [None]:
if 'UID' in df_deduped.columns:
    uids = df_deduped['UID']
    dup_uids = uids[uids.duplicated()]
    if not dup_uids.empty:
        print(f"Found {len(dup_uids)} duplicate UIDs.")
        # Keep the first occurrence
        df_deduped = df_deduped.drop_duplicates(subset=['UID'], keep='first')
        print(f"Rows after removing duplicate UIDs: {len(df_deduped):,}")
    else:
        print("No duplicate UIDs found.")

## 4. Standardize Missing Values
Convert empty strings and whitespace to NaN.

In [None]:
df_deduped = df_deduped.replace(r'^\s*$', np.nan, regex=True)
print("Standardized missing values.")

## 5. Save Cleaned Data

In [None]:
print(f"Final row count: {len(df_deduped):,}")
df_deduped.to_csv(OUTPUT_FILE, index=False)
print(f"Cleaned data saved to: {OUTPUT_FILE}")