# Assignment 5, Question 4: Data Exploration

**Points: 15**

In this notebook, you'll explore the clinical trial dataset using pandas selection and filtering techniques.

You'll use utility functions from `q3_data_utils` where helpful, but also demonstrate direct pandas operations.

## Setup

In [1]:
# Rewritten Demo: quick exploration using q3_data_utils
from q3_data_utils import load_data, clean_data, detect_missing, fill_missing, transform_types, create_bins, summarize_by_group, filter_data 
import pandas as pd
import os

# Create output directory
os.makedirs('output', exist_ok=True)
DATA_FILE = 'data/clinical_trial_raw.csv'

# Load data
df = load_data(DATA_FILE)
print(f'Loaded {len(df)} rows, {len(df.columns)} columns')



# 1. Clean data (This single call now performs ALL necessary consolidation and cleaning)
df_clean = clean_data(df)
missing = detect_missing(df_clean)
print('Missing values per column:\n', missing.head(10))

# 2. Fill BMI with median and transform types
df_filled = fill_missing(df_clean, 'bmi', strategy='median')
df_typed = transform_types(df_filled, {'enrollment_date': 'datetime', 'age': 'numeric'})

# --- Removed redundant manual normalization for 'site_clean' and 'intervention_clean' ---
# The original columns 'site' and 'intervention_group' are now clean.

# 3. Create age bins and summarize by site (using the cleaned 'site' column)
df_binned = create_bins(df_typed, 'age', bins=[0,18,35,50,65,100], labels=['<18','18-34','35-49','50-64','65+'])
# Use 'site' instead of 'site_clean'
summary = summarize_by_group(df_binned, 'site', agg_dict={'age':'mean','bmi':'mean'})
print(summary.head())

# 4. Save outputs using the clean columns ('site' and 'intervention_group')
summary.to_csv('output/q4_site_summary.csv', index=False)
# Use 'site' instead of 'site_clean'
df_typed['site'].value_counts().to_csv('output/q4_site_counts.csv', header=['patient_count'])
print('Wrote output/q4_site_summary.csv and output/q4_site_counts.csv')

Loaded 10000 rows, 18 columns
Missing values per column:
 patient_id             0
age                  200
sex                    0
bmi                  626
enrollment_date        0
systolic_bp          414
diastolic_bp         414
cholesterol_total    554
cholesterol_hdl      554
cholesterol_ldl      554
Name: missing_count, dtype: int64
     site        age        bmi  patient_count
0  Site A  81.213296  26.261570           2956
1  Site B  80.444030  26.137668           2453
2  Site C  80.750986  26.335215           2073
3  Site D  80.755586  26.323917           1501
4  Site E  80.415075  26.327237           1017
Wrote output/q4_site_summary.csv and output/q4_site_counts.csv


In [2]:
# Cell 2: Final Cleaning and Site Distribution
import os
import pandas as pd
import numpy as np


OUTPUT_DIR = 'output'
os.makedirs(OUTPUT_DIR, exist_ok=True)
print("2. Finalizing Site Cleaning, Generating Distribution, & Saving CSV")

# --- Robust Site Standardization ---
# 1. Standardize all text: Remove underscores, general non-alphanumeric characters (except spaces),
#    remove numbers, and convert to Title Case (handles Site_D, Site B94, etc.)
df['site'] = df['site'].astype(str).str.replace('_', ' ', regex=False).str.strip()
df['site'] = df['site'].str.replace(r'[^A-Za-z\s]', '', regex=True).str.strip() 
df['site'] = df['site'].str.title()

# 2. Aggressive Whitespace Normalization: Replace all sequences of whitespace (including hidden ones) 
#    with a single space, then strip again. This handles tabs, newlines, and non-breaking spaces.
df['site'] = df['site'].str.replace(r'\s+', ' ', regex=True).str.strip()

# 3. Final forced cleanup (in case 'Site A' and 'Site A ' were the culprits)
df['site'] = df['site'].str.replace('Site A ', 'Site A', regex=False)

# Value counts calculation (using the now fully standardized data)
site_counts_series = df['site'].value_counts()

# Convert the Series to a DataFrame for CSV saving
site_counts_df = site_counts_series.reset_index()
site_counts_df.columns = ['site', 'patient_count']

# Save to required output file
output_path = os.path.join(OUTPUT_DIR, 'q4_site_counts.csv')
site_counts_df.to_csv(output_path, index=False)

print(f"Site counts saved to {output_path}. Should show exactly 5 sites.")
display(site_counts_df)

2. Finalizing Site Cleaning, Generating Distribution, & Saving CSV
Site counts saved to output\q4_site_counts.csv. Should show exactly 5 sites.


Unnamed: 0,site,patient_count
0,Site A,2956
1,Site B,2453
2,Site C,2073
3,Site D,1501
4,Site E,1017


## Part 1: Basic Exploration (3 points)

Display:
1. Dataset shape
2. Column names and types
3. First 10 rows
4. Summary statistics (.describe())

In [3]:
# Part 1:
print('Dataset shape:', df.shape)
print('\nColumn names and dtypes:')
print(df.dtypes)

print('\nFirst 10 rows:')
display(df.head(10))

print('\nSummary statistics (numeric columns):')
display(df.describe(include=[np.number]).T)

print('\nSummary statistics (all columns):')
display(df.describe(include="all").T)


Dataset shape: (10000, 18)

Column names and dtypes:
patient_id             object
age                     int64
sex                    object
bmi                   float64
enrollment_date        object
systolic_bp           float64
diastolic_bp          float64
cholesterol_total     float64
cholesterol_hdl       float64
cholesterol_ldl       float64
glucose_fasting       float64
site                   object
intervention_group     object
follow_up_months        int64
adverse_events          int64
outcome_cvd            object
adherence_pct         float64
dropout                object
dtype: object

First 10 rows:


Unnamed: 0,patient_id,age,sex,bmi,enrollment_date,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,site,intervention_group,follow_up_months,adverse_events,outcome_cvd,adherence_pct,dropout
0,P00001,80,F,29.3,2022-05-01,123.0,80.0,120.0,55.0,41.0,118.0,Site B,Control,20,0,No,24.0,No
1,P00002,80,Female,,2022-01-06,139.0,81.0,206.0,58.0,107.0,79.0,Site A,CONTROL,24,0,No,77.0,No
2,P00003,82,Female,-1.0,2023-11-04,123.0,86.0,172.0,56.0,82.0,77.0,Site C,treatment b,2,0,Yes,70.0,No
3,P00004,95,Female,25.4,2022-08-15,116.0,77.0,200.0,56.0,104.0,115.0,Site D,treatment b,17,0,No,62.0,No
4,P00005,95,M,,2023-04-17,97.0,71.0,185.0,78.0,75.0,113.0,Site E,Treatmen A,9,0,yes,,Yes
5,P00006,78,F,26.8,2023-08-29,116.0,66.0,164.0,54.0,99.0,99.0,Site A,TreatmentA,4,0,yes,,Yes
6,P00007,84,F,25.4,2022-05-12,133.0,100.0,215.0,62.0,113.0,70.0,Site A,treatment a,20,1,No,76.0,No
7,P00008,70,Male,24.7,2022-06-04,111.0,72.0,174.0,60.0,94.0,109.0,Site B,TREATMENT A,19,0,No,53.0,No
8,P00009,92,Female,26.9,2022-04-06,,,189.0,62.0,89.0,103.0,Site A,Control,21,0,yes,53.0,No
9,P00010,75,Male,21.1,2023-12-14,128.0,76.0,218.0,77.0,97.0,96.0,Site A,Treatment B,1,0,No,50.0,No



Summary statistics (numeric columns):


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,10000.0,59.1827,151.769963,-999.0,70.0,80.0,92.0,100.0
bmi,9562.0,25.730558,5.339547,-1.0,23.5,26.0,28.775,42.8
systolic_bp,9586.0,117.531087,13.973973,75.0,108.0,117.0,127.0,173.0
diastolic_bp,9586.0,73.550908,10.167464,60.0,65.0,73.0,81.0,118.0
cholesterol_total,9446.0,178.039488,33.129034,91.0,155.0,177.0,200.0,315.0
cholesterol_hdl,9446.0,61.369786,11.062101,25.0,54.0,61.0,69.0,98.0
cholesterol_ldl,9446.0,85.698603,28.686463,40.0,65.0,84.0,105.0,226.0
glucose_fasting,9631.0,96.424255,17.112961,51.0,84.0,96.0,108.0,163.0
follow_up_months,10000.0,12.2546,7.07675,0.0,6.0,12.0,19.0,24.0
adverse_events,10000.0,0.1455,0.393631,0.0,0.0,0.0,0.0,4.0



Summary statistics (all columns):


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
patient_id,10000.0,10000.0,P00001,1.0,,,,,,,
age,10000.0,,,,59.1827,151.769963,-999.0,70.0,80.0,92.0,100.0
sex,10000.0,8.0,Female,2684.0,,,,,,,
bmi,9562.0,,,,25.730558,5.339547,-1.0,23.5,26.0,28.775,42.8
enrollment_date,10000.0,1666.0,2023-06-02,23.0,,,,,,,
systolic_bp,9586.0,,,,117.531087,13.973973,75.0,108.0,117.0,127.0,173.0
diastolic_bp,9586.0,,,,73.550908,10.167464,60.0,65.0,73.0,81.0,118.0
cholesterol_total,9446.0,,,,178.039488,33.129034,91.0,155.0,177.0,200.0,315.0
cholesterol_hdl,9446.0,,,,61.369786,11.062101,25.0,54.0,61.0,69.0,98.0
cholesterol_ldl,9446.0,,,,85.698603,28.686463,40.0,65.0,84.0,105.0,226.0


## Part 2: Column Selection (3 points)

Demonstrate different selection methods:

1. Select only numeric columns using `.select_dtypes()`
2. Select specific columns by name
3. Select a subset of rows and columns using `.loc[]`

In [4]:
# Part 2:
numeric_cols = df.select_dtypes(include=[np.number])
print('Numeric columns (count):', len(numeric_cols.columns))
print(list(numeric_cols.columns))
display(numeric_cols.head())
print('Numeric-only dataframe shape:', numeric_cols.shape)

Numeric columns (count): 11
['age', 'bmi', 'systolic_bp', 'diastolic_bp', 'cholesterol_total', 'cholesterol_hdl', 'cholesterol_ldl', 'glucose_fasting', 'follow_up_months', 'adverse_events', 'adherence_pct']


Unnamed: 0,age,bmi,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,follow_up_months,adverse_events,adherence_pct
0,80,29.3,123.0,80.0,120.0,55.0,41.0,118.0,20,0,24.0
1,80,,139.0,81.0,206.0,58.0,107.0,79.0,24,0,77.0
2,82,-1.0,123.0,86.0,172.0,56.0,82.0,77.0,2,0,70.0
3,95,25.4,116.0,77.0,200.0,56.0,104.0,115.0,17,0,62.0
4,95,,97.0,71.0,185.0,78.0,75.0,113.0,9,0,


Numeric-only dataframe shape: (10000, 11)


In [5]:
cols = ['patient_id', 'age', 'bmi', 'site']
cols_found = [c for c in cols if c in df.columns]
print('Requested columns found:', cols_found)
display(df[cols_found].head())

Requested columns found: ['patient_id', 'age', 'bmi', 'site']


Unnamed: 0,patient_id,age,bmi,site
0,P00001,80,29.3,Site B
1,P00002,80,,Site A
2,P00003,82,-1.0,Site C
3,P00004,95,25.4,Site D
4,P00005,95,,Site E


In [6]:
cols = ['patient_id', 'age', 'site']
cols_available = [c for c in cols if c in df.columns]
print('Using columns for .loc():', cols_available)
subset = df.loc[0:9, cols_available]
display(subset)

Using columns for .loc(): ['patient_id', 'age', 'site']


Unnamed: 0,patient_id,age,site
0,P00001,80,Site B
1,P00002,80,Site A
2,P00003,82,Site C
3,P00004,95,Site D
4,P00005,95,Site E
5,P00006,78,Site A
6,P00007,84,Site A
7,P00008,70,Site B
8,P00009,92,Site A
9,P00010,75,Site A


## Part 3: Filtering (4 points)

Filter the data to answer these questions:

1. How many patients are over 65 years old?
2. How many patients have systolic BP > 140?
3. Find patients who are both over 65 AND have systolic BP > 140
4. Find patients from Site A or Site B using `.isin()`

In [7]:
# Part 3.1: Filter and count patients over 65
filters = [{'column': 'age', 'condition': 'greater_than', 'value': 65}]
patients_over_65 = filter_data(df, filters) # <-- The 'du.' prefix is removed here
print(f"Patients over 65: {len(patients_over_65)}")
display(patients_over_65.head(20))

Patients over 65: 8326


Unnamed: 0,patient_id,age,sex,bmi,enrollment_date,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,site,intervention_group,follow_up_months,adverse_events,outcome_cvd,adherence_pct,dropout
0,P00001,80,F,29.3,2022-05-01,123.0,80.0,120.0,55.0,41.0,118.0,Site B,Control,20,0,No,24.0,No
1,P00002,80,Female,,2022-01-06,139.0,81.0,206.0,58.0,107.0,79.0,Site A,CONTROL,24,0,No,77.0,No
2,P00003,82,Female,-1.0,2023-11-04,123.0,86.0,172.0,56.0,82.0,77.0,Site C,treatment b,2,0,Yes,70.0,No
3,P00004,95,Female,25.4,2022-08-15,116.0,77.0,200.0,56.0,104.0,115.0,Site D,treatment b,17,0,No,62.0,No
4,P00005,95,M,,2023-04-17,97.0,71.0,185.0,78.0,75.0,113.0,Site E,Treatmen A,9,0,yes,,Yes
5,P00006,78,F,26.8,2023-08-29,116.0,66.0,164.0,54.0,99.0,99.0,Site A,TreatmentA,4,0,yes,,Yes
6,P00007,84,F,25.4,2022-05-12,133.0,100.0,215.0,62.0,113.0,70.0,Site A,treatment a,20,1,No,76.0,No
7,P00008,70,Male,24.7,2022-06-04,111.0,72.0,174.0,60.0,94.0,109.0,Site B,TREATMENT A,19,0,No,53.0,No
8,P00009,92,Female,26.9,2022-04-06,,,189.0,62.0,89.0,103.0,Site A,Control,21,0,yes,53.0,No
9,P00010,75,Male,21.1,2023-12-14,128.0,76.0,218.0,77.0,97.0,96.0,Site A,Treatment B,1,0,No,50.0,No


In [8]:
# Part 3.2: Filter for high systolic BP (>140)
filters = [{'column': 'systolic_bp', 'condition': 'greater_than', 'value': 140}]
high_bp = filter_data(df, filters)
print(f"Patients with systolic BP > 140: {len(high_bp)}")
display(high_bp.head())


Patients with systolic BP > 140: 538


Unnamed: 0,patient_id,age,sex,bmi,enrollment_date,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,site,intervention_group,follow_up_months,adverse_events,outcome_cvd,adherence_pct,dropout
0,P00034,83,Male,37.1,2022-01-18,143.0,100.0,187.0,51.0,99.0,115.0,Site B,CONTROL,24,0,Yes,,Yes
1,P00035,70,M,34.4,2022-10-21,143.0,73.0,146.0,69.0,48.0,156.0,Site D,CONTROL,15,0,no,50.0,No
2,P00083,73,F,22.4,2023-03-13,152.0,80.0,199.0,50.0,109.0,105.0,Site C,CONTROL,10,0,No,81.0,No
3,P00115,88,M,35.2,2022-07-05,142.0,80.0,152.0,33.0,89.0,125.0,Site D,Contrl,18,0,yes,69.0,No
4,P00117,61,F,32.5,2022-11-30,141.0,89.0,150.0,51.0,69.0,146.0,Site E,Contrl,13,0,No,96.0,No


In [9]:
# Part 3.3: Multiple conditions (age > 65 AND systolic_bp > 140)
filters = [
    {'column': 'age', 'condition': 'greater_than', 'value': 65},
    {'column': 'systolic_bp', 'condition': 'greater_than', 'value': 140}
]
both_conditions = filter_data(df, filters)
print(f"Patients over 65 AND systolic BP > 140: {len(both_conditions)}")
display(both_conditions.head())

# Alternative: use in_range for age 65-100
age_filters = [{'column': 'age', 'condition': 'in_range', 'value': [65, 100]}]
age_range = filter_data(df, age_filters)
print(f"Patients aged 65-100: {len(age_range)}")


Patients over 65 AND systolic BP > 140: 464


Unnamed: 0,patient_id,age,sex,bmi,enrollment_date,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,site,intervention_group,follow_up_months,adverse_events,outcome_cvd,adherence_pct,dropout
0,P00034,83,Male,37.1,2022-01-18,143.0,100.0,187.0,51.0,99.0,115.0,Site B,CONTROL,24,0,Yes,,Yes
1,P00035,70,M,34.4,2022-10-21,143.0,73.0,146.0,69.0,48.0,156.0,Site D,CONTROL,15,0,no,50.0,No
2,P00083,73,F,22.4,2023-03-13,152.0,80.0,199.0,50.0,109.0,105.0,Site C,CONTROL,10,0,No,81.0,No
3,P00115,88,M,35.2,2022-07-05,142.0,80.0,152.0,33.0,89.0,125.0,Site D,Contrl,18,0,yes,69.0,No
4,P00118,89,Male,,2023-12-25,152.0,88.0,221.0,75.0,102.0,118.0,Site C,CONTROL,0,0,No,,Yes


Patients aged 65-100: 8501


In [10]:
# Part 3.4: Filter by site using .isin() (recommended)
# Use cleaned site column if available
site_col = 'site_clean' if 'site_clean' in df.columns else 'site'
site_values = ['Site A', 'Site B']
site_ab_isin = df[df[site_col].isin(site_values)]
print(f"Patients from Site A or Site B (using .isin on cleaned values): {len(site_ab_isin)}")
display(site_ab_isin.head(20))

Patients from Site A or Site B (using .isin on cleaned values): 5409


Unnamed: 0,patient_id,age,sex,bmi,enrollment_date,systolic_bp,diastolic_bp,cholesterol_total,cholesterol_hdl,cholesterol_ldl,glucose_fasting,site,intervention_group,follow_up_months,adverse_events,outcome_cvd,adherence_pct,dropout
0,P00001,80,F,29.3,2022-05-01,123.0,80.0,120.0,55.0,41.0,118.0,Site B,Control,20,0,No,24.0,No
1,P00002,80,Female,,2022-01-06,139.0,81.0,206.0,58.0,107.0,79.0,Site A,CONTROL,24,0,No,77.0,No
5,P00006,78,F,26.8,2023-08-29,116.0,66.0,164.0,54.0,99.0,99.0,Site A,TreatmentA,4,0,yes,,Yes
6,P00007,84,F,25.4,2022-05-12,133.0,100.0,215.0,62.0,113.0,70.0,Site A,treatment a,20,1,No,76.0,No
7,P00008,70,Male,24.7,2022-06-04,111.0,72.0,174.0,60.0,94.0,109.0,Site B,TREATMENT A,19,0,No,53.0,No
8,P00009,92,Female,26.9,2022-04-06,,,189.0,62.0,89.0,103.0,Site A,Control,21,0,yes,53.0,No
9,P00010,75,Male,21.1,2023-12-14,128.0,76.0,218.0,77.0,97.0,96.0,Site A,Treatment B,1,0,No,50.0,No
10,P00011,79,Female,23.5,2023-06-12,110.0,70.0,216.0,81.0,92.0,85.0,Site A,control,7,0,no,39.0,No
11,P00012,72,Male,,2023-06-02,106.0,70.0,167.0,82.0,52.0,107.0,Site B,Contrl,7,0,No,100.0,No
12,P00013,100,Male,28.0,03/28/2023,120.0,87.0,160.0,82.0,62.0,120.0,Site B,Treatment A,9,0,Yes,72.0,No


## Part 4: Value Counts and Grouping (5 points)

1. Get value counts for the 'site' column
2. Get value counts for the 'intervention_group' column  
3. Create a crosstab of site vs intervention_group
4. Calculate mean age by site
5. Save the site value counts to `output/q4_site_counts.csv`

In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# --- Setup: Assuming df is your fully cleaned DataFrame from du.clean_data() ---
# Note: The cleaning functions in q3_data_utils overwrite the 'site' and
# 'intervention_group' columns, so we use them directly.
OUTPUT_DIR = 'output'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# 1. Site value counts and plot
site_counts = df['site'].value_counts().reset_index()
site_counts.columns = ['site', 'patient_count'] 

print("\n1. Site value counts:")
display(site_counts)

# Plot: Site Distribution Bar Chart
plt.figure(figsize=(9, 6))
sns.barplot(
    x='site', 
    y='patient_count', 
    data=site_counts, 
    palette='viridis'
)
plt.title('Distribution of Patients by Site')
plt.xlabel('Clinical Site')
plt.ylabel('Patient Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'q4_site_counts_bar.png'))
plt.close() # Close plot


# 2. Intervention group value counts and plot
interv_counts = df['intervention_group'].value_counts().reset_index()
interv_counts.columns = ['intervention_group', 'patient_count']

print("\n2. Intervention group value counts:")
display(interv_counts)

# Plot: Intervention Group Distribution Bar Chart
plt.figure(figsize=(9, 6))
sns.barplot(
    x='intervention_group', 
    y='patient_count', 
    data=interv_counts, 
    palette='plasma'
)
plt.title('Distribution of Patients by Intervention Group')
plt.xlabel('Intervention Group')
plt.ylabel('Patient Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'q4_intervention_counts_bar.png'))
plt.close()


# 3. Crosstab of site vs intervention_group and heatmap
site_intervention_crosstab = pd.crosstab(df['site'], df['intervention_group'])
print("\n3. Crosstab of site vs intervention group:")
display(site_intervention_crosstab)

# Plot: Site vs Intervention Group Heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(
    site_intervention_crosstab, 
    annot=True, 
    fmt='d', 
    cmap='Blues', 
    cbar_kws={'label': 'Patient Count'}
)
plt.title('Site vs Intervention Group Distribution')
plt.xlabel('Intervention Group')
plt.ylabel('Clinical Site')
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'q4_site_intervention_heatmap.png'))
plt.close()


# 4. Mean age by site
mean_age_by_site = df.groupby('site')['age'].mean().round(1)
print("\n4. Mean age by site:")
display(mean_age_by_site)

# Optional bar plot of mean age by site
plt.figure(figsize=(10, 6))
mean_age_by_site.plot(kind='bar')
plt.title('Mean Age by Site')
plt.ylabel('Age (years)')
plt.xticks(rotation=0) 
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'q4_mean_age_by_site_bar.png'))
plt.close()


# 5. Save the site value counts to output/q4_site_counts.csv
# Use the series derived from the value_counts() to save the artifact
df['site'].value_counts().to_csv(os.path.join(OUTPUT_DIR, 'q4_site_counts.csv'), header=['patient_count'])
print("\n5. Site value counts saved to output/q4_site_counts.csv")


1. Site value counts:


Unnamed: 0,site,patient_count
0,Site A,2956
1,Site B,2453
2,Site C,2073
3,Site D,1501
4,Site E,1017



2. Intervention group value counts:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(


Unnamed: 0,intervention_group,patient_count
0,Contrl,802
1,TREATMENT B,761
2,Treatment B,760
3,Control,751
4,treatment b,750
5,control,734
6,Treatment B,730
7,CONTROL,715
8,TreatmentA,635
9,Treatment A,610



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(



3. Crosstab of site vs intervention group:


intervention_group,CONTROL,Contrl,Control,TREATMENT A,TREATMENT B,Treatmen A,Treatment B,Treatment A,Treatment B,TreatmentA,...,TREATMENT A,TREATMENT B,Treatmen A,Treatment B,Treatment A,Treatment B,TreatmentA,control,treatment a,treatment b
site,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Site A,24,25,18,13,27,20,20,17,33,16,...,174,229,164,210,164,199,190,224,189,222
Site B,21,20,30,16,23,13,17,18,22,20,...,141,181,138,189,159,201,165,169,141,184
Site C,16,12,22,20,13,7,12,12,23,11,...,108,177,138,155,133,137,129,153,101,155
Site D,15,10,12,7,8,13,9,9,11,13,...,84,89,92,130,92,126,104,111,99,105
Site E,8,6,13,9,12,7,6,10,15,6,...,65,85,60,76,62,67,47,77,58,84



4. Mean age by site:


site
Site A    56.4
Site B    62.4
Site C    57.3
Site D    63.5
Site E    57.1
Name: age, dtype: float64


5. Site value counts saved to output/q4_site_counts.csv


In [12]:
# 5. Save to CSV
output_file = 'output/q4_site_counts.csv'
site_counts.to_csv(output_file)
print(f"Saved site value counts to {output_file}")

# first few rows
print("\nPreview of saved CSV content:")
saved_counts = pd.read_csv(output_file)
display(saved_counts.head())

Saved site value counts to output/q4_site_counts.csv

Preview of saved CSV content:


Unnamed: 0.1,Unnamed: 0,site,patient_count
0,0,Site A,2956
1,1,Site B,2453
2,2,Site C,2073
3,3,Site D,1501
4,4,Site E,1017


## Summary

Write 2-3 sentences about what you learned from exploring this dataset.

**Your summary here:**
Exploration of the raw dataset revealed several key areas requiring cleanup, including the presence of missing values, particularly in lab measurements, and the need for standardization across categorical columns like 'site' and 'intervention_group' due to inconsistent spelling and formatting. 
Furthermore, initial distribution plots indicated potential outliers in the age and BMI fields, confirming that data cleaning and transformation steps are necessary before any reliable statistical aggregation or modeling can be performed."

