# Assignment 5, Question 7: Group Operations & Final Analysis

**Points: 15**

Perform grouped analysis and create summary reports.

## Setup

In [48]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import utilities
from q3_data_utils import load_data, summarize_by_group, clean_data

df = clean_data(load_data('data/clinical_trial_raw.csv'))
print(f"Loaded {len(df)} patients")

# Prewritten visualization function for grouped analysis
def plot_group_comparison(data, x_col, y_col, title):
    """
    Create a bar chart comparing groups.
    
    Args:
        data: DataFrame with grouped data
        x_col: Column name for x-axis (groups)
        y_col: Column name for y-axis (values)
        title: Chart title
    """
    plt.figure(figsize=(10, 6))
    data.plot(x=x_col, y=y_col, kind='bar')
    plt.title(title)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

Loaded 10000 patients


## Part 1: Basic Groupby (5 points)

1. Group by 'site' and calculate mean age, BMI, and blood pressure
2. Group by 'intervention_group' and count patients
3. Use the `summarize_by_group()` utility to get overall statistics by site

In [49]:
# TODO: Group by site
summary_site = summarize_by_group(df,'site')
display(summary_site['age'].mean(),summary_site['bmi'].mean(),summary_site['systolic_bp'].mean())

count    288.235294
mean      80.608051
std       13.370459
min       49.705882
25%       70.441176
50%       80.264706
75%       92.044118
max      100.000000
dtype: float64

count    281.235294
mean      25.890765
std        5.107645
min        4.091176
25%       23.700735
50%       26.185294
75%       28.719853
max       36.929412
dtype: float64

count    281.941176
mean     117.765441
std       14.178626
min       83.617647
25%      108.051471
50%      117.441176
75%      127.279412
max      155.823529
dtype: float64

In [55]:
# TODO: Count by intervention group
summarize_by_group(df, 'intervention_group')

Unnamed: 0_level_0,age,age,age,age,age,age,age,age,bmi,bmi,...,adverse_events,adverse_events,adherence_pct,adherence_pct,adherence_pct,adherence_pct,adherence_pct,adherence_pct,adherence_pct,adherence_pct
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
intervention_group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
CONTROL,84.0,80.77381,13.546307,52.0,70.0,81.0,94.0,100.0,83.0,26.790361,...,0.0,1.0,68.0,58.705882,18.717905,20.0,46.75,57.5,71.25,100.0
Contrl,71.0,80.676056,14.200377,56.0,70.0,78.0,96.0,100.0,70.0,24.844286,...,0.0,2.0,57.0,61.263158,18.862781,20.0,51.0,64.0,72.0,100.0
Control,91.0,80.406593,12.71218,46.0,72.0,80.0,88.5,100.0,91.0,24.965934,...,0.0,2.0,83.0,61.361446,18.669512,20.0,46.5,62.0,75.5,94.0
TREATMENT A,63.0,81.095238,14.635038,47.0,69.5,83.0,94.5,100.0,59.0,25.516949,...,0.0,2.0,54.0,57.296296,22.409255,20.0,42.0,55.5,76.5,100.0
TREATMENT B,80.0,76.7125,14.802663,50.0,65.0,75.5,89.0,100.0,79.0,25.681013,...,0.0,2.0,76.0,62.171053,17.562375,20.0,49.75,63.5,75.0,100.0
Treatmen A,59.0,82.40678,13.022684,53.0,73.5,82.0,93.0,100.0,58.0,25.560345,...,0.0,2.0,53.0,60.566038,20.16996,21.0,45.0,59.0,79.0,95.0
Treatment B,63.0,80.365079,14.001499,56.0,70.0,79.0,94.0,100.0,58.0,25.684483,...,0.0,2.0,54.0,59.740741,16.144612,20.0,50.5,59.0,68.75,93.0
Treatment A,64.0,81.171875,12.929055,52.0,71.75,80.0,94.0,100.0,62.0,25.524194,...,0.0,2.0,53.0,61.358491,17.738587,20.0,49.0,58.0,74.0,100.0
Treatment B,103.0,78.951456,14.711859,50.0,67.0,78.0,92.0,100.0,101.0,24.909901,...,0.0,2.0,86.0,58.872093,19.336522,20.0,44.5,62.5,72.0,99.0
TreatmentA,65.0,81.6,13.084819,48.0,71.0,81.0,92.0,100.0,64.0,26.303125,...,0.0,2.0,53.0,58.528302,18.420937,20.0,48.0,61.0,74.0,90.0


**Note:** The `summarize_by_group()` function has an optional `agg_dict` parameter for custom aggregations. If you don't specify it, it will use `.describe()` on numeric columns. You can use `agg_dict={'age': ['mean', 'std'], 'bmi': 'mean'}` for custom aggregations.


In [23]:
# TODO: Use summarize_by_group utility
summarize_by_group(df,'site',agg_dict={'age': ['mean', 'std'], 'bmi': 'mean'})

Unnamed: 0_level_0,age,age,bmi
Unnamed: 0_level_1,mean,std,mean
site,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
SITE A,23.608108,246.53588,25.872059
SITE B,59.670213,157.423433,26.822826
SITE C,60.254545,146.070787,25.688889
SITE D,77.853659,11.501219,25.892105
SITE E,12.258065,270.211641,25.496667
Site A,80.328358,14.431015,25.452239
Site A,49.84375,190.210011,26.263934
Site B,56.318182,162.423495,26.521429
Site C,27.831325,232.797453,26.454054
Site D,-25.3125,318.343742,26.387097


## Part 2: Multiple Aggregations (5 points)

Group by 'site' and apply multiple aggregations:
- age: mean, std, min, max
- bmi: mean, std
- systolic_bp: mean, median

Display the results in a well-formatted table.

In [32]:
# TODO: Multiple aggregations
summarize_by_group(df,'site',agg_dict={'age': ['mean', 'std','min', 'max'], 'bmi': ['mean', 'std'], 'systolic_bp': ['mean', 'median']})

Unnamed: 0_level_0,age,age,age,age,bmi,bmi,systolic_bp,systolic_bp
Unnamed: 0_level_1,mean,std,min,max,mean,std,mean,median
site,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
SITE A,23.608108,246.53588,-999,100,25.872059,4.805123,116.633803,115.0
SITE B,59.670213,157.423433,-999,100,26.822826,3.562445,118.055556,119.0
SITE C,60.254545,146.070787,-999,100,25.688889,6.682268,115.942308,115.0
SITE D,77.853659,11.501219,55,100,25.892105,8.003154,119.513514,120.0
SITE E,12.258065,270.211641,-999,100,25.496667,3.439275,113.066667,111.0
Site A,80.328358,14.431015,49,100,25.452239,5.813447,118.552239,119.0
Site A,49.84375,190.210011,-999,100,26.263934,4.044504,119.68254,120.0
Site B,56.318182,162.423495,-999,100,26.521429,3.800034,117.035294,117.0
Site C,27.831325,232.797453,-999,100,26.454054,3.563521,116.419753,117.0
Site D,-25.3125,318.343742,-999,100,26.387097,3.759321,117.40625,118.0


## Part 3: Comparative Analysis (5 points)

Compare intervention groups:
1. Calculate mean outcome_cvd rate by intervention_group
2. Calculate mean adherence_pct by intervention_group
3. Create a cross-tabulation of intervention_group vs dropout status
4. Visualize the comparison with a bar plot

In [60]:
# TODO: Intervention group comparisons
summarize_by_group(df, 'outcome_cvd', agg_dict={'intervention_group': 'mean'})


TypeError: agg function failed [how->mean,dtype->object]

In [None]:
# TODO: Visualization


## Part 4: Final Report

Create and save:
1. Summary statistics by site → `output/q7_site_summary.csv`
2. Intervention group comparison → `output/q7_intervention_comparison.csv`
3. Text report with key findings → `output/q7_analysis_report.txt`

In [None]:
# TODO: Save summary outputs
summary_site.to_csv('output/q7_site_summary.csv')
summary_intervention.to_csv('output/q7_intervention_comparison.csv')


## Summary

What are the 3 most important findings from your analysis?

**Key Findings:**

1. TODO
2. TODO
3. TODO
