# Assignment 5, Question 4: Data Exploration

**Points: 15**

In this notebook, you'll explore the clinical trial dataset using pandas selection and filtering techniques.

You'll use utility functions from `q3_data_utils` where helpful, but also demonstrate direct pandas operations.

## Setup

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import utilities from Q3
from q3_data_utils import load_data, detect_missing, filter_data

# Load the data
df = load_data('data/clinical_trial_raw.csv')
print(f"Loaded {len(df)} patients with {len(df.columns)} variables")

# Prewritten visualization functions for exploration
def plot_value_counts(series, title, figsize=(10, 6)):
    """
    Create a bar chart of value counts.
    
    Args:
        series: pandas Series with value counts
        title: Chart title
        figsize: Figure size tuple
    """
    plt.figure(figsize=figsize)
    series.plot(kind='bar')
    plt.title(title)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

def plot_crosstab(crosstab_data, title, figsize=(10, 6)):
    """
    Create a heatmap of crosstab data.
    
    Args:
        crosstab_data: pandas DataFrame from pd.crosstab()
        title: Chart title
        figsize: Figure size tuple
    """
    plt.figure(figsize=figsize)
    plt.imshow(crosstab_data.values, cmap='Blues', aspect='auto')
    plt.colorbar()
    plt.title(title)
    plt.xticks(range(len(crosstab_data.columns)), crosstab_data.columns, rotation=45)
    plt.yticks(range(len(crosstab_data.index)), crosstab_data.index)
    plt.tight_layout()
    plt.show()

Matplotlib is building the font cache; this may take a moment.


data loaded with success: 10000,rows and 18 columns
Loaded 10000 patients with 18 variables


## Part 1: Basic Exploration (3 points)

Display:
1. Dataset shape
2. Column names and types
3. First 10 rows
4. Summary statistics (.describe())

In [47]:
# TODO: Display dataset info
import pandas as pd
df = load_data('data/clinical_trial_raw.csv')
print(df)
print(f"\\nShape: {df.shape}") # rows and columns
print(f"\\nColumn Names: {list(df.columns)}")
print(f"\\nData types:\\n{df.types}") 
print("First 10 rows:") #10 rows 
print( df.head(10))  
#summary stats
print("\\nSummary statistics:")
print(df.describe())

    

data loaded with success: 10000,rows and 18 columns
     patient_id  age         sex   bmi enrollment_date  systolic_bp  \
0        P00001   80           F  29.3      2022-05-01        123.0   
1        P00002   80    Female     NaN      2022-01-06        139.0   
2        P00003   82      Female  -1.0      2023-11-04        123.0   
3        P00004   95      Female  25.4      2022-08-15        116.0   
4        P00005   95           M   NaN      2023-04-17         97.0   
...         ...  ...         ...   ...             ...          ...   
9995     P09996   72        Male  23.2      2022-04-11        122.0   
9996     P09997  100      Female  28.9      2023-02-10        124.0   
9997     P09998   78           F  23.8      2023-11-05        110.0   
9998     P09999   86           F  27.0      2022-08-27        139.0   
9999     P10000   67      Female  29.4      25-03-2022        134.0   

      diastolic_bp  cholesterol_total  cholesterol_hdl  cholesterol_ldl  \
0             80.0  

AttributeError: 'DataFrame' object has no attribute 'types'

## Part 2: Column Selection (3 points)

Demonstrate different selection methods:

1. Select only numeric columns using `.select_dtypes()`
2. Select specific columns by name
3. Select a subset of rows and columns using `.loc[]`

In [8]:
# TODO: Select numeric columns
import pandas as pd
df = load_data('data/clinical_trial_raw.csv')
numeric_cols = df.select_dtypes(include=['number'])
print("\\nNumeric columns only:")
print(numeric_cols)


data loaded with success: 10000,rows and 18 columns
\nNumeric columns only:
      age   bmi  systolic_bp  diastolic_bp  cholesterol_total  \
0      80  29.3        123.0          80.0              120.0   
1      80   NaN        139.0          81.0              206.0   
2      82  -1.0        123.0          86.0              172.0   
3      95  25.4        116.0          77.0              200.0   
4      95   NaN         97.0          71.0              185.0   
...   ...   ...          ...           ...                ...   
9995   72  23.2        122.0          73.0              182.0   
9996  100  28.9        124.0          78.0              157.0   
9997   78  23.8        110.0          63.0              154.0   
9998   86  27.0        139.0          98.0              196.0   
9999   67  29.4        134.0          83.0              197.0   

      cholesterol_hdl  cholesterol_ldl  glucose_fasting  follow_up_months  \
0                55.0             41.0            118.0           

In [15]:
# TODO: Select specific columnames = df['name']
age = df['age']
bmi = df['bmi']
glucose_fasting = df ['glucose_fasting']
print(age, bmi, glucose_fasting)


data loaded with success: 10000,rows and 18 columns
0        80
1        80
2        82
3        95
4        95
       ... 
9995     72
9996    100
9997     78
9998     86
9999     67
Name: age, Length: 10000, dtype: int64 0       29.3
1        NaN
2       -1.0
3       25.4
4        NaN
        ... 
9995    23.2
9996    28.9
9997    23.8
9998    27.0
9999    29.4
Name: bmi, Length: 10000, dtype: float64 0       118.0
1        79.0
2        77.0
3       115.0
4       113.0
        ...  
9995     97.0
9996    102.0
9997    114.0
9998    126.0
9999    128.0
Name: glucose_fasting, Length: 10000, dtype: float64


In [16]:
# TODO: Use .loc[] to select subset
print(df.loc[0:5, ['age' ,'bmi', 'glucose_fasting']])


   age   bmi  glucose_fasting
0   80  29.3            118.0
1   80   NaN             79.0
2   82  -1.0             77.0
3   95  25.4            115.0
4   95   NaN            113.0
5   78  26.8             99.0


## Part 3: Filtering (4 points)

Filter the data to answer these questions:

1. How many patients are over 65 years old?
2. How many patients have systolic BP > 140?
3. Find patients who are both over 65 AND have systolic BP > 140
4. Find patients from Site A or Site B using `.isin()`

In [30]:
# TODO: Filter and count patients over 65
# 1. Use the filter_data utility from Q3
age_filter = [{'column': 'age' ,'condition': 'greater_than', 'value': 65}]
# 2. Create a filter for age > 65
#patients_over_65 = df[ df['age']] > 65
patients_over_65 = filter_data(df ,age_filter)
# 3. Apply the filter and count the results
print(f"Patients over 65: {len(patients_over_65)}")



filtered data with 8326
Patients over 65: 8326


In [31]:
# TODO: Filter for high BP
# 1. Use the filter_data utility from Q3
sbp_filter = [{'column': 'systolic_bp' ,'condition': 'greater_than', 'value': 140}]
# 2. Create a filter for systolic_bp > 140
#high_bp = df[ df['systolic_bp']] > 140
high_bp = filter_data(df ,sbp_filter)
# 3. Apply the filter and count the results
print(f"Patients with high BP: {len(high_bp)}")




filtered data with 538
Patients with high BP: 538


In [48]:
# TODO: Multiple conditions with &
# 1. Use filter_data for multiple conditions:
# 2. Create filters for both conditions:
#     {'column': 'age', 'condition': 'greater_than', 'value': 65},
#     {'column': 'systolic_bp', 'condition': 'greater_than', 'value': 140}
# ]
filter_data = [
    {'column': 'age', 'condition': 'greater_than', 'value': 65},
    {'column': 'systolic_bp', 'condition': 'greater_than', 'value': 140}
]   
# 3. Apply the filter and count the results
print(f"Patients over 65 AND high BP: {len(filter_data)}")

#print(f"Patients over 65:{len( AND high BP: {len(filter_data)}")

# 5. Alternative: Use in_range for age range:
# 5. Create filter for age range 65-100
in_range_filter = [ {'column':'age', 'condition': 'in range', 'value':[65-100]}]
patients_65_100 = filter_data(df, in_range_filter)
# 6. Apply the filter and count the results
print(f"Patient aged between 65 and 100 year: {len(patients_65_100)}")


Patients over 65 AND high BP: 2


TypeError: 'list' object is not callable

In [46]:
# TODO: Filter by site using .isin()
# 1. Use the filter_data utility from Q3
site_filter = [ {'column': 'site', 'condition': 'in_list', 'value': ['Site A', 'SiteB']}] 
# 2. Create a filter for Site A or Site B
site_ab = df[df['site'].isin (['site A', 'site b'])]
# 3. Apply the filter and count the results
# print(f"Patients from Site A or B: {len(site_ab)}")
print(f"Patients from Site A or B: {len(site_ab)}")

Patients from Site A or B: 742


## Part 4: Value Counts and Grouping (5 points)

1. Get value counts for the 'site' column
2. Get value counts for the 'intervention_group' column  
3. Create a crosstab of site vs intervention_group
4. Calculate mean age by site
5. Save the site value counts to `output/q4_site_counts.csv`

In [52]:
# TODO: Value counts and analysis
#value for the site
print(df ['site'].value_counts())
#value for the intervention_grp
intervention_counts = df['site intervention_grou'].value_counts()
print(df ['site intervention_grou'].value_counts())
#cross tab 
site_interve_crosstab = pd.crosstab(df['site'], df['intervention_group'])
print("nCrosstab of site vs intervention group:")
#mean age by site
mean_age = df.groupby('site') ['age'].mean()
print("nMean age by site:")



site
site b         742
Site B         736
SITE B         703
SITE A         684
Site  A        681
Site A         661
Site C         658
site a         651
site c         615
SITE C         605
Site D         362
site d         349
Site_D         332
Site E         319
SITE D         313
SITE E         295
site e         294
  SITE B        94
  site b        90
  Site B        88
  Site C        83
  site a        74
  SITE A        74
  Site  A       67
  Site A        64
  site c        57
  SITE C        55
  Site E        42
  SITE D        41
  site d        41
  site e        36
  Site D        32
  SITE E        31
  Site_D        31
Name: count, dtype: int64


KeyError: 'site intervention_grou'

In [None]:
# TODO: Save output
# site_counts.to_csv('output/q4_site_counts.csv')
site_counts.to_csv('output/q4_site_counts.csv')

## Summary

Write 2-3 sentences about what you learned from exploring this dataset.

**Your summary here:**

TODO: Write your observations
