# <center>Study of the Efficacy of Different Anti-Cancer Treatments</center>
In a recent animal study, 249 mice with squamous cell carcinoma (SCC) tumors were treated with various drug regimens, including Capomulin. Over a 45-day period, the study aimed to assess the efficacy of these treatments in inhibiting tumor development. The task involves analyzing the complete dataset and generating tables and figures for the technical report. The study's top-level summary is pending detailed analysis, but it aims to compare the performance of Capomulin against other regimens, considering factors such as tumor size reduction, survival rates, and potential side effects. The ultimate goal is to provide the executive team with a comprehensive overview of the study results to inform future directions in anti-cancer medication development.

## Data Cleaning

In [141]:
# Import dependencies

import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st

In [142]:
# Read the mouse CSV file and print the first five rows

mouse_df = pd.read_csv('data\Mouse_metadata.csv')
mouse_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g)
0,k403,Ramicane,Male,21,16
1,s185,Capomulin,Female,3,17
2,x401,Capomulin,Female,16,15
3,m601,Capomulin,Male,22,17
4,g791,Ramicane,Male,11,16


In [143]:
# Read the study CSV file and print the first five rows

study_df = pd.read_csv('data\Study_results.csv')
study_df.head()

Unnamed: 0,Mouse ID,Timepoint,Tumor Volume (mm3),Metastatic Sites
0,b128,0,45.0,0
1,f932,0,45.0,0
2,g107,0,45.0,0
3,a457,0,45.0,0
4,c819,0,45.0,0


In [144]:
# Merge the two DataFrames on the 'Mouse ID' column and print the first 15 rows

merged_df = pd.merge(mouse_df, study_df, on = 'Mouse ID', how = 'left')
merged_df.head(15)

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1
5,k403,Ramicane,Male,21,16,25,33.464577,1
6,k403,Ramicane,Male,21,16,30,31.099498,1
7,k403,Ramicane,Male,21,16,35,26.546993,1
8,k403,Ramicane,Male,21,16,40,24.365505,1
9,k403,Ramicane,Male,21,16,45,22.050126,1


In [145]:
# Investigate the merged DataFrame and check for any null values

merged_df.info()
print('There are no null values in the merged DataFrame')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1893 entries, 0 to 1892
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Mouse ID            1893 non-null   object 
 1   Drug Regimen        1893 non-null   object 
 2   Sex                 1893 non-null   object 
 3   Age_months          1893 non-null   int64  
 4   Weight (g)          1893 non-null   int64  
 5   Timepoint           1893 non-null   int64  
 6   Tumor Volume (mm3)  1893 non-null   float64
 7   Metastatic Sites    1893 non-null   int64  
dtypes: float64(1), int64(4), object(3)
memory usage: 133.1+ KB
There are no null values in the merged DataFrame


In [146]:
# Investigate the descriptive statistics for the numerical columns

merged_df.describe()

Unnamed: 0,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
count,1893.0,1893.0,1893.0,1893.0,1893.0
mean,12.81458,25.662441,19.572108,50.448381,1.021659
std,7.189592,3.921622,14.07946,8.894722,1.137974
min,1.0,15.0,0.0,22.050126,0.0
25%,7.0,25.0,5.0,45.0,0.0
50%,13.0,27.0,20.0,48.951474,1.0
75%,20.0,29.0,30.0,56.2922,2.0
max,24.0,30.0,45.0,78.567014,4.0


In [147]:
# Investigate the shape of the DataFrame

shape = merged_df.shape
print(f'There are {shape[0]} rows and {shape[1]} columns in the DataFrame.')

There are 1893 rows and 8 columns in the DataFrame.


In [148]:
# Find the total number of mice used in the study

mouse_count = merged_df['Mouse ID'].nunique()
print(f'There are a total of {mouse_count} mice in the sample.')

There are a total of 249 mice in the sample.


In [149]:
# Find any duplicates in trials and delete the corresponding specimen as bad samples

duplicates = merged_df.duplicated(subset=['Mouse ID', 'Timepoint'])
merged_df.loc[duplicates, :]

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age_months,Weight (g),Timepoint,Tumor Volume (mm3),Metastatic Sites
909,g989,Propriva,Female,21,26,0,45.0,0
911,g989,Propriva,Female,21,26,5,47.570392,0
913,g989,Propriva,Female,21,26,10,49.880528,0
915,g989,Propriva,Female,21,26,15,53.44202,0
917,g989,Propriva,Female,21,26,20,54.65765,1


In [150]:
# Mouse ID g989 has duplicate trials. Therefore, we will delete the specimen as bad samples

merged_df = merged_df.loc[merged_df['Mouse ID'] != 'g989']
mouse_count = merged_df['Mouse ID'].nunique()
print(f'There are now a total of {mouse_count} mice in the sample after dropping bad samples.')

There are now a total of 248 mice in the sample after dropping bad samples.


In [151]:
# Rename Columns

merged_df = merged_df.rename(columns={'Age_months':'Age (Months)', 'Timepoint':'Timepoint (Days)'})
merged_df.head()

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age (Months),Weight (g),Timepoint (Days),Tumor Volume (mm3),Metastatic Sites
0,k403,Ramicane,Male,21,16,0,45.0,0
1,k403,Ramicane,Male,21,16,5,38.825898,0
2,k403,Ramicane,Male,21,16,10,35.014271,1
3,k403,Ramicane,Male,21,16,15,34.223992,1
4,k403,Ramicane,Male,21,16,20,32.997729,1


## Summary Statistics

In [152]:
# Create a DataFrame to summarize Drug Regimen vs. Tumor Volume (mm3)

group = merged_df.groupby('Drug Regimen')
t_mean = group['Tumor Volume (mm3)'].mean()
t_median = group['Tumor Volume (mm3)'].median()
t_variance = group['Tumor Volume (mm3)'].var()
t_stdev = group['Tumor Volume (mm3)'].std()
t_sem = group['Tumor Volume (mm3)'].sem()

summary_df = pd.DataFrame({'Mean Tumor Volume (mm3)':t_mean, 'Median Tumor Volume (mm3)':t_median, 
                           'Variance Tumor Volume':t_variance, 'Standard Deviation Tumor Volume':t_stdev, 
                           'Standard Error of the Mean':t_sem})
summary_df

Unnamed: 0_level_0,Mean Tumor Volume (mm3),Median Tumor Volume (mm3),Variance Tumor Volume,Standard Deviation Tumor Volume,Standard Error of the Mean
Drug Regimen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Capomulin,40.675741,41.557809,24.947764,4.994774,0.329346
Ceftamin,52.591172,51.776157,39.290177,6.268188,0.469821
Infubinol,52.884795,51.820584,43.128684,6.567243,0.492236
Ketapril,55.235638,53.698743,68.553577,8.279709,0.60386
Naftisol,54.331565,52.509285,66.173479,8.134708,0.596466
Placebo,54.033581,52.288934,61.168083,7.821003,0.581331
Propriva,52.32093,50.446266,43.852013,6.622085,0.544332
Ramicane,40.216745,40.673236,23.486704,4.846308,0.320955
Stelasyn,54.233149,52.431737,59.450562,7.710419,0.573111
Zoniferol,53.236507,51.818479,48.533355,6.966589,0.516398


Capomulin and Ramicane emerge as the most effective in diminishing tumor volume among the mice samples. The minimal variance in sizes under these two regimens, coupled with the smallest standard errors, suggests a heightened level of precision and consistency in their impact. This implies a more uniform and reliable response to treatment within these groups, reinforcing the notion that Capomulin and Ramicane showcase notable effectiveness in consistently reducing tumor sizes across the studied mice samples. These findings underscore the potential significance of these regimens for further exploration and potential clinical applications.

## Chart Observations

In [168]:
# A bar graph showing the total number of timepoints for all mice tested for each drug regimen using PyPlot

groupby_drug = merged_df.groupby('Drug Regimen')
bar_y = groupby_drug['Timepoint (Days)'].count()
bar_x = merged_df['Drug Regimen'].unique()

plt.figure(1, figsize=(9,6))

plt.bar(bar_x, bar_y, color='darkgreen', width=0.6)
plt.title('Timepoint Count vs. Drug Regimen')
plt.xlabel('Drug Regimen')
plt.ylabel('Timepoint Count')
plt.xticks(rotation = 45, rotation_mode = 'anchor', ha = 'right')
plt.grid()
plt.show()
plt.tight_layout()

<IPython.core.display.Javascript object>

The bar graph illustrates that the study has a greater number of timepoints allocated to the drug regimens 'Ramicane' and 'Ketapril,' while the regimen 'Zoniferol' has the fewest number of timepoints.

In [154]:
# A pie plot showing the distribution of sex using PyPlot

distribution_sex_df = merged_df[['Mouse ID', 'Sex']]
distribution_sex_df = distribution_sex_df.drop_duplicates('Mouse ID', keep='first')
mouse_sex = distribution_sex_df['Sex'].value_counts()
print(f'There are {mouse_sex[0]} male mice in the sample and {mouse_sex[1]} female mice in the sample.')

There are 125 male mice in the sample and 123 female mice in the sample.


In [169]:
plt.figure(2, figsize=(9,6))
plt.pie([mouse_sex[0], mouse_sex[1]], explode=[0,0], labels=['Male', 'Female'], colors=['blue','red'],
        shadow=False, startangle=140, autopct='%1.1f%%')
plt.show()

<IPython.core.display.Javascript object>

The pie chart shows that we have a balanced distribution of male and female mice samples. This could imply that the researchers researchers aimed to minimize the impact of sex-related variables, making it easier to attribute observed effects to the treatments rather than potential gender biases. It's worth noting that such considerations in experimental design are essential for robust and reliable scientific findings, as they help control for potential confounding factors and enhance the generalizability of the results.

In [170]:
# A line graph showing the mean tumor volume across the timepoints for each drug regimen

grouped = merged_df.groupby(['Drug Regimen', 'Timepoint (Days)'])
mean_tumor_volume = grouped['Tumor Volume (mm3)'].mean().reset_index()

plt.figure(3, figsize=(9, 6))
for regimen in mean_tumor_volume['Drug Regimen'].unique():
    regimen_data = mean_tumor_volume[mean_tumor_volume['Drug Regimen'] == regimen]
    plt.plot(regimen_data['Timepoint (Days)'], regimen_data['Tumor Volume (mm3)'], label=regimen)

plt.title('Mean Tumor Volume Over Time by Drug Regimen')
plt.xlabel('Timepoint (Days)')
plt.ylabel('Mean Tumor Volume (mm3)')
plt.legend()
plt.show()

<IPython.core.display.Javascript object>

Only two drugs - Capomulin and Ramicane - shows efficacy in decreasing the tumor volume in the mice sample. All the other drugs do not appear to have any effect in decreasing the tumor volume.

# Quartiles, Outliers, and a Box Plot

Calculate the final tumor volume of each mouse across four of the most promising treatment regimens by mean volume: 
Capomulin, Ramicane, Infubinol, and Ceftamin.
Then calculate the quartiles and IQR, and determine if there are any potential outliers across all four regimens.

In [161]:
# Add a column "Final Timepoint (Days)" to the existing DataFrame that filters for the four regimens

promising_regimens_df = merged_df.loc[merged_df['Drug Regimen'].isin(['Capomulin','Ramicane','Infubinol','Ceftamin'])]
promising_regimens_groupby = promising_regimens_df.groupby('Mouse ID')
promising_regimens_timepoint = promising_regimens_groupby['Timepoint (Days)'].max()
new_merge = pd.merge(promising_regimens_df, promising_regimens_timepoint, how='left', on='Mouse ID')

final_volume_df = new_merge.loc[new_merge['Timepoint (Days)_x'] == new_merge['Timepoint (Days)_y']]

final_volume_df = final_volume_df.rename(columns={'Timepoint (Days)_x':'Timepoint (Days)', 
                                     'Timepoint (Days)_y':'Final Timepoint (Days)'})
final_volume_df.reset_index(inplace=True)
final_volume_df.drop(columns=['index'], inplace=True)
final_volume_df

Unnamed: 0,Mouse ID,Drug Regimen,Sex,Age (Months),Weight (g),Timepoint (Days),Tumor Volume (mm3),Metastatic Sites,Final Timepoint (Days)
0,k403,Ramicane,Male,21,16,45,22.050126,1,45
1,s185,Capomulin,Female,3,17,45,23.343598,1,45
2,x401,Capomulin,Female,16,15,45,28.484033,0,45
3,m601,Capomulin,Male,22,17,45,28.430964,1,45
4,g791,Ramicane,Male,11,16,45,29.128472,1,45
...,...,...,...,...,...,...,...,...,...
95,x822,Ceftamin,Male,3,29,45,61.386660,3,45
96,y163,Infubinol,Female,17,27,45,67.685569,3,45
97,y769,Ceftamin,Female,6,27,45,68.594745,4,45
98,y865,Ceftamin,Male,23,26,45,64.729837,3,45


In [162]:
# Print the samples' ID and their final tumor volume after going through treatment

mouse_id = final_volume_df['Mouse ID']
tumor_volume = final_volume_df['Tumor Volume (mm3)']

for i in range(final_volume_df.shape[0]):
    print(f'Mouse ID: {mouse_id[i]}, Final Tumor Volume: {round(tumor_volume[i],2)}mm3')

Mouse ID: k403, Final Tumor Volume: 22.05mm3
Mouse ID: s185, Final Tumor Volume: 23.34mm3
Mouse ID: x401, Final Tumor Volume: 28.48mm3
Mouse ID: m601, Final Tumor Volume: 28.43mm3
Mouse ID: g791, Final Tumor Volume: 29.13mm3
Mouse ID: s508, Final Tumor Volume: 30.28mm3
Mouse ID: f966, Final Tumor Volume: 30.49mm3
Mouse ID: m546, Final Tumor Volume: 30.56mm3
Mouse ID: z578, Final Tumor Volume: 30.64mm3
Mouse ID: j913, Final Tumor Volume: 31.56mm3
Mouse ID: u364, Final Tumor Volume: 31.02mm3
Mouse ID: n364, Final Tumor Volume: 31.1mm3
Mouse ID: y793, Final Tumor Volume: 31.9mm3
Mouse ID: r554, Final Tumor Volume: 32.38mm3
Mouse ID: m957, Final Tumor Volume: 33.33mm3
Mouse ID: c758, Final Tumor Volume: 33.4mm3
Mouse ID: t565, Final Tumor Volume: 34.46mm3
Mouse ID: a644, Final Tumor Volume: 32.98mm3
Mouse ID: i177, Final Tumor Volume: 33.56mm3
Mouse ID: j989, Final Tumor Volume: 36.13mm3
Mouse ID: i738, Final Tumor Volume: 37.31mm3
Mouse ID: a520, Final Tumor Volume: 38.81mm3
Mouse ID: w91

In [163]:
# Locate potential outliers for each of the four promising regimen:

drug_regimens = final_volume_df['Drug Regimen'].unique()

list_of_series_for_plotting = []
for drug_regimen in drug_regimens:
    each_regimen_df = final_volume_df[final_volume_df['Drug Regimen'] == drug_regimen][['Mouse ID','Tumor Volume (mm3)']]
    
    quartiles = each_regimen_df['Tumor Volume (mm3)'].quantile([0.25,0.5,0.75])
    lower_quartile = quartiles[0.25]
    upper_quartile = quartiles[0.75]
    iqr = upper_quartile-lower_quartile
    
    upper_whisker = upper_quartile+(1.5*iqr)
    lower_whisker = lower_quartile-(1.5*iqr)

    
    outlier_mice = each_regimen_df.loc[(each_regimen_df['Tumor Volume (mm3)']>upper_whisker) | (each_regimen_df['Tumor Volume (mm3)']<lower_whisker)]['Mouse ID'].values
    outlier_tumor_size = each_regimen_df.loc[(each_regimen_df['Tumor Volume (mm3)']>upper_whisker) | (each_regimen_df['Tumor Volume (mm3)']<lower_whisker)]['Tumor Volume (mm3)'].values
    
    print(f'Potential outlier mouse ID and its tumor size for {drug_regimen}:')
    print(f'Mouse ID: {outlier_mice}, Tumor Size: {outlier_tumor_size}')
    
    list_of_series_for_plotting.append(each_regimen_df['Tumor Volume (mm3)'])    

Potential outlier mouse ID and its tumor size for Ramicane:
Mouse ID: [], Tumor Size: []
Potential outlier mouse ID and its tumor size for Capomulin:
Mouse ID: [], Tumor Size: []
Potential outlier mouse ID and its tumor size for Infubinol:
Mouse ID: ['c326'], Tumor Size: [36.3213458]
Potential outlier mouse ID and its tumor size for Ceftamin:
Mouse ID: [], Tumor Size: []


In [171]:
# Boxplot to locate outliers

plt.figure(4, figsize=(9,6))
plt.boxplot(list_of_series_for_plotting, flierprops={'marker':'o', 'markersize':8, 'markerfacecolor': 'blue'})
x_axis = np.arange(len(drug_regimens))+1
ticklocations = [value for value in x_axis]
labels = drug_regimens
plt.xticks(ticklocations, labels)
plt.xlabel('Drug Regimen')
plt.ylabel('Tumor Volume (mm3)')
plt.title('Distribution of the Final Tumor Volume (mm3) for each Drug Regimen')
plt.show()

<IPython.core.display.Javascript object>

There is one outlier under the drug regimen, Infubinol. This could imply several things including measurement error, biological variation, response heterogeneity, or drug resistance or ineffectivenes. Further investigation is required if the outlier raises concerns or questions about the validity of the data or the effectiveness of the drug regimen.

# A Deeper Look at the Capomulin Regimen