# Pymaceuticals Inc.
---

### Analysis

Typically, scientific data needs to be analyzed relative to controls in order to deduce the meaning behind the data.  Thus, an additional analysis of the Placebo was included.  A placebo, by definition, would be classified as a negative control, thus, allowing for comparison of the drug efficacy versus if nothing was done at all.

Comparison of the four drugs, Capomulin, Ceftamin, Infubinol, and Ramicane, the more efficacious drugs at decreasing overall tumor volume were Capomulin and Ramicane.  In fact, in comparison with Placebo, Infubinol and Ramicane did not have a significant impact on decreasing overall tumor volume.  Further studies into the targets of each of these drugs, and specifically Capomulin and Ramicane would be most helpful at honing in on the best drug regimens to administer as well as optimal drug targets to conduct studies on in future investigations.

The overall study took into account bias against gender, and analysis of the mice used in this study showed no preference was given to either gender.  A total of 249 mice were orignally used to conduct this study, with male and female mice at 125 and 124, respectively.  

There was a positive correlation between average tumor volume and the weight of the mouse under the Capomulin regimen. However, whether this is significant is difficult to discern unless we have a control for comparison.  Again, additional analysis of the placebo regimen will act as a negative control.  Scatter plot and regression analyis of mice under the placebo regimen show a lower correlation score compared to Capomulin (-0.2 vs 0.83) and a more varied scatter plot for placebo.  Thus, mice under the Capomulin regimen exhibit a significant average tumor volume increase as the weight of the mouse increases.  



 

In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
# added more depndencies
import numpy as np
from scipy.stats import sem
from scipy.stats import linregress


# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single dataset
mouse_study_df = pd.merge(mouse_metadata, study_results, how="left", on=["Mouse ID", "Mouse ID"])

# Display the data table for preview
mouse_study_df.head()

In [None]:
# Checking the number of mice.
number_of_mice = len(mouse_study_df["Mouse ID"].unique())
number_of_mice

In [None]:
# Getting the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
duplicate = mouse_study_df[mouse_study_df.duplicated(subset=['Mouse ID','Timepoint'], keep=False)]
duplicate


In [None]:
# Optional: Get all the data for the duplicate mouse ID. 
all_duplicate_data = mouse_study_df[mouse_study_df['Mouse ID'] == ('g989')]
all_duplicate_data


In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
indexMouseID = mouse_study_df[mouse_study_df['Mouse ID'] == 'g989'].index
mouse_study_df.drop(indexMouseID , inplace=True)
mouse_study_df.head()


In [None]:
# Checking the number of mice in the clean DataFrame.
number_of_cleaned_mice = len(mouse_study_df["Mouse ID"].unique())
number_of_cleaned_mice


## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
# mean, median, variance, standard deviation, and SEM of the tumor volume. 

tumor_volume_mean = mouse_study_df.groupby(["Drug Regimen"]).mean()["Tumor Volume (mm3)"]
tumor_volume_median = mouse_study_df.groupby(["Drug Regimen"]).median()["Tumor Volume (mm3)"]
tumor_volume_variance = mouse_study_df.groupby(["Drug Regimen"]).var()["Tumor Volume (mm3)"]
tumor_volume_stdev = mouse_study_df.groupby(["Drug Regimen"]).std()["Tumor Volume (mm3)"]
tumor_volume_sem = mouse_study_df.groupby(["Drug Regimen"]).sem()["Tumor Volume (mm3)"]

# Assemble the resulting series into a single summary DataFrame.
tumor_volume_stats_table = pd.DataFrame()
tumor_volume_stats_table['Average Tumor Volume (mm3)'] = tumor_volume_mean
tumor_volume_stats_table['Median Tumor Volume (mm3)'] = tumor_volume_median
tumor_volume_stats_table['Variance of Tumor Volume (mm3)'] = tumor_volume_variance
tumor_volume_stats_table['Standard Deviation of Tumor Volume (mm3)'] = tumor_volume_stdev
tumor_volume_stats_table['SEM of Tumor Volume (mm3)'] = tumor_volume_sem
tumor_volume_stats_table





In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, 
# and SEM of the tumor volume for each regimen

# Using the aggregation method, produce the same summary statistics in a single line.

# Create a new dataframe for aggregation
agg_mouse_study_df = mouse_study_df[["Drug Regimen", "Tumor Volume (mm3)"]]

# create a dict with the appropriate aggregation functions
agg_functions = {'Tumor Volume (mm3)':['mean', 'median', 'var', 'std', 'sem']}

# aggregate
agg_mouse_study_df.groupby(['Drug Regimen']).agg(agg_functions)


## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of timepoints for all mice tested for each drug regimen using Pandas.
total_timepoints = mouse_study_df.groupby(["Drug Regimen"]).count()["Timepoint"]
pandas_plot = total_timepoints.plot(kind='bar', title='Total Timepoints for all mice')
pandas_plot.set_xlabel("Drug Regimen")
pandas_plot.set_ylabel("Number of Timepoints")
plt.show()


In [None]:
# convert from series to dataframe
total_timepoints = total_timepoints.to_frame().reset_index()

In [None]:
# Generate a bar plot showing the total number of timepoints for all mice tested for each drug regimen using pyplot.
x_axis = np.arange(len(total_timepoints))
tick_locations = [value+0.004 for value in x_axis]
plt.figure(figsize= (6,5))
plt.bar(x_axis, total_timepoints["Timepoint"], color='r', width = 0.5, align = 'center')
plt.xticks(tick_locations, total_timepoints["Drug Regimen"], rotation="vertical")
plt.title("Total Timepoints for all mice")
plt.xlabel("Drug Regiment")
plt.ylabel("Number of Timepoints")
plt.show()


In [None]:
# Number of male mice
male_mice = mouse_study_df.loc[mouse_study_df["Sex"] == 'Male', :]
numb_males = len(male_mice["Mouse ID"].unique())

# Number of female mice, with one added to compensate for removal of duplicate mouse
female_mice = mouse_study_df.loc[mouse_study_df["Sex"] == 'Female', :]
numb_females = (len(female_mice["Mouse ID"].unique())) + 1

# Assemble the Dataframe
mice_sex_df = pd.DataFrame({'Number of male mice' : [numb_males], 'Number of female mice' : [numb_females]})
mice_sex_df


In [None]:
# Generate a pie plot showing the distribution of female versus male mice using Pandas
mice_sex_dict={'Sex of Mice':['Male','Female'], 'Total Number':[numb_males,numb_females]}
labels=['Male','Female']
colors=['lightblue', 'silver']
pandas_mice_sex_df = pd.DataFrame(data=mice_sex_dict)
pandas_mice_sex_df.plot.pie(y='Total Number', legend=False, startangle=140, fontsize=20, labels=labels, colors=colors, autopct='%1.1f%%')
plt.show()


In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot
labels = ["Male mice", "Female mice"]
sizes = [numb_males, numb_females]
colors = ["lightcoral", "lightskyblue"]
explode = (0, 0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct="%1.1f%%", startangle=140, textprops={'fontsize': 18})
plt.show()


## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin

# Start by getting the last (greatest) timepoint for each mouse
max_timepoint = mouse_study_df.groupby(["Mouse ID"]).max()["Timepoint"]

# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
mouse_study_max_timepoint_merged = pd.merge(mouse_study_df, max_timepoint, how="left", on=["Mouse ID", "Mouse ID"])

# Renaming columns
mouse_study_max_timepoint_merged = mouse_study_max_timepoint_merged.rename(columns={"Timepoint_x": "Timepoint", "Timepoint_y": "Max Timepoint"})
mouse_study_max_timepoint_merged.head()


In [None]:
# Create new dataframe for statistical analysis
# Capture rows only where 'Tumor Volume (mm3)' values are equal to those at 'Tumor Volume at max timepoint'
mouse_study_max_timepoint_merged['Tumor Volume at max timepoint'] = np.where((mouse_study_max_timepoint_merged['Timepoint'] == mouse_study_max_timepoint_merged['Max Timepoint']) , mouse_study_max_timepoint_merged['Tumor Volume (mm3)'], np.nan)
mouse_study_max_timepoint_merged.head()

# Drop the NaN values for the "Tumor Volume at max timepoint" column
mouse_study_max_timepoint_merged = mouse_study_max_timepoint_merged[mouse_study_max_timepoint_merged['Tumor Volume at max timepoint'].notna()]
mouse_study_max_timepoint_merged.reset_index(drop=True, inplace=True)

# Create a dataframe with only the needed columns
max_tumor_volume = mouse_study_max_timepoint_merged[['Mouse ID', 'Drug Regimen', 'Tumor Volume at max timepoint']]
max_tumor_volume







In [None]:
# Summary Stats for Capomulin
Capomulin_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Capomulin', :]
Capomulin = Capomulin_tumor_volumes['Tumor Volume at max timepoint']


Capomulin_q3 = np.quantile(Capomulin_tumor_volumes['Tumor Volume at max timepoint'], 0.75)
Capomulin_q1 = np.quantile(Capomulin_tumor_volumes['Tumor Volume at max timepoint'], 0.25)
Capomulin_IQR = Capomulin_q3 - Capomulin_q1

Capomulin_lower_range = Capomulin_q1 - 1.5 * Capomulin_IQR
Capomulin_upper_range = Capomulin_q3 + 1.5 * Capomulin_IQR

Capomulin_outliers = [x for x in Capomulin_tumor_volumes['Tumor Volume at max timepoint'] if ((x < Capomulin_lower_range) | (x > Capomulin_upper_range))]
print(f'For mice receiving Capomulin, the IQR for all tumor volumes at the maximum time point is {round(Capomulin_IQR):.2f}.')
print(f'The lower and upper range are {(Capomulin_lower_range):.2f} and {(Capomulin_upper_range):.2f}, respectively.')
if len(Capomulin_outliers) > 0:
    print(f'The number of outliers is {len(Capomulin_outliers)}. The outlier(s) had timepoint(s) at {(Capomulin_outliers)}.')
else:
    print('There were no outliers.')

In [None]:
# Summary Stats for Infubinol
Ramicane_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Ramicane', :]
Ramicane = Ramicane_tumor_volumes['Tumor Volume at max timepoint']


Ramicane_q3 = np.quantile(Ramicane_tumor_volumes['Tumor Volume at max timepoint'], 0.75)
Ramicane_q1 = np.quantile(Ramicane_tumor_volumes['Tumor Volume at max timepoint'], 0.25)
Ramicane_IQR = Ramicane_q3 - Ramicane_q1

Ramicane_lower_range = Ramicane_q1 - 1.5 * Ramicane_IQR
Ramicane_upper_range = Ramicane_q3 + 1.5 * Ramicane_IQR

Ramicane_outliers = [x for x in Ramicane_tumor_volumes['Tumor Volume at max timepoint'] if ((x < Ramicane_lower_range) | (x > Ramicane_upper_range))]
print(f'For mice receiving Ramicane, the IQR for all tumor volumes at the maximum time point is {(Ramicane_IQR):.2f}.')
print(f'The lower and upper range are {(Ramicane_lower_range):.2f} and {(Ramicane_upper_range):.2f}, respectively.')
if len(Ramicane_outliers) > 0:
    print(f'The number of outliers is {len(Ramicane_outliers)}. The outlier(s) had timepoint(s) at {(Ramicane_outliers)}.')
else:
    print('There were no outliers.')

In [None]:
# Summary Stats for Infubinol
Infubinol_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Infubinol', :]
Infubinol = Infubinol_tumor_volumes['Tumor Volume at max timepoint']


Infubinol_q3 = np.quantile(Infubinol_tumor_volumes['Tumor Volume at max timepoint'], 0.75)
Infubinol_q1 = np.quantile(Infubinol_tumor_volumes['Tumor Volume at max timepoint'], 0.25)
Infubinol_IQR = Infubinol_q3 - Infubinol_q1

Infubinol_lower_range = Infubinol_q1 - 1.5 * Infubinol_IQR
Infubinol_upper_range = Infubinol_q3 + 1.5 * Infubinol_IQR

Infubinol_outliers = [x for x in Infubinol_tumor_volumes['Tumor Volume at max timepoint'] if ((x < Infubinol_lower_range) | (x > Infubinol_upper_range))]
print(f'For mice receiving Infubinol, the IQR for all tumor volumes at the maximum time point is {(Infubinol_IQR):.2f}.')
print(f'The lower and upper range are {(Infubinol_lower_range):.2f} and {(Infubinol_upper_range):.2f}, respectively.')
if len(Infubinol_outliers) > 0:
    print(f'The number of outliers is {len(Infubinol_outliers)}. The outlier(s) had timepoint(s) at {(Infubinol_outliers)}.')
else:
    print('There were no outliers.')

In [None]:
# Summary Stats for Ceftamin
Ceftamin_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Ceftamin', :]
Ceftamin = Ceftamin_tumor_volumes['Tumor Volume at max timepoint']


Ceftamin_q3 = np.quantile(Ceftamin_tumor_volumes['Tumor Volume at max timepoint'], 0.75)
Ceftamin_q1 = np.quantile(Ceftamin_tumor_volumes['Tumor Volume at max timepoint'], 0.25)
Ceftamin_IQR = Ceftamin_q3 - Ceftamin_q1

Ceftamin_lower_range = Ceftamin_q1 - 1.5 * Ceftamin_IQR
Ceftamin_upper_range = Ceftamin_q3 + 1.5 * Ceftamin_IQR

Ceftamin_outliers = [x for x in Ceftamin_tumor_volumes['Tumor Volume at max timepoint'] if ((x < Ceftamin_lower_range) | (x > Ceftamin_upper_range))]
print(f'For mice receiving Ceftamin, the IQR for all tumor volumes at the maximum time point is {(Ceftamin_IQR):.2f}.')
print(f'The lower and upper range are {(Ceftamin_lower_range):.2f} and {(Ceftamin_upper_range):.2f}, respectively.')
if len(Ceftamin_outliers) > 0:
    print(f'The number of outliers is {len(Ceftamin_outliers)}. The outlier(s) had timepoint(s) at {Ceftamin_outliers}.')
else:
    print('There were no outliers.')

In [None]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.

# Capomulin calculations
Capomulin = (Capomulin_tumor_volumes['Tumor Volume at max timepoint'])

# Ramicane calculations
Ramicane = Ramicane_tumor_volumes['Tumor Volume at max timepoint']

# Infubinol calculations
Infubinol = Infubinol_tumor_volumes['Tumor Volume at max timepoint']

# Ceftamin calculations
Ceftamin = Ceftamin_tumor_volumes['Tumor Volume at max timepoint']


# Create figure with all four boxplots

top4_drugs = [Capomulin, Ramicane, Infubinol, Ceftamin]
labels = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin']
flierprops = dict(marker='o', markerfacecolor='r', markersize=11, linestyle='none', markeredgecolor='b')


plt.boxplot(top4_drugs, labels=labels, medianprops = dict(color="black",linewidth=0.5), patch_artist=True, flierprops=flierprops, boxprops=dict(facecolor='coral', color='coral'), whiskerprops=dict(color='red'))
            
plt.ylabel('Tumor Volume (mm3)')
plt.title('Final Tumor Volume Distribution across Four Drug Regimens')

plt.show()


## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a mouse treated with Capomulin

# Dataframe for looking at specifically mouse s185
s185_mouse = mouse_study_df.loc[mouse_study_df["Mouse ID"] == 's185', :]

# plot for s185
plt.plot(s185_mouse['Timepoint'] , s185_mouse['Tumor Volume (mm3)'])
plt.xlabel('Timepoint')
plt.ylabel('Tumor Volume (mm3)')
plt.title('Tumor Volume (mm3) of mouse s185')
plt.show()



In [None]:
# Generate a scatter plot of average tumor volume vs. mouse weight for the Capomulin regimen

# Create dataframe for Capomulin treated mice based off of original dataframe
capomulin_regimen = mouse_study_df.loc[mouse_study_df["Drug Regimen"] == 'Capomulin', :]

# Calculate Average Tumor Volume for Mice in Capomulin drug regimen
capomulin_mean_tv = capomulin_regimen.groupby(["Mouse ID"]).mean()["Tumor Volume (mm3)"]

# Combine the data into a single dataset
capomulin_mean_tv_and_weight = pd.merge(capomulin_regimen, capomulin_mean_tv, how="left", on=["Mouse ID", "Mouse ID"])

# Change column header to reflect Average Tumor Volume
capomulin_mean_tv_and_weight_df = capomulin_mean_tv_and_weight.rename(columns={'Tumor Volume (mm3)_y' : 'Average Tumor Volume (mm3)'})

# Create Scatter Plot
capomulin_mean_tv_and_weight_df.plot(kind="scatter", x="Weight (g)", y="Average Tumor Volume (mm3)", grid=True, figsize=(8,8), title="Avg. Tumor Volume vs Weight - Capomulin")
plt.show()


## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and linear regression model 
# for mouse weight and average tumor volume for the Capomulin regimen

x_values = capomulin_mean_tv_and_weight_df['Weight (g)']
y_values = capomulin_mean_tv_and_weight_df['Average Tumor Volume (mm3)']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values)
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(18,35),fontsize=15,color="red")
plt.xlabel('Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.title('Average Tumor Volume vs Weight - Capomulin')
plt.show()

print(f"The correlation coefficient between mouse weight and average tumor volume for Capomulin is {round(st.pearsonr(x_values,y_values)[0],2)}.")



## Extra Analysis including Placebo

In [None]:
# Summary Stats for Placebo
Placebo_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Placebo', :]
Placebo = Placebo_tumor_volumes['Tumor Volume at max timepoint']


Placebo_q3 = np.quantile(Placebo_tumor_volumes['Tumor Volume at max timepoint'], 0.75)
Placebo_q1 = np.quantile(Placebo_tumor_volumes['Tumor Volume at max timepoint'], 0.25)
Placebo_IQR = Placebo_q3 - Placebo_q1

Placebo_lower_range = Placebo_q1 - 1.5 * Placebo_IQR
Placebo_upper_range = Placebo_q3 + 1.5 * Placebo_IQR

Placebo_outliers = [x for x in Placebo_tumor_volumes['Tumor Volume at max timepoint'] if ((x < Placebo_lower_range) | (x > Placebo_upper_range))]
print(f'For mice receiving Placebo, the IQR for all tumor volumes at the maximum time point is {(Placebo_IQR):.2f}.')
print(f'The lower and upper range are {(Placebo_lower_range):.2f} and {(Placebo_upper_range):.2f}, respectively.')
if len(Placebo_outliers) > 0:
    print(f'The number of outliers is {len(Placebo_outliers)}. The outlier(s) had timepoint(s) at {(Placebo_outliers)}.')
else:
    print('There were no outliers.')

In [None]:
# Summary Stats for Placebo
Placebo_tumor_volumes = mouse_study_max_timepoint_merged.loc[mouse_study_max_timepoint_merged["Drug Regimen"] == 'Placebo', :]
Placebo = Placebo_tumor_volumes['Tumor Volume at max timepoint']

# Capomulin calculations
Capomulin = Capomulin_tumor_volumes['Tumor Volume at max timepoint']

# Ramicane calculations
Ramicane = Ramicane_tumor_volumes['Tumor Volume at max timepoint']

# Infubinol calculations
Infubinol = Infubinol_tumor_volumes['Tumor Volume at max timepoint']

# Ceftamin calculations
Ceftamin = Ceftamin_tumor_volumes['Tumor Volume at max timepoint']


# Create figure with all four boxplots

top4_drugs_and_placebo = [Capomulin, Ramicane, Infubinol, Ceftamin, Placebo]
labels = ['Capomulin', 'Ramicane', 'Infubinol', 'Ceftamin', 'Placebo']
flierprops = dict(marker='o', markerfacecolor='r', markersize=11, linestyle='none', markeredgecolor='b')


plt.boxplot(top4_drugs_and_placebo, labels=labels, medianprops = dict(color="black",linewidth=0.5), patch_artist=True, flierprops=flierprops, boxprops=dict(facecolor='coral', color='coral'), whiskerprops=dict(color='red'))
            
plt.ylabel('Tumor Volume (mm3)')
plt.title('Final Tumor Volume Distribution across Four Drug Regimens')

plt.show()


In [None]:
# Generate a scatter plot of average tumor volume vs. mouse weight for the Placebo regimen

# Create dataframe for Placebo treated mice based off of original dataframe
Placebo_regimen = mouse_study_df.loc[mouse_study_df["Drug Regimen"] == 'Placebo', :]

# Calculate Average Tumor Volume for Mice in Placebo drug regimen
Placebo_mean_tv = Placebo_regimen.groupby(["Mouse ID"]).mean()["Tumor Volume (mm3)"]

# Combine the data into a single dataset
Placebo_mean_tv_and_weight = pd.merge(Placebo_regimen, Placebo_mean_tv, how="left", on=["Mouse ID", "Mouse ID"])

# Change column header to reflect Average Tumor Volume
Placebo_mean_tv_and_weight_df = Placebo_mean_tv_and_weight.rename(columns={'Tumor Volume (mm3)_y' : 'Average Tumor Volume (mm3)'})

# Create Scatter Plot
Placebo_mean_tv_and_weight_df.plot(kind="scatter", x="Weight (g)", y="Average Tumor Volume (mm3)", grid=True, figsize=(8,8), title="Avg. Tumor Volume vs Weight - Placebo")
plt.show()


In [None]:
# Calculate the correlation coefficient and linear regression model 
# for mouse weight and average tumor volume for the Placebo regimen

x_values = Placebo_mean_tv_and_weight_df['Weight (g)']
y_values = Placebo_mean_tv_and_weight_df['Average Tumor Volume (mm3)']
(slope, intercept, rvalue, pvalue, stderr) = linregress(x_values, y_values)
regress_values = x_values * slope + intercept
line_eq = "y = " + str(round(slope,2)) + "x + " + str(round(intercept,2))
plt.scatter(x_values,y_values)
plt.plot(x_values,regress_values,"r-")
plt.annotate(line_eq,(25,50),fontsize=15,color="red")
plt.xlabel('Weight (g)')
plt.ylabel('Average Tumor Volume (mm3)')
plt.title('Average Tumor Volume vs Weight - Placebo')
plt.show()

print(f"The correlation coefficient between mouse weight and average tumor volume for Placebo is {round(st.pearsonr(x_values,y_values)[0],2)}.")

