# Pymaceuticals Inc.
---

### Analysis

Key Insights: 

This experiment compares the effectiveness of our companies drug of interest (Capomulin) against other drugs in terms of the reduction of tumor sizes.Upon concluding this experiment and analyzing the results, here are my top insights:


Weight vs Tumor Size:
A resulting .834 correlation value of the mice involved concludes that there is a strong  and positive correlation between Weight vs Tumor Size meaning that overall, larger mice had larger tumors. Due to this, we can use this information to understand and potentially predict how tumor volume may change with changes in mouse weight.

Capomulin VS Other Drugs
Data shows that the drug Ramicane performed similarly to our drug of interest. While Capomilin performed better between the two, it was only a slight lead. These results are promising and indicate that an interest should be made on Ramicane as well as Capomulin. However, it is worth noting that in comparison to these two drugs, the remaining drugs had a lower amount of observed data points with Proviva being the lowest. A case could be made to push for more experiments with a closer value of observed points per each mouse. 











Regenerate


In [None]:
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import scipy.stats as st
import numpy as np
import random
from scipy.stats import pearsonr
from scipy.stats import linregress



# Study data files
mouse_metadata_path = "data/Mouse_metadata.csv"
study_results_path = "data/Study_results.csv"

# Read the mouse data and the study results
mouse_metadata = pd.read_csv(mouse_metadata_path)
study_results = pd.read_csv(study_results_path)

# Combine the data into a single DataFrame
data = pd.merge(mouse_metadata,study_results, on="Mouse ID")
new_order = ['Mouse ID',
 'Timepoint',
 'Tumor Volume (mm3)',
 'Metastatic Sites',
 'Drug Regimen',
 'Sex',
 'Age_months',
 'Weight (g)']
# Display the data table for preview
data = data[new_order]
data.head()

In [None]:
# Checking the number of mice.
len(data["Mouse ID"].unique())

In [None]:
# dupes = data.duplicated(subset=['Mouse ID', 'Timepoint'])
# data[dupes]

In [None]:
# Our data should be uniquely identified by Mouse ID and Timepoint
# Get the duplicate mice by ID number that shows up for Mouse ID and Timepoint. 
dupes = data.duplicated(subset=['Mouse ID', 'Timepoint'])
dupe_id = np.array(data[dupes]["Mouse ID"].unique())
dupe_id


In [None]:
# Optional: Get all the data for the duplicate mouse ID. 
data[data["Mouse ID"] == dupe_id[0]]

In [None]:
# Create a clean DataFrame by dropping the duplicate mouse by its ID.
df = data[data["Mouse ID"]!= dupe_id[0]]
df.head()

In [None]:
# Checking the number of mice in the clean DataFrame.
len(df["Mouse ID"].unique())

## Summary Statistics

In [None]:
# Generate a summary statistics table of mean, median, variance, standard deviation, and SEM of the tumor volume for each regimen
df_grp = df.groupby('Drug Regimen')

mean = df_grp['Tumor Volume (mm3)'].mean()
median = df_grp['Tumor Volume (mm3)'].median()
variance = df_grp['Tumor Volume (mm3)'].var()
std_dev = df_grp['Tumor Volume (mm3)'].std()
sem = df_grp['Tumor Volume (mm3)'].sem()


stat_summary = pd.DataFrame({"Mean Tumor Volume":mean,
                             "Median Tumor Volume": median,
                             "Tumor Volume Variance": variance,
                             "Tumor Volume Std. Dev.": std_dev,
                             "Tumor Volume Std. Err.": sem})
stat_summary

# Use groupby and summary statistical methods to calculate the following properties of each drug regimen: 
# mean, median, variance, standard deviation, and SEM of the tumor volume. 
# Assemble the resulting series into a single summary DataFrame.


In [None]:
# A more advanced method to generate a summary statistics table of mean, median, variance, standard deviation,
# and SEM of the tumor volume for each regimen (only one method is required in the solution)

# Using the aggregation method, produce the same summary statistics in a single line

adv_stats =   df_grp.agg({'Tumor Volume (mm3)': ['mean', 'median','var','std','sem']})

adv_stats



## Bar and Pie Charts

In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using Pandas.
df['Drug Regimen'].value_counts().plot(kind = 'bar',rot = 45)

plt.xlabel('Drug Regimen')
plt.ylabel('# of Observed Mouse Timepoints')
plt.tight_layout()



In [None]:
# Generate a bar plot showing the total number of rows (Mouse ID/Timepoints) for each drug regimen using pyplot.
counts = df['Drug Regimen'].value_counts()

plt.xlabel('Drug Regimen')
plt.ylabel('# of Observed Mouse Timepoints')
plt.bar(counts.index,counts.values)
plt.xticks(rotation=45)

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using Pandas
df["Sex"].value_counts().plot(kind = 'pie',autopct='%1.1f%%')
plt.xlabel("")
plt.ylabel("Sex")

In [None]:
# Generate a pie plot showing the distribution of female versus male mice using pyplot
counts = df["Sex"].value_counts()

plt.pie(counts,labels=counts.index,autopct='%1.1f%%')
plt.xlabel("")
plt.ylabel("Sex")
plt.show()


## Quartiles, Outliers and Boxplots

In [None]:
# Calculate the final tumor volume of each mouse across four of the treatment regimens:  
# Capomulin, Ramicane, Infubinol, and Ceftamin
# Start by getting the last (greatest) timepoint for each mouse
df_grp2 = df.groupby(["Mouse ID","Drug Regimen"])["Timepoint"].max()
df2 = pd.DataFrame(df_grp2)
# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
tumor_df = pd.merge(df,df2, on = ["Mouse ID","Drug Regimen","Timepoint"],how='inner')
# Merge this group df with the original DataFrame to get the tumor volume at the last timepoint
tumor_df.head()

In [None]:
# Put treatments into a list for for loop (and later for plot labels)
treats = ['Capomulin','Ramicane','Infubinol','Ceftamin']
treats
# Create empty list to fill with tumor vol data (for plotting)

vol_data = []
# # Calculate the IQR and quantitatively determine if there are any potential outliers. 
# for treat in treats: 
for treat in treats: 
    sub_df = tumor_df[tumor_df["Drug Regimen"] == treat]
    tumor_vol = sub_df["Tumor Volume (mm3)"]
    vol_data.append(tumor_vol)
    
    # Determine outliers using upper and lower bounds
    quartiles = tumor_vol.quantile([0.25, 0.75])
    lower_q = quartiles[0.25]
    upper_q = quartiles[0.75]
    iqr = upper_q - lower_q

    lower_bound = lower_q - 1.5 * iqr
    upper_bound = upper_q + 1.5 * iqr
    pot_outliers = sub_df[(tumor_vol < lower_bound) | (tumor_vol > upper_bound)]
    print(f"{treat}'s potential outliers:pot_outliers: {pot_outliers['Tumor Volume (mm3)']}")




In [None]:
# Generate a box plot that shows the distrubution of the tumor volume for each treatment group.
plt.boxplot(vol_data,labels=treats)
plt.ylabel("Funal Tumor Volume (mm3)")

## Line and Scatter Plots

In [None]:
# Generate a line plot of tumor volume vs. time point for a single mouse treated with Capomulin

mouse = df[df["Drug Regimen"] == "Capomulin"].sample(n=1)
mouse = mouse["Mouse ID"].astype('str').tolist()[0]

mouse_data  = df[df["Mouse ID"] == mouse]
plt.plot(mouse_data["Timepoint"],mouse_data["Tumor Volume (mm3)"])
plt.ylabel("Tumor Volume (mm3)")
plt.xlabel("Timepoint(days)")
plt.title(f"Data for Mouse {mouse}")

In [None]:
df3 = df[df["Drug Regimen"] == "Capomulin"]

df3_grp = df3.groupby("Mouse ID")
mouse_averages = pd.DataFrame(df3_grp["Tumor Volume (mm3)"].mean())
mouse_averages = mouse_averages.rename(columns={"Tumor Volume (mm3)": "Avg Tumor Volume"})
df3 = pd.merge(mouse_averages,df3,on= "Mouse ID",how = 'left')

plt.scatter(df3["Weight (g)"], df3["Avg Tumor Volume"])

## Correlation and Regression

In [None]:
# Calculate the correlation coefficient and a linear regression model 
# for mouse weight and average observed tumor volume for the entire Capomulin regimen
correlation_coefficient = pearsonr(df3["Weight (g)"], df3["Avg Tumor Volume"])
print(f"The correlation between mouse weight and the average tumor volume is {correlation_coefficient[0].round(3)}")
plt.scatter(df3["Weight (g)"], df3["Avg Tumor Volume"])
slope, intercept, r_value, p_value, std_err = linregress(df3["Weight (g)"], df3["Avg Tumor Volume"])
forecast = slope * df3["Weight (g)"] + intercept


plt.plot(df3["Weight (g)"], forecast, color='r', label="Linear Regression Line")

plt.show()
