# 2 EDA: Breast Cancer Gene Expressions

## 2.1 Contents <a id='2.1_Contents'></a>

* [2 Exploratory Data Analysis](#2_EDA)
    * [2.1 Contents](#2.1_Contents)
    * [2.2 Introduction](#2.2_Introduction)
    * [2.3 Imports](#2.3_Imports)
    * [2.4 Loading the Data](#2.4_Loading)

## 2.2 Introduction <a id='2.2_Introduction'></a>

Breast cancer is the most frequently occurring cancer in women, and the leading cause of cancer-related deaths in women. The most important part of the decision making process for cancer patients is the accurate estimation of prognosis and survival duration, but the reality is that these are hard to predict because gene expression greatly impacts these metrics. Breast cancer patients with the same stage of the disease and the same clinical attributes can have different treatment responses and overall survival outcomes. This difference in outcomes is likely attributed to differences in gene expressions for specific genetic mutations.

In this capstone project I will be using the clinical data, z scores, and genetic mutation data to predict if a patient with primary breast cancer will die of disease or not.dalharbi/breast-cancer-gene-expression-profiles-metabric

_____
This notebook is dedicated to exploratory data analysis (EDA) of the Metabric Breast Cancer Gene Expression Profiles data, sourced from https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric. 

Questions:
1) what is the distribution of the different clinical attributes?
2) How are the z scores connected to the genetic mutation data? (i.e. are there z scores associated with each mutation?)
3) Can we drop the 'Stage' data entirely based on other clinical attributes that have less missing data? That is the one missing so much. 
4) are there similar distributions in the data for the different cohorts? 
5) What does it look like when we compare presence of genetic mutations with survival? 


## Imports  <a id='2.3_Imports'></a>

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


## Loading the Data <a id='2.4_Loading'></a>

In [4]:
cancer_data=pd.read_csv(r'C:\Users\leann\OneDrive\Desktop\SPRINGBOARD\capstone 2\cancer_data_cleaned.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [37]:
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1903 entries, 0 to 1902
Columns: 694 entries, Unnamed: 0 to siah1_mut
dtypes: float64(494), int64(10), object(190)
memory usage: 10.1+ MB


In [31]:
# I'm realizing that there is a type issue here... 

#Checking to see what's happening with one of the columns:
col_679 = cancer_data.iloc[:,679]
print('\nData types: ',col_679.dtypes)
print('\nDescribe: ',col_679.describe())
print('\nUnique: ',col_679.unique())



Data types:  object

Describe:  count     1903
unique       9
top          0
freq      1024
Name: rasgef1b_mut, dtype: int64

Unique:  [0 '0' 'V418A' 'S343G' 'R140Q' 'X2_splice' 'R353H' 'H396Y' 'E78K']


In [34]:
# change datatype to object for these:679,689,691,693
column_indexes = [679,689,691,693]  # Indexes of the columns to change
cancer_data.iloc[:, column_indexes] = cancer_data.iloc[:, column_indexes].astype(str)

# now see if the column we looked at above is the right type. It is. 
cancer_data.iloc[:,679].describe()

count     1903
unique       8
top          0
freq      1896
Name: rasgef1b_mut, dtype: object

NOTE the z scores correspond to a gene in the gene mutations. Need to look at this and somehow link then, as the z scores show how much those genes are up or downregulated. 

should the patients be grouped by similar attributes? would we want to do unsupervised learning to cluster them and then take a classification approach? the premise of the business problem is that patients with the same clinical attributes have different outcomes, so should we take this into account?

also see where there are many more z scores than gene mutations - how do they match up? Why aren't there the same number? Is this EDA or wrangling?

In [None]:
#looking at survival rates - 0 = died, 1 = lived. This is the most important one

survival_counts = clinical_data['overall_survival'].value_counts()
print('survival rates:\n',survival_counts)

In [None]:
#but let's look at all of our data - for some reason it's showing all data types here, but good to be able to see it all. 
for col in clinical_data:
    if clinical_data[col].dtype == 'category':
        counts = clinical_data[col].value_counts()
        print('\n',counts)
    else:
        pass

In [None]:
# I WANT TO LOOK AT MY CLINICAL DATA - plotting all of them that make sense to plot. 

fig, ax = plt.subplots(2,3, figsize=(12,8))
clinical_data.overall_survival.value_counts().plot(kind='bar', ax=ax[0,0])
ax[0,0].set_title('Overall Survival')
ax[0,0].set_xlabel('Survival')

clinical_data.type_of_breast_surgery.value_counts().plot(kind='bar', ax=ax[0,1])
ax[0,1].set_title('Type of Breast Surgery')

clinical_data.cancer_type_detailed.value_counts().plot(kind='bar', ax=ax[0,2])
ax[0,2].set_title('cancer_type_detailed ')

clinical_data.age_at_diagnosis.plot(kind='hist', ax=ax[1,0])
ax[1,0].set_title('Age at Diagnosis ')
ax[1,0].set_xlabel('Age')


clinical_data.cellularity.value_counts().plot(kind='bar', ax=ax[1,1])
ax[1,1].set_title('Cellularity')

clinical_data.chemotherapy.value_counts().plot(kind='bar', ax=ax[1,2])
ax[1,2].set_title('Chemo')

plt.subplots_adjust(wspace=0.5, hspace=2.5);


In [None]:

fig, ax = plt.subplots(2,3, figsize=(12,8))
clinical_data['pam50_+_claudin-low_subtype'].value_counts().plot(kind='bar', ax=ax[0,0])
ax[0,0].set_title('pam50_+_claudin-low_subtype')

clinical_data.er_status_measured_by_ihc.value_counts().plot(kind='bar', ax=ax[0,1])
ax[0,1].set_title('er_status_measured_by_ihc')

clinical_data.neoplasm_histologic_grade.value_counts().plot(kind='bar', ax=ax[0,2])
ax[0,2].set_title('neoplasm_histologic_grade')

clinical_data.her2_status.value_counts().plot(kind='bar', ax=ax[1,0])
ax[1,0].set_title('her2_status  ')

clinical_data.tumor_other_histologic_subtype  .value_counts().plot(kind='bar', ax=ax[1,1])
ax[1,1].set_title('tumor_other_histologic_subtype')

clinical_data.hormone_therapy  .value_counts().plot(kind='bar', ax=ax[1,2])
ax[1,2].set_title('hormone_therapy  ')

plt.subplots_adjust(wspace=0.5, hspace=.5);

In [None]:

fig, ax = plt.subplots(2,3, figsize=(12,8))
clinical_data['inferred_menopausal_state'].value_counts().plot(kind='bar', ax=ax[0,0])
ax[0,0].set_title('inferred_menopausal_state')

clinical_data.integrative_cluster.value_counts().plot(kind='bar', ax=ax[0,1])
ax[0,1].set_title('integrative_cluster ')

clinical_data.primary_tumor_laterality.value_counts().plot(kind='bar', ax=ax[0,2])
ax[0,2].set_title('primary_tumor_laterality')

clinical_data.lymph_nodes_examined_positive.value_counts().plot(kind='bar', ax=ax[1,0])
ax[1,0].set_title('lymph_nodes_examined_positive')
ax[1,0].set_xlim(0, 18)

clinical_data.mutation_count.plot(kind='hist',bins=100, ax=ax[1,1])
ax[1,1].set_title('mutation count')
ax[1,1].set_xlim(0, 25)

clinical_data.overall_survival_months.plot(kind='hist', ax=ax[1,2])
ax[1,2].set_title('overall_survival_months')

plt.subplots_adjust(wspace=0.5, hspace=.5);

In [None]:

fig, ax = plt.subplots(2,3, figsize=(12,8))
clinical_data['pr_status'].value_counts().plot(kind='bar', ax=ax[0,0])
ax[0,0].set_title('pr_status  ')

clinical_data.radio_therapy.value_counts().plot(kind='bar', ax=ax[0,1])
ax[0,1].set_title('radio_therapy')

clinical_data['3-gene_classifier_subtype'].value_counts().plot(kind='bar', ax=ax[0,2])
ax[0,2].set_title('3-gene_classifier_subtype')

clinical_data.tumor_size.value_counts().plot(kind='bar', ax=ax[1,0])
ax[1,0].set_title('tumor_size ')
ax[1,0].set_xlim(0, 80)
# Set the x-axis tick frequency to 2
ax[1,0].set_xticks(ax[1,0].get_xticks()[::10])

clinical_data.mutation_count.value_counts().plot(kind='bar', ax=ax[1,1])
ax[1,1].set_title('tumor_stage')
ax[1,1].set_xlim(0, 23)

plt.subplots_adjust(wspace=0.5, hspace=.5);

In [None]:
#TAKEAWAYS 

#The proportions for death and getting a mastectomy are similar - I'm guessing that the patients to had breast conserving
#surgery had less severe disease and therefore were more likely to survive. 
# cancer type - the vast majority of cases were invasive ductal carcinoma
# age follows a rougly normal distribution, peaking around age 70. Most women were post menopausal, which makes sense 
#given the age distribution.
#most patients did got get chemo, but did get radiotherapy and hormone therapy. The proportions with death and getting 
#radiotherapy are roughly the same, and again I wonder if this is because more severe disease was present. 
#her2 status for most patients was negative, er status mostly positive
#most neoplasms were grade 3
#mutation counts peaked at 7
#positive lymph nodes looks to fall roughly exponentially
# left and right sides are about equal - I think we can take this out because I dont think it is relavent to survival
#maximum survival time - I'm not sure how to read this. I know data was collected for 351 months. Does this track how many
#people died at a given amount of time?
#tumor stage and tumor size have similar distributions (both left skewed), and I'm guessing they are usually similar
#pr status roughly equal, but a little more positive. 

#IDEA - it looks like we are missing a lot of tumor stage data, but hardly any tumor size. As it looks like they follow 
#a similar distribution, maybe we don't need the stage data. How can I figure out if they are similar enough?