<h3>The purpose of this initial EDA notebook is to check the data and make sure there is nothing unusual that needs to be taken into account or fixed.

In [42]:
import pandas as pd
from sklearn.cluster import KMeans
import numpy as np

print('Pandas version used: ' + pd.__version__)
print('Numpy version used: ' + np.__version__)

Pandas version used: 1.0.5
Numpy version used: 1.18.5


In [2]:
# Read in the gene expression information:
gene_df = pd.read_csv('gene_datasets/gene_data.csv', index_col=0)

In [3]:
# Read in the label dataframe. This contains the labels of the type of tumor associated with the sample:
label_df = pd.read_csv('gene_datasets/gene_labels.csv', index_col=0)

In [6]:
# Check the shape of the gene dataframe:
gene_df.shape

(801, 20531)

The gene dataframe consists of 801 samples with 20,531 gene expression columns:

In [34]:
gene_df.head(10)

Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,...,gene_20521,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530
sample_0,0.0,2.017209,3.265527,5.478487,10.431999,0.0,7.175175,0.591871,0.0,0.0,...,4.926711,8.210257,9.723516,7.22003,9.119813,12.003135,9.650743,8.921326,5.286759,0.0
sample_1,0.0,0.592732,1.588421,7.586157,9.623011,0.0,6.816049,0.0,0.0,0.0,...,4.593372,7.323865,9.740931,6.256586,8.381612,12.674552,10.517059,9.397854,2.094168,0.0
sample_2,0.0,3.511759,4.327199,6.881787,9.87073,0.0,6.97213,0.452595,0.0,0.0,...,5.125213,8.127123,10.90864,5.401607,9.911597,9.045255,9.788359,10.09047,1.683023,0.0
sample_3,0.0,3.663618,4.507649,6.659068,10.196184,0.0,7.843375,0.434882,0.0,0.0,...,6.076566,8.792959,10.14152,8.942805,9.601208,11.392682,9.694814,9.684365,3.292001,0.0
sample_4,0.0,2.655741,2.821547,6.539454,9.738265,0.0,6.566967,0.360982,0.0,0.0,...,5.996032,8.891425,10.37379,7.181162,9.84691,11.922439,9.217749,9.461191,5.110372,0.0
sample_5,0.0,3.467853,3.581918,6.620243,9.706829,0.0,7.75851,0.0,0.0,0.0,...,5.726657,8.602588,9.928339,6.096154,9.816001,11.556995,9.24415,9.836473,5.355133,0.0
sample_6,0.0,1.224966,1.691177,6.572007,9.640511,0.0,6.754888,0.531868,0.0,0.0,...,5.105904,7.927968,9.673966,1.877744,9.802692,13.25606,9.664486,9.244219,8.330912,0.0
sample_7,0.0,2.854853,1.750478,7.22672,9.758691,0.0,5.952103,0.0,0.0,0.0,...,5.297833,8.277092,9.59923,5.24429,9.994339,12.670377,9.987733,9.216872,6.55149,0.0
sample_8,0.0,3.992125,2.77273,6.546692,10.488252,0.0,7.690222,0.352307,0.0,4.067604,...,6.721974,9.597533,9.763753,7.933278,10.95288,12.498919,10.389954,10.390255,7.828321,0.0
sample_9,0.0,3.642494,4.423558,6.849511,9.464466,0.0,7.947216,0.724214,0.0,0.0,...,6.020051,8.712809,10.259096,6.131583,9.923582,11.144295,9.244851,9.484299,4.759151,0.0


It would appear based off of this very small snippet of the dataframe that gene expression levels seem to fall between 0 and a little over 12.

These gene expression values are likely standardized based off of the relative expression versus the expression of a common human 'housekeeping' gene. Housekeeping genes are genes that are expressed as a normal part of cellular function and so their expression is not expected to fluctuate much if at all.

So, for example, the value of 2.017209 for gene_1 of sample_0 means that this gene was expressed a little over twice as much as the housekeeping gene. A reading of 0.0 means the gene was not expressed at all. A value of 1.0 would mean that the gene was expressed exactly as much as the reference gene.

References:

https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkaa609/5871367

https://journals.physiology.org/doi/full/10.1152/physiolgenomics.2001.7.2.95

A database of reference genes for human and mice can be found here:

http://www.housekeeping.unicamp.br/

Let's also look at the label_df and how many of each type of tumor we have samples for:

In [40]:
# Check the number of each tumor type in the label dataframe:
np.unique(label_df, return_counts=True)

(array(['BRCA', 'COAD', 'KIRC', 'LUAD', 'PRAD'], dtype=object),
 array([300,  78, 146, 141, 136], dtype=int64))

In this dataframe, we have:

300 BRCA tumors,<br>
78 COAD tumors,<br>
146 KIRC tumors,<br>
141 LUAD tumors,<br>
136 PRAD tumors.

The tumor labels correspond to the following types of tumors:<br>

BRCA = Breast Carcinoma<br>
COAD = Colon Adenocarcinoma<br>
KIRC = Kidney Renal Clear Cell Carcinoma<br>
LUAD = Lung Adenocarcinoma<br>
PRAD = Prostate Adenocarcinoma<br>

Information from:

Cancer Genome Atlas Research Network, Weinstein JN, Collisson EA, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113-1120. doi:10.1038/ng.2764

So most of our samples come from patients with a tumor of the breast carcinoma variety.

In [36]:
# Let's Make sure there are no null values in our gene dataframe.
# There are so many features that it's easiest to just sum the sum:
sum(gene_df.isnull().sum())

0

There do not appear to be any null values, and according to the source there are no missing or otherwise unusual values included. With...

In [37]:
801*20531

16445331

16,445,331 values, I will take their word for it but I will do some other tests just to make sure:

In [52]:
# There shouldn't be any negative values, but let's just make sure:
gene_df[(gene_df < 0).all(1)]

Unnamed: 0,gene_0,gene_1,gene_2,gene_3,gene_4,gene_5,gene_6,gene_7,gene_8,gene_9,...,gene_20521,gene_20522,gene_20523,gene_20524,gene_20525,gene_20526,gene_20527,gene_20528,gene_20529,gene_20530


No negative values is a good sign. Let's check the average value across the entire dataframe too:

In [53]:
# Add up the sum of each column and then divide it by the number of columns to get an average value:
sum_col = 0
for col in gene_df.mean(axis=0):
    sum_col = sum_col + col
print(sum_col, sum_col/len(gene_df.columns))

132287.8494550156 6.44332226657326


The average value, 6.44 seems reasonable given what we saw in the snippet of the dataframe above. This hopefully demonstrates there are no errant values contained in the dataset.

Since the dataset seems to be in good shape, I will continue on to perform the clustering.