AI LAB REPORT ELENA NOTES

In [None]:
import sys
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_meta_HCC = pd.read_csv("raw_data/HCC1806_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
df_meta_MCF = pd.read_csv("raw_data/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
print("Meta data dimensions for HCC1806:", df_meta_HCC.shape)
print("Meta data dimensions for MCF7:", df_meta_MCF.shape)

In [None]:
df_meta_HCC.head(10)
df_meta_HCC.describe()

In [None]:
df_meta_MCF.head(10)
df_meta_MCF.describe()

Exploring the unfiltered data

We import all the files needed for the entire project (even model training)

In [None]:
#HCC cell line
df_HCC_s_f = pd.read_csv("raw_data/HCC1806_SmartS_Filtered_Data.txt", delimiter="\ ",engine='python',index_col=0)
df_HCC_s_f_n_test = pd.read_csv("raw_data/HCC1806_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter="\ ",engine='python',index_col=0)
df_HCC_s_f_n_train = pd.read_csv("raw_data/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt", delimiter="\ ",engine='python',index_col=0)
df_HCC_s_uf = pd.read_csv("raw_data/HCC1806_SmartS_Unfiltered_Data.txt", delimiter="\ ",engine='python',index_col=0)

#MCF cell line
df_MCF_s_f = pd.read_csv("raw_data/MCF7_SmartS_Filtered_Data.txt", delimiter="\ ",engine='python',index_col=0)
df_MCF_s_f_n_test = pd.read_csv("raw_data/MCF7_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt", delimiter="\ ",engine='python',index_col=0)
df_MCF_s_f_n_train = pd.read_csv("raw_data/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt", delimiter="\ ",engine='python',index_col=0)
df_MCF_s_uf = pd.read_csv("raw_data/MCF7_SmartS_Unfiltered_Data.txt", delimiter="\ ",engine='python',index_col=0)


In this part, we will first analyse the unfiltered data through plots and graphs. We will understand the dataset better, which will help us identify some potential problems of the dataset and give the foundation for the next steps, which will be filtering and normalization.

In [None]:
print("Number of genes for unfiltered MCF7 data: ", df_MCF_s_uf.shape[0])
print("Number of cells for unfiltered MCF7 data: ", df_MCF_s_uf.shape[1])
df_MCF_s_uf.describe()

The rows of each dataset represent gene codes and are labeled as follows:

In [None]:
print("First 5 gene codes of HCC1806 data: \n", np.array(df_HCC_s_uf.index.values)[:5], "\n")
print("First 5 gene codes of MCF7 data:\n ", np.array(df_MCF_s_uf.index.values)[:5])


On the other hand, the columns of the datasets are the individual cells that were sequenced. In the cell name itself, a lot of information is given: technique used, Hyp/Norm, (...)

In [None]:
print("First 5 cells of HCC1806 data: \n", np.array(df_HCC_s_uf.columns)[:5], "\n")
print("First 5 cells of MCF7 data:\n ", np.array(df_MCF_s_uf.columns)[:5])

!!! ADD !!!

Missing Values:
One of the first things to check is whether there are missing values. In these datasets, there are none: this is due to the fact that if a gene was not found in a specific cell, the value was set to 0, eliminating the possibility of NA. We do notice, however, that many rows contain a large amount of zeros, which is a problem which we will discuss further on.
Since there are no missing values, we can proceed by doing some preliminary analysis of our datasets.

In [None]:
#Creating a function which returns the number of missing values given a data set
def missing(df):
    miss = False
    for c in df.columns:
        if df[c].isnull().sum() != 0:
            miss = True
            return str(df[c].isnull().sum())
    if not miss:
        return "No missing values"

print("Number of missing values for the HCC1806 data: ", missing(df_HCC_s_uf))
print("Number of missing values for the MCF7 data: ", missing(df_MCF_s_uf))

As mentioned previously, there are many zero values, hence some genes occur rarely. This means that we are dealing with a sparse dataset, and it has to be taken into account throughout this analysis.

In [None]:
#Function that returns percentage of entries which are zero given a data frame
def frac_zeros(df, n=20):
    return (((df == 0).sum(axis=1).sum())/(df.shape[0] * df.shape[1])) * 100


print("Percentage of entries which are zero in the HCC1806 dataset: ", f"{round(frac_zeros(df_HCC_s_uf), 2)}%")
print("Percentage of entries which are zero in the MCF7 dataset: ", f"{round(frac_zeros(df_MCF_s_uf), 2)}%")


BOXPLOT, KERNEL DENSITY PLOT, HEATMAP

In [None]:
plt.boxplot(df_MCF_s_uf.values)
plt.show()

plt.boxplot(df_HCC_s_uf.values)
plt.show()

In [None]:
sns.kdeplot(df_MCF_s_uf, shade=True)
#plt.xlabel("")
#plt.ylabel("")
#plt.title("")
plt.show()

sns.kdeplot(df_HCC_s_uf, shade=True)
plt.show()

In [None]:
sns.heatmap(df_MCF_s_uf.corr(), annot=True)
plt.show()

sns.heatmap(df_HCC_s_uf.corr(), annot=True)
plt.show()

#A heatmap is a graphical representation of data where the values are represented as colors in a two-dimensional matrix. 

Swarm plot, Pair plot, Clustermap

In [None]:
sns.swarmplot(x=df[cnames[i**2]], data=df_MCF_s_uf)
plt.show()

sns.swarmplot(x=df[cnames[i**2]], data=df_HCC_s_uf)
plt.show()

#A swarm plot is a categorical scatter plot that displays the distribution of observations for each category by position along the horizontal axis

In [None]:
sns.pairplot(df_MCF_s_uf)
plt.show()

sns.pairplot(df_HCC_s_uf)
plt.show()


#Pair plot: A pair plot is a grid of scatter plots that displays the pairwise relationships between variables in a dataset. 

In [None]:
sns.clustermap(df_MCF_s_uf)
plt.show()

sns.clustermap(df_HCC_s_uf)
plt.show()

#A clustermap is a heatmap that arranges the rows and columns of a dataset according to their similarity. Clustermaps are useful for identifying groups and clusters within a dataset.

Investigating the genes:
As a next step we plot some violin graphs. It is a statistical graphic that takes a cell as input and visualizes how many genes take a specific value in that cell's column.
There is somewhat of a drawback: the number of genes sequenced for each cell can be any positive integer essencially between 0 and over 50 000.
Therefore, it is very rare that many genes occur exactly the same amount of times and the result is that there are a lot of genes that occur 0 times, and all the other genes are spread out between 0 and the maximum. We do observe, however, that the number of gene occurences tend to be accumulated around lower values and only a few genes have very large number of occurences. 
(??)

In [None]:
#Function to crate the violin plots
cnames_MCF = list(df_MCF_s_uf.columns)
cnames_HCC = list(df_HCC_s_uf.columns)
def violin(df, n=5):
    cnames = list(df.columns)
    for i in range(n):
        #We square i just so that we do not only get the first few cells.
        sns.boxplot(x=df[cnames[i**2]])
        sns.violinplot(x=df[cnames[i**2]])
        plt.show()

#Violin plots for the HCC1806 dataset  
violin(df_HCC_s_uf)

We can also compare directly the violin plots for 50 cells. For the reasons mentioned above these plots show us the range of gene occurences for some columns of our dataset, however as we have seen the points on the violin graphs have a tendency to be more present around lower values. We also (temporarily) randomly mix around the columns so that we are not allways graphing the same 50 or so cells.

In [None]:
#Comparing violin plots for the HCC1806 dataset
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_HCC_s_uf.sample(frac=1, axis = 'columns').iloc[:,:50],palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

In [None]:
#Comparing violin plots for the MCF7 dataset
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_MCF_s_uf.sample(frac=1, axis = 'columns').iloc[:,:50],palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

Moving our attention to the genes we decided to plot a graph which illustrates the total number of times a gene occurs over all the cells. Then, we plotted the 50 genes with the largest number of occurences. In doing so we get to see if the dataset contains some genes that appear a lot and some that never appear or if the apperences are more evenly spred. For both datasets we see that after the initail spike with very common genes the bar graph smooths out. We also calculated how many total gene occurences we are neglecting by plotting only the 50 most common genes, and we realize that the remaining genes still represent a very large amount of gene detections(which we expect because of the large amount of genes in the datasets).

In [None]:
#Representing how often a specific gene is found in a cell(I picked the 50 largest ones)
largest_HCC = df_HCC_s_uf.sum(axis='columns').nlargest(50)
remaining_HCC = df_HCC_s_uf.sum(axis='columns').sum() - df_HCC_s_uf.sum(axis='columns')[largest_HCC.index.values].sum()
print("Number of remaining occurences:", remaining_HCC)
plt.figure(figsize=(12,6))
ax = largest_HCC.plot.bar(stacked = True, fontsize = 5)
plt.xlabel('Genes')
plt.ylabel('Number of occurences')
plt.show()

In [None]:
#Representing how often a specific gene is found in a cell(I picked the 50 largest ones)
largest_MCF = df_MCF_s_uf.sum(axis='columns').nlargest(50)
remaining_MCF = df_MCF_s_uf.sum(axis='columns').sum() - df_MCF_s_uf.sum(axis='columns')[largest_MCF.index.values].sum()
print("Number of remaining occurences:", remaining_MCF)
plt.figure(figsize=(12,6))
ax = largest_MCF.plot.bar(fontsize = 5)
plt.xlabel('Genes')
plt.ylabel('Number of occurences')
plt.show()

This next step will be very useful later on: For each data set we differenciate between cells from the hypoxia experiment and cells from the normoxia experiment. We then create two subdatasets, one of which contains all the columns corresponding to hypoxia cells and the other containing only columns of normoxia cells.

In [None]:
#Function that retruns lists of all cells that were part of the hypoxia and normoxia groups
def hypo_and_norm(df):
    hypo = []
    norm = []
    for cell in df.columns:
        if "Hypo" in cell.split("_") or "Hypoxia" in cell.split("_"):
            hypo.append(cell)
        elif "Norm" in cell.split("_") or "Normoxia" in cell.split("_"):
            norm.append(cell)
        else:
            print("Unkown:", cell)
    return (hypo, norm)

#Data sets that contain only hypoxia cells
df_MCF_hypo = df_MCF_s_uf[hypo_and_norm(df_MCF_s_uf)[0]]
df_HCC_hypo = df_HCC_s_uf[hypo_and_norm(df_HCC_s_uf)[0]]

#Data sets that contain only normoxia cells
df_MCF_norm = df_MCF_s_uf[hypo_and_norm(df_MCF_s_uf)[1]]
df_HCC_norm = df_HCC_s_uf[hypo_and_norm(df_HCC_s_uf)[1]]

#How many hypoxia and how many normoxia are in each dataset
print("Number of cells exposed to hypoxia for HCC1806 data: ", len(hypo_and_norm(df_HCC_s_uf)[0])) 
print("Number of cells exposed to normoxia for HCC1806 data: ", len(hypo_and_norm(df_HCC_s_uf)[1]))

print("Number of cells exposed to hypoxia for MCF data: ", len(hypo_and_norm(df_MCF_s_uf)[0])) 
print("Number of cells exposed to normoxia for MCF data: ", len(hypo_and_norm(df_MCF_s_uf)[1]))


In view of our final goal of this report, we looked at whether (??) even without a model, there were some genes that seem to be more present in the cells of the hypoxia enviorment or vice versa. To illustrate this did the following for both datasets: First we took only the colums with cell from the hypoxia experiment and summed them so we could see how often each gene was found in the cells that had little oxygen. We did the same for the Normoxia cells and we did the differences (in abs) between the gene occurences in normoxia cells and hypoxia cells. We took the 20 genes that had the largest differences. The idea of this represention is to see if some genes are obviously more present in hypoxia cells. If this was the case, we would be lead to believe that this gene may play a role in the survival of a cell with no oxygen. Similarly if a gene was very present only in normoxia cells then this gene might not be useful in a hypoxia enviorment(or it might even be degenerous). Please note that we cannot strongly conclude anything from the following graphs, differences might also be due to some sampling bias.

In [None]:
def hypo_vs_norm(df_hypo, df_norm,n=20, width = 0.25, title="Hypoxia vs Normoxia"):
    #Get a list of the total occurences of each gene
    genes_norm = df_norm.sum(axis='columns')
    genes_hypo = df_hypo.sum(axis='columns')

    #Find the genes with the biggest difference of occurences between hypo cells and norm cells
    largest_diffs = (genes_hypo.sub(genes_norm)).apply(abs).nlargest(n)
    largest_diffs_genes = largest_diffs.index.values

    #Bar graph with gene occurences in hypo vs norm
    plt.bar(np.arange(len(genes_hypo[largest_diffs_genes])), 
            genes_hypo[largest_diffs_genes].tolist(), 
            color ='r', 
            width = width,
            edgecolor ='grey', 
            label ='Hypoxia')

    plt.bar([x + width for x in np.arange(len(genes_hypo[largest_diffs_genes]))],
            genes_norm[largest_diffs_genes].tolist(), 
            color ='g', 
            width = width,
            edgecolor ='grey', 
            label ='Normoxia')

    plt.xticks([r + width for r in range(len(largest_diffs_genes))],
            largest_diffs_genes,
            rotation=90,
            fontsize=10)
    plt.title(title, weight='bold')
    plt.legend()
    plt.show()
    return largest_diffs_genes


hypo_vs_norm(df_HCC_hypo, df_HCC_norm, title = "Genes with the largest difference of occurences for HCC")
hypo_vs_norm(df_MCF_hypo, df_MCF_norm, title = "Genes with the largest difference of occurences for MCF")


Data Cleaning:
Now that we have analysed the datasets, we can move on to data cleaning. In this process, we can identify and correct any potential issues that could arise when analysing our dataset. We had already foreshadowed this while checking for missing values, but since there were none we did not have to change the dataset. 

Outliers:
The next step is to find any possible outliers and remove them. Outliers will definitely degrade the performace of our models, and so we ought to remove them to avoid this problem. We must tread carefully though, because if we decide to remove a point we must be very confident that it is indeed an outlier and it does not contain any useful infomation, otherwise it can compromise the data and therefore our results.

In [None]:
Q1_HCC = df_HCC_s_uf.quantile(0.25)
Q3_HCC = df_HCC_s_uf.quantile(0.75)
Q1_MCF = df_MCF_s_uf.quantile(0.25)
Q3_MCF = df_MCF_s_uf.quantile(0.75)
IQR_HCC = Q3_HCC - Q1_HCC
IQR_MCF = Q3_MCF - Q1_MCF
print("HCC:\n", IQR_HCC)
print("MCF:\n", IQR_MCF)

A first and rather crude way of removing outliers is to only look at the quantiles. This is a very easy way of removing outliers however we risk eliminating a lot of useful data points. We see in fact that in both cases we have removed more than half of the cells when performing this operation. It is very improbable if not impossible that more than half of our data points are outliers. Later on we will see a better may to discover outliers, using clustering and SVMs for example.

In [None]:
df_HCC_noOut = df_HCC_s_uf[~((df_HCC_s_uf < (Q1_HCC - 1.5 * IQR_HCC)) |(df_HCC_s_uf > (Q3_HCC + 1.5 * IQR_HCC))).any(axis=1)]
print("Shape with outliers: ", df_HCC_s_uf.shape)
print("Shape without outliers: ", df_HCC_noOut.shape)
print("Number of removed data points: ", df_HCC_s_uf.shape[0] - df_HCC_noOut.shape[0])
df_HCC_noOut.head(3)

In [None]:
df_MCF_noOut = df_MCF_s_uf[~((df_MCF_s_uf < (Q1_MCF - 1.5 * IQR_MCF)) |(df_MCF_s_uf > (Q3_MCF + 1.5 * IQR_MCF))).any(axis=1)]
print("Shape with outliers: ", df_MCF_s_uf.shape)
print("Shape without outliers: ", df_MCF_noOut.shape)
print("Number of removed data points: ", df_MCF_s_uf.shape[0] - df_MCF_noOut.shape[0])
df_MCF_noOut.head(3)

The most likely explanation for such a bad outlier detection using quantiles is that the data is very sparse. We have already seen that a lot of the entries of the dataset are zeros and so the quantiles are influenced dratically. This gives the impression that any data point with a lot of gene occurences is an outlier(which is indeed not the case), in fact these points are where most of our information comes from.

words or counts of categorical data. On the other hand, features with dense data have predominantly non-zero values.

can you quantify the sparsity?

would using sparse matrix representation be an advantage?

what would you do to adress this sparsity?

In [None]:
######
###### Should we keep this, Matt does not really see the point
######
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_MCF_noOut.iloc[:,:50],palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

In [None]:
######
###### Should we keep this, Matt does not really see the point
######
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_HCC_noOut.iloc[:,:50],palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
plt.show()

Distribiution:
In any dataset it is always useful to at least have an idea of what kind of distribiution does your data follow. While we will never know with certainty we can find some models that best approximate the data. Let's start with the Skewness of both our datasets.

From the graphs below we see that the data has very large positive skew. This is exactly what we expect from our dataset, indeed we metioned before that we have a lot of entries which are zero in out datasets, this will lead to the mode (of a single cell) to be probably 0. The mean on the other hand will be very afected by the large values present in our columns (which can reach the order of 1e5) and the median will be somewhere between the two. In fact the Fisher-Pearson Coefficient (cancluated by scripy.stats.skew) when mode < median < mean will return a positive number. In our case the skewness is very drastic as our numbers can range in a very large interval.

In [None]:
#Skewness
from scipy.stats import skew

def skewness(df1, df2, title1 = '', title2 = ''):
  figure, ax = plt.subplots(1, 2, figsize=(12,6))
  cnames1 = list(df1.columns)
  cnames2 = list(df2.columns)
  colN1 = np.shape(df1)[1]
  colN2 = np.shape(df2)[1]
  df_skew_cells1 = []
  df_skew_cells2 = []

  for i in range(colN1) :     
      v_df1 = df1[cnames1[i]]
      df_skew_cells1 += [skew(v_df1)]
   
  for i in range(colN2):
     v_df2 = df2[cnames2[i]]
     df_skew_cells2 += [skew(v_df2)]

  #First graph 
  ax[0].hist(df_skew_cells1,bins=100)
  ax[0].set_title("Skewness of single cells for " + title1)

  #Second graph 
  ax[1].hist(df_skew_cells2,bins=100)
  ax[1].set_title("Skewness of single cells for " + title2)
  
  #plt.xlabel('Skewness of single cells expression profiles - original df')
  #print( "Skewness of normal distribution: ", skew(df_skew_cells) )

skewness(df_HCC_s_uf, df_MCF_s_uf, title1="HCC1806", title2="MCF7")


In [None]:
#Kurtosis
from scipy.stats import kurtosis

def kurt(df1, df2, title1 = '', title2 = ''):
  figure, ax = plt.subplots(1, 2, figsize=(12,6))
  cnames1 = list(df1.columns)
  cnames2 = list(df2.columns)
  colN1 = np.shape(df1)[1]
  colN2 = np.shape(df2)[1]
  df_kurt_cells1 = []
  df_kurt_cells2 = []

  for i in range(colN1) :     
      v_df1 = df1[cnames1[i]]
      df_kurt_cells1 += [kurtosis(v_df1)]
   
  for i in range(colN2):
     v_df2 = df2[cnames2[i]]
     df_kurt_cells2 += [kurtosis(v_df2)]

  #First graph 
  ax[0].hist(df_kurt_cells1,bins=100)
  ax[0].set_title("Kurtosis of single cells for " + title1)

  #Second graph 
  ax[1].hist(df_kurt_cells2,bins=100)
  ax[1].set_title("Kurtosis of single cells for " + title2)
  
kurt(df_HCC_s_uf, df_MCF_s_uf, title1="HCC1806", title2="MCF7")

In [None]:
#Entropy
from scipy.stats import entropy

def entro(df1, df2, title1 = '', title2 = ''):
  figure, ax = plt.subplots(1, 2, figsize=(12,6))
  cnames1 = list(df1.columns)
  cnames2 = list(df2.columns)
  colN1 = np.shape(df1)[1]
  colN2 = np.shape(df2)[1]
  df_kurt_cells1 = []
  df_kurt_cells2 = []

  for i in range(colN1) :     
      v_df1 = df1[cnames1[i]]
      df_kurt_cells1 += [entropy(v_df1)]
   
  for i in range(colN2):
     v_df2 = df2[cnames2[i]]
     df_kurt_cells2 += [entropy(v_df2)]

  #First graph 
  ax[0].hist(df_kurt_cells1,bins=100)
  ax[0].set_title("Entropy of single cells for " + title1)

  #Second graph 
  ax[1].hist(df_kurt_cells2,bins=100)
  ax[1].set_title("Entropy of single cells for " + title2)
  
entro(df_HCC_s_uf, df_MCF_s_uf, title1="HCC1806", title2="MCF7")

The distribution are highly non-normal, skewed with heavy tails. Why is this a problem?
Problem because statistical tests assume model follows a normal distribution. If non-normal, results and estimates can be incorrect. We can fix this problem through data transformation.

Data Transformation:

In [None]:
#Log transformation
def transform_log2(df):
    cnames = list(df.columns)
    df_log2 = np.log2(df[cnames[1]]+1)
    return df_log2


sns.boxplot(x=transform_log2(df_MCF_s_uf))
sns.violinplot(x=transform_log2(df_MCF_s_uf))
plt.show()

sns.boxplot(x=transform_log2(df_HCC_s_uf))
sns.violinplot(x=transform_log2(df_HCC_s_uf))
plt.show()

In [None]:
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=transform_log2(df_MCF_s_uf),palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)

plt.figure(figsize=(16,4))
plot=sns.violinplot(data=transform_log2(df_HCC_s_uf),palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)

Normalizing:

In [None]:
def graph_normalization(df):
    df_small = df.iloc[:, 10:30]  #just selecting part of the samples so run time not too long
    sns.displot(data=df_small,palette="Set3",kind="kde", bw_adjust=2)
    plt.show()

graph_normalization(df_MCF_s_uf)
graph_normalization(df_MCF_s_f_n_train)

graph_normalization(df_HCC_s_uf)
graph_normalization(df_HCC_s_f_n_train)

Duplicate rows:

In [None]:
def duplicate_rows(df, all_cells = False, shape = False):
    if shape:
        print("Number of duplicate rows: ", df[df.duplicated(keep=False)].shape)
    if all_cells:
        print("Duplicate rows: ", df[df.duplicated(keep=False)])
    return df[df.duplicated(keep=False)]

duplicate_rows(df_MCF_s_uf)
duplicate_rows(df_MCF_s_uf, True, True)

In [None]:
#Check where the duplicates are:

duplicate_rows_df_MCF_t = duplicate_rows(df_MCF_s_uf).T
c_dupl_MCF = duplicate_rows_df_MCF_t.corr()
c_dupl_MCF

duplicate_rows_df_HCC_t = duplicate_rows(df_HCC_s_uf).T
c_dupl_HCC = duplicate_rows_df_HCC_t.corr()
c_dupl_HCC

In [None]:
# warning: the scatter plots below might take a long time if the number of duplicate features is large
# sns.pairplot(duplicate_rows_df_MCF_t)
# sns.pairplot(duplicate_rows_df_HCC_t)

We can look at the statistics of the gene expression profiles of genes/features that seem duplicates. They might be features with many zeros, or many missing data.

In [None]:
duplicate_rows_df_MCF_t.describe()
duplicate_rows_df_HCC_t.describe()

In [None]:
#Dropping duplicates

df_MCF_noDup = df_MCF_s_uf.drop_duplicates()
df_HCC_noDup = df_HCC_s_uf.drop_duplicates()

print(df_MCF_noDup.count())
print(df_HCC_noDup.count())

Data Structure:

In [None]:
plt.figure(figsize=(10,5))

c_MCF = df_MCF_s_uf.corr()
midpoint = (c_MCF.values.max() - c_MCF.values.min()) /2 + c_MCF.values.min()
# sns.heatmap(c,cmap='coolwarm',annot=True, center=midpoint )
# plt.show()
sns.heatmap(c_MCF,cmap='coolwarm', center=0 )
plt.show()
print("Number of cells included: ", np.shape(c_MCF))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c_MCF.values.min())


plt.figure(figsize=(10,5))

c_HCC= df_HCC_s_uf.corr()
midpoint = (c_HCC.values.max() - c_HCC.values.min()) /2 + c_HCC.values.min()
#sns.heatmap(c,cmap='coolwarm',annot=True, center=midpoint )
#plt.show()
sns.heatmap(c_HCC,cmap='coolwarm', center=0 )
plt.show()
print("Number of cells included: ", np.shape(c_HCC))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c_HCC.values.min())

We could look at the distribution of the correlation between gene expression profiles using Histogram

In [None]:

c_small = c_MCF.iloc[:,:3]
print(c_small)
sns.histplot(c_small,bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')

c_small = c_HCC.iloc[:,:3]
sns.histplot(c_small,bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')

We expect the correlation between the gene expression profiles of the the single cells to be fairly high.

Some genes will be characteristic of some cells. For example in our case we expect some genes to be expressed at high levels only in cells cultured in conditions of low oxygen (hypoxia), or viceversa. However, most of the low and/or high expressed genes will tend to be generally similar. Several genes will have a high expression across cells as they are house keeping genes needed for the basic functioning of the cell. Some genes will have low expression across cells as they are less or not essential for the normal functioning, so they will have low or no expression across cells and will only be expressed in specific circumstances.

Are there some cells which are not correlated with the others?

Can you explore the distributions of gene expression for these cells and check why? Do they have more zero values than other cells?

Or do they have higher values?

Next you could explore the features/genes. Are they correlated? Is this expected? Could this generate issues in the ML?

Repeat the steps above for all datasets, and discuss the findings.

Model: