In [1]:
using Pkg; Pkg.activate(joinpath(Pkg.devdir(), "MLCourse"))
using CSV, DataFrames, Plots, OpenML , MLJ, Serialization

[32m[1m  Activating[22m[39m project at `~/.julia/dev/MLCourse`


# Raw Data Visualisation

Let's start looking at the raw training data:

In [3]:
train_data = CSV.read("DATA/train.csv", DataFrame) 

Unnamed: 0_level_0,Xkr4,Gm1992,Gm19938,Gm37381,Rp1,Sox17,Gm37587,Gm37323,Mrpl15
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,2.19038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.86198,0.0,0.506726,0.0,0.0,0.0,0.0,0.0,0.0
3,2.76676,0.0,0.629614,0.0,0.0,0.0,0.0,0.0,0.0
4,2.14643,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2.84005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.79552,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.96702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3.02083,0.0,0.374098,0.0,0.0,0.0,0.0,0.0,0.0
10,1.97939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
count(train_data.labels.== "KAT5") #returns total number of cells in category "KAT5"

1556

In [5]:
count(train_data.labels.== "eGFP") #returns total number of cells in category "eGFP"

1592

In [6]:

count(train_data.labels.== "CBP")#returns total number of cells in category "CBP"

1852

We can tell from these numbers that: in the training data set, all 3 categories are well represented.

In [7]:
describe(train_data) #returns statistics informations about each predictors

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,Xkr4,2.41064,0.0,2.51936,4.26052,0,Float64
2,Gm1992,0.0232695,0.0,0.0,1.8822,0,Float64
3,Gm19938,0.214613,0.0,0.0,1.8791,0,Float64
4,Gm37381,0.000976904,0.0,0.0,1.20868,0,Float64
5,Rp1,0.00707406,0.0,0.0,1.34236,0,Float64
6,Sox17,0.000259006,0.0,0.0,1.29503,0,Float64
7,Gm37587,0.0,0.0,0.0,0.0,0,Float64
8,Gm37323,0.000239274,0.0,0.0,0.721071,0,Float64
9,Mrpl15,0.133054,0.0,0.0,2.3545,0,Float64
10,Lypla1,0.158978,0.0,0.0,2.5327,0,Float64


We can see that some genes have a mean expression of zero, plus a minimal value and a maximal value of 0. This type of parameters are useless to our model prediction. Therefore some data cleaning needs to be done. 

# Data Treatment

## 1-Removal of constant and correlated parameters

In Data_treatment.jl :
- we tested if we had missing data: none found
- then we removed: 
    - the constant parameters
    - the correlated pararameters


In [2]:
train_data_treated = CSV.read("DATA/trainX.csv", DataFrame)

Unnamed: 0_level_0,Xkr4,Gm1992,Gm19938,Gm37381,Rp1,Sox17,Gm37323,Mrpl15,Lypla1
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,2.19038,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2.86198,0.0,0.506726,0.0,0.0,0.0,0.0,0.0,0.0
3,2.76676,0.0,0.629614,0.0,0.0,0.0,0.0,0.0,0.0
4,2.14643,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2.84005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1.79552,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1.96702,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,3.02083,0.0,0.374098,0.0,0.0,0.0,0.0,0.0,0.0
10,1.97939,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
describe(train_data_treated)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Float64,Float64,Float64,Int64,DataType
1,Xkr4,2.41064,0.0,2.51936,4.26052,0,Float64
2,Gm1992,0.0232695,0.0,0.0,1.8822,0,Float64
3,Gm19938,0.214613,0.0,0.0,1.8791,0,Float64
4,Gm37381,0.000976904,0.0,0.0,1.20868,0,Float64
5,Rp1,0.00707406,0.0,0.0,1.34236,0,Float64
6,Sox17,0.000259006,0.0,0.0,1.29503,0,Float64
7,Gm37323,0.000239274,0.0,0.0,0.721071,0,Float64
8,Mrpl15,0.133054,0.0,0.0,2.3545,0,Float64
9,Lypla1,0.158978,0.0,0.0,2.5327,0,Float64
10,Tcea1,0.334586,0.0,0.0,2.71235,0,Float64


All the parametors with a 0 mean are gone.
However we can still notice that there might be outliers in our data sets. For example for Gm37323(row 7) the mean is 0.000239274 but the maximal value is 0.721071. 
 

In [10]:
using Statistics
t = train_data_treated.Gm37323 #takes all data points in the column "Gm37323"
u = quantile(t, 0.99) #returns 0.99 quantile 
a = t[t.> u] # returns expression of the cells wich value is above the 0.99 quantile for Gm37323 expression

0.012212418396005238

Thus 2 data points are above the 0.99 quantile

In [13]:
average = (a[1]+a[2])/5000 

0.000239274035926351

The mean observed in column Gm37323 (see describe aboce) is only due to this 2 observations : only 2 cells express this gene!!!
Data needs further treatment!

## 2- Principal Component Analysis

To further treat our data, we decided to PCA

![img](PLOTS/PCA_2D.png)

![Image](PLOTS/PCA_2D_sd.png)

![Image](PLOTS/TNse.png)

![Image](PLOTS/Pvar_explained.png)

![Image](PLOTS/Biplot_PCA_standardized_1.png)

# Looking at Clusters