 # Intro
 - **Project:** Meta Machine Learning
 - **Author:** Eduardo Salmerón Castaño
 - **Date Notebook Started:** 02.03.2020
    - **Quick Description:** Notebook version of Meta ML.

---
### Quick Description: 
**Problem:** Obtain a Meta-ML model with one source.

**Solution:** First prepare data for ML, then get level 1 experts and finally obtain the Meta-ML model. 

### Motivation/Background:
You want to get a Meta-ML model.

### Thoughts for Future Development of Code in this Notebook: 
- Select which algorithms you want to use for ML and Meta-ML.
- Improve the accuracy.
- Use GenoML functionality.

## Imports
All imports needed here

In [7]:
import fromGenoToMLData as fm
import genoMML as gm
import pickle
import os
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri

## GenModels_1SmE function
Main function that applies the Meta Learning 1SmE and returns the final results of the evaluation in the test set

Parameters:

&emsp;__workPath__: String with your workpath<br>
&emsp;__path2Data__: Path to the folder that contains the PLINK files<br>
&emsp;__path2packages__: Path to the folder that contains PLINK, PRSice and the GWAS<br>
&emsp;__genoTrain__: String with the name of the cohort used for training<br>
&emsp;__covsTrain__: String with the name of the covariates used for training<br>
&emsp;__genoTest__: Vector with the names of the geno files for the cohort used in test<br>
&emsp;__covsTest__: Vector with the names of the covs files for the cohort used in test<br>
&emsp;__path2PCA__: Path to the set with the PCA applied to the cohorts<br>

Return value:

&emsp;The final results on the test data.

In [8]:
def MML_1SmE(workPath,
            path2Data,
            path2packages,
            genoTrain,
            covsTrain,
            genoTest,
            covsTest,
            path2PCA):
    # Read PCAs
    #pandas2ri.activate()

    readRDS = robjects.r['readRDS']
    pcaSET = readRDS(path2PCA)
    
    lworkPath = workPath + "dataRepo/"
    try:
        os.mkdir(lworkPath)
    except Exception:
        pass

    # Create handler
    if (os.path.isfile(lworkPath + "mldatahandler.pydat")): # If the handler already exists, read it
        with open(lworkPath + 'mldatahandler.pydat', 'rb') as mldatahandler_file:
            handlerML = pickle.load(mldatahandler_file)
        print(lworkPath + 'mldatahandler.pydat' + " already exists.")
    else:
        handlerML = fm.fromGenoToMLdata(lworkPath,
                                        path2Geno=path2Data + "/" + genoTrain,
                                        path2Covs=path2Data + "/" + covsTrain,
                                        predictor="DISEASE",
                                        path2GWAS=path2packages,  
                                        path2PRSice=path2packages,  
                                        path2plink=path2packages,
                                        iter=1)
    
    # Generate k folds and obtain their level 1 experts
    experts_L1 = gm.genModels_1SmE(workPath=lworkPath,
                                   handlerMLdata=handlerML,
                                   k=5)


    # Obtain level 2 expert
    expert_L2 = gm.trainAndTestMML_1SmE(experts_L1,
                                        handlerML,
                                        lworkPath)

    # Test each cohort
    for geno, covs in zip(genoTest, covsTest):
        print()
        print("#" * 30)
        print("Final Test with " + geno + "\n")

        lworkPath = workPath + "/" + covs + "/"
        try:
            os.mkdir(lworkPath)
        except Exception:
            pass

        # generate mldata for the test repository
        handlerTest = gm.prepareFinalTest(workPath=lworkPath,
                                          path2Geno=path2Data + "/" + geno,
                                          path2Covs=path2Data + "/" + covs,
                                          predictor="PHENO_PLINK",
                                          snpsToPull=handlerML["snpsToPull"])

        with open(lworkPath + "/handler- " + geno + ".pydat", 'wb') as handlerTest_file:
            pickle.dump(handlerTest, handlerTest_file)

        # obtain final evaluation results
        finalResult = gm.finalTest_1SmE(lworkPath,
                                        expert_L2,
                                        handlerTest,
                                        experts_L1)

        with open(lworkPath + "/finalResults.pydat", 'wb') as finalResult_file:
            pickle.dump(finalResult, finalResult_file)


# Example of the functionality
In this section we use the upper function to explain: 
 - An example of parameters we can use.
 - How we process the genotype data to obtain the dataforML file.
 - How we obtain level 1 experts from this processed data.
 - How we obtain final model from level 1 experts.
 - Test the final model

## Parameters

The parameters we are using are:
 - genoTrain: We are using the spanish cohort for training, "SPANISH" indicates the file name.
 - covsTrain: The covariations file of the spanish cohort.
 - path2Data: This is the path to the files we are using are.
 - path2packages: This is the path to the packages we use (i.e. plink).
 - workPath: Path where we all files will be saved.
 - genoTest: Cohorts names we are using for test final model. We are using PPMI, HBS and PDBP cohorts.
 - covsTest: Covariations files of the cohorts we are using for test final model.
 - path2PCA: PCA file we will use in our trainig.

In [9]:
genoTrain = "SPANISH" 
covsTrain = "COVS_SPANISH"
path2Data = "/home/edusal/data/FINALES"
path2packages = "/home/edusal/packages/"
workPath = "/home/edusal/MML-1SmE-4/"
genoTest = ["PPMI", "HBS", "PDBP"]
covsTest = ["COVS_" + cov for cov in genoTest]
path2PCA = '/home/edusal/OBTAIN_PCA/pcaSET.rds'

## Obtain dataforML

The first step is process our geno file so we can use it on the training process. We can do this with the function `fromGenoToMLData` from fromGenoToMLData.py (there is a notebook of this file that explains every function). This function basically reads our files, splits the data (by default 75-25), selects the most relevants SNPs (we use plink) and finally creates the dataforML file. Spanish cohort's predictor is called "DISEASE", we have to indicate it in the parameter "predictor".

If the dataforML file is already created we skip it so we dont waste time.

In [10]:
readRDS = robjects.r['readRDS']
pcaSET = readRDS(path2PCA)

lworkPath = workPath + "dataRepo/"
try:
    os.mkdir(lworkPath)
except Exception:
    pass

# Create handler
if (os.path.isfile(lworkPath + "mldatahandler.pydat")): # If the handler already exists, read it
    with open(lworkPath + 'mldatahandler.pydat', 'rb') as mldatahandler_file:
        handlerML = pickle.load(mldatahandler_file)
    print(lworkPath + 'mldatahandler.pydat' + " already exists.")
else:
    handlerML = fm.fromGenoToMLdata(lworkPath,
                                    path2Geno=path2Data + "/" + genoTrain,
                                    path2Covs=path2Data + "/" + covsTrain,
                                    predictor="DISEASE",
                                    path2GWAS=path2packages,  
                                    path2PRSice=path2packages,  
                                    path2plink=path2packages,
                                    iter=1)



1
trainFoldFiles
Creating genotype data for fold 1 and train with command /home/edusal/packages/plink --bfile /home/edusal/data/FINALES/SPANISH --keep /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//COVS_SPANISHtrain.ids --make-bed --out /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//SPANISHtrain.1

1
testFoldFiles
Creating genotype data for fold 1 and test with command /home/edusal/packages/plink --bfile /home/edusal/data/FINALES/SPANISH --keep /home/edusal/MML-1SmE-4/dataRepo/Foldtest1//COVS_SPANISHtest.ids --make-bed --out /home/edusal/MML-1SmE-4/dataRepo/Foldtest1//SPANISHtest.1

1
trainFoldFiles
/home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//SPANISHtrain.1 already created, skipping

1
testFoldFiles
/home/edusal/MML-1SmE-4/dataRepo/Foldtest1//SPANISHtest.1 already created, skipping

The covstr is --cov-file /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//COVS_train1.cov 
(5710, 24)
The covstr is --cov-file /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//COVS_train1.cov 
The command to run: Rscr

## Obtain level 1 experts

Once we have the data prepared, we can obtain level 1 experts with the function `genModels_1SmE` from GenoMML.py. This function uses the training partition (75%) and train using cross-validation (5 folds , as indicated in the parameter k). In the training process we use some algorithms, in our case we are using 3 algorithms (see on GenoMML notebook). Finally we obtain an expert for each algorithm, we will use these models in metaML training.

In [11]:
# Generate k folds and obtain their level 1 experts
experts_L1 = gm.genModels_1SmE(workPath=lworkPath,
                               handlerMLdata=handlerML,
                               k=5)

[1 2]
[]

##############################
RandomForestClassifier


	##########################
	Fold 1





		AUC: 53.7116%
		Accuracy: 54.8656%
		Balanced Accuracy: 52.6616%
		Kappa: 5.3930%
		Log Loss: 0.8318
		Runtime in seconds: 0.1729

	##########################
	Fold 2

		AUC: 52.6127%
		Accuracy: 54.4022%
		Balanced Accuracy: 52.0490%
		Kappa: 4.1273%
		Log Loss: 0.7473
		Runtime in seconds: 0.1633

	##########################
	Fold 3

		AUC: 55.6702%
		Accuracy: 55.2363%
		Balanced Accuracy: 53.5188%
		Kappa: 6.9208%
		Log Loss: 0.7201
		Runtime in seconds: 0.1591

	##########################
	Fold 4

		AUC: 50.7346%
		Accuracy: 53.7106%
		Balanced Accuracy: 51.2984%
		Kappa: 2.6460%
		Log Loss: 0.8487
		Runtime in seconds: 0.1578

	##########################
	Fold 5

		AUC: 54.3138%
		Accuracy: 56.5863%
		Balanced Accuracy: 54.5729%
		Kappa: 9.2073%
		Log Loss: 0.7711
		Runtime in seconds: 0.169

##############################
LogisticRegression


	##########################
	Fold 1





		AUC: 59.9201%
		Accuracy: 60.7970%
		Balanced Accuracy: 57.7472%
		Kappa: 16.0392%
		Log Loss: 0.7475
		Runtime in seconds: 0.3659

	##########################
	Fold 2





		AUC: 63.6805%
		Accuracy: 63.7627%
		Balanced Accuracy: 60.5315%
		Kappa: 21.7793%
		Log Loss: 0.696
		Runtime in seconds: 0.4138

	##########################
	Fold 3





		AUC: 61.3824%
		Accuracy: 61.3531%
		Balanced Accuracy: 58.2988%
		Kappa: 16.8824%
		Log Loss: 0.7116
		Runtime in seconds: 0.3786

	##########################
	Fold 4





		AUC: 59.2243%
		Accuracy: 58.5343%
		Balanced Accuracy: 55.9731%
		Kappa: 12.2479%
		Log Loss: 0.748
		Runtime in seconds: 0.3975

	##########################
	Fold 5





		AUC: 62.1505%
		Accuracy: 60.5751%
		Balanced Accuracy: 59.0692%
		Kappa: 18.1317%
		Log Loss: 0.7235
		Runtime in seconds: 0.4139

##############################
DecisionTreeClassifier


	##########################
	Fold 1

		AUC: 52.9015%
		Accuracy: 54.0315%
		Balanced Accuracy: 52.9015%
		Kappa: 5.7506%
		Log Loss: 15.88
		Runtime in seconds: 0.4455

	##########################
	Fold 2

		AUC: 54.3427%
		Accuracy: 56.1631%
		Balanced Accuracy: 54.3427%
		Kappa: 8.6681%
		Log Loss: 15.14
		Runtime in seconds: 0.4572

	##########################
	Fold 3

		AUC: 55.7359%
		Accuracy: 57.1826%
		Balanced Accuracy: 55.7359%
		Kappa: 11.2462%
		Log Loss: 14.79
		Runtime in seconds: 0.4548

	##########################
	Fold 4

		AUC: 53.4146%
		Accuracy: 54.7310%
		Balanced Accuracy: 53.4146%
		Kappa: 6.8108%
		Log Loss: 15.64
		Runtime in seconds: 0.4509

	##########################
	Fold 5

		AUC: 54.9109%
		Accuracy: 56.6790%
		Balanced Accuracy: 54.9109%
		Kappa: 9.8401%
		Log Loss: 

In the output we can see some of the metrics obtained of the training process, such as AUC, accuracy or kappa for each algorithm.

## Obtain final model
Now we can use models generated in the previous step for meta-ML in function `trainAndTestMML-1SmE`. For Meta-ML we will use test data (25%) and the predictions of this data from level 1 experts. After training with the same algorithms used in ML (but in a future version we will use a parameter to indicate the algorithms we want to use) and using also cross-validation with 5 folds, we obtain a Meta-ML model.

In [12]:
# Obtain level 2 expert
expert_L2 = gm.trainAndTestMML_1SmE(experts_L1,
                                    handlerML,
                                    lworkPath)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)



##############################
RandomForestClassifier


	##########################
	Fold 1

		AUC: 44.0851%
		Accuracy: 46.0674%
		Balanced Accuracy: 44.4789%
		Kappa: -11.0123%
		Log Loss: 0.8817
		Runtime in seconds: 0.03797

	##########################
	Fold 2

		AUC: 54.7297%
		Accuracy: 52.2556%
		Balanced Accuracy: 51.0822%
		Kappa: 2.1889%
		Log Loss: 0.7129
		Runtime in seconds: 0.03762

	##########################
	Fold 3

		AUC: 51.7204%
		Accuracy: 55.2632%
		Balanced Accuracy: 53.9407%
		Kappa: 7.8915%
		Log Loss: 0.7288
		Runtime in seconds: 0.03752

	##########################
	Fold 4

		AUC: 53.2566%
		Accuracy: 53.3835%
		Balanced Accuracy: 51.4771%
		Kappa: 3.0168%
		Log Loss: 0.8388
		Runtime in seconds: 0.03697

	##########################
	Fold 5

		AUC: 51.8313%
		Accuracy: 54.5113%
		Balanced Accuracy: 53.4157%
		Kappa: 6.5827%
		Log Loss: 0.7284
		Runtime in seconds: 0.03736

##############################
LogisticRegression


	##########################
	Fold 



		AUC: 58.9288%
		Accuracy: 56.1798%
		Balanced Accuracy: 54.8466%
		Kappa: 9.6799%
		Log Loss: 1.977
		Runtime in seconds: 0.08196

	##########################
	Fold 2

		AUC: 58.6635%
		Accuracy: 57.1429%
		Balanced Accuracy: 56.8484%
		Kappa: 13.6265%
		Log Loss: 1.634
		Runtime in seconds: 0.09474

	##########################
	Fold 3

		AUC: 61.9529%
		Accuracy: 57.8947%
		Balanced Accuracy: 57.9890%
		Kappa: 15.5795%
		Log Loss: 1.837
		Runtime in seconds: 0.07975

	##########################
	Fold 4





		AUC: 60.0691%
		Accuracy: 56.7669%
		Balanced Accuracy: 55.1828%
		Kappa: 10.5294%
		Log Loss: 2.072
		Runtime in seconds: 0.09712

	##########################
	Fold 5

		AUC: 67.2229%
		Accuracy: 62.0301%
		Balanced Accuracy: 61.4277%
		Kappa: 22.0236%
		Log Loss: 1.437
		Runtime in seconds: 0.1051

##############################
DecisionTreeClassifier


	##########################
	Fold 1

		AUC: 57.1656%
		Accuracy: 58.4270%
		Balanced Accuracy: 57.1656%
		Kappa: 14.3117%
		Log Loss: 14.36
		Runtime in seconds: 0.06626

	##########################
	Fold 2

		AUC: 56.7568%
		Accuracy: 57.5188%
		Balanced Accuracy: 56.7568%
		Kappa: 13.5718%
		Log Loss: 14.67
		Runtime in seconds: 0.0664

	##########################
	Fold 3

		AUC: 49.1078%
		Accuracy: 50.3759%
		Balanced Accuracy: 49.1078%
		Kappa: -1.7798%
		Log Loss: 17.14
		Runtime in seconds: 0.06382

	##########################
	Fold 4

		AUC: 51.8716%
		Accuracy: 53.0075%
		Balanced Accuracy: 51.8716%
		Kappa: 3.7627%
		Log L

Just as before we see some metrics of the final model.

## Test our model
For the testing part we use the 3 testing cohorts (PPMI, HBS and PDBP), for each one, we process test data with `prepareFinalTest` and test the model with `finalTest_1SmE`, this function first obtain the predictions of the data with level 1 experts and then predict data with the final model.

In [13]:
# Test each cohort
for geno, covs in zip(genoTest, covsTest):
    print()
    print("#" * 30)
    print("Final Test with " + geno + "\n")

    lworkPath = workPath + "/" + covs + "/"
    try:
        os.mkdir(lworkPath)
    except Exception:
        pass

    # generate mldata for the test repository
    handlerTest = gm.prepareFinalTest(workPath=lworkPath,
                                      path2Geno=path2Data + "/" + geno,
                                      path2Covs=path2Data + "/" + covs,
                                      predictor="PHENO_PLINK",
                                      snpsToPull=handlerML["snpsToPull"])

    with open(lworkPath + "/handler- " + geno + ".pydat", 'wb') as handlerTest_file:
        pickle.dump(handlerTest, handlerTest_file)

    # obtain final evaluation results
    finalResult = gm.finalTest_1SmE(lworkPath,
                                    expert_L2,
                                    handlerTest,
                                    experts_L1)

    with open(lworkPath + "/finalResults.pydat", 'wb') as finalResult_file:
        pickle.dump(finalResult, finalResult_file)


##############################
Final Test with PPMI

The command to run: cp /home/edusal/data/FINALES/COVS_PPMI.cov /home/edusal/MML-1SmE-4//COVS_PPMI/
Running command /home/edusal/packages/plink --bfile /home/edusal/data/FINALES/PPMI --extract /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//g-SPANISHtrain.1-p-MyPhenotype1train-c-COVS_train1-a-NA.temp.snpsToPull2 --recode A --out /home/edusal/MML-1SmE-4//COVS_PPMI//g-PPMI-p-MyPhenotype-c-COVS_PPMI-a-NA.reduced_genos

Running command cut -f 1 /home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//g-SPANISHtrain.1-p-MyPhenotype1train-c-COVS_train1-a-NA.temp.snpsToPull2 > /home/edusal/MML-1SmE-4//COVS_PPMI//g-PPMI-p-MyPhenotype-c-COVS_PPMI-a-NA.reduced_genos_snpList

Number of folds here 1

1
trainFoldFiles
1
testFoldFiles
/home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//g-SPANISHtrain.1-p-MyPhenotype1train-c-COVS_train1-a-NA.temp.snpsToPull2
/home/edusal/MML-1SmE-4/dataRepo/Foldtrain1//g-SPANISHtrain.1-p-MyPhenotype1train-c-COVS_train1-a-NA.temp.snpsToP

In this output we see the final results of our testing process.