# Example model and use of the MOD PTB Microbiome challenge data

*Microbiome data* has a few quirks for machine learning applications. Here I will go through an example using what I feel is the best feature type to use for ML-microbiome studies. 

**TL/DR: Use the phylotype tables and/or alpha-diversity and/or pairwise distance and/or CST** instead of taxonomy or raw sequence-variant counts, as the former are the best harmonized across datasets. 

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import plot_roc_curve, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, r2_score, mean_squared_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
import os

import models
import regressors
import graphing
import helpers

2022-09-28 08:42:49.778351: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-28 08:42:49.778374: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [19]:
import importlib
importlib.reload(models)
importlib.reload(regressors)
importlib.reload(graphing)
importlib.reload(helpers)

<module 'helpers' from '/home/ababjac/Dream_Preterm/helpers.py'>

I do my ML in python. The raw data are almost all in `csv` format, which should import into a variety of engines without fuss.

In [3]:
BASEDIR = './'

This is the path to the training data, in turn with the following subdirectories and files:

```
├── alpha_diversity
│  └── alpha_diversity.csv
├── community_state_types
│  ├── cst_valencia.csv
│  └── cst_valencia_w_taxons.csv
├── metadata
│  └── metadata.csv
├── pairwise_distance
│  ├── krd_distance_long.csv
│  └── krd_distance_wide.csv
├── phylotypes
│  ├── phylotype_nreads.1e0.csv
│  ├── phylotype_nreads.1e_1.csv
│  ├── phylotype_nreads.5e_1.csv
│  ├── phylotype_relabd.1e0.csv
│  ├── phylotype_relabd.1e_1.csv
│  ├── phylotype_relabd.5e_1.csv
│  ├── pt.1e-1.csv
│  ├── pt.1e0.csv
│  └── pt.5e-1.csv
├── sv_counts
│  └── sp_sv_long.csv
└── taxonomy
   ├── sv_taxonomy.csv
   ├── taxonomy_nreads.family.csv
   ├── taxonomy_nreads.genus.csv
   ├── taxonomy_nreads.species.csv
   ├── taxonomy_relabd.family.csv
   ├── taxonomy_relabd.genus.csv
   └── taxonomy_relabd.species.csv
```

In [4]:
metadata = pd.read_csv(
    os.path.join(BASEDIR, 'metadata', 'metadata.csv')
)
metadata.iloc[0:3]

Unnamed: 0,project,specimen,participant_id,was_term,delivery_wk,collect_wk,race,age,NIH Racial Category,NIH Ethnicity Category,was_preterm,was_early_preterm
0,A,A00001-05,A00001,False,38.0,33.0,American Indian,Unknown,American Indian or Alaska Native,Unknown,False,False
1,A,A00002-01,A00002,False,40.0,38.0,White,Unknown,White,Unknown,False,False
2,A,A00003-02,A00003,False,40.0,30.0,Asian-Japanese,Unknown,Asian,Unknown,False,False


**metadata** is provided for both the training and validation data sets. In our simple model today, we will not correct for things like age, gestational week of collection of the specimen, race, ethnicity, etc.

The validation data will lack the outcome data (e.g. `was_preterm`, `was_early_preterm`, `was_term` and `delivery_wk` columns).

For convenience, I convert the metadata into associative dictionaries to be able to quickly look up crucial values

In [5]:
# Get some per-participant values

specimen_participant = {
    sp: p
    for (sp, p) in zip(
        metadata.specimen,
        metadata.participant_id
    )
}
len(specimen_participant)

participant_preterm = {
    p: pt
    for (p, pt)
    in zip(
        metadata.participant_id,
        metadata.was_preterm
    )
}

Validation data will be limited to through 32 weeks for the preterm challenge and through 28 weeks for the early-preterm challenge. Let's do the same to the training data to be sure we are not relying upon later-in-pregnancy specimens we will not have access to in the validation set....

We will then further split the training data into a train-test set at 70-30 to be able to check on our models.

In [6]:
    # Make a split 70-30 *by participant* (to ensure no leak from a person's specimens between test/train)
    all_participants = sorted(metadata[metadata.collect_wk <= 32].participant_id.unique())
    early_participants = sorted(metadata[metadata.collect_wk <= 28].participant_id.unique())

    random.seed(12345)
    train_participants = random.sample(all_participants, int(0.7*len(all_participants)))
    test_participants = [p for p in all_participants if p not in train_participants]

In [7]:
print(metadata.columns.tolist(), end='\n\n')

print(set(metadata['age'].tolist()), end='\n\n')
print(set(metadata['race'].tolist()), end='\n\n')
print(set(metadata['NIH Racial Category'].tolist()), end='\n\n')
print(set(metadata['NIH Ethnicity Category'].tolist()), end='\n\n')

['project', 'specimen', 'participant_id', 'was_term', 'delivery_wk', 'collect_wk', 'race', 'age', 'NIH Racial Category', 'NIH Ethnicity Category', 'was_preterm', 'was_early_preterm']

{'Unknown', '21', '42.0', '32', '28', '27', '24.0', '35', '37.0', '29-38', '20.0', '51', '18_to_28', '41', '33.0', '19.0', '32.0', '36', '34', '48', 'Below_18', '26', '31.0', '38', '42', '30', '29.0', '24', '23', '16', '21.0', '27.0', '18', '25', '34.0', '29', '30.0', '36.0', '20', '25.0', '37', '19', '43', '28.0', '38.0', '39', '22.0', '17.0', '31', '33', '23.0', 'Above_38', '17', '35.0', '41.0', '26.0', '40', '22'}

{'Caucasian', 'Indian', 'Unknown', 'american_indian_or_alaska_native', 'other', 'african_american', 'Black or African American', 'Pacific Islander', 'Asian-Japanese', 'Black', 'white', 'caucasian', 'Other', 'Decline', 'asian', 'Asian-Unspecified', 'More than one race', 'American Indian or Alaska Native', 'AmericanIndian', 'black', 'White', 'American Indian', 'Asian-Chinese', 'hispanic_or_lat

In [8]:
# clean age metadata

age_parsed = []
for elem in metadata['age']:
    if elem in ['Unknown', '18_to_28', 'Above_38', 'Below_18']:
        age_parsed.append(elem)
    elif elem == '29-38':
        age_parsed.append('29_to_38')
    else:
        age = float(elem)
        if age < 18:
            age_parsed.append('Below_18')
        elif age <= 28:
            age_parsed.append('18_to_28')
        elif age <= 38:
            age_parsed.append('29_to_38')
        elif age > 38:
            age_parsed.append('Above_38')
        else:
            age_parsed.append('Unknown')
            
metadata['age'] = age_parsed
#print(metadata['age'].value_counts())

metadata = metadata.loc[:, ~metadata.columns.isin(['project', 'was_term', 'race', 'NIH Ethnicity Category', 'was_early_preterm'])]
metadata.rename(columns = {'NIH Racial Category':'race'}, inplace = True)

metadata

Unnamed: 0,specimen,participant_id,delivery_wk,collect_wk,age,race,was_preterm
0,A00001-05,A00001,38.0,33.0,Unknown,American Indian or Alaska Native,False
1,A00002-01,A00002,40.0,38.0,Unknown,White,False
2,A00003-02,A00003,40.0,30.0,Unknown,Asian,False
3,A00004-08,A00004,40.0,27.0,Unknown,White,False
4,A00004-12,A00004,40.0,29.0,Unknown,White,False
...,...,...,...,...,...,...,...
3573,J00111-01,J00111,40.0,17.0,18_to_28,White,False
3574,J00112-01,J00112,39.0,19.0,18_to_28,White,False
3575,J00113-01,J00113,41.0,16.0,29_to_38,White,False
3576,J00115-01,J00115,42.0,18.0,29_to_38,White,False


In [9]:
metadata['collect_wk'].value_counts()

18.0    1515
22.0     108
16.0     108
28.0     104
26.0      97
17.0      96
25.0      90
20.0      88
29.0      86
33.0      83
31.0      81
24.0      81
27.0      80
32.0      78
30.0      76
35.0      68
19.0      67
36.0      66
23.0      66
15.0      61
21.0      58
34.0      58
37.0      50
14.0      46
13.0      41
12.0      41
11.0      33
10.0      32
38.0      29
39.0      23
8.0       17
9.0       16
40.0      11
7.0        9
6.0        6
2.0        3
1.0        2
3.0        1
41.0       1
5.0        1
4.0        1
Name: collect_wk, dtype: int64

## For this model, we will ONLY use the phylotype-relative abundance data.

Phylotypes are a cross-study feature, representing either a microbe, or very closely related (in evolutionary time) set of microbes. 

Phylotypes are groups of 16S rRNA sequence variants that have been clustered together based on the phylogenetic distance between these sequence variants after placement onto a tree of full-length 16S rRNA alleles. These were generated using the [MaLiAmPi pipeline](https://github.com/jgolob/maliampi).

If you are not sure which feature to use, I suggest *trying your modeling first with phylotypes as the feature*.


In [10]:
pt_ra_1e0 = pd.read_csv(
    os.path.join(BASEDIR, 'phylotypes', 'phylotype_relabd.1e0.csv'),
    index_col=0
)

#weight the values by collection week
ser = metadata['collect_wk']
func =  func = lambda x: np.asarray(x) * np.asarray(ser)
pt_ra_1e0 = pt_ra_1e0.apply(func, axis=0)
pt_ra_1e0

Unnamed: 0_level_0,pt__00001,pt__00002,pt__00003,pt__00004,pt__00005,pt__00006,pt__00007,pt__00008,pt__00009,pt__00010,...,pt__01835,pt__01836,pt__01837,pt__01838,pt__01839,pt__01840,pt__01841,pt__01842,pt__01843,pt__01844
specimen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00001-05,26.330709,0.762205,0.069291,0.294488,0.242520,0.000000,3.499213,0.0,0.311811,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A00002-01,30.614347,0.023299,0.232986,1.607603,0.396076,0.000000,0.023299,0.0,0.069896,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A00003-02,28.898964,0.019430,0.032383,0.000000,0.349741,0.000000,0.006477,0.0,0.032383,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A00004-08,25.043679,0.013875,0.027749,0.249743,0.124872,0.000000,0.735355,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A00004-12,23.391209,0.010623,0.015934,0.414286,0.015934,0.010623,4.142857,0.0,0.026557,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
J00111-01,17.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
J00112-01,18.993227,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
J00113-01,7.734932,0.613443,0.000000,0.000000,0.010098,0.000000,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
J00115-01,18.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


These phylotype feature tables are very akin to OTU tables used in other forms of microbiome analysis or a FPKM table from a transcriptioanl data set. Each specimen is a row. Each feature gets a column. The values are the relative transcripts that belong to that feature for that specimen....

In [11]:
pt_ra_1e0.reset_index(inplace=True)

Sum up by row and the total is always 1.0, as these are normalized relative abundance. The `nreads` versions of these same tables have the raw read counts per specimen, suitable for things like beta-binomial modeling, etc. 

In [12]:
data = pd.merge(metadata, pt_ra_1e0, on='specimen', how='inner')
data.iloc[0:3]

Unnamed: 0,specimen,participant_id,delivery_wk,collect_wk,age,race,was_preterm,pt__00001,pt__00002,pt__00003,...,pt__01835,pt__01836,pt__01837,pt__01838,pt__01839,pt__01840,pt__01841,pt__01842,pt__01843,pt__01844
0,A00001-05,A00001,38.0,33.0,Unknown,American Indian or Alaska Native,False,26.330709,0.762205,0.069291,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,A00002-01,A00002,40.0,38.0,Unknown,White,False,30.614347,0.023299,0.232986,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,A00003-02,A00003,40.0,30.0,Unknown,Asian,False,28.898964,0.01943,0.032383,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
data = data.loc[:, ~data.columns.isin(['specimen', 'collect_wk'])]

data = pd.get_dummies(data, prefix=['age', 'race'], columns=['age', 'race'])
data = data.groupby('participant_id').mean()
data

Unnamed: 0_level_0,delivery_wk,was_preterm,pt__00001,pt__00002,pt__00003,pt__00004,pt__00005,pt__00006,pt__00007,pt__00008,...,age_29_to_38,age_Above_38,age_Below_18,age_Unknown,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Native Hawaiian or Other Pacific Islander,race_Unknown,race_White
participant_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00001,38.0,False,26.330709,0.762205,0.069291,0.294488,0.242520,0.000000,3.499213,0.000000,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
A00002,40.0,False,30.614347,0.023299,0.232986,1.607603,0.396076,0.000000,0.023299,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
A00003,40.0,False,28.898964,0.019430,0.032383,0.000000,0.349741,0.000000,0.006477,0.000000,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
A00004,40.0,False,28.722788,0.010748,0.007677,0.381403,0.024630,0.001518,2.696539,0.000718,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
A00005,41.0,False,30.039347,0.069911,0.082155,0.005673,0.507385,0.011191,0.023018,0.005277,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
J00111,40.0,False,17.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
J00112,39.0,False,18.993227,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
J00113,41.0,False,7.734932,0.613443,0.000000,0.000000,0.010098,0.000000,0.000000,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
J00115,42.0,False,18.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [20]:
# splitting train and test data
data_train = data.loc[train_participants]
data_test = data.loc[test_participants]

labels_train = data_train['was_preterm']
labels_test = data_test['was_preterm']

y_train = data_train['delivery_wk']
y_test = data_test['delivery_wk']

data_train = data_train.loc[:, ~data_train.columns.isin(['was_preterm', 'delivery_wk'])]
data_test = data_test.loc[:, ~data_test.columns.isin(['was_preterm', 'delivery_wk'])]

data_train, data_test = helpers.minmax_scale(data_train, data_test)
print(data_train.shape, data_test.shape)

(849, 1855) (365, 1855)


In [21]:
data_train

Unnamed: 0,pt__00001,pt__00002,pt__00003,pt__00004,pt__00005,pt__00006,pt__00007,pt__00008,pt__00009,pt__00010,...,age_29_to_38,age_Above_38,age_Below_18,age_Unknown,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Native Hawaiian or Other Pacific Islander,race_Unknown,race_White
0,0.677643,0.006231,0.000265,0.000000,0.001130,0.000280,0.002996,0.000219,0.000728,0.000843,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.756965,0.001133,0.000865,0.000360,0.020824,0.000000,0.000691,0.000097,0.022337,0.000000,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.325648,0.048953,0.030455,0.001057,0.030129,0.000761,0.013099,0.048134,0.019366,0.022803,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.016735,0.010623,0.285192,0.037720,0.017626,0.000000,0.000000,0.112322,0.000000,0.283398,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.897492,0.009569,0.000092,0.003764,0.000110,0.002946,0.000125,0.000000,0.000567,0.000291,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
844,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
845,0.036727,0.016409,0.197937,0.015228,0.083095,0.001513,0.003228,0.118347,0.015041,0.231106,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
846,0.411245,0.039574,0.063007,0.007627,0.068841,0.032571,0.152223,0.249343,0.019901,0.036020,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
847,0.675255,0.009820,0.063436,0.002002,0.006218,0.000363,0.039908,0.002326,0.030183,0.000663,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [22]:
data_test

Unnamed: 0,pt__00001,pt__00002,pt__00003,pt__00004,pt__00005,pt__00006,pt__00007,pt__00008,pt__00009,pt__00010,...,age_29_to_38,age_Above_38,age_Below_18,age_Unknown,race_American Indian or Alaska Native,race_Asian,race_Black or African American,race_Native Hawaiian or Other Pacific Islander,race_Unknown,race_White
0,0.913115,0.001903,0.002170,0.000000,0.018290,0.000000,0.000270,0.000000,0.002919,0.000000,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.729431,0.001798,0.136820,0.000000,0.000000,0.000000,0.047288,0.000000,0.215149,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.119860,0.045405,0.056073,0.014984,0.234395,0.008021,0.137471,0.003696,0.049807,0.000000,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.535056,0.005582,0.004211,0.007296,0.148021,0.000659,0.002989,0.000000,0.125247,0.007986,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.405453,0.002061,0.006450,0.018567,0.187056,0.000580,0.001176,0.002009,0.105865,0.000000,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
360,0.537146,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
361,0.536595,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
362,0.536963,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
363,0.505395,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [16]:
y_train

participant_id
I00261    40.0
A00024    39.0
I00019    39.0
I00162    36.0
G00032    40.0
          ... 
E00028    39.0
I00267    38.0
G00130    40.0
G00076    36.0
D00084    36.0
Name: delivery_wk, Length: 849, dtype: float64

In [17]:
y_test

participant_id
A00003    40.0
A00006    41.0
A00012    36.0
A00013    38.0
A00014    41.0
          ... 
J00099    37.0
J00101    41.0
J00102    38.0
J00107    40.0
J00113    41.0
Name: delivery_wk, Length: 365, dtype: float64

This is the actual modeling, with a random forest classifer. This is a very basic model with no optimization of hyperparameters. The intent was just to make a model suitable for developing the validation pipelines.

In [23]:
#FULL DATA + XGBoost

print('RUNNING XGBOOST')
y_pred = regressors.run_XGBoost(data_train, data_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING XGBOOST
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.61386856834867


In [24]:
#FULL DATA + RF

print('RUNNING RANDOM FORESTS')
y_pred = regressors.run_RF(data_train, data_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING RANDOM FORESTS
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.585509384558875


In [25]:
print('RUNNING PCA')
PCA_train, PCA_test, PCA_model = models.run_PCA(data_train, data_test)
print(PCA_train.shape, PCA_test.shape)

RUNNING PCA
(849, 74) (365, 74)


In [26]:
#PCA + XGBoost

print('RUNNING XGBOOST')
y_pred = regressors.run_XGBoost(PCA_train, PCA_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING XGBOOST
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.59026992575224


In [27]:
#PCA+RF

print('RUNNING RANDOM FORESTS')
y_pred = regressors.run_RF(PCA_train, PCA_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING RANDOM FORESTS
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.60074747391908


In [18]:
print('RUNNING AE')
AE_train, AE_test = models.run_AE(data_train, data_test)
print(AE_train.shape, AE_test.shape)

RUNNING AE


2022-09-27 19:15:22.301591: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-09-27 19:15:22.301719: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-09-27 19:15:22.301821: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-09-27 19:15:22.304449: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2022-09-27 19:15:22.304556: W tensorflow/stream_executor/platform/default/dso_loader

(849, 100) (365, 100)


In [27]:
#AE+XGB

print('RUNNING XGBOOST')
y_pred = regressors.run_XGBoost(AE_train, AE_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING XGBOOST
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.605977483626255


In [28]:
#AE+RF

print('RUNNING RANDOM FORESTS')
y_pred = regressors.run_RF(AE_train, AE_test, labels_train, labels_test, label='Preterm Phylotypes 1e0')

print('RMSE:', mean_squared_error(y_test, y_pred)**0.5)
#graphing.plot_points(y_pred, y_test, 'Preterm Predictions: Full+XGB')

RUNNING RANDOM FORESTS
Building model for label: Preterm Phylotypes 1e0
Predicting on test data for label: Preterm Phylotypes 1e0
RMSE: 36.602138349687266
