<a href="https://colab.research.google.com/github/bryankolaczkowski/gmcdp/blob/main/examples/ibd/ibd_sandbox.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# read csv data

The pandas library will import a csv-formatted text file into a pandas DataFrame object.

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/bryankolaczkowski/gmcdp/main/examples/ibd/data_normalized.csv")
df.head()

# extract ASV data to numpy

Numpy "N-dimensional arrays" (aka, "tensors") are a standard way to 'translate' data between different python libraries.

In this case, we first extract all the columns from the DataFrame whose headers start with "ASV_", which indicates that the column contains abundance data (aka, the explanatory variables).

In [None]:
asvs = [ x for x in df.columns if x.find('ASV_') == 0]
x = df[asvs].to_numpy()
x

# extract disease state data to numpy

We'll also need the response variable, which is stored in the column called "Group". In this case, the response variable is a binary indicating whether the sample (row) is from an individual diagnosed with the disease, or not.

In the DataFrame, disease-diagnosed rows are indicated by Group="CD", and non-diagnosed rows are indicated by Group="Control". We translate these labels to 0|1, using a python Dictionary (map).

Response variables are also translated to a Numpy array.

In [None]:
import numpy as np

c = df[['Group']]
c = c.to_numpy().ravel()
map = { 'CD':1, 'Control':0 }
yl = [ map[x] for x in c]
y = np.array(yl)
y

# package expanatory,response variables for tensorflow

Tensorflow can read Numpy arrays containing explanatory and response variables into a Dataset object that can be used to efficiently train neural networks.

It's a bit difficlt to 'look at' the contents of the Dataset object, directly (although you can re-convert the Dataset to numpy arrays). The "shape" of the explanatory (x) and response (y) tensors includes a "batch dimension" (None), which will be 'filled in' by tensorflow during training, based on the specified 'batch size' of the Dataset (in this case, 10).

In [None]:
import tensorflow as tf

data = tf.data.Dataset.from_tensor_slices((x,y)).batch(10)
data

# data visualization

XX

In [None]:
import matplotlib
import matplotlib.pyplot as plt

cmap = ['blue', 'red']

fig,ax = plt.subplots(1,1, figsize=(20,10))

xax = np.arange(0, x.shape[-1], 1)

for j in range(x.shape[0]):
  yax = np.array(x[j,:])
  ax.scatter(xax, np.flip(np.sort(yax)), color=cmap[y[j]])