To be precise, the integration of bacterial and gene data across datasets S12 and S13 could become challenging, particularly as the clustering approach scales. On the other hand, a tiered clustering model offers a more strategic solution for forward-looking predictive models.  
- The logical next step was to incorporate a tree-based clustering, which will facilitate re-clustering based on gene-level data. A notable issue encountered was dealing with missing or incomplete bacterial abundance values, which posed difficulties in mapping protocols.  
- In this instance, when clustering bacteria based on abundance, the predictability of gene expression emerged as a key challenge, especially given the potential volume of null outcomes from the gene-side predictions.  
- The subsequent method of re-clustering bacterial features provided a stronger foundation for accurate gene expression prediction.  
- This second-level clustering introduced a crucial function where bacterial features were prioritized, producing a more adaptable framework for predictive analysis (akin to multi-level regression techniques).  
- Implementing clustering based on bacterial abundance was straightforward, but during the exploratory data analysis (EDA), a persistent issue arose with incomplete bacterial taxa labels. A future API update could address this by allowing for flexible labeling mechanisms that don’t impact the integrity of the primary datasets.  
- The tree-based clustering might present challenges for broader interpretation, but the algorithm has potential for in-depth breakdowns that can better explain the interactions between bacteria and gene expression outcomes.  
- A more unconventional approach was taken in performing a second round of clustering on bacterial data, which deviated from standard protocol, but ensured all relevant data was parsed without loss of important features from S12 and S13. This ensured usability across multiple levels of analysis.  
- The significance of outlier bacterial species has yet to be fully assessed. However, outliers are retained in the EDA, as their influence on gene expression models could offer critical insights in later stages of the project.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from scipy.spatial.distance import pdist, squareform
from skbio.stats.distance import mantel
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# smf model issue ??

In [2]:
# Integrating S12 and S13
# S12 is the List of Genes of the Patients(found in Column A)
# Ex. -> B01 -> B03

'''
The patients are coded b01 .. b0x
with the genes being listed in the first column(column A)

check the genes from s12 against the genes from 
From there
check the bacteria tree of each, focusing on the second in the tree
Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia 

check the patients genes against the the genes with CRC from table s6
with more occurances being more chance ??? -> prediction model ???? aimed to be made


create a dataframe with 
S12 -> focusing on the gene(in column A)
S13 and S12 -> Row 1 (match up the values and re organize the table then print it out)
S12 and S13 -> once the values are matched, create a table from s12 that has the occurances of each gene
For example column B2 -> would be:
        B01x
TSPAN6   235
Where TSPAN is the gene 
235 the amount of the gene in patient B01



S13 and S6 -> check the bacteria against each other in the tables -> s6 and s13 focusingon the second in the tree
Ex. -> Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia 
S13 -> once organized alongside 

then from the S6 table
check which genes based on the amount of occurances would be likely to have CRC
the taxa would be in the input feature(since it is Bacteria based(SECOND ON THE TREE))

from there check the gene through:
with a high R-squared, low p-value, and valid after correction, could considered be a candidate for carrying CRC risk or involvement.

'''

'\nThe patients are coded b01 .. b0x\nwith the genes being listed in the first column(column A)\n\ncheck the genes from s12 against the genes from \nwhy use s6 "The associations in the file are statistically significant, with a false \ndiscovery rate (FDR) of less than 0.1. This means that the likelihood of identifying \nfalse positives is controlled, ensuring that the associations are reliable "\n\nFrom there\ncheck the bacteria tree of each, focusing on the second in the tree\nBacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia \n\ncheck the patients genes against the the genes with CRC from table s6\nwith more occurances being more chance ??? -> prediction model ???? aimed to be made\n\n\ncreate a dataframe with \nS12 -> focusing on the gene(in column A)\nS13 and S12 -> Row 1 (match up the values and re organize the table then print it out)\nS12 and S13 -> once the values are matched, create a table from s12 that has the occurances of each gene\nFor e

After fitting the model, look for significant gene predictions that are well explained by certain bacterial taxa.
A gene whose expression is consistently predicted by microbial abundance, with a high R-squared, low p-value, and valid after correction, could be a candidate for carrying CRC risk or involvement.

In [3]:
# CSV IMPORT(s)
csv6 = '/Users/schoudhry/Desktop/IIT/Research/researchData/Supplementary Tables S6 S6.csv'
csv12 = '/Users/schoudhry/Desktop/IIT/Research/researchData/StableS12.csv'
csv13 = '/Users/schoudhry/Desktop/IIT/Research/researchData/Stable13.csv'



In [None]:
# Testing runs


s12_df = pd.read_csv(csv12)
s13_df = pd.read_csv(csv13)


def get_second_tree(bacteria_tree):
    return bacteria_tree.split(';')[1] if ';' in bacteria_tree else None

s13_df['Second_Taxonomy'] = s13_df.iloc[:, 0].apply(get_second_tree)

s12_genes = s12_df.iloc[:, 0]


