To be precise, the integration of bacterial and gene data across datasets S12 and S13 could become challenging, particularly as the clustering approach scales. On the other hand, a tiered clustering model offers a more strategic solution for forward-looking predictive models.  
- The logical next step was to incorporate a tree-based clustering, which will facilitate re-clustering based on gene-level data. A notable issue encountered was dealing with missing or incomplete bacterial abundance values, which posed difficulties in mapping protocols.  
- In this instance, when clustering bacteria based on abundance, the predictability of gene expression emerged as a key challenge, especially given the potential volume of null outcomes from the gene-side predictions.  
- The subsequent method of re-clustering bacterial features provided a stronger foundation for accurate gene expression prediction.  
- This second-level clustering introduced a crucial function where bacterial features were prioritized, producing a more adaptable framework for predictive analysis (akin to multi-level regression techniques).  
- Implementing clustering based on bacterial abundance was straightforward, but during the exploratory data analysis (EDA), a persistent issue arose with incomplete bacterial taxa labels. A future API update could address this by allowing for flexible labeling mechanisms that don’t impact the integrity of the primary datasets.  
- The tree-based clustering might present challenges for broader interpretation, but the algorithm has potential for in-depth breakdowns that can better explain the interactions between bacteria and gene expression outcomes.  
- A more unconventional approach was taken in performing a second round of clustering on bacterial data, which deviated from standard protocol, but ensured all relevant data was parsed without loss of important features from S12 and S13. This ensured usability across multiple levels of analysis.  
- The significance of outlier bacterial species has yet to be fully assessed. However, outliers are retained in the EDA, as their influence on gene expression models could offer critical insights in later stages of the project.

In [4]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
#import statsmodels.api as sm
#import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from scipy.spatial.distance import pdist, squareform
#from skbio.stats.distance import mantel
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

In [5]:
csv12 = '/Users/schoudhry/Desktop/IIT/Research/researchData/StableS12.csv'
csv13 = '/Users/schoudhry/Desktop/IIT/Research/researchData/Stable13.csv'

In [6]:

s12_df = pd.read_csv(csv12).loc[:, ~pd.read_csv(csv12).columns.str.contains('^Unnamed')]
s13_df = pd.read_csv(csv13).loc[:, ~pd.read_csv(csv13).columns.str.contains('^Unnamed')]

# Iterate through Patient IDs in S12 and check for matches in S13
for patient_id in s12_df.columns:
    if patient_id in s13_df.columns:
        # Get the values for this Patient ID from both dataframes
        s12_values = s12_df[patient_id].values
        s13_values = s13_df[patient_id].values

        # Print the matched values
        print(f"Patient ID: {patient_id}")
        print(f"S12 Values: {s12_values}")
        print(f"S13 Values: {s13_values}")
        print("-" * 40)


Patient ID: B01
S12 Values: [235  23 184 ...   0   0   0]
S13 Values: [116  86   0 ...   0   0   0]
----------------------------------------
Patient ID: B02
S12 Values: [815  15 166 ...   0   0   0]
S13 Values: [ 66 186   0 ...   0   0   0]
----------------------------------------
Patient ID: B03
S12 Values: [351 188 146 ...   0   5   0]
S13 Values: [ 0  7 22 ...  0  0  0]
----------------------------------------
Patient ID: B04
S12 Values: [629  12 124 ...   0   0   0]
S13 Values: [14 37  0 ...  0  0  0]
----------------------------------------
Patient ID: B05
S12 Values: [ 91   4 150 ...   0   0   0]
S13 Values: [265   0   0 ...   0   0   0]
----------------------------------------
Patient ID: B06
S12 Values: [175  10 121 ...   0   0   0]
S13 Values: [594  13  54 ...   0   0   0]
----------------------------------------
Patient ID: B07
S12 Values: [1240   24  536 ...    0    0    0]
S13 Values: [351  27  24 ...   0   0   0]
----------------------------------------
Patient ID: B08
S12

In [7]:
"""
Integrating S12 and S13
S12 is the List of Genes of the Patients (found in Column A)
Ex. -> B01 -> B03

The patients are coded B01 .. B0x
with the genes being listed in the first column (Column A)

Check the genes from S12 against the genes from S13.
From there, check the bacteria tree of each, focusing on the second in the tree.
Example: Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia.

Check the patients' genes against the genes with CRC from table S6,
with more occurrences being more chance -> prediction model aimed to be made.

Create a DataFrame with:
- S12 -> focusing on the gene (in Column A)
- S13 and S12 -> Row 1 (match up the values and reorganize the table then print it out)
- S12 and S13 -> once the values are matched, create a table from S12 that has the occurrences of each gene.
For example, Column B2 -> would be:
        B01x
TSPAN6   235
Where TSPAN is the gene
235 the amount of the gene in patient B01.

S13 and S6 -> check the bacteria against each other in the tables -> S6 and S13 focusing on the second in the tree.
Example: Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia.
S13 -> once organized alongside.

Then from the S6 table, check which genes based on the amount of occurrences would be likely to have CRC.
The taxa would be in the input feature (since it is Bacteria based (SECOND ON THE TREE)).

From there, check the gene through:
With a high R-squared, low p-value, and valid after correction, could be considered a candidate for carrying CRC risk or involvement.
"""


"\nIntegrating S12 and S13\nS12 is the List of Genes of the Patients (found in Column A)\nEx. -> B01 -> B03\n\nThe patients are coded B01 .. B0x\nwith the genes being listed in the first column (Column A)\n\nCheck the genes from S12 against the genes from S13.\nFrom there, check the bacteria tree of each, focusing on the second in the tree.\nExample: Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae <- focus on Clostridia.\n\nCheck the patients' genes against the genes with CRC from table S6,\nwith more occurrences being more chance -> prediction model aimed to be made.\n\nCreate a DataFrame with:\n- S12 -> focusing on the gene (in Column A)\n- S13 and S12 -> Row 1 (match up the values and reorganize the table then print it out)\n- S12 and S13 -> once the values are matched, create a table from S12 that has the occurrences of each gene.\nFor example, Column B2 -> would be:\n        B01x\nTSPAN6   235\nWhere TSPAN is the gene\n235 the amount of the gene in patient B01.\n\nS13