To be precise, the integration of bacterial and gene data across datasets S12 and S13 could become challenging, particularly as the clustering approach scales. On the other hand, a tiered clustering model offers a more strategic solution for forward-looking predictive models.  
- The logical next step was to incorporate a tree-based clustering, which will facilitate re-clustering based on gene-level data. A notable issue encountered was dealing with missing or incomplete bacterial abundance values, which posed difficulties in mapping protocols.  
- In this instance, when clustering bacteria based on abundance, the predictability of gene expression emerged as a key challenge, especially given the potential volume of null outcomes from the gene-side predictions.  
- The subsequent method of re-clustering bacterial features provided a stronger foundation for accurate gene expression prediction.  
- This second-level clustering introduced a crucial function where bacterial features were prioritized, producing a more adaptable framework for predictive analysis (akin to multi-level regression techniques).  
- Implementing clustering based on bacterial abundance was straightforward, but during the exploratory data analysis (EDA), a persistent issue arose with incomplete bacterial taxa labels. A future API update could address this by allowing for flexible labeling mechanisms that don’t impact the integrity of the primary datasets.  
- The tree-based clustering might present challenges for broader interpretation, but the algorithm has potential for in-depth breakdowns that can better explain the interactions between bacteria and gene expression outcomes.  
- A more unconventional approach was taken in performing a second round of clustering on bacterial data, which deviated from standard protocol, but ensured all relevant data was parsed without loss of important features from S12 and S13. This ensured usability across multiple levels of analysis.  
- The significance of outlier bacterial species has yet to be fully assessed. However, outliers are retained in the EDA, as their influence on gene expression models could offer critical insights in later stages of the project.

In [5]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from scipy.spatial.distance import pdist, squareform
from skbio.stats.distance import mantel
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

In [6]:
import pandas as pd

stable12 = pd.read_csv('/Users/schoudhry/Desktop/IIT/Research/researchData/StableS12.csv')
stable13 = pd.read_csv('/Users/schoudhry/Desktop/IIT/Research/researchData/Stable13.csv')

genes = stable12.iloc[0]
bacteria = stable13.iloc[0]

data = {
    'Bacteria': bacteria,
    'Gene': genes
}

combined_df = pd.DataFrame(data)

print(combined_df)


    Bacteria  Gene
B01      116   235
B02       66   815
B03        0   351
B04       14   629
B05      265    91
..       ...   ...
s89      422  1028
s90      584    14
s94       31  1985
s95      145   369
s96      487   291

[89 rows x 2 columns]
