<a href="https://colab.research.google.com/github/christophergaughan/Machine-Learning/blob/master/nat_data_april2024_rescaled.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This analysis is based on the following e-mail from the grad-student Dan Quiggley

Dear Chris,

This is the human fetal data (week 7 to 17) we would like to analyze for correlations. The first column is the sample name and the other columns are normalized expression values for different transcription factors. We would like to do a similar analysis that we previously did with the mouse data that you did last year and focus on transcription factors that have a positive correlation with ALB or negative correlation with AFP. If we can get that information, we would also like to look at correlation with HNF4A.

Best,
Dan

From Natesh:

"Basically, we are stuck because we are trying to show that the FOXA-TBX3-CEBPA TF gene circuit matters for other’s data, and not just our own."

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/liver_TF_normalized_week7_17_01_rescaled.csv')

In [None]:
df.head()

# Rename 0th Column to 'Cell Number'

In [None]:
# Rename the column
data = df.rename(columns={'Unnamed: 0': 'Cell_Number'})

# Save the updated dataframe back to a CSV file, or continue with analysis
data.to_csv('/content/drive/MyDrive/Colab Notebooks/liver_TF_normalized_week7_17_01_rescaled.csv', index=False)  # Save the file if needed

In [None]:
data.head()

# Look for obvious trends in data

To address Dan's request, we'll focus on finding transcription factors that correlate with ALB positively and AFP negatively. Additionally, we'll examine correlations with HNF4A. We'll start by computing the correlation matrix for these specific transcription factors, and then we'll highlight the relationships as requested.

Here's how we'll proceed:

**Compute the correlation matrix** for the dataset.
Extract and focus on the correlations for ALB, AFP, and HNF4A.

**Identify transcription factors** with strong positive correlations with ALB, strong negative correlations with AFP, and any notable correlations with HNF4A.
Visualize these correlations for a clearer understanding.

In [None]:
# Calculate the correlation matrix
correlation_matrix = data.iloc[:, 1:].corr()  # Exclude the identifier column

# Extract correlations with ALB, AFP, and HNF4A
alb_correlations = correlation_matrix['ALB'].sort_values(ascending=False)
afp_correlations = correlation_matrix['AFP'].sort_values()
hnf4a_correlations = correlation_matrix['HNF4A'].sort_values(ascending=False)

# Display top positive correlations with ALB, top negative correlations with AFP, and correlations with HNF4A
alb_top_positive = alb_correlations.head(11)  # Top 5 + ALB itself
afp_top_negative = afp_correlations.head(6)
hnf4a_top = hnf4a_correlations.head(11)  # Top 5 + HNF4A itself

alb_top_positive, afp_top_negative, hnf4a_top


## Correlation Analysis Results:
### Here’s what we found regarding the transcription factors of interest:

**Correlations with ALB:**
*Top Positively Correlated Transcription Factors:*

APOA1: Correlation coefficient of 0.741

AFP: Correlation coefficient of 0.603

TTR: Correlation coefficient of 0.586

PROX1: Correlation coefficient of 0.413

ATF5: Correlation coefficient of 0.398

**Correlations with AFP:**

*Top Negatively Correlated Transcription Factors:*

HLX: Correlation coefficient of -0.065

CRIP2: Correlation coefficient of -0.059

NFIB: Correlation coefficient of -0.055

NFIA: Correlation coefficient of -0.054

CEBPB: Correlation coefficient of -0.039

**Correlations with HNF4A:**

*Top Positively Correlated Transcription Factors:*

APOA1: Correlation coefficient of 0.234

AFP: Correlation coefficient of 0.212

PPARA: Correlation coefficient of 0.174

TTR: Correlation coefficient of 0.168

ALB: Correlation coefficient of 0.158

## Observations:
**ALB** shows strong positive correlations with other liver-specific proteins such as APOA1, AFP, and TTR, indicating a possible co-regulation in liver function or development.

**AFP**, while positively correlated with ALB and HNF4A, shows weak negative correlations with several other factors, though these negative correlations are quite small in magnitude.

**HNF4A** correlates positively with several factors associated with liver function, reinforcing its role in liver development and metabolism.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns  # Seaborn for enhanced visualization

# Function to create enhanced scatter plots with regression line and correlation coefficient
def plot_enhanced_correlation(data, tf1, tf2, title):
    plt.figure(figsize=(6, 4))
    # Scatter plot with regression line
    sns.regplot(x=tf1, y=tf2, data=data, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
    # Calculate the Pearson correlation coefficient
    correlation_coef = data[[tf1, tf2]].corr().iloc[0, 1]
    plt.title(f'Correlation between {tf1} and {tf2}: {correlation_coef:.2f}\n{title}')
    plt.xlabel(tf1)
    plt.ylabel(tf2)
    plt.grid(True)
    plt.show()

# Examples of enhanced plots
plot_enhanced_correlation(data, 'ALB', 'APOA1', 'Positive Correlation')
plot_enhanced_correlation(data, 'ALB', 'TTR', 'Positive Correlation')
plot_enhanced_correlation(data, 'AFP', 'HLX', 'Negative Correlation')
plot_enhanced_correlation(data, 'AFP', 'CRIP2', 'Negative Correlation')
plot_enhanced_correlation(data, 'HNF4A', 'APOA1', 'Positive Correlation')
plot_enhanced_correlation(data, 'HNF4A', 'PPARA', 'Positive Correlation')


## These plots are hard to read. To better visualize the correlation, especially to distinguish between positive and negative trends, I enhanced the scatter plots with the following techniques:

1. **Linear Regression Line**: Adding a linear regression line to the scatter plot can help illustrate the direction and strength of the relationship. This line effectively summarizes the correlation by showing the best fit through the data points.

2. **Correlation Coefficient:** Displaying the correlation coefficient on the plot can immediately indicate the strength and direction of the correlation (positive or negative).

3. **Density Contours:** For dense plots, adding contours can help show the concentration of points and highlight the trend areas.

## Detailed Correlation Analysis:
Investigate the correlations more deeply to identify any potential regulatory mechanisms or interactions between transcription factors.

Let's start with hierarchical clustering of the correlation matrix to identify clusters of transcription factors that might be working together.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy

# Calculate the correlation matrix
correlation_matrix = data.iloc[:, 1:].corr()  # Exclude the sample identifier column

# Perform hierarchical clustering
hc = hierarchy.linkage(correlation_matrix, method='average')

# Plot the dendrogram
plt.figure(figsize=(12, 10))
dendro = hierarchy.dendrogram(hc, labels=correlation_matrix.columns, leaf_rotation=90)
plt.title('Hierarchical Clustering of Transcription Factors')
plt.xlabel('Transcription Factors')
plt.ylabel('Distance')
plt.show()


### Interpretation of the Transcription Factor Correlation Network

The network visualization created with `networkx` represents transcription factors as nodes and their correlations as edges. The strength and significance of these correlations are visually encoded in the graph. Understanding and interpreting this graph involves considering several key aspects:

#### Nodes
Each node in the graph represents a transcription factor. The presence of a node indicates that this transcription factor has at least one strong correlation (either positive or negative) with another transcription factor beyond a specified threshold.

#### Edges
Edges between nodes represent correlations between the transcription factors. The weight of these edges (often shown as the thickness or color intensity of the line) indicates the strength of the correlation:
- Thicker or more intensely colored edges represent stronger correlations.
- The color scale might represent the nature of the correlation, with one color for positive and another for negative correlations, if differentiated.

#### Threshold Significance
The threshold value, typically set (e.g., $|\text{r}| > 0.5$), is critical for determining which correlations are included in the network:
- **Threshold Value**: This value is chosen to ensure that only the most significant correlations are visualized, reducing noise from weaker, potentially less meaningful correlations.
- **Interpretation**: A higher threshold means that only stronger and potentially more biologically significant interactions are shown. This helps in focusing on the most influential transcription factor interactions.

#### Analytical Insights
From the network, one can derive insights such as:
- **Clusters**: Groups of interconnected nodes (transcription factors) might indicate a regulatory complex or a pathway with closely related functions.
- **Hub Nodes**: Nodes with many connections might be key regulators, affecting or being affected by many other transcription factors.
- **Isolated Nodes**: Sparse connections or isolated nodes might indicate transcription factors with specific, less integrated roles in the cellular context.

#### Biological Relevance
To ascertain the biological relevance of the observed patterns:
- **Database Validation**: It is advisable to validate interesting findings using biological databases or literature to see if the correlations correspond to known pathways or interactions.
- **Experimental Verification**: Ultimately, experimental studies are needed to verify the predicted interactions and their functional implications in biological processes.

This network analysis provides a powerful overview of potential regulatory mechanisms and can guide further detailed investigations into the roles of specific transcription factors in biological processes.


# Interpretation of the Dendrogram in Context of Natesh's Gene Circuit ideas (form **THIS** data:)

1. **Cluster Analysis:**

* If FOXA, TBX3, and CEBPA cluster closely together, it would suggest that these transcription factors might share similar expression patterns or regulatory inputs, supporting their hypothesized interconnected roles in gene regulation circuits.
* If they are distant from each other in the clustering hierarchy, it might indicate that they operate in distinct regulatory pathways or under different regulatory mechanisms, which could challenge or refine the hypothesis.

2. **Integration with Other Factors:**

* The position of FOXA, TBX3, and CEBPA relative to other transcription factors can provide insights into broader networks. For example, if they cluster with other known key regulators of liver development or function, it would support their critical roles in these processes.

* Identifying unexpected clustering with other transcription factors could also uncover new avenues of research or suggest additional components of the gene circuitry that were not previously considered.

3. **Distance and Linkage:**

* The vertical distance (height of the linkages) between clusters indicates the degree of similarity; shorter distances suggest higher similarity. This can be critical in assessing the strength of the association between different transcription factors.

* The methodology used for clustering (e.g., single linkage, complete linkage, average linkage) and the distance metric (e.g., Euclidean, Manhattan) can affect the interpretation, so these should be considered when drawing conclusions from the dendrogram.

4. **Potential Insights:**

**Support for Gene Ciruit Idea:** If FOXA, TBX3, and CEBPA are closely linked in this hierarchical clustering, it would lend support to Natesh’s hypothesis by indicating a co-regulated nature or shared functional pathways.

**Revision of Gene Circuit Idea:** Alternatively, significant distance between these transcription factors in the dendrogram might suggest that the hypothesis needs revision or that the relationships are context-dependent, varying across conditions or cell types that may not have been uniformly sampled or represented in the dataset.

**Next Steps for Analysis:**

* **Deeper Examination of Clusters:** Analyze the specific biological functions or pathways associated with other transcription factors in the same clusters as FOXA, TBX3, and CEBPA to understand potential mechanisms or interactions.

* **Cross-referencing with External Data:** Validate the clustering results with external databases or literature to ensure that the observed patterns reflect known biological relationships and are not artifacts of the dataset or methodology.

* **Experimental Validation:** Consider designing experiments to test the regulatory interactions suggested by the clustering, such as gene knockdown or overexpression studies, to observe impacts on other members of the same cluster?

In [None]:
import networkx as nx

# Threshold for strong correlations
threshold = 0.5

# Create the network
G = nx.Graph()

# Add nodes and edges based on correlation threshold
for col1 in correlation_matrix.columns:
    for col2 in correlation_matrix.columns:
        if col1 != col2 and abs(correlation_matrix.loc[col1, col2]) > threshold:
            G.add_edge(col1, col2, weight=correlation_matrix.loc[col1, col2])

# Draw the network
plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G, seed=42)  # For consistent layout
edges = G.edges(data=True)
weights = [abs(data['weight']) for _, _, data in edges]
nx.draw_networkx(G, pos, edge_color=weights, width=4, edge_cmap=plt.cm.viridis,
                 node_color='lightblue', with_labels=True, font_weight='bold')
plt.title('Network of Transcription Factors with Strong Correlations')
plt.show()


## Discussion of above graph:

**ALB (Albumin)** is connected to both **AFP (Alpha-fetoprotein)** and **APOA1 (Apolipoprotein A1)**, and **TTR (Transthyretin)**.
*These connections indicate strong correlations between the expression levels of these genes.*

1. AFP is directly connected to APOA1 and indirectly to TTR through ALB.

2. TTR is linked directly to ALB and indirectly to AFP and APOA1 through ALB.

3. The colors of the edges (purple and yellow) could indicate the strength or type of correlation (positive/negative), assuming a color-coding scheme is used to represent this.


# Discussion of relation to gene circuit that Natesh is interested in.

* **Gene Circuit Interest:** The provided network doesn’t specifically include the FOXA, TBX3, and CEBPA transcription factors mentioned by the PI. However, if these genes are part of a broader hypothesis involving liver function or development, then examining their relationships with ALB, AFP, and others shown could be significant.

* **Biological Context:** ALB, AFP, and TTR are all significant in liver biology. ALB and AFP are major plasma proteins synthesized in the liver, while APOA1 is a major component of HDL in plasma. Their correlations might reflect shared regulatory mechanisms or responses to similar physiological conditions, potentially valuable in understanding liver function or diseases.

* **Broader Implications:** If the goal is to establish the importance of another gene circuit (FOXA-TBX3-CEBPA) and its relevance in the context of other data, showing how these well-known liver-related genes interact could support arguments for the physiological and developmental relevance of the FOXA-TBX3-CEBPA circuit, especially if it can be shown that they are involved in similar or interconnected pathways.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Calculate the correlation matrix
correlation_matrix = data.iloc[:, 1:].corr()  # Exclude the sample identifier column

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(12, 12))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(correlation_matrix, mask=mask, cmap='coolwarm', vmax=1, vmin=-1, center=0,
            square=True, linewidths=1, cbar_kws={"shrink": .5}, annot=False)

plt.title('Heatmap of Transcription Factor Correlations')
plt.show()


In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame and it's already scaled
# If 'data' includes non-numeric data or identifiers, ensure to exclude them as follows:
# Assuming the first column is identifiers and needs to be excluded
X_scaled = data.iloc[:, 1:].values

# Initialize PCA and fit the scaled data
pca = PCA(n_components=2)  # Adjust the number of components as needed
principal_components = pca.fit_transform(X_scaled)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['Principal Component 1', 'Principal Component 2'])

# Plotting the first two principal components
plt.figure(figsize=(8, 6))
plt.scatter(pca_df['Principal Component 1'], pca_df['Principal Component 2'])
plt.title('PCA Result - First Two Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show()

# Optionally, check the explained variance ratio to see how much information is captured by the first two components
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")


In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Assuming 'data' is your DataFrame and it's already scaled
X_scaled = data.iloc[:, 1:].values  # Exclude the first column if it's an identifier

# Initialize PCA with enough components to capture most variability
pca = PCA()
principal_components = pca.fit(X_scaled)

# Explained variance for each component
explained_variance = pca.explained_variance_ratio_

# Creating a scree plot
plt.figure(figsize=(10, 6))
plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.5, align='center',
        label='Individual explained variance')
plt.step(range(1, len(explained_variance) + 1), np.cumsum(explained_variance), where='mid',
         label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.title('Scree Plot of PCA')
plt.legend(loc='best')
plt.tight_layout()
plt.show()


# The data has many principal components (like 40)

# Let's drill down into Natesh's Gene circuit Idea, or try to address them as directly as this data can

To delve into the hypothesis concerning the FOXA, TBX3, and CEBPA transcription factors (TFs) and their significance in gene regulatory circuits, we can start with a structured analysis that specifically targets these TFs. Here’s how we can approach it:

1. **Correlation Analysis**
First, we need to analyze the correlations among FOXA, TBX3, and CEBPA, as well as their correlations with other genes in the dataset. This will help us understand their interaction dynamics.

2. **Steps to Perform Correlation Analysis:**
Extract Data for Relevant Transcription Factors: Isolate the columns for FOXA, TBX3, CEBPA, and any other genes of interest from the dataset.

I. **Compute the Correlation Matrix:** Calculate the correlation coefficients among these transcription factors to identify any significant relationships.

II. **Visualize the Correlations:** Use a heatmap or correlation matrix plot to visually represent these correlations, making it easier to identify patterns or strong interactions.

# Heatmap

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming 'data' is your DataFrame and it contains the relevant TFs
focus_tfs = ['FOXA1', 'FOXA2', 'TBX3', 'CEBPA']  # Modify 'FOXA1', 'FOXA2' based on the specific FOXA genes involved
tf_data = data[focus_tfs]

# Compute the correlation matrix
correlation_matrix = tf_data.corr()

# Plotting the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Matrix of Selected Transcription Factors FOXA1, FOXA2, TBX3, CEBPA')
plt.show()
