# SMARTER Sheep Dimensionality Reduction

This notebook is a simple demonstration of how to use UMAP, tSNE and PCA to visualize the SMARTER Sheep dataset. Since UMAP and t-SNE does not with missing data and will take a lot of time on genotypes, we will first reduce filter out variants with `PLINK`, impute missing data with `beagle` and the reduce the dimensionality of the data using `PLINK` to do a PCA:

```bash
cd data/interim
wget ftp://webserver.ibba.cnr.it/smarter/SHEEP/OAR3/SMARTER-OA-OAR3-top-0.4.10.zip
unzip SMARTER-OA-OAR3-top-0.4.10.zip -d SMARTER-OA-OAR3-top-0.4.10
plink --chr-set 26 no-xy no-mt --allow-no-sex --bfile SMARTER-OA-OAR3-top-0.4.10/SMARTER-OA-OAR3-top-0.4.10 \
    --geno 0.1 --recode vcf bgz --out SMARTER-OA-OAR3-top-0.4.10/SMARTER-OA-OAR3-top-0.4.10
cd SMARTER-OA-OAR3-top-0.4.10
beagle gt=SMARTER-OA-OAR3-top-0.4.10.vcf.gz out=SMARTER-OA-OAR3-top-0.4.10_imputed.vcf.gz
plink --vcf SMARTER-OA-OAR3-top-0.4.10_imputed.vcf.gz --pca 20 --out SMARTER-OA-OAR3-top-0.4.10_imputed
cd -
```

In [None]:
import pandas as pd
import umap
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.manifold import TSNE

from src.features.utils import get_interim_dir

In [None]:
# Load the PCA results from PLINK (.eigenvec file)
pca_df = pd.read_table(get_interim_dir() / "SMARTER-OA-OAR3-top-0.4.10/SMARTER-OA-OAR3-top-0.4.10_imputed.eigenvec", sep=r'\s+', header=None)

# Rename the columns for easier reference
pca_df.columns = ['FID', 'IID'] + [f'PC{i}' for i in range(1, 21)]

# Drop the FID and IID columns (if needed) to work with PCA components only
pca_data = pca_df.drop(columns=['FID', 'IID'])

# Check the shape of the PCA data
print(pca_data.shape)

Try to apply dimensionality reduction with tree different methods: PCA, t-SNE and UMAP:

| **Aspect**          | **PCA**                                 | **t-SNE**                               | **UMAP**                               |
|---------------------|-----------------------------------------|-----------------------------------------|----------------------------------------|
| **Type**            | Linear                                  | Nonlinear                               | Nonlinear                              |
| **Goal**            | Preserve global variance (overall structure) | Preserve local relationships (clusters) | Preserve both local and some global structure |
| **Computational Efficiency** | Very fast, scalable                 | Slower, computationally expensive       | Fast, scalable                         |
| **Interpretability** | High (variance explained by components) | Low (non-metric space, only visualization) | Low (non-metric space, only visualization) |
| **Strengths**       | Captures global structure, simple to understand | Captures local clusters, good for complex data | Captures local clusters, efficient on large datasets |
| **Weaknesses**      | Cannot capture nonlinear structure      | Can’t preserve global structure, slow   | Requires parameter tuning              |
| **Best For**        | Linear relationships, feature extraction | Visualizing complex, clustered data     | Visualizing large datasets with both local and global structure |

### When to Use Each Method:
- **PCA**: Use PCA when you're interested in understanding the overall **global structure** of the data or when you want a fast, simple dimensionality reduction technique for visualization or further analysis.
  
- **t-SNE**: Use t-SNE when your goal is to **visualize clusters** or uncover **local relationships** in complex data, especially for datasets with **nonlinear structure**. However, it is best suited for smaller datasets due to its computational cost.

- **UMAP**: Use UMAP when working with **large datasets** that require capturing **both local and some global structure**. UMAP is faster and more scalable than t-SNE, making it a preferred choice for high-dimensional biological data like single-cell RNA-seq or other big data applications.

In [None]:
# Plot the PCA results
plt.figure(figsize=(10, 6))
plt.scatter(
    pca_df['PC1'],
    pca_df['PC2'],
    s=2, alpha=0.7
)
plt.title('PCA')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()

In [None]:
# Run tSNE on the PCA-reduced data
tsne_model = TSNE(
    n_components=3,
    random_state=42
)
tsne_results = tsne_model.fit_transform(pca_data)

In [None]:
# Run UMAP on the PCA-reduced data
umap_model = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    metric='euclidean',
    n_components=3,
    random_state=42
)
umap_results = umap_model.fit_transform(pca_data)

In [None]:
assignment_df = pca_df[['FID', 'IID']].copy()
assignment_df.rename(columns={'FID': 'breed', 'IID': 'sample'}, inplace=True)

# Add UMAP results to the DataFrame
assignment_df['umap-1'] = umap_results[:, 0]
assignment_df['umap-2'] = umap_results[:, 1]
assignment_df['umap-3'] = umap_results[:, 2]

# Add tSNE results to the DataFrame
assignment_df['tsne-1'] = tsne_results[:, 0]
assignment_df['tsne-2'] = tsne_results[:, 1]
assignment_df['tsne-3'] = tsne_results[:, 2]

assignment_df.head()

In [None]:
# plot the tSNE results
plt.figure(figsize=(10, 6))
plt.scatter(
    assignment_df['tsne-1'],
    assignment_df['tsne-2'],
    s=2, alpha=0.7
)
plt.title('tSNE clustering of PCA-Reduced Genotype data')
plt.xlabel('TSNE-1')
plt.ylabel('TSNE-2')
plt.show()

In [None]:
# Plot the UMAP results
plt.figure(figsize=(10, 6))
plt.scatter(
    assignment_df['umap-1'],
    assignment_df['umap-2'],
    s=2, alpha=0.7
)
plt.title('UMAP Clustering on PCA-Reduced Genotype Data')
plt.xlabel('UMAP-1')
plt.ylabel('UMAP-2')
plt.show()

In [None]:
# Plot the UMAP results
plt.figure(figsize=(10, 6))
plt.scatter(
    assignment_df['umap-2'],
    assignment_df['umap-3'],
    s=2, alpha=0.7
)
plt.title('UMAP Clustering on PCA-Reduced Genotype Data')
plt.xlabel('UMAP-2')
plt.ylabel('UMAP-3')
plt.show()

In [None]:
# Plot the UMAP results
plt.figure(figsize=(10, 6))
plt.scatter(
    assignment_df['umap-1'],
    assignment_df['umap-3'],
    s=2, alpha=0.7
)
plt.title('UMAP Clustering on PCA-Reduced Genotype Data')
plt.xlabel('UMAP-1')
plt.ylabel('UMAP-3')
plt.show()

In [None]:
# Generate a large color palette by combining multiple qualitative color palettes
colors = px.colors.qualitative.Set1 + \
    px.colors.qualitative.Antique + \
    px.colors.qualitative.Vivid + \
    px.colors.qualitative.Dark24 + \
    px.colors.qualitative.Bold
num_colors = len(colors)

# List of available marker symbols
markers = ['circle', 'cross', 'diamond', 'square', 'x']
num_markers = len(markers)

# Assign colors and markers to each breed
assignment_df['color'] = assignment_df['breed'].apply(lambda x: colors[hash(x) % num_colors])
assignment_df['marker'] = assignment_df['breed'].apply(lambda x: markers[hash(x) % num_markers])

In [None]:
# Create a plotly scatter plot with custom markers and colors
fig = go.Figure()

# Plot each breed with its assigned color and marker
for breed in assignment_df['breed'].unique():
    breed_data = assignment_df[assignment_df['breed'] == breed]
    fig.add_trace(go.Scatter(
        x=breed_data['umap-1'],
        y=breed_data['umap-2'],
        mode='markers',
        marker=dict(
            color=breed_data['color'],
            symbol=breed_data['marker'],
            size=2,
            opacity=0.7
        ),
        name=breed,
        text=breed_data['breed'],
        hoverinfo='text'
    ))

# Set the figure size and display the plot
fig.update_layout(
    title=f"UMAP Clustering with {len(assignment_df['breed'].unique())} Breeds",
    xaxis_title='UMAP-1',
    yaxis_title='UMAP-2',
    width=1000,
    height=800,
    showlegend=False  # Hide the legend to avoid clutter
)

fig.show()

In [None]:

def plot_highlight_breed(assignment_df, highlight_breed):
    # Create a plotly scatter plot with custom markers and colors
    fig = go.Figure()

    # Plot all breeds in gray
    for breed in assignment_df['breed'].unique():
        breed_data = assignment_df[assignment_df['breed'] == breed]
        color = 'gray' if breed != highlight_breed else "red"
        marker = "circle" if breed == highlight_breed else 'circle-open'

        fig.add_trace(go.Scatter(
            x=breed_data['umap-1'],
            y=breed_data['umap-2'],
            mode='markers',
            marker=dict(
                color=color,
                symbol=marker,
                size=2 if breed != highlight_breed else 6,  # Highlight breed with larger markers
                opacity=0.7
            ),
            name=breed,
            text=breed_data['breed'],
            hoverinfo='text'
        ))

    # Set the figure size and display the plot
    fig.update_layout(
        title=f"UMAP Clustering Highlighting {highlight_breed}",
        xaxis_title='UMAP-1',
        yaxis_title='UMAP-2',
        width=1000,
        height=800,
        showlegend=False  # Hide the legend to avoid clutter
    )

    fig.show()

# Example of how to use the function
plot_highlight_breed(assignment_df, "FRZ")


In [None]:
# Create a 3D scatter plot with unique color-marker combinations for each breed
fig = go.Figure()

# Plot each breed with its assigned color and marker
for breed in assignment_df['breed'].unique():
    breed_data = assignment_df[assignment_df['breed'] == breed]
    fig.add_trace(go.Scatter3d(
        x=breed_data['umap-1'],
        y=breed_data['umap-2'],
        z=breed_data['umap-3'],  # Add the third dimension for 3D plot
        mode='markers',
        marker=dict(
            color=breed_data['color'].iloc[0],  # Color assigned to this breed
            symbol=breed_data['marker'].iloc[0],  # Marker assigned to this breed
            size=2,
            opacity=0.7
        ),
        name=breed,
        text=breed_data['breed'],
        hoverinfo='text'
    ))

# Update layout to set the figure size and title
fig.update_layout(
    title=f"3D UMAP Clustering with {len(assignment_df['breed'].unique())} Breeds",
    width=1000,
    height=800,
    scene=dict(
        xaxis_title="UMAP-1",
        yaxis_title="UMAP-2",
        zaxis_title="UMAP-3"
    ),
    showlegend=False  # Hide the legend to avoid clutter
)

# Show the interactive 3D plot
fig.show()