Script name: 
    PlotDistMatrices.py

Description: 
    Based on identity and alignment scores scripted by another student, this code should create 4 dendrograms displayed the differences or similarities of the "Romanov" family and other individuals. For training purposes, the script was built with a mock comparison file. In the final runs, this should work with the real input file, which should have the same formatting style.
        
User defined functions: 
    
Procedure:
    1. Import numpy, pandas, matplotlib and scipy
    2. Assess the format of the input file
    3. Identify the two different clusters in the file
    4. Define plot style
    5. Create the actual dendrograms
 
Input:
    input_file (This is the output_file of student 2)

Output:
    DG_mtDNA/Y_alignemt/identity.png   
Usage: 
    python PlotDistMatrices.py input_file output_file

Version: 1.0
Date 2025-10-23
Author: Ariane Neumann

In [1]:
#!pip install numpy pandas # this only needs to be done if not yet in environment
#!pip install matplotlib scipy pandas # this only needs to be done if not yet in environment

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram
from scipy.spatial.distance import squareform

In [8]:
# Load the cleaned mock comparison file
mock_df = pd.read_csv("mock_comparison.txt", sep='\t')

# Clean percentage and convert to float
mock_df['IdentityScore'] = mock_df['IdentityScore'].str.replace('%', '').astype(float)

# Extract unique individuals
individuals = pd.unique(mock_df[['SampleA', 'SampleB']].values.ravel())
individuals.sort()

# Define Romanov identifiers
romanov_names = [
    'Princess Irene', 'Prince Fred', 'Nicolas II Romanov', 'Alexandra Romanov',
    'Olga Romanov', 'Tatiana Romanov', 'Maria Romanov', 'Alexei Romanov',
    'Suspected body of Anastasia Romanov']

# Function to create distance matrix from similarity scores
def create_distance_matrix(df, score_col):
    matrix = pd.DataFrame(np.ones((len(individuals), len(individuals))) * 100,
                          index=individuals, columns=individuals)
    for _, row in df.iterrows():
        a, b = row['SampleA'], row['SampleB']
        score = row[score_col]
        matrix.loc[a, b] = 100 - score
        matrix.loc[b, a] = 100 - score
    np.fill_diagonal(matrix.values, 0)
    return matrix

# Function to plot and save dendrogram
def plot_dendrogram(dist_matrix, title, filename, show_plot):
    condensed = squareform(dist_matrix.values)
    linkage_matrix = linkage(condensed, method='average')

    def label_colors(label):
        return 'teal' if label in romanov_names else 'purple'

    plt.figure(figsize=(12, 8))
    dendro = dendrogram(linkage_matrix, labels=dist_matrix.index.tolist(),
                        leaf_font_size=10, leaf_rotation=0, orientation='left',
                        link_color_func=lambda k: 'black')

    ax = plt.gca()
    xlbls = ax.get_ymajorticklabels()
    for lbl in xlbls:
        lbl.set_color(label_colors(lbl.get_text()))

    plt.title(title)
    plt.tight_layout()
    plt.savefig(filename)
    if show_plot:
        plt.show()
    plt.close()

# User option to show plots
show_plots = False  # Set to True to display plots inline

# Generate and save dendrograms
plot_dendrogram(create_distance_matrix(mock_df, 'ORScore'),
                "mtDNA - Alignment Score", "dendrogram_mtDNA_alignment.png", show_plots)

plot_dendrogram(create_distance_matrix(mock_df, 'IdentityScore'),
                "mtDNA - Identity Score", "dendrogram_mtDNA_identity.png", show_plots)

plot_dendrogram(create_distance_matrix(mock_df, 'ORScore'),
                "Y Chromosome - Alignment Score", "dendrogram_Y_alignment.png", show_plots)

plot_dendrogram(create_distance_matrix(mock_df, 'IdentityScore'),
                "Y Chromosome - Identity Score", "dendrogram_Y_identity.png", show_plots)

print("Dendrograms saved as PNG files.")


Dendrograms saved as PNG files.
