In [10]:
import pandas as pd

# Read the CSV files
zero_shot_df = pd.read_csv('Data/Gemini/llm_responses_final_zero_shot.csv')
few_shot_df = pd.read_csv('Data/Gemini/llm_responses_final_few_shot.csv')
chain_of_thoughts_df = pd.read_csv('Data/Gemini/llm_responses_final_chain_of_thoughts.csv')

# Create the combined text column for original title and body
zero_shot_df['Title_Body'] = '<<' + zero_shot_df['Title'] + '>>\n<<' + zero_shot_df['Body'] + '>>'

# Create combined response columns for each LLM
zero_shot_df['llm_response_combined'] = '<<' + zero_shot_df['llm_title_response'] + '>>\n<<' + zero_shot_df['llm_body_response'] + '>>'
few_shot_df['llm_response_combined'] = '<<' + few_shot_df['llm_title_response'] + '>>\n<<' + few_shot_df['llm_body_response'] + '>>'
chain_of_thoughts_df['llm_response_combined'] = '<<' + chain_of_thoughts_df['llm_title_response'] + '>>\n<<' + chain_of_thoughts_df['llm_body_response'] + '>>'

# Create the final dataframe with all columns
combined_df = pd.DataFrame({
    'Id': zero_shot_df['Id'],
    'Title': zero_shot_df['Title'],
    'Body': zero_shot_df['Body'],
    'Title_Body': zero_shot_df['Title_Body'],
    'ImageURLs': zero_shot_df['ImageURLs'],
    # Zero-shot responses
    'llm_zero_shot_title': zero_shot_df['llm_title_response'],
    'llm_zero_shot_body': zero_shot_df['llm_body_response'],
    'llm_zero_shot_combined': zero_shot_df['llm_response_combined'],
    # Few-shot responses
    'llm_few_shot_title': few_shot_df['llm_title_response'],
    'llm_few_shot_body': few_shot_df['llm_body_response'],
    'llm_few_shot_combined': few_shot_df['llm_response_combined'],
    # Chain of thoughts responses
    'llm_cot_title': chain_of_thoughts_df['llm_title_response'],
    'llm_cot_body': chain_of_thoughts_df['llm_body_response'],
    'llm_cot_combined': chain_of_thoughts_df['llm_response_combined']
})

# Save the combined dataframe to a new CSV file
combined_df.to_csv('Data/Gemini/llm_responses_combined.csv', index=False)

# Print basic information about the combined file
print(f"Combined file created with {len(combined_df)} rows")
print("\nColumns in the combined file:")
for col in combined_df.columns:
    print(f"- {col}")

# Print a sample row to verify the structure
print("\nSample of first row:")
for col in combined_df.columns:
    print(f"\n{col}:")
    print(str(combined_df[col].iloc[0])[:100] + "..." if len(str(combined_df[col].iloc[0])) > 100 else str(combined_df[col].iloc[0]))

Combined file created with 143 rows

Columns in the combined file:
- Id
- Title
- Body
- Title_Body
- ImageURLs
- llm_zero_shot_title
- llm_zero_shot_body
- llm_zero_shot_combined
- llm_few_shot_title
- llm_few_shot_body
- llm_few_shot_combined
- llm_cot_title
- llm_cot_body
- llm_cot_combined

Sample of first row:

Id:
79146548

Title:
GitHub Copilot responds to 'Hey Code' but dictation doesn't work

Body:
As the title explains, I can start an inline chat session using the 'hey code' voice command in VS C...

Title_Body:
<<GitHub Copilot responds to 'Hey Code' but dictation doesn't work>>
<<As the title explains, I can ...

ImageURLs:
['https://i.sstatic.net/MgGjdapB.png']

llm_zero_shot_title:
Why is my IDE/Code editor suggesting "Ask Copilot" when I have code on the next line?

llm_zero_shot_body:
I'm trying to use the "Ask Copilot" feature in my IDE, but it's not working as expected. When I clic...

llm_zero_shot_combined:
<<Why is my IDE/Code editor suggesting "Ask Copilot" when I

In [11]:
combined_df.head()

Unnamed: 0,Id,Title,Body,Title_Body,ImageURLs,llm_zero_shot_title,llm_zero_shot_body,llm_zero_shot_combined,llm_few_shot_title,llm_few_shot_body,llm_few_shot_combined,llm_cot_title,llm_cot_body,llm_cot_combined
0,79146548,GitHub Copilot responds to 'Hey Code' but dict...,"As the title explains, I can start an inline c...",<<GitHub Copilot responds to 'Hey Code' but di...,['https://i.sstatic.net/MgGjdapB.png'],"Why is my IDE/Code editor suggesting ""Ask Copi...","I'm trying to use the ""Ask Copilot"" feature in...","<<Why is my IDE/Code editor suggesting ""Ask Co...",How to Implement Drag and Drop Functionality B...,I am trying to find a reference to the `SetupE...,<<How to Implement Drag and Drop Functionality...,How to use GitHub Copilot to generate code for...,I'm trying to use GitHub Copilot to help me wr...,<<How to use GitHub Copilot to generate code f...
1,79146419,How can I fix my Workflow file to successfully...,I am trying to use Github Actions with Azure S...,<<How can I fix my Workflow file to successful...,['https://i.sstatic.net/THwNK2Jj.png'],"""npm: not found"" error when deploying .NET Cor...",I am trying to deploy my ASP.NET Core applicat...,"<<""npm: not found"" error when deploying .NET C...","Azure DevOps Build Fails: ""The command 'npm ru...",I am trying to publish my .NET Core applicatio...,"<<Azure DevOps Build Fails: ""The command 'npm ...","Azure deployment fails with ""npm: not found"" w...",I am trying to deploy my ASP.NET Core applicat...,"<<Azure deployment fails with ""npm: not found""..."
2,79146412,LINQPad 8 Dump Property Order different that L...,"In LINQPad 5 with Linq-to-Sql DataContext, if ...",<<LINQPad 8 Dump Property Order different that...,['https://i.sstatic.net/efq4SfvI.png'],C# - LINQ Select new object with Property Type...,## C# Select Statement Returning Unexpected Pr...,<<C# - LINQ Select new object with Property Ty...,LINQPad Error: No native tree vulnerabilities ...,I am receiving a LINQPad error message stating...,<<LINQPad Error: No native tree vulnerabilitie...,How to Select Specific Properties from a User ...,I'm trying to retrieve user properties and the...,<<How to Select Specific Properties from a Use...
3,79146127,SyntaxError: Cannot use import statement outsi...,"I'm using TypeScript, ESM, npm, and ts-jest. U...",<<SyntaxError: Cannot use import statement out...,['https://i.sstatic.net/Jp5wj6k2.png'],"""SyntaxError: Cannot use import statement outs...",## SyntaxError: Cannot use import statement ou...,"<<""SyntaxError: Cannot use import statement ou...","How to Fix ""SyntaxError: Cannot use import sta...",I am trying to import `chalk` and `picocolors`...,"<<How to Fix ""SyntaxError: Cannot use import s...","""SyntaxError: Cannot use import statement outs...","I'm encountering a ""SyntaxError: Cannot use im...","<<""SyntaxError: Cannot use import statement ou..."
4,79145758,Typescript Polymorphic Component Event Handler,I have written a strongly-typed Polymorphic Ty...,<<Typescript Polymorphic Component Event Handl...,['https://i.sstatic.net/19LCKEF3.png'],TypeScript Error: Property 'currentTarget' doe...,"## TypeScript error: ""Property 'currentTarget'...",<<TypeScript Error: Property 'currentTarget' d...,Property 'currentTarget' does not exist on typ...,I am working with a React project using TypeSc...,<<Property 'currentTarget' does not exist on t...,Property 'currentTarget' does not exist on typ...,I'm encountering a TypeScript error in my Reac...,<<Property 'currentTarget' does not exist on t...


In [12]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

def create_embeddings_and_analyze(df, model_name='sentence-transformers/all-mpnet-base-v1'):
    """
    Create embeddings using SentenceTransformer and analyze similarities for both
    separate title/body responses and combined responses
    """
    # Initialize the model
    model = SentenceTransformer(model_name)
    
    # Generate embeddings for original content
    print("Generating embeddings for original content...")
    title_embeddings = model.encode(df['Title'].tolist(), show_progress_bar=True)
    body_embeddings = model.encode(df['Body'].tolist(), show_progress_bar=True)
    title_body_embeddings = model.encode(df['Title_Body'].tolist(), show_progress_bar=True)
    
    # Generate embeddings for LLM responses
    print("\nGenerating embeddings for LLM responses...")
    response_embeddings = {
        # Title responses
        'zero_shot_title': model.encode(df['llm_zero_shot_title'].tolist(), show_progress_bar=True),
        'few_shot_title': model.encode(df['llm_few_shot_title'].tolist(), show_progress_bar=True),
        'cot_title': model.encode(df['llm_cot_title'].tolist(), show_progress_bar=True),
        
        # Body responses
        'zero_shot_body': model.encode(df['llm_zero_shot_body'].tolist(), show_progress_bar=True),
        'few_shot_body': model.encode(df['llm_few_shot_body'].tolist(), show_progress_bar=True),
        'cot_body': model.encode(df['llm_cot_body'].tolist(), show_progress_bar=True),
        
        # Combined responses
        'zero_shot_combined': model.encode(df['llm_zero_shot_combined'].tolist(), show_progress_bar=True),
        'few_shot_combined': model.encode(df['llm_few_shot_combined'].tolist(), show_progress_bar=True),
        'cot_combined': model.encode(df['llm_cot_combined'].tolist(), show_progress_bar=True)
    }
    
    # Calculate similarities
    similarities = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Calculate title similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['title'][response_type] = np.diagonal(
            cosine_similarity(title_embeddings, response_embeddings[f'{response_type}_title'])
        )
    
    # Calculate body similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['body'][response_type] = np.diagonal(
            cosine_similarity(body_embeddings, response_embeddings[f'{response_type}_body'])
        )
    
    # Calculate combined similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['combined'][response_type] = np.diagonal(
            cosine_similarity(title_body_embeddings, response_embeddings[f'{response_type}_combined'])
        )
    
    # Add similarity scores to dataframe
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        df[f'similarity_{response_type}_title'] = similarities['title'][response_type]
        df[f'similarity_{response_type}_body'] = similarities['body'][response_type]
        df[f'similarity_{response_type}_combined'] = similarities['combined'][response_type]
    
    # Save embeddings
    np.save('Data/title_embeddings_st.npy', title_embeddings)
    np.save('Data/body_embeddings_st.npy', body_embeddings)
    np.save('Data/title_body_embeddings_st.npy', title_body_embeddings)
    
    for response_type, embeddings in response_embeddings.items():
        np.save(f'Data/{response_type}_embeddings_st.npy', embeddings)
    
    return df, similarities

def analyze_similarities(similarities):
    """
    Analyze and visualize similarity distributions for title, body, and combined responses
    """
    categories = {
        'Very High': (0.8, 1.0),
        'High': (0.6, 0.8),
        'Moderate': (0.4, 0.6),
        'Low': (0.2, 0.4),
        'Very Low': (0.0, 0.2)
    }
    
    results = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Create figures for each type of response
    for response_category in ['title', 'body', 'combined']:
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        fig.suptitle(f'Similarity Analysis for {response_category.capitalize()} Responses')
        
        # Plot distributions and calculate statistics for each LLM type
        for idx, response_type in enumerate(['zero_shot', 'few_shot', 'cot']):
            scores = similarities[response_category][response_type]
            
            # Calculate statistics
            stats = {
                'mean': np.mean(scores),
                'median': np.median(scores),
                'std': np.std(scores),
                'min': np.min(scores),
                'max': np.max(scores)
            }
            
            # Calculate distribution across categories
            distribution = {}
            for category, (low, high) in categories.items():
                count = np.sum((scores >= low) & (scores < high))
                percentage = (count / len(scores)) * 100
                distribution[category] = percentage
                
            results[response_category][response_type] = {
                'statistics': stats,
                'distribution': distribution
            }
            
            # Plot histogram
            ax = axes[idx]
            sns.histplot(scores, bins=30, ax=ax)
            ax.set_title(f'{response_type.replace("_", " ").title()}')
            ax.set_xlabel('Similarity Score')
            ax.set_ylabel('Count')
            
            # Add category boundaries
            for category, (low, high) in categories.items():
                if low > 0:  # Don't plot the lowest boundary
                    ax.axvline(x=low, color='r', linestyle='--', alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f'Data/similarity_distributions_{response_category}_st.png')
        plt.close()
    
    # Print analysis
    for response_category in ['title', 'body', 'combined']:
        print(f"\n{response_category.upper()} Response Analysis:")
        for response_type, result in results[response_category].items():
            print(f"\n{response_type.upper()}:")
            print("\nBasic Statistics:")
            for metric, value in result['statistics'].items():
                print(f"{metric}: {value:.4f}")
            
            print("\nDistribution across categories:")
            for category, percentage in result['distribution'].items():
                print(f"{category}: {percentage:.2f}%")
    
    return results

# Usage example:
if __name__ == "__main__":
    # Read the combined CSV file
    df = pd.read_csv('Data/Gemini/llm_responses_combined.csv')
    
    # Generate embeddings and calculate similarities
    df_with_similarities, similarities = create_embeddings_and_analyze(df)
    
    # Analyze and visualize results
    analysis_results = analyze_similarities(similarities)
    
    # Save updated dataframe with similarities
    df_with_similarities.to_csv('Data/Gemini/llm_responses_with_similarities_st.csv', index=False)



Generating embeddings for original content...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


Generating embeddings for LLM responses...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


TITLE Response Analysis:

ZERO_SHOT:

Basic Statistics:
mean: 0.4581
median: 0.4816
std: 0.2075
min: -0.0375
max: 0.8861

Distribution across categories:
Very High: 3.50%
High: 25.87%
Moderate: 32.17%
Low: 24.48%
Very Low: 12.59%

FEW_SHOT:

Basic Statistics:
mean: 0.3552
median: 0.3461
std: 0.2145
min: -0.1036
max: 0.9188

Distribution across categories:
Very High: 1.40%
High: 14.69%
Moderate: 27.27%
Low: 27.97%
Very Low: 26.57%

COT:

Basic Statistics:
mean: 0.4512
median: 0.4544
std: 0.2340
min: -0.0135
max: 0.9497

Distribution across categories:
Very High: 6.99%
High: 25.17%
Moderate: 24.48%
Low: 25.17%
Very Low: 16.78%

BODY Response Analysis:

ZERO_SHOT:

Basic Statistics:
mean: 0.4923
median: 0.5016
std: 0.2138
min: -0.0008
max: 0.8823

Distribution across categories:
Very High: 5.59%
High: 30.07%
Moderate: 32.87%
Low: 18.88%
Very Low: 11.89%

FEW_SHOT:

Basic Statistics:
mean: 0.4208
median: 0.4270
std: 0.2038
min: -0.0833
max: 0.8904

Distribution across categories:
Very Hig

In [13]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns

def create_embeddings_and_analyze(df, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    """
    Create embeddings using SentenceTransformer and analyze similarities for both
    separate title/body responses and combined responses
    """
    # Initialize the model
    model = SentenceTransformer(model_name)
    
    # Generate embeddings for original content
    print("Generating embeddings for original content...")
    title_embeddings = model.encode(df['Title'].tolist(), show_progress_bar=True)
    body_embeddings = model.encode(df['Body'].tolist(), show_progress_bar=True)
    title_body_embeddings = model.encode(df['Title_Body'].tolist(), show_progress_bar=True)
    
    # Generate embeddings for LLM responses
    print("\nGenerating embeddings for LLM responses...")
    response_embeddings = {
        # Title responses
        'zero_shot_title': model.encode(df['llm_zero_shot_title'].tolist(), show_progress_bar=True),
        'few_shot_title': model.encode(df['llm_few_shot_title'].tolist(), show_progress_bar=True),
        'cot_title': model.encode(df['llm_cot_title'].tolist(), show_progress_bar=True),
        
        # Body responses
        'zero_shot_body': model.encode(df['llm_zero_shot_body'].tolist(), show_progress_bar=True),
        'few_shot_body': model.encode(df['llm_few_shot_body'].tolist(), show_progress_bar=True),
        'cot_body': model.encode(df['llm_cot_body'].tolist(), show_progress_bar=True),
        
        # Combined responses
        'zero_shot_combined': model.encode(df['llm_zero_shot_combined'].tolist(), show_progress_bar=True),
        'few_shot_combined': model.encode(df['llm_few_shot_combined'].tolist(), show_progress_bar=True),
        'cot_combined': model.encode(df['llm_cot_combined'].tolist(), show_progress_bar=True)
    }
    
    # Calculate similarities
    similarities = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Calculate title similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['title'][response_type] = np.diagonal(
            cosine_similarity(title_embeddings, response_embeddings[f'{response_type}_title'])
        )
    
    # Calculate body similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['body'][response_type] = np.diagonal(
            cosine_similarity(body_embeddings, response_embeddings[f'{response_type}_body'])
        )
    
    # Calculate combined similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['combined'][response_type] = np.diagonal(
            cosine_similarity(title_body_embeddings, response_embeddings[f'{response_type}_combined'])
        )
    
    # Add similarity scores to dataframe
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        df[f'similarity_{response_type}_title'] = similarities['title'][response_type]
        df[f'similarity_{response_type}_body'] = similarities['body'][response_type]
        df[f'similarity_{response_type}_combined'] = similarities['combined'][response_type]
    
    # Save embeddings
    np.save('Data/title_embeddings_st.npy', title_embeddings)
    np.save('Data/body_embeddings_st.npy', body_embeddings)
    np.save('Data/title_body_embeddings_st.npy', title_body_embeddings)
    
    for response_type, embeddings in response_embeddings.items():
        np.save(f'Data/{response_type}_embeddings_st.npy', embeddings)
    
    return df, similarities

def analyze_similarities(similarities):
    """
    Analyze and visualize similarity distributions for title, body, and combined responses
    """
    categories = {
        'Very High': (0.8, 1.0),
        'High': (0.6, 0.8),
        'Moderate': (0.4, 0.6),
        'Low': (0.2, 0.4),
        'Very Low': (0.0, 0.2)
    }
    
    results = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Create figures for each type of response
    for response_category in ['title', 'body', 'combined']:
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        fig.suptitle(f'Similarity Analysis for {response_category.capitalize()} Responses')
        
        # Plot distributions and calculate statistics for each LLM type
        for idx, response_type in enumerate(['zero_shot', 'few_shot', 'cot']):
            scores = similarities[response_category][response_type]
            
            # Calculate statistics
            stats = {
                'mean': np.mean(scores),
                'median': np.median(scores),
                'std': np.std(scores),
                'min': np.min(scores),
                'max': np.max(scores)
            }
            
            # Calculate distribution across categories
            distribution = {}
            for category, (low, high) in categories.items():
                count = np.sum((scores >= low) & (scores < high))
                percentage = (count / len(scores)) * 100
                distribution[category] = percentage
                
            results[response_category][response_type] = {
                'statistics': stats,
                'distribution': distribution
            }
            
            # Plot histogram
            ax = axes[idx]
            sns.histplot(scores, bins=30, ax=ax)
            ax.set_title(f'{response_type.replace("_", " ").title()}')
            ax.set_xlabel('Similarity Score')
            ax.set_ylabel('Count')
            
            # Add category boundaries
            for category, (low, high) in categories.items():
                if low > 0:  # Don't plot the lowest boundary
                    ax.axvline(x=low, color='r', linestyle='--', alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f'Data/similarity_distributions_{response_category}_st.png')
        plt.close()
    
    # Print analysis
    for response_category in ['title', 'body', 'combined']:
        print(f"\n{response_category.upper()} Response Analysis:")
        for response_type, result in results[response_category].items():
            print(f"\n{response_type.upper()}:")
            print("\nBasic Statistics:")
            for metric, value in result['statistics'].items():
                print(f"{metric}: {value:.4f}")
            
            print("\nDistribution across categories:")
            for category, percentage in result['distribution'].items():
                print(f"{category}: {percentage:.2f}%")
    
    return results

# Usage example:
if __name__ == "__main__":
    # Read the combined CSV file
    df = pd.read_csv('Data/Gemini/llm_responses_combined.csv')
    
    # Generate embeddings and calculate similarities
    df_with_similarities, similarities = create_embeddings_and_analyze(df)
    
    # Analyze and visualize results
    analysis_results = analyze_similarities(similarities)
    
    # Save updated dataframe with similarities
    df_with_similarities.to_csv('Data/Gemini/llm_responses_with_similarities_st.csv', index=False)



Generating embeddings for original content...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


Generating embeddings for LLM responses...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


TITLE Response Analysis:

ZERO_SHOT:

Basic Statistics:
mean: 0.4720
median: 0.4942
std: 0.2152
min: -0.0771
max: 0.9190

Distribution across categories:
Very High: 2.80%
High: 30.77%
Moderate: 30.07%
Low: 24.48%
Very Low: 11.19%

FEW_SHOT:

Basic Statistics:
mean: 0.3751
median: 0.3910
std: 0.2242
min: -0.1247
max: 0.9602

Distribution across categories:
Very High: 2.80%
High: 14.69%
Moderate: 27.97%
Low: 29.37%
Very Low: 21.68%

COT:

Basic Statistics:
mean: 0.4654
median: 0.4642
std: 0.2299
min: 0.0050
max: 0.9656

Distribution across categories:
Very High: 8.39%
High: 25.87%
Moderate: 25.17%
Low: 25.87%
Very Low: 14.69%

BODY Response Analysis:

ZERO_SHOT:

Basic Statistics:
mean: 0.5186
median: 0.5462
std: 0.2058
min: -0.0848
max: 0.9123

Distribution across categories:
Very High: 8.39%
High: 27.97%
Moderate: 35.66%
Low: 21.68%
Very Low: 4.20%

FEW_SHOT:

Basic Statistics:
mean: 0.4432
median: 0.4576
std: 0.1916
min: 0.0127
max: 0.8875

Distribution across categories:
Very High: 

In [15]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
import seaborn as sns

def create_embeddings_and_analyze(df, model_name='OrlikB/KartonBERT-USE-base-v1'):
    """
    Create embeddings using KartonBERT and analyze similarities with normalized embeddings
    """
    # Initialize the model
    model = SentenceTransformer(model_name)
    
    # Generate embeddings for original content
    print("Generating embeddings for original content...")
    title_embeddings = model.encode(df['Title'].tolist(), normalize_embeddings=True, show_progress_bar=True)
    body_embeddings = model.encode(df['Body'].tolist(), normalize_embeddings=True, show_progress_bar=True)
    title_body_embeddings = model.encode(df['Title_Body'].tolist(), normalize_embeddings=True, show_progress_bar=True)
    
    # Generate embeddings for LLM responses
    print("\nGenerating embeddings for LLM responses...")
    response_embeddings = {
        # Title responses
        'zero_shot_title': model.encode(df['llm_zero_shot_title'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'few_shot_title': model.encode(df['llm_few_shot_title'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'cot_title': model.encode(df['llm_cot_title'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        
        # Body responses
        'zero_shot_body': model.encode(df['llm_zero_shot_body'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'few_shot_body': model.encode(df['llm_few_shot_body'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'cot_body': model.encode(df['llm_cot_body'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        
        # Combined responses
        'zero_shot_combined': model.encode(df['llm_zero_shot_combined'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'few_shot_combined': model.encode(df['llm_few_shot_combined'].tolist(), normalize_embeddings=True, show_progress_bar=True),
        'cot_combined': model.encode(df['llm_cot_combined'].tolist(), normalize_embeddings=True, show_progress_bar=True)
    }
    
    # Calculate similarities using dot product
    similarities = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Calculate title similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        # Using dot product for normalized vectors
        similarities['title'][response_type] = np.sum(
            title_embeddings * response_embeddings[f'{response_type}_title'], axis=1
        )
    
    # Calculate body similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['body'][response_type] = np.sum(
            body_embeddings * response_embeddings[f'{response_type}_body'], axis=1
        )
    
    # Calculate combined similarities
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        similarities['combined'][response_type] = np.sum(
            title_body_embeddings * response_embeddings[f'{response_type}_combined'], axis=1
        )
    
    # Add similarity scores to dataframe
    for response_type in ['zero_shot', 'few_shot', 'cot']:
        df[f'similarity_{response_type}_title'] = similarities['title'][response_type]
        df[f'similarity_{response_type}_body'] = similarities['body'][response_type]
        df[f'similarity_{response_type}_combined'] = similarities['combined'][response_type]
    
    # Save embeddings
    np.save('Data/title_embeddings_kartonbert.npy', title_embeddings)
    np.save('Data/body_embeddings_kartonbert.npy', body_embeddings)
    np.save('Data/title_body_embeddings_kartonbert.npy', title_body_embeddings)
    
    for response_type, embeddings in response_embeddings.items():
        np.save(f'Data/{response_type}_embeddings_kartonbert.npy', embeddings)
    
    return df, similarities

def analyze_similarities(similarities):
    """
    Analyze and visualize similarity distributions for title, body, and combined responses
    """
    categories = {
        'Very High': (0.8, 1.0),
        'High': (0.6, 0.8),
        'Moderate': (0.4, 0.6),
        'Low': (0.2, 0.4),
        'Very Low': (0.0, 0.2)
    }
    
    results = {
        'title': {},
        'body': {},
        'combined': {}
    }
    
    # Create figures for each type of response
    for response_category in ['title', 'body', 'combined']:
        fig, axes = plt.subplots(1, 3, figsize=(15, 5))
        fig.suptitle(f'Similarity Analysis for {response_category.capitalize()} Responses (KartonBERT)')
        
        # Plot distributions and calculate statistics for each LLM type
        for idx, response_type in enumerate(['zero_shot', 'few_shot', 'cot']):
            scores = similarities[response_category][response_type]
            
            # Calculate statistics
            stats = {
                'mean': np.mean(scores),
                'median': np.median(scores),
                'std': np.std(scores),
                'min': np.min(scores),
                'max': np.max(scores)
            }
            
            # Calculate distribution across categories
            distribution = {}
            for category, (low, high) in categories.items():
                count = np.sum((scores >= low) & (scores < high))
                percentage = (count / len(scores)) * 100
                distribution[category] = percentage
                
            results[response_category][response_type] = {
                'statistics': stats,
                'distribution': distribution
            }
            
            # Plot histogram
            ax = axes[idx]
            sns.histplot(scores, bins=30, ax=ax)
            ax.set_title(f'{response_type.replace("_", " ").title()}')
            ax.set_xlabel('Similarity Score')
            ax.set_ylabel('Count')
            
            # Add category boundaries
            for category, (low, high) in categories.items():
                if low > 0:  # Don't plot the lowest boundary
                    ax.axvline(x=low, color='r', linestyle='--', alpha=0.3)
        
        plt.tight_layout()
        plt.savefig(f'Data/similarity_distributions_{response_category}_kartonbert.png')
        plt.close()
    
    # Print analysis
    for response_category in ['title', 'body', 'combined']:
        print(f"\n{response_category.upper()} Response Analysis (KartonBERT):")
        for response_type, result in results[response_category].items():
            print(f"\n{response_type.upper()}:")
            print("\nBasic Statistics:")
            for metric, value in result['statistics'].items():
                print(f"{metric}: {value:.4f}")
            
            print("\nDistribution across categories:")
            for category, percentage in result['distribution'].items():
                print(f"{category}: {percentage:.2f}%")
    
    return results

# Usage example:
if __name__ == "__main__":
    # Read the combined CSV file
    df = pd.read_csv('Data/Gemini/llm_responses_combined.csv')
    
    # Generate embeddings and calculate similarities
    df_with_similarities, similarities = create_embeddings_and_analyze(df)
    
    # Analyze and visualize results
    analysis_results = analyze_similarities(similarities)
    
    # Save updated dataframe with similarities
    df_with_similarities.to_csv('Data/Gemini/llm_responses_with_similarities_kartonbert.csv', index=False)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/499k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/741 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/415M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/176k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/535k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings for original content...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


Generating embeddings for LLM responses...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]


TITLE Response Analysis (KartonBERT):

ZERO_SHOT:

Basic Statistics:
mean: 0.6121
median: 0.6075
std: 0.1376
min: 0.2549
max: 0.9097

Distribution across categories:
Very High: 11.19%
High: 41.26%
Moderate: 42.66%
Low: 4.90%
Very Low: 0.00%

FEW_SHOT:

Basic Statistics:
mean: 0.5506
median: 0.5403
std: 0.1565
min: 0.1331
max: 0.9077

Distribution across categories:
Very High: 8.39%
High: 28.67%
Moderate: 46.15%
Low: 15.38%
Very Low: 1.40%

COT:

Basic Statistics:
mean: 0.6026
median: 0.6009
std: 0.1582
min: 0.2320
max: 0.9381

Distribution across categories:
Very High: 13.29%
High: 37.06%
Moderate: 38.46%
Low: 11.19%
Very Low: 0.00%

BODY Response Analysis (KartonBERT):

ZERO_SHOT:

Basic Statistics:
mean: 0.6950
median: 0.7261
std: 0.1323
min: 0.2278
max: 0.9387

Distribution across categories:
Very High: 17.48%
High: 61.54%
Moderate: 17.48%
Low: 3.50%
Very Low: 0.00%

FEW_SHOT:

Basic Statistics:
mean: 0.6448
median: 0.6551
std: 0.1139
min: 0.3176
max: 0.9371

Distribution across ca