Team Name: DataVortex_005_020_044_045

Name 1: Abhishek Bhat - PES1UG22AM005

Name 2: Anagha S Bharadwaj - PES1UG22AM020

Name 3: C Hemachandra - PES1UG22AM044

Name 4: Chaitra V - PES1UG22AM045

**Secure Multi-Party Computation (SMPC) Principles:**

Split the query and dataset into encrypted shares distributed across multiple parties. Similarity search can then be performed collaboratively without revealing any sensitive data to individual parties.

The primary goal is to ensure that the similarity search is performed collaboratively among multiple parties without revealing sensitive data to any individual participant.

Function integrated to remove duplicate entries in the dataset before performing the similarity search

Additions Made:

**Federated Environment:** Simulate data storage across multiple nodes. Each node will store and process its data locally, and results will be aggregated securely.

**SMPC:** Avoid decrypting data during similarity computation by using homomorphic encryption principles or secure aggregation.

Modified the code to display both the encrypted and decrypted versions of the results (similar statements)

In [None]:
import pandas as pd
from cryptography.fernet import Fernet
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Step 1: Generate encryption key and create Fernet instance
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Step 2: Load dataset (modify the file path as needed)
file_path = 'D:/SEMESTER 5/Algorithms and Optimizations in Machine Learning (AOML)/AOML_Project/Reviews.csv/Reviews.csv'
df = pd.read_csv(file_path)

# Ensure relevant columns exist
if 'Text' not in df.columns:
    raise ValueError("The dataset must contain a 'Text' column.")

# Truncate long text entries to reduce memory usage
df['Text'] = df['Text'].apply(lambda x: x[:500] if isinstance(x, str) else x)

# Simulate distributed nodes
node_1_data = df.iloc[:len(df) // 2]
node_2_data = df.iloc[len(df) // 2:]

# Function to encrypt text
def encrypt(text):
    return cipher_suite.encrypt(text.encode()).decode()

# Function to decrypt text
def decrypt(encrypted_text):
    return cipher_suite.decrypt(encrypted_text.encode()).decode()

# Function for local processing on each node
def local_similarity_search(encrypted_query, node_data):
    decrypted_query = decrypt(encrypted_query)
    vectorizer = TfidfVectorizer()
    dataset_texts = node_data['Text'].dropna().tolist()
    tfidf_matrix = vectorizer.fit_transform(dataset_texts)
    query_vector = vectorizer.transform([decrypted_query])
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()
    top_indices = similarity_scores.argsort()[-5:][::-1]
    return [(similarity_scores[i], dataset_texts[i]) for i in top_indices]

# Securely aggregate results from multiple nodes with duplicate filtering
def secure_aggregate_results(encrypted_query, nodes):
    all_results = []
    for node_data in nodes:
        local_results = local_similarity_search(encrypted_query, node_data)
        encrypted_results = [(score, encrypt(text)) for score, text in local_results]
        all_results.extend(encrypted_results)

    # Decrypt results for final similarity ranking (simulate secure decryption in real SMPC)
    decrypted_results = [(score, decrypt(enc_text)) for score, enc_text in all_results]

    # Filter duplicates while keeping the highest score for each unique text
    unique_results = {}
    for score, text in decrypted_results:
        if text not in unique_results or unique_results[text] < score:
            unique_results[text] = score

    # Sort the unique results by score in descending order
    sorted_results = sorted(unique_results.items(), key=lambda x: x[1], reverse=True)
    return [(score, text) for text, score in sorted_results[:5]]

# Full workflow
def privacy_preserving_workflow_federated(dataset_nodes):
    # Take user input for the query
    query = input("Enter your query: ")
    print("\n1. Original Query:", query)

    # Encrypt the query
    encrypted_query = encrypt(query)
    print("\n2. Encrypted Query:", encrypted_query)

    # Aggregate results from nodes
    similar_statements = secure_aggregate_results(encrypted_query, dataset_nodes)
    print("\n3. Similar Statements Found (Aggregated from Nodes):")

    for i, (score, stmt) in enumerate(similar_statements, 1):
        encrypted_stmt = encrypt(stmt)
        # Print Encrypted statement first
        print(f"{i}. [Score: {score:.4f}] Encrypted: {encrypted_stmt}")
        # Then print the decrypted statement below
        print(f"   Decrypted: {stmt}\n")

# Simulate distributed nodes
nodes = [node_1_data, node_2_data]

# Run the federated privacy-preserving workflow
privacy_preserving_workflow_federated(nodes)


1. Original Query: Delicious

2. Encrypted Query: gAAAAABnPdwNcpbKI2psH4-AKldFkeqZZt8w91k83mAfJHI-bOfz3Nwwj4cNvHe8eIGGLzEvttIZVBHwDwMTXdLuBKLWA0H51Q==

3. Similar Statements Found (Aggregated from Nodes):
1. [Score: 0.8720] Encrypted: gAAAAABnPdw0nwe5K8zQjaSohUYnwhGWKTmYXEXH7F12lGdPMZnYJ3_W7xYbowtTvSgApErJLU8aC0LZboNwBQDFTQSKpU7ZDE7naOGOEo5O22auwng_i7i5fWUuaL3hjPpXY_ZobVS6W9tEzfI2QpEIh2j0cK7CM5NpCb0bVfzTnGih0cmGaZGnOXRJIgtd8jUMO2wPXRCIefcHPTiDchkJ_yVcs-fBvoqj_QlRE8zJu-uRbdiDkzzRv86oJ_2esj137tYhXFZsRWj7X0_5q0df6l6u56qmpFf4V3H0sroiPasea-7eTEzxqgWGTMP3qwUlU8g987HM
   Decrypted: DELICIOUS!!! DELICIOUS!!! DELICIOUS!!! DELICIOUS!!! By the way, did I mention.....DELICIOUS!!! I love this stuff. It's sweet, melts-in-your-mouth, and DELICIOUS!!! DELICIOUS!!!

2. [Score: 0.7100] Encrypted: gAAAAABnPdw0OgUhlAz21O09-PruqRIGmVW6BhL0b45D4N3at4h2W7oClyxUcUM5zl5ndHEXfWBdLLlyxyUWQze3ySUkb9-Ddo-xsdVtXiokPtBKz7al_FKXhW2M93WTzPUx1Mme4P9XYo12PUyqICXSd4LtphgfyPd6RAd-WujpP0gN1kWmQ_WiNJOGmCpNemzWcbJ1bw7BtdYdf