
<h2>Operational Research Project :</h2>
<h1>------------------------------</h1>
<h1>Dataset Synchronization Algorithms: Comparing Greedy, Local Search, and Genetic Approaches</h1>


<p style="font-size: 25px;">============== Summary ============== </p>
<p style="font-size: 20px;">1. Data Synchronization Between Two Identical Datasets</p>
<p style="font-size: 20px;">1.1 Size Of Dataset = 1000 Rows</p>
<p style="font-size: 20px;">1.2 Size Of Dataset = 100 000 Rows </p>
<p style="font-size: 20px;">1.3 Size Of Dataset = 200 000 Rows </p>
<p style="font-size: 20px;">2. Conclusion </p>


<p style="font-size: 25px;">1. Data Synchronization Between Two Identical Datasets </p>

<span style="font-size: 20px;">**1.1 size of dataset = 1000 rows** </span>

In [11]:
import pandas as pd
import random
import string
import time

# Function to generate a random dataset
def generate_dataset(size, id_start=1):
    data = {
        "ID": list(range(id_start, id_start + size)),
        "name": [
            ''.join(random.choices(string.ascii_letters, k=random.randint(5, 10)))
            for _ in range(size)
        ]
    }
    return pd.DataFrame(data)

# Cost function (calculates the number of modifications between two datasets)
def calculate_cost(original_dataset, synced_dataset):
    return (original_dataset['name'] != synced_dataset['name']).sum()

# Function to synchronize two datasets with a greedy algorithm
def greedy_sync(dataset1, dataset2):
    start_time = time.time()
    merged = pd.merge(dataset1, dataset2, on="ID", how="outer", suffixes=("_1", "_2"))
    merged["name"] = merged["name_1"].combine_first(merged["name_2"])
    merged = merged[["ID", "name"]]
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, merged)
    return merged, execution_time, cost

# Function to synchronize two datasets with a local search algorithm
def local_search_sync(dataset1, dataset2):
    start_time = time.time()
    result = dataset2.copy()  # Synchronize dataset2 based on dataset1
    for index, row in dataset1.iterrows():
        result.loc[result['ID'] == row['ID'], 'name'] = row['name']  # Update dataset2
    result = result.sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, result)
    return result, execution_time, cost

# Function to synchronize two datasets with a genetic algorithm (simplified)
def genetic_sync(dataset1, dataset2):
    start_time = time.time()
    # First, prioritize dataset1 by updating dataset2 with dataset1 values where IDs match
    combined = pd.merge(dataset2, dataset1, on="ID", how="left", suffixes=("_2", "_1"))
    combined["name"] = combined["name_1"].combine_first(combined["name_2"])  # Prefer dataset1 values
    combined = combined[["ID", "name"]].sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, combined)
    return combined, execution_time, cost

# Generate datasets
size_dataset1 = 1000  # Size of the first dataset
size_dataset2 = 1000  # Size of the second dataset

dataset1 = generate_dataset(size_dataset1)
dataset2 = dataset1.copy()  # Dataset2 starts as an identical copy of dataset1

# Save the datasets to CSV (optional)
dataset1.to_csv("dataset1.csv", index=False)
dataset2.to_csv("dataset2.csv", index=False)

# Display the original datasets
print("\n=== Dataset 1 ===")
print("-" * 40)
print(dataset1.head())
print("-" * 40)

print("\n=== Dataset 2 ===")
print("-" * 40)
print(dataset2.head())
print("-" * 40)

# Modify values in dataset1 (Manual input)
print("\nEnter the modifications for dataset1 (ID and name)")

# Prompt the user to modify values
while True:
    try:
        row_id = int(input("Enter the ID of the row to modify (or 0 to stop): "))
        if row_id == 0:
            break
        new_name = input("Enter the new name: ")
        dataset1.loc[dataset1['ID'] == row_id, 'name'] = new_name
    except ValueError:
        print("Invalid entry. Please try again: ")

# Apply synchronization algorithms
results = {}

# Greedy Synchronization
print("\n===== 1. Synchronization with the Greedy Algorithm =====")
print("-" * 40)
updated_dataset2_greedy, time_greedy, cost_greedy = greedy_sync(dataset1, dataset2)
print("Dataset2 after Greedy synchronization:")
print("-" * 40)
print(updated_dataset2_greedy.head())
print("-" * 40)
print(f"Cost of Greedy synchronization (number of modifications): {cost_greedy}")
print(f"Execution Time: {time_greedy:.6f} seconds\n")

# Local Search Synchronization
print("\n===== 2. Synchronization with the Local Search Algorithm =====")
print("-" * 40)
updated_dataset2_local, time_local, cost_local = local_search_sync(dataset1, dataset2)
print("Dataset2 after Local Search synchronization:")
print("-" * 40)
print(updated_dataset2_local.head())
print("-" * 40)
print(f"Cost of Local Search synchronization (number of modifications): {cost_local}")
print(f"Execution Time: {time_local:.6f} seconds\n")

# Genetic Synchronization
print("\n===== 3. Synchronization with the Genetic Algorithm =====")
print("-" * 40)
updated_dataset2_genetic, time_genetic, cost_genetic = genetic_sync(dataset1, dataset2)
print("Dataset2 after Genetic synchronization:")
print("-" * 40)
print(updated_dataset2_genetic.head())
print("-" * 40)
print(f"Cost of Genetic synchronization (number of modifications): {cost_genetic}")
print(f"Execution Time: {time_genetic:.6f} seconds\n")

# Display execution times for all algorithms
print("\n===== Summary of Execution Times =====")
print("-" * 40)
print(f"Greedy: {time_greedy:.6f} seconds")
print(f"Local Search: {time_local:.6f} seconds")
print(f"Genetic: {time_genetic:.6f} seconds")
print("-" * 40)


=== Dataset 1 ===
----------------------------------------
   ID     name
0   1   ZawjlR
1   2   vSkNlf
2   3    pmzHd
3   4  BPnOUWu
4   5   mcuJXz
----------------------------------------

=== Dataset 2 ===
----------------------------------------
   ID     name
0   1   ZawjlR
1   2   vSkNlf
2   3    pmzHd
3   4  BPnOUWu
4   5   mcuJXz
----------------------------------------

Enter the modifications for dataset1 (ID and name)


Enter the ID of the row to modify (or 0 to stop):  1
Enter the new name:  dhouha
Enter the ID of the row to modify (or 0 to stop):  2
Enter the new name:  nouha 
Enter the ID of the row to modify (or 0 to stop):  3
Enter the new name:  nour
Enter the ID of the row to modify (or 0 to stop):  0



===== 1. Synchronization with the Greedy Algorithm =====
----------------------------------------
Dataset2 after Greedy synchronization:
----------------------------------------
   ID     name
0   1   dhouha
1   2   nouha 
2   3     nour
3   4  BPnOUWu
4   5   mcuJXz
----------------------------------------
Cost of Greedy synchronization (number of modifications): 3
Execution Time: 0.006624 seconds


===== 2. Synchronization with the Local Search Algorithm =====
----------------------------------------
Dataset2 after Local Search synchronization:
----------------------------------------
   ID     name
0   1   dhouha
1   2   nouha 
2   3     nour
3   4  BPnOUWu
4   5   mcuJXz
----------------------------------------
Cost of Local Search synchronization (number of modifications): 3
Execution Time: 0.462626 seconds


===== 3. Synchronization with the Genetic Algorithm =====
----------------------------------------
Dataset2 after Genetic synchronization:
----------------------------------

<p style="font-size: 20px;"> =>  the genetic algorithm stands out as the most efficient solution, achieving an impressive execution time of just 0.004732 seconds</p>

<span style="font-size: 20px;">**1.2 size of dataset = 100 000 rows** </span>

In [21]:
import pandas as pd
import random
import string
import time

# Function to generate a random dataset
def generate_dataset(size, id_start=1):
    data = {
        "ID": list(range(id_start, id_start + size)),
        "name": [
            ''.join(random.choices(string.ascii_letters, k=random.randint(5, 10)))
            for _ in range(size)
        ]
    }
    return pd.DataFrame(data)

# Cost function (calculates the number of modifications between two datasets)
def calculate_cost(original_dataset, synced_dataset):
    return (original_dataset['name'] != synced_dataset['name']).sum()

# Function to synchronize two datasets with a greedy algorithm
def greedy_sync(dataset1, dataset2):
    start_time = time.time()
    merged = pd.merge(dataset1, dataset2, on="ID", how="outer", suffixes=("_1", "_2"))
    merged["name"] = merged["name_1"].combine_first(merged["name_2"])
    merged = merged[["ID", "name"]]
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, merged)
    return merged, execution_time, cost

# Function to synchronize two datasets with a local search algorithm
def local_search_sync(dataset1, dataset2):
    start_time = time.time()
    result = dataset2.copy()  # Synchronize dataset2 based on dataset1
    for index, row in dataset1.iterrows():
        result.loc[result['ID'] == row['ID'], 'name'] = row['name']  # Update dataset2
    result = result.sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, result)
    return result, execution_time, cost

# Function to synchronize two datasets with a genetic algorithm (simplified)
def genetic_sync(dataset1, dataset2):
    start_time = time.time()
    # First, prioritize dataset1 by updating dataset2 with dataset1 values where IDs match
    combined = pd.merge(dataset2, dataset1, on="ID", how="left", suffixes=("_2", "_1"))
    combined["name"] = combined["name_1"].combine_first(combined["name_2"])  # Prefer dataset1 values
    combined = combined[["ID", "name"]].sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, combined)
    return combined, execution_time, cost

# Generate datasets
size_dataset1 = 100000  # Size of the first dataset
size_dataset2 = 100000  # Size of the second dataset

dataset1 = generate_dataset(size_dataset1)
dataset2 = dataset1.copy()  # Dataset2 starts as an identical copy of dataset1

# Save the datasets to CSV (optional)
dataset1.to_csv("dataset1.csv", index=False)
dataset2.to_csv("dataset2.csv", index=False)

# Display the original datasets
print("\n=== Dataset 1 ===")
print("-" * 40)
print(dataset1.head())
print("-" * 40)

print("\n=== Dataset 2 ===")
print("-" * 40)
print(dataset2.head())
print("-" * 40)

# Modify values in dataset1 (Manual input)
print("\nEnter the modifications for dataset1 (ID and name)")

# Prompt the user to modify values
while True:
    try:
        row_id = int(input("Enter the ID of the row to modify (or 0 to stop): "))
        if row_id == 0:
            break
        new_name = input("Enter the new name: ")
        dataset1.loc[dataset1['ID'] == row_id, 'name'] = new_name
    except ValueError:
        print("Invalid entry. Please try again: ")

# Apply synchronization algorithms
results = {}

# Greedy Synchronization
print("\n===== 1. Synchronization with the Greedy Algorithm =====")
print("-" * 40)
updated_dataset2_greedy, time_greedy, cost_greedy = greedy_sync(dataset1, dataset2)
print("Dataset2 after Greedy synchronization:")
print("-" * 40)
print(updated_dataset2_greedy.head())
print("-" * 40)
print(f"Cost of Greedy synchronization (number of modifications): {cost_greedy}")
print(f"Execution Time: {time_greedy:.6f} seconds\n")

# Local Search Synchronization
print("\n===== 2. Synchronization with the Local Search Algorithm =====")
print("-" * 40)
updated_dataset2_local, time_local, cost_local = local_search_sync(dataset1, dataset2)
print("Dataset2 after Local Search synchronization:")
print("-" * 40)
print(updated_dataset2_local.head())
print("-" * 40)
print(f"Cost of Local Search synchronization (number of modifications): {cost_local}")
print(f"Execution Time: {time_local:.6f} seconds\n")

# Genetic Synchronization
print("\n===== 3. Synchronization with the Genetic Algorithm =====")
print("-" * 40)
updated_dataset2_genetic, time_genetic, cost_genetic = genetic_sync(dataset1, dataset2)
print("Dataset2 after Genetic synchronization:")
print("-" * 40)
print(updated_dataset2_genetic.head())
print("-" * 40)
print(f"Cost of Genetic synchronization (number of modifications): {cost_genetic}")
print(f"Execution Time: {time_genetic:.6f} seconds\n")

# Display execution times for all algorithms
print("\n===== Summary of Execution Times =====")
print("-" * 40)
print(f"Greedy: {time_greedy:.6f} seconds")
print(f"Local Search: {time_local:.6f} seconds")
print(f"Genetic: {time_genetic:.6f} seconds")
print("-" * 40)



=== Dataset 1 ===
----------------------------------------
   ID       name
0   1  ZwzhfvOBS
1   2  LOljuEQuc
2   3    eKoMlpQ
3   4   dXirDhqw
4   5  asysQAwgS
----------------------------------------

=== Dataset 2 ===
----------------------------------------
   ID       name
0   1  ZwzhfvOBS
1   2  LOljuEQuc
2   3    eKoMlpQ
3   4   dXirDhqw
4   5  asysQAwgS
----------------------------------------

Enter the modifications for dataset1 (ID and name)


Enter the ID of the row to modify (or 0 to stop):  1
Enter the new name:  dhouha
Enter the ID of the row to modify (or 0 to stop):  2
Enter the new name:  nouha
Enter the ID of the row to modify (or 0 to stop):  0



===== 1. Synchronization with the Greedy Algorithm =====
----------------------------------------
Dataset2 after Greedy synchronization:
----------------------------------------
   ID       name
0   1     dhouha
1   2      nouha
2   3    eKoMlpQ
3   4   dXirDhqw
4   5  asysQAwgS
----------------------------------------
Cost of Greedy synchronization (number of modifications): 2
Execution Time: 0.024652 seconds


===== 2. Synchronization with the Local Search Algorithm =====
----------------------------------------
Dataset2 after Local Search synchronization:
----------------------------------------
   ID       name
0   1     dhouha
1   2      nouha
2   3    eKoMlpQ
3   4   dXirDhqw
4   5  asysQAwgS
----------------------------------------
Cost of Local Search synchronization (number of modifications): 2
Execution Time: 67.743525 seconds


===== 3. Synchronization with the Genetic Algorithm =====
----------------------------------------
Dataset2 after Genetic synchronization:
---------

<p style="font-size: 20px;"> => the greedy algorithm stands out as the most efficient solution, achieving an impressive execution time of just 0.024652 seconds </p>

<span style="font-size: 20px;">**2.3 size of dataset =  200 000 rows**</span>

In [33]:
import pandas as pd
import random
import string
import time

# Function to generate a random dataset
def generate_dataset(size, id_start=1):
    data = {
        "ID": list(range(id_start, id_start + size)),
        "name": [
            ''.join(random.choices(string.ascii_letters, k=random.randint(5, 10)))
            for _ in range(size)
        ]
    }
    return pd.DataFrame(data)

# Cost function (calculates the number of modifications between two datasets)
def calculate_cost(original_dataset, synced_dataset):
    return (original_dataset['name'] != synced_dataset['name']).sum()

# Function to synchronize two datasets with a greedy algorithm
def greedy_sync(dataset1, dataset2):
    start_time = time.time()
    merged = pd.merge(dataset1, dataset2, on="ID", how="outer", suffixes=("_1", "_2"))
    merged["name"] = merged["name_1"].combine_first(merged["name_2"])
    merged = merged[["ID", "name"]]
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, merged)
    return merged, execution_time, cost

# Function to synchronize two datasets with a local search algorithm
def local_search_sync(dataset1, dataset2):
    start_time = time.time()
    result = dataset2.copy()  # Synchronize dataset2 based on dataset1
    for index, row in dataset1.iterrows():
        result.loc[result['ID'] == row['ID'], 'name'] = row['name']  # Update dataset2
    result = result.sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, result)
    return result, execution_time, cost

# Function to synchronize two datasets with a genetic algorithm (simplified)
def genetic_sync(dataset1, dataset2):
    start_time = time.time()
    # First, prioritize dataset1 by updating dataset2 with dataset1 values where IDs match
    combined = pd.merge(dataset2, dataset1, on="ID", how="left", suffixes=("_2", "_1"))
    combined["name"] = combined["name_1"].combine_first(combined["name_2"])  # Prefer dataset1 values
    combined = combined[["ID", "name"]].sort_values(by="ID").reset_index(drop=True)
    execution_time = time.time() - start_time
    cost = calculate_cost(dataset2, combined)
    return combined, execution_time, cost

# Generate datasets
size_dataset1 = 200000  # Size of the first dataset
size_dataset2 = 200000  # Size of the second dataset

dataset1 = generate_dataset(size_dataset1)
dataset2 = dataset1.copy()  # Dataset2 starts as an identical copy of dataset1

# Save the datasets to CSV (optional)
dataset1.to_csv("dataset1.csv", index=False)
dataset2.to_csv("dataset2.csv", index=False)

# Display the original datasets
print("\n=== Dataset 1 ===")
print("-" * 40)
print(dataset1.head())
print("-" * 40)

print("\n=== Dataset 2 ===")
print("-" * 40)
print(dataset2.head())
print("-" * 40)

# Modify values in dataset1 (Manual input)
print("\nEnter the modifications for dataset1 (ID and name)")

# Prompt the user to modify values
while True:
    try:
        row_id = int(input("Enter the ID of the row to modify (or 0 to stop): "))
        if row_id == 0:
            break
        new_name = input("Enter the new name: ")
        dataset1.loc[dataset1['ID'] == row_id, 'name'] = new_name
    except ValueError:
        print("Invalid entry. Please try again: ")

# Apply synchronization algorithms
results = {}

# Greedy Synchronization
print("\n===== 1. Synchronization with the Greedy Algorithm =====")
print("-" * 40)
updated_dataset2_greedy, time_greedy, cost_greedy = greedy_sync(dataset1, dataset2)
print("Dataset2 after Greedy synchronization:")
print("-" * 40)
print(updated_dataset2_greedy.head())
print("-" * 40)
print(f"Cost of Greedy synchronization (number of modifications): {cost_greedy}")
print(f"Execution Time: {time_greedy:.6f} seconds\n")

# Local Search Synchronization
print("\n===== 2. Synchronization with the Local Search Algorithm =====")
print("-" * 40)
updated_dataset2_local, time_local, cost_local = local_search_sync(dataset1, dataset2)
print("Dataset2 after Local Search synchronization:")
print("-" * 40)
print(updated_dataset2_local.head())
print("-" * 40)
print(f"Cost of Local Search synchronization (number of modifications): {cost_local}")
print(f"Execution Time: {time_local:.6f} seconds\n")

# Genetic Synchronization
print("\n===== 3. Synchronization with the Genetic Algorithm =====")
print("-" * 40)
updated_dataset2_genetic, time_genetic, cost_genetic = genetic_sync(dataset1, dataset2)
print("Dataset2 after Genetic synchronization:")
print("-" * 40)
print(updated_dataset2_genetic.head())
print("-" * 40)
print(f"Cost of Genetic synchronization (number of modifications): {cost_genetic}")
print(f"Execution Time: {time_genetic:.6f} seconds\n")

# Display execution times for all algorithms
print("\n===== Summary of Execution Times =====")
print("-" * 40)
print(f"Greedy: {time_greedy:.6f} seconds")
print(f"Local Search: {time_local:.6f} seconds")
print(f"Genetic: {time_genetic:.6f} seconds")
print("-" * 40)



=== Dataset 1 ===
----------------------------------------
   ID        name
0   1       lkhFX
1   2  wVVzVDUhHP
2   3     jJHVIuu
3   4       hbVlA
4   5    kLhJsveK
----------------------------------------

=== Dataset 2 ===
----------------------------------------
   ID        name
0   1       lkhFX
1   2  wVVzVDUhHP
2   3     jJHVIuu
3   4       hbVlA
4   5    kLhJsveK
----------------------------------------

Enter the modifications for dataset1 (ID and name)


Enter the ID of the row to modify (or 0 to stop):  1
Enter the new name:  dhouha
Enter the ID of the row to modify (or 0 to stop):  0



===== 1. Synchronization with the Greedy Algorithm =====
----------------------------------------
Dataset2 after Greedy synchronization:
----------------------------------------
   ID        name
0   1      dhouha
1   2  wVVzVDUhHP
2   3     jJHVIuu
3   4       hbVlA
4   5    kLhJsveK
----------------------------------------
Cost of Greedy synchronization (number of modifications): 1
Execution Time: 0.038137 seconds


===== 2. Synchronization with the Local Search Algorithm =====
----------------------------------------
Dataset2 after Local Search synchronization:
----------------------------------------
   ID        name
0   1      dhouha
1   2  wVVzVDUhHP
2   3     jJHVIuu
3   4       hbVlA
4   5    kLhJsveK
----------------------------------------
Cost of Local Search synchronization (number of modifications): 1
Execution Time: 175.489956 seconds


===== 3. Synchronization with the Genetic Algorithm =====
----------------------------------------
Dataset2 after Genetic synchronizati

<p style="font-size: 20px;"> =>   the greedy algorithm stands out as the most efficient solution, achieving an impressive execution time of just 0.038137 seconds </p>

<p style="font-size: 25px;">2. Conclusion </p>


<p style="font-size: 20px;"> => In this analysis, we observed that the genetic algorithm demonstrates superior performance compared to other algorithms when applied to small datasets. <br><br>  => However, its effectiveness diminishes significantly with larger datasets, where it struggles to deliver optimal results. Conversely, the greedy algorithm excels in handling larger datasets, outperforming both the local search and genetic algorithms in these scenarios. <br><br> => This insight highlights the importance of selecting the appropriate algorithm based on dataset size and characteristics to achieve optimal performance. </p>