In [1]:

# A short time ago I was tasked with organizing 1 million time series files in a data lake. I had to merge files that were
# subbstantially similar, based on column names, into a specific dataframe, and if the CSV files were different, they had 
# to be merged into their own dataframes. Eventually, everything was pushed into a Synapse datbased (SQL Server), in Azure. 
# Why was Bloomberg changing the column names for it's bond data...I have no idea....but that's what we had to deal with.
# The main challenge was to figure out which files were similar, or dissimilar, and then do all the appropriate merges. 
# At the time, the process was quite manual, tedious, and frustrating too. My team achieved the intended goal, and we 
# merged 1 million files into 250 tables in the database, but the process was quite painful! I was recently thinking of a 
# more elegant solution to handle this kind of problem. The solution that I came up with is described below.


In [8]:

# Basically, what are the benefits to loading CSV files into dataframes and merging these dataframes together, based on 
# similarity of column names?

# Simplified Analysis: When working with multiple datasets, especially those with similar but not identical column names, 
# merging based on similarity can simplify data analysis. It allows you to combine related information from different 
# sources without the need for extensive data cleaning or renaming.

# Preservation of Data Structure: By merging dataframes with similar column names, you preserve the overall structure and 
# integrity of the data. This ensures that related information remains organized and accessible within a single dataframe.

# Improved Data Consistency: Merging based on column name similarity can help ensure consistency across datasets. It reduces 
# the risk of discrepancies or errors that may arise from combining data with mismatched column names.

# Efficient Data Integration: Instead of manually aligning column names or performing complex data transformations, merging 
# based on similarity streamlines the integration process. It saves time and effort, especially when dealing with large or 
# complex datasets.

# Enhanced Insights: Merging similar datasets enables comprehensive analysis and provides a more complete picture of the 
# underlying data. It allows you to leverage information from multiple sources to gain deeper insights and make 
# critical informed decisions.

# Scalability and Flexibility: This approach is scalable and flexible, allowing you to merge dataframes with varying 
# degrees of similarity. Whether the column names are nearly identical or exhibit some variation, you can adapt the 
# merging process to accommodate different scenarios.


In [3]:

# Let's create generic sample dataframes; I don't have the original data that I worked with before and I'm not going to post anything 
# confidential, or protected intellectual property, anyway

import pandas as pd
import numpy as np
from itertools import combinations
from scipy.cluster.hierarchy import linkage, fcluster

# Create the dataframes
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]})
df2 = pd.DataFrame({'ID': [4, 5, 6], 'Person': ['David', 'Edward', 'Fiona'], 'Age': [40, 45, 50]})
df3 = pd.DataFrame({'ID': [7, 8, 9], 'FName': ['George', 'Hannah', 'Ian'], 'LName': ['Peterson', 'Smith', 'Jones'], 'Age': [55, 60, 65]})
df4 = pd.DataFrame({'ID': [10, 11, 12], 'FullName': ['Jack', 'Katie', 'Leo'], 'Years': [28, 33, 38]})
df5 = pd.DataFrame({'ID': [13, 14, 15], 'Name': ['Mona', 'Nick', 'Olivia'], 'Years': [43, 48, 53]})
df6 = pd.DataFrame({'ID': [16, 17, 18], 'FullName': ['Paul', 'Quinn', 'Rachel'], 'Score': [58, 63, 68]})
df7 = pd.DataFrame({'EmpID': [19, 20, 21], 'EmployeeName': ['Steve', 'Tina', 'Uma'], 'Experience': [5, 10, 15]})
df8 = pd.DataFrame({'EmpID': [22, 23, 24], 'Name': ['Victor', 'Wendy', 'Xander'], 'WorkYears': [20, 25, 30]})
df9 = pd.DataFrame({'ID': [25, 26, 27], 'EmployeeName': ['Yara', 'Zack', 'Amy'], 'Experience': [35, 40, 45]})

dataframes = [df1, df2, df3, df4, df5, df6, df7, df8, df9]
print(dataframes)


[   ID     Name  Age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35,    ID  Person  Age
0   4   David   40
1   5  Edward   45
2   6   Fiona   50,    ID   FName     LName  Age
0   7  George  Peterson   55
1   8  Hannah     Smith   60
2   9     Ian     Jones   65,    ID FullName  Years
0  10     Jack     28
1  11    Katie     33
2  12      Leo     38,    ID    Name  Years
0  13    Mona     43
1  14    Nick     48
2  15  Olivia     53,    ID FullName  Score
0  16     Paul     58
1  17    Quinn     63
2  18   Rachel     68,    EmpID EmployeeName  Experience
0     19        Steve           5
1     20         Tina          10
2     21          Uma          15,    EmpID    Name  WorkYears
0     22  Victor         20
1     23   Wendy         25
2     24  Xander         30,    ID EmployeeName  Experience
0  25         Yara          35
1  26         Zack          40
2  27          Amy          45]


In [4]:

# Function to calculate Jaccard similarity between two sets
def jaccard_similarity(set1, set2):
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union

# Calculate the Jaccard similarity for all pairs of dataframes
def calculate_similarity_matrix(dataframes):
    columns_list = [set(df.columns) for df in dataframes]
    n = len(dataframes)
    similarity_matrix = np.zeros((n, n))
    
    for (i, cols1), (j, cols2) in combinations(enumerate(columns_list), 2):
        sim = jaccard_similarity(cols1, cols2)
        similarity_matrix[i, j] = sim
        similarity_matrix[j, i] = sim
    
    return similarity_matrix


# Calculate similarity matrix
similarity_matrix = calculate_similarity_matrix(dataframes)
print(similarity_matrix)


[[0.         0.5        0.4        0.2        0.5        0.2
  0.         0.2        0.2       ]
 [0.5        0.         0.4        0.2        0.2        0.2
  0.         0.         0.2       ]
 [0.4        0.4        0.         0.16666667 0.16666667 0.16666667
  0.         0.         0.16666667]
 [0.2        0.2        0.16666667 0.         0.5        0.5
  0.         0.         0.2       ]
 [0.5        0.2        0.16666667 0.5        0.         0.2
  0.         0.2        0.2       ]
 [0.2        0.2        0.16666667 0.5        0.2        0.
  0.         0.         0.2       ]
 [0.         0.         0.         0.         0.         0.
  0.         0.2        0.5       ]
 [0.2        0.         0.         0.         0.2        0.
  0.2        0.         0.        ]
 [0.2        0.2        0.16666667 0.2        0.2        0.2
  0.5        0.         0.        ]]


In [5]:

# Cluster the dataframes based on their similarity
def cluster_dataframes(similarity_matrix, n_clusters):
    # Convert the similarity matrix to a distance matrix
    distance_matrix = 1 - similarity_matrix
    
    # Perform hierarchical clustering
    Z = linkage(distance_matrix, 'complete')
    
    # Create clusters
    labels = fcluster(Z, n_clusters, criterion='maxclust')
    
    return labels

# Find 3 most similar groups of dataframes
labels = cluster_dataframes(similarity_matrix, 3)
print(labels)
    

[1 2 2 1 2 2 3 2 2]


In [6]:

# Group dataframes based on labels
groups = {}
for idx, label in enumerate(labels):
    if label not in groups:
        groups[label] = []
    groups[label].append(dataframes[idx])

# Function to merge dataframes
def merge_dataframes(dfs):
    return pd.concat(dfs, ignore_index=True)
    

In [7]:

# Merge dataframes in each group
merged_dfs = []

for key, group in groups.items():
    merged_df = merge_dataframes(group)
    merged_dfs.append(merged_df)

# Display merged dataframes
for idx, merged_df in enumerate(merged_dfs):
    print(f"Merged DataFrame {idx + 1}:")
    print(merged_df)
    print("\n")
    

Merged DataFrame 1:
   ID     Name   Age FullName  Years
0   1    Alice  25.0      NaN    NaN
1   2      Bob  30.0      NaN    NaN
2   3  Charlie  35.0      NaN    NaN
3  10      NaN   NaN     Jack   28.0
4  11      NaN   NaN    Katie   33.0
5  12      NaN   NaN      Leo   38.0


Merged DataFrame 2:
      ID  Person   Age   FName     LName    Name  Years FullName  Score  \
0    4.0   David  40.0     NaN       NaN     NaN    NaN      NaN    NaN   
1    5.0  Edward  45.0     NaN       NaN     NaN    NaN      NaN    NaN   
2    6.0   Fiona  50.0     NaN       NaN     NaN    NaN      NaN    NaN   
3    7.0     NaN  55.0  George  Peterson     NaN    NaN      NaN    NaN   
4    8.0     NaN  60.0  Hannah     Smith     NaN    NaN      NaN    NaN   
5    9.0     NaN  65.0     Ian     Jones     NaN    NaN      NaN    NaN   
6   13.0     NaN   NaN     NaN       NaN    Mona   43.0      NaN    NaN   
7   14.0     NaN   NaN     NaN       NaN    Nick   48.0      NaN    NaN   
8   15.0     NaN   NaN  

In [None]:

# Jaccard Similarity is a measure used to compare the similarity and dissimilarity between two sets. It is commonly used 
# in data mining, information retrieval, and text analysis to quantify the similarity between two collections of objects.

# This solution can simplify a complex process, and automate the tasks of something that is very manual and time consuming, 
# Anything that is a manual process will inevitably lead to wasted time and countless errors, which will propagate within 
# the environment for a long time to come.


In [None]:

# END!!!
