## Identifying duplicate

This notebook focuses on the evaluation of various techniques for identifying duplicate. It aims to explore and compare different methods to determine which approach is most effective in identifying and handling duplicates within a dataset.

## Import Twitter data 
The data is stored in cpt.pARQUET. hge sample dataset includes x entries

In [None]:
import pandas as pd 

# need to specify this to show full text in one line
pd.options.display.max_colwidth = 0

df_path = "../data/data.csv"
df = pd.read_csv(df_path)

# just printing the informaition 
print("dataframe shape", df.shape)
print("--------------------------------")
print("the columns feed value counts", df.feed.value_counts())
print("-------------------------------")
df.head(3)


### Modify Report to Include Duplicated Rows in the DataFrame

The objective of this modification is to include the duplicated rows in the DataFrame, along with an additional column indicating whether each row is duplicated or not. By doing so, we can have a clear view of the duplicated data and facilitate further analysis.

To achieve this, the following steps were taken:

**The full dataset:**

- The dataset was split into two samples: one containing duplicate rows and the other containing non-duplicate rows. This division ensured that both types of data were included for analysis.

**Duplicate Sample with ChatGPT:**

- The duplicate sample was subjected to the chatGPT process, which involved restructuring the text to generate variations while preserving the essence of the original information. This step was crucial in simulating different text structures for the duplicated rows.

**Labeling Duplicate Rows:**

- In order to differentiate between duplicate and non-duplicate rows, a new column was added to the DataFrame. This additional column indicates whether each row is a duplicate or not. By labeling the rows accordingly, we can easily identify and analyze the duplicated data within the dataset.

By incorporating these modifications, the DataFrame now includes the duplicated rows, allowing for a comprehensive assessment of the data. The new column serves as a valuable indicator for distinguishing between duplicate and non-duplicate entries, enabling efficient analysis and further processing of the dataset.

In [None]:
cols=['report_description','feed']
sample = 300

sample_filter = df.loc[:,cols]

# Select the sample that is going to be passed to chatGPT to generate duplicates
duplicate_set = sample_filter.sample(n=sample, replace=False)

# Select the second sample, excluding the data that is passed to chatGPT for duplicate generation
non_duplicate_set = sample_filter.drop(duplicate_set.index).sample(n=sample, replace=False)

# Add label column
duplicate_set['state'] = ['duplicated'] * sample
duplicate_set = duplicate_set.reset_index(drop=True)

# Add the repeated word column to the DataFrame
non_duplicate_set['state'] = ['not duplicated'] * sample
non_duplicate_set = non_duplicate_set.reset_index(drop=True)

non_duplicate_set.head(5)


### chatGPT

You are doing some prompt engineering with chatGPT. My suggestion:
Move this piece of code out - put it in a separate file and the call the method and comment it out
Write a comment around if you have AN API KEY THEN YOU CAN UNCOMMENT THIS CODE OUT (ll the code for chatgpt together)

You should:
Run this notebook and save the chatgpt data in the parquet file. Once-off so do as separate script
If the key is dicontinued - then use chatgpt manually


In [None]:
from dotenv import dotenv_values, find_dotenv
import os
import sys
sys.path.append('../scripts')
from scripts import rewrite_text

config = dotenv_values(find_dotenv())
config

In [None]:
import openai
# Set up OpenAI API credentials
from dotenv import dotenv_values, find_dotenv
import os


config = dotenv_values(find_dotenv())
  
openai.api_key = "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"
  
duplicate_set_gpt = duplicate_set.copy()

duplicate_set_gpt["report_description"] = duplicate_set["report_description"].apply(rewrite_text)
duplicate_set_gpt

### show the actual and Paraphrased sentences side by side

In [None]:
duplicate_set["openai_report_description"] = duplicate_set_gpt["report_description"]
duplicate_set

#### put all togther

In [None]:
### make a non-duplicated report for each report in the sample 
non_duplicate_reports["openai_report_description"] = duplicate_reports["report_description"]

# combine both samples
sample_filter = pd.concat([non_duplicate_reports,duplicate_reports])
sample_filter = sample_filter.sample(frac=1, random_state=42)
sample_filter = sample_filter.reset_index(drop=True)
sample_filter

## Entity Recognition

In [None]:
import spacy
 
 
# Load the English language model
nlp = spacy.load('en_core_web_sm')
def perform_entity_recognition(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return  ', '.join(entities)

 

# Apply entity recognition to the 'text' column
sample_filter['entities'] = sample_filter['report_description'].apply(perform_entity_recognition)
sample_filter['entities_openai'] = sample_filter['openai_report_description'].apply(perform_entity_recognition)
sample_filter = sample_filter.reset_index(drop=True)

sample_filter

## cosine similarity using TFIDF

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
threshold = 70
def add_label_column(df, existing_column, threshold):
    df['predicted state'] = df[existing_column].apply(lambda x: 'not duplicated' if x < threshold else 'duplicated')
    return df

def is_duplicate_cosine_similarity(df,  cols,threshold):
    vectorizer = TfidfVectorizer()
    tfidf_matrix1 = vectorizer.fit_transform(df[cols[0]])
    tfidf_matrix2 = vectorizer.transform(df[cols[1]])
    cosine_sim = cosine_similarity(tfidf_matrix1, tfidf_matrix2)
    # print(cosine_sim.diagonal())

    comparison_df = pd.DataFrame({
        'report_description': df[cols[0]],
        'openai_report_description': df[cols[1]],
        'Similarity': cosine_sim.diagonal()*100,
        'state': df[cols[2]]
    })
    
    
    comparison_df =  add_label_column(comparison_df, 'Similarity', threshold)
 
    return comparison_df
 



is_duplicate_cosine_similarity_df = is_duplicate_cosine_similarity(sample_filter, ['report_description',"openai_report_description","state"],threshold)
is_duplicate_cosine_similarity_df  

In [None]:
is_duplicate_cosine_similarity_df = (is_duplicate_cosine_similarity_df['state'] == is_duplicate_cosine_similarity_df['predicted state']).mean()
is_duplicate_cosine_similarity_df

In [None]:
entities_is_duplicate_cosine_similarity = is_duplicate_cosine_similarity(sample_filter,['report_description','entities',"state"],threshold)
 
accuracy_entities_is_duplicate_cosine_similarity = (entities_is_duplicate_cosine_similarity['state'] == entities_is_duplicate_cosine_similarity['predicted state']).mean()
accuracy_entities_is_duplicate_cosine_similarity

## fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz
import pandas as pd

def is_duplicate_fuzzywuzzy(df, cols, threshold=70):
    comparison_results = []
     
    for index, row in df.iterrows():
        text1 = row[cols[0]]
        text2 = row[cols[1]]
        # print( row)
        similarity_ratio = fuzz.token_set_ratio(text1, text2)
        label =  row[cols[2]]
        if similarity_ratio >= threshold:
            comparison_results.append({
                'Text1': text1,
                'Text2': text2,
                'Similarity ratio': similarity_ratio,
                "predicted state":"duplicated",
                'state': label
            })
        else:

            
             
            comparison_results.append({
                'Text1': text1,
                'Text2': text2,
                'Similarity ratio': int(similarity_ratio),
                "predicted state":"not duplicated",
                "state":label
            })

    comparison_df = pd.DataFrame(comparison_results)
    return comparison_df
 

is_duplicate_fuzzywuzzy_df = is_duplicate_fuzzywuzzy(sample_filter, ['report_description',"openai_report_description","state"],threshold)
is_duplicate_fuzzywuzzy_df  

In [None]:
accuracy_is_duplicate_fuzzywuzzy_df = (is_duplicate_fuzzywuzzy_df['state'] == is_duplicate_fuzzywuzzy_df['predicted state']).mean()
accuracy_is_duplicate_fuzzywuzzy_df


In [None]:
entities_is_duplicate_fuzzywuzzy = is_duplicate_fuzzywuzzy(sample_filter,['report_description','entities',"state"],threshold)

accuracy_entities_is_duplicate_fuzzywuzzy = (entities_is_duplicate_fuzzywuzzy['state'] == entities_is_duplicate_fuzzywuzzy['predicted state']).mean()
accuracy_entities_is_duplicate_fuzzywuzzy


## Levenshtein

In [None]:
import pandas as pd
import Levenshtein
 

# Function to compare texts using Levenshtein distance and return similarity as a percentage
def is_duplicate_Levenshtein(df,cols,threshold):
    similarities = []
    for text1, text2 in zip(df[cols[0]], df[cols[1]]):
        distance = Levenshtein.distance(text1, text2)
        max_length = max(len(text1), len(text2))
        similarity = (max_length - distance) / max_length * 100
        similarities.append(int(similarity))
    
    comparison_df = pd.DataFrame({
        'report_description': df[cols[0]],
        'openai_report_description': df[cols[1]],
        'Similarity Percentage':  similarities,
         "state": df[cols[2]]        
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    
    return comparison_df

# Compare texts using Levenshtein distance and return similarity as a percentage
is_duplicate_Levenshtein_df = is_duplicate_Levenshtein(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_Levenshtein_df


In [None]:
accuracy_is_duplicate_Levenshtein_df_df = (is_duplicate_Levenshtein_df['state'] == is_duplicate_Levenshtein_df['predicted state']).mean()
accuracy_is_duplicate_Levenshtein_df_df

In [None]:
entities_is_duplicate_Levenshtein = is_duplicate_Levenshtein(sample_filter,['report_description','entities',"state"],threshold)

accuracy_entities_is_duplicate_Levenshtein = (entities_is_duplicate_Levenshtein['state'] == entities_is_duplicate_Levenshtein['predicted state']).mean()
accuracy_entities_is_duplicate_Levenshtein


## embeddings

 ##### SentenceTransformer('distilbert-base-nli-mean-tokens')

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer,util
from sklearn.metrics.pairwise import cosine_similarity
 

# Load the SentenceTransformer model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# Function to compare texts using SentenceTransformer
def compare_texts_sentence_transformer1(df, cols,threshold):
    embeddings1 = model.encode(df[cols[0]].tolist())
    embeddings2 = model.encode(df[cols[1]].tolist())
    similarities = util.cos_sim(embeddings1, embeddings2)

    comparison_df = pd.DataFrame({
        'Text1': df[cols[0]],
        'Text2': df[cols[1]],
        'Similarity Percentage': similarities.diagonal()*100,
         "state": df[cols[2]] 
          
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    comparison_df['Similarity Percentage'] = comparison_df['Similarity Percentage'].astype(int)
    return comparison_df

# Compare texts using SentenceTransformer
is_duplicate_SentenceTransformer1_df = compare_texts_sentence_transformer1(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_SentenceTransformer1_df


In [None]:
accuracy_is_duplicate_SentenceTransformer1_df = (is_duplicate_SentenceTransformer1_df['state'] == is_duplicate_SentenceTransformer1_df['predicted state']).mean()
accuracy_is_duplicate_SentenceTransformer1_df


In [None]:
entities_is_duplicate_SentenceTransformer1 = compare_texts_sentence_transformer1(sample_filter,['report_description','entities',"state"],threshold)
 
accuracy_entities_is_duplicate_SentenceTransformer1 = (entities_is_duplicate_SentenceTransformer1['state'] == entities_is_duplicate_SentenceTransformer1['predicted state']).mean()
accuracy_entities_is_duplicate_SentenceTransformer1


##### SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer,util
from sklearn.metrics.pairwise import cosine_similarity
 


# Load the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to compare texts using SentenceTransformer
def compare_texts_sentence_transformer2(df, cols,threshold):
    # embeddings1 = model.encode(df[cols[0]].tolist())
    # embeddings2 = model.encode(df[cols[1]].tolist())
    # similarities = cosine_similarity(embeddings1, embeddings2)
    embeddings1 = model.encode(df[cols[0]].tolist())
    embeddings2 = model.encode(df[cols[1]].tolist())
    similarities = util.cos_sim(embeddings1, embeddings2)
   
    comparison_df = pd.DataFrame({
        'Text1': df[cols[0]],
        'Text2': df[cols[1]],
        'Similarity Percentage':similarities.diagonal()*100,
         "state": df[cols[2]] 
          
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    comparison_df['Similarity Percentage'] = comparison_df['Similarity Percentage'].astype(int)
    return comparison_df

# Compare texts using SentenceTransformer
is_duplicate_SentenceTransformer2_df = compare_texts_sentence_transformer2(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_SentenceTransformer2_df


In [None]:
accuracy_is_duplicate_SentenceTransformer2_df = (is_duplicate_SentenceTransformer2_df['state'] == is_duplicate_SentenceTransformer2_df['predicted state']).mean()
accuracy_is_duplicate_SentenceTransformer2_df


In [None]:
entities_is_duplicate_SentenceTransformer2_df = compare_texts_sentence_transformer2(sample_filter,['report_description','entities',"state"],threshold)
accuracy_entities_is_duplicate_SentenceTransformer2_df = (entities_is_duplicate_SentenceTransformer2_df['state'] == entities_is_duplicate_SentenceTransformer2_df['predicted state']).mean()
accuracy_entities_is_duplicate_SentenceTransformer2_df



at the moment is just comparing the text in whole report with the entities - it make sinse to mark the pair as duplicated if they contain the same entities and in that case the prediction would be a bit higher I think 

In [None]:
df = pd.DataFrame({
    'Method':["SentenceTransformer1","SentenceTransformer2","Levenshtein","fuzzywuzzy","cosine_similarity"],
    'text prediction':  [accuracy_is_duplicate_SentenceTransformer1_df,accuracy_is_duplicate_SentenceTransformer2_df,accuracy_is_duplicate_Levenshtein_df_df,accuracy_is_duplicate_fuzzywuzzy_df,is_duplicate_cosine_similarity_df],  
    'entities text prediction': [accuracy_entities_is_duplicate_SentenceTransformer2_df,accuracy_entities_is_duplicate_SentenceTransformer1,accuracy_entities_is_duplicate_Levenshtein,accuracy_entities_is_duplicate_fuzzywuzzy,accuracy_entities_is_duplicate_cosine_similarity]
})
df['text prediction'] = df['text prediction']*100
df['entities text prediction'] = df['entities text prediction']*100
# Display the DataFrame
# Display the DataFrame
df