## Identifying duplicate

This notebook focuses on the evaluation of various techniques for identifying duplicate. It aims to explore and compare different methods to determine which approach is most effective in identifying and handling duplicates within a dataset.

## Import Twitter data 
The data is stored in cpt.pARQUET. hge sample dataset includes x entries

In [8]:
import pandas as pd 

# need to specify this to show full text in one line
pd.options.display.max_colwidth = 0

df_path = "data.csv"
df = pd.read_csv(df_path)

# just printing the informaition 
print("dataframe shape", df.shape)
print("--------------------------------")
print("the columns feed value counts", df.feed.value_counts())
print("-------------------------------")
df.head(3)


dataframe shape (46726, 3)
--------------------------------
the columns feed value counts feed
CapeTownTrains     19812
CapeTownFreeway    8552 
WCLiveTraffic      7490 
CityofCTAlerts     5507 
MetrorailWC        3491 
TrafficSA          1660 
MyCiTiBus          202  
Rumbi_CPT          12   
Name: count, dtype: int64
-------------------------------


Unnamed: 0,country,report_description,feed
0,South Africa,Update: #MyCiTiAlert Power has been restored at Kuyasa station.\nhttps://t.co/rnAmfA6bwc,MyCiTiBus
1,South Africa,#MyCiTiAlert: Melkbosstrand station is currently offline due to loadshedding. Passengers are advised to use the validators on the bus for tapping in and out.,MyCiTiBus
2,South Africa,Update: #MyCiTiAlert Power has been restored at Melkbosstrand station.\nhttps://t.co/gJy9b8ST6i,MyCiTiBus


### Modify report to include the duplicated rows in the DataFrame, along with an additional column indicating whether each row is duplicated or not.
---
### Training datatset

- Splitting the dataset and getting a sample for both duplicates and non duplicates
- doing the chatGPT thing with the duplicate sample
- Make sure to label which sample was the duplicates

In [None]:
cols=['report_description','feed']
sample = 100

sample_filter = df.loc[:,cols]

# Select the sample that is going to be passed to chatGPT to generate duplicates
duplicate_set = sample_filter.sample(n=sample, replace=False)

# Select the second sample, excluding the data that is passed to chatGPT for duplicate generation
non_duplicate_set = sample_filter.drop(duplicate_set.index).sample(n=sample, replace=False)

# Add label column
duplicate_set['state'] = ['duplicated'] * sample
duplicate_set = duplicate_set.reset_index(drop=True)

# Add the repeated word column to the DataFrame
non_duplicate_set['state'] = ['not duplicated'] * sample
non_duplicate_set = non_duplicate_set.reset_index(drop=True)

non_duplicate_set.head(5)


Unnamed: 0,report_description,feed,state
0,Update: 340295: Road closure due to crash: N2 inbound at N2/M3 interchange. use alt route.\n \n#BeTheChange https://t.co/mZ99CatVNr,CapeTownFreeway,not duplicated
1,"There is a water outage in Mission Street, Sir Lowry's Pass, which could also be affecting the surrounds. The department is attending to this.",CityofCTAlerts,not duplicated
2,#NorthernLineCT \nInbound-\nT3506 approaching Eikenfontein station en-route \nOutbound-\nT2509 departed Bellville station en-route Kraaifontein station,CapeTownTrains,not duplicated
3,#SouthernLineCT : \nOutbound \nT0115 departed Retreat station en-route to Fish Hoek,MetrorailWC,not duplicated
4,MBA\nBlaauwberg Road\nLandmark: Bayside Mall\n(Tableview)\nServices Notified\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #MikeCharlie1,WCLiveTraffic,not duplicated


### chatGPT

You are doing some prompt engineering with chatGPT. My suggestion:
Move this piece of code out - put it in a separate file and the call the method and comment it out
Write a comment around if you have AN API KEY THEN YOU CAN UNCOMMENT THIS CODE OUT (ll the code for chatgpt together)

You should:
Run this notebook and save the chatgpt data in the parquet file. Once-off so do as separate script
If the key is dicontinued - then use chatgpt manually


In [None]:
import openai
# Set up OpenAI API credentials
from dotenv import dotenv_values, find_dotenv
import os


config = dotenv_values(find_dotenv())
  
openai.api_key = config['openai_api_key']
 
# Function to rewrite text using OpenAI
def rewrite_text(text):
    response = openai.Completion.create(
        engine='text-davinci-003',
        prompt=f"Rephrase with keeping the road names and the city: \"{text}\"",
        max_tokens=50,
        temperature=1
    )
    # print(response)
    rewritten_text = response.choices[0].text.strip()
    return rewritten_text
duplicate_reports_gpt = duplicate_reports.copy()

duplicate_reports_gpt["report_description"] = duplicate_reports["report_description"].apply(rewrite_text)
duplicate_reports_gpt

### show the actual and Paraphrased sentences side by side

In [None]:
duplicate_reports["openai_report_description"] = duplicate_reports_gpt["report_description"]
duplicate_reports

Unnamed: 0,report_description,feed,state,openai_report_description
0,"The water supply in Altena Road, Strand , has been restored.",CityofCTAlerts,duplicated,Water has been restored to Altena Road in Strand.
1,#NorthernLineCT \nOutbound - \nT2515 approaching Maitland station en-route Kraaifontein station.,MetrorailWC,duplicated,#NorthernLineCT \nHeading out - \nT2515 is about to reach Maitland station and then will continue its journey to Kraaifontein station.
2,#CapeFlatsLineCT \nInbound \nT0504 arrived Koeberg station en-route to Cape Town,CapeTownTrains,duplicated,The T0504 train has just arrived at Koeberg station on its journey towards Cape Town along the Cape Flats Line.
3,"Update,343127: Stationary Vehicle on N1 Outbound after Lower Church. All lanes open, No delays...#BeTheChange https://t.co/6SGn65Vai8",CapeTownFreeway,duplicated,Reporting from 343127: We have a stationary vehicle on the N1 Outbound after Lower Church. All lanes open with no delays. Stay safe out there! #BeTheChange https://t.co/6SGn65Vai
4,"There is a water outage in Liedla Street, Elim, which could also be affecting the surrounds. The department is attending to this.",CityofCTAlerts,duplicated,"Residents of Liedla Street in Elim may be affected by a water outage, and the relevant department is working to resolve the issue and any potential impacts on the surrounding areas."
...,...,...,...,...
95,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",WCLiveTraffic,duplicated,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No disruptions.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Anticipate delays. #WCLive"
96,#NorthernLineCT \nInbound - \nT2506 at Bellville station en-route Cape Town station .,CapeTownTrains,duplicated,T2506 on the Northern Line is travelling inbound from Bellville station to Cape Town station.
97,#SouthernLineCT \nInbound\nT0122 departed Heathfield en-route to Cape Town station.,CapeTownTrains,duplicated,T0122 has departed Heathfield and is heading towards Cape Town Station via the Southern Line.
98,#NorthernLineCT\nOutbound \nT3503 approaching Huguenot Station. \nInbound\nT3504 departed Klapmuts Station enroute to Muldersvlei Station.,CapeTownTrains,duplicated,Outbound\nT3503 approaching Huguenot Station.\nInbound\nT3504 departing Klapmuts Station heading to Muldersvlei Station.


#### put all togther

In [None]:
### make a non-duplicated report for each report in the sample 
non_duplicate_reports["openai_report_description"] = duplicate_reports["report_description"]

# combine both samples
sample_filter = pd.concat([non_duplicate_reports,duplicate_reports])
sample_filter = sample_filter.sample(frac=1, random_state=42)
sample_filter = sample_filter.reset_index(drop=True)
sample_filter

Unnamed: 0,report_description,feed,state,openai_report_description
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,CapeTownTrains,not duplicated,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic"
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,CapeTownTrains,not duplicated,Durban - South Beach area https://t.co/ySUwf3ljyL
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,WCLiveTraffic,not duplicated,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC"
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio",WCLiveTraffic,duplicated,"Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #"
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,CapeTownFreeway,duplicated,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL
...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.",CityofCTAlerts,duplicated,"A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter."
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,CapeTownTrains,not duplicated,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,CapeTownTrains,not duplicated,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.",CityofCTAlerts,duplicated,"A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully."


## Entity Recognition

In [None]:
import spacy
 
 
# Load the English language model
nlp = spacy.load('en_core_web_sm')
def perform_entity_recognition(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents]
    return  ', '.join(entities)

 

# Apply entity recognition to the 'text' column
sample_filter['entities'] = sample_filter['report_description'].apply(perform_entity_recognition)
sample_filter['entities_openai'] = sample_filter['openai_report_description'].apply(perform_entity_recognition)
sample_filter = sample_filter.reset_index(drop=True)

sample_filter

Unnamed: 0,report_description,feed,state,openai_report_description,entities,entities_openai
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,CapeTownTrains,not duplicated,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic","Fish Hoek, Cape Town","Roadworks, Potsdam, Borcherds Quarry, Two, #, #BokRadio #, #, #, EWNTraffic"
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,CapeTownTrains,not duplicated,Durban - South Beach area https://t.co/ySUwf3ljyL,"Fisantekraal, Cape Town",Durban - South Beach
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,WCLiveTraffic,not duplicated,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC","Lowrys Pass Village, Approach &, #HappeningRadio #MikeCharlie1 #, #EWNTraffic #, #WCLiveTelegram #, BokRadio","334366, Veld Fire, N1 Inbound, Klapmuts"
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio",WCLiveTraffic,duplicated,"Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #","MVA, Sienna, Burgandy Estate, Plattekloof Bridge, #WCLiveTraffic , #, EWNTraffic, BokRadio","Plattekloof Bridge, Sienna Drive, #WCLiveTraffic, #Bosbeer2006, #"
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,CapeTownFreeway,duplicated,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,"N2 Inbound, Shell Fuel Station, #BoozeFreeRoads","the N2 Highway Inbound, the Shell Fuel Station, #BoozeFreeRoads"
...,...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.",CityofCTAlerts,duplicated,"A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",Fink Rd Bridgetown,"Fink Rd, Bridgetown"
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,CapeTownTrains,not duplicated,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,"Fish Hoek, Cape Town, Cape Town","Cape Town, Fish Hoek"
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,CapeTownTrains,not duplicated,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,"Langa Station, Inbound, Langa Station, Cape Town, Pinelands Station, 4.30pm","MBA, Victoria Rd &amp, Twine Rd, #WCLiveTelegram #WCLiveZello, #WCLiveTraffic , #EWNTraffic #, BokRadio #, CompassTowing"
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.",CityofCTAlerts,duplicated,"A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.","Jakes Gerwel Dr, Viking Way",Jakes Gerwel Drive


## cosine similarity using TFIDF

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
threshold = 70
def add_label_column(df, existing_column, threshold):
    df['predicted state'] = df[existing_column].apply(lambda x: 'not duplicated' if x < threshold else 'duplicated')
    return df

def is_duplicate_cosine_similarity(df,  cols,threshold):
    vectorizer = TfidfVectorizer()
    tfidf_matrix1 = vectorizer.fit_transform(df[cols[0]])
    tfidf_matrix2 = vectorizer.transform(df[cols[1]])
    cosine_sim = cosine_similarity(tfidf_matrix1, tfidf_matrix2)
    # print(cosine_sim.diagonal())

    comparison_df = pd.DataFrame({
        'report_description': df[cols[0]],
        'openai_report_description': df[cols[1]],
        'Similarity': cosine_sim.diagonal()*100,
        'state': df[cols[2]]
    })
    
    
    comparison_df =  add_label_column(comparison_df, 'Similarity', threshold)
 
    return comparison_df
 



is_duplicate_cosine_similarity_df = is_duplicate_cosine_similarity(sample_filter, ['report_description',"openai_report_description","state"],threshold)
is_duplicate_cosine_similarity_df  

Unnamed: 0,report_description,openai_report_description,Similarity,state,predicted state
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",0.000000,not duplicated,not duplicated
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,Durban - South Beach area https://t.co/ySUwf3ljyL,0.000000,not duplicated,not duplicated
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC",0.000000,not duplicated,not duplicated
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio","Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #",70.009098,duplicated,duplicated
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,70.004658,duplicated,duplicated
...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.","A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",55.300734,duplicated,not duplicated
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,60.853729,not duplicated,not duplicated
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,0.000000,not duplicated,not duplicated
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.","A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.",47.536413,duplicated,not duplicated


In [None]:
is_duplicate_cosine_similarity_df = (is_duplicate_cosine_similarity_df['state'] == is_duplicate_cosine_similarity_df['predicted state']).mean()
is_duplicate_cosine_similarity_df

0.67

In [None]:
entities_is_duplicate_cosine_similarity = is_duplicate_cosine_similarity(sample_filter,['report_description','entities',"state"],threshold)
 
accuracy_entities_is_duplicate_cosine_similarity = (entities_is_duplicate_cosine_similarity['state'] == entities_is_duplicate_cosine_similarity['predicted state']).mean()
accuracy_entities_is_duplicate_cosine_similarity

0.52

## fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz
import pandas as pd

def is_duplicate_fuzzywuzzy(df, cols, threshold=70):
    comparison_results = []
     
    for index, row in df.iterrows():
        text1 = row[cols[0]]
        text2 = row[cols[1]]
        # print( row)
        similarity_ratio = fuzz.token_set_ratio(text1, text2)
        label =  row[cols[2]]
        if similarity_ratio >= threshold:
            comparison_results.append({
                'Text1': text1,
                'Text2': text2,
                'Similarity ratio': similarity_ratio,
                "predicted state":"duplicated",
                'state': label
            })
        else:

            
             
            comparison_results.append({
                'Text1': text1,
                'Text2': text2,
                'Similarity ratio': int(similarity_ratio),
                "predicted state":"not duplicated",
                "state":label
            })

    comparison_df = pd.DataFrame(comparison_results)
    return comparison_df
 

is_duplicate_fuzzywuzzy_df = is_duplicate_fuzzywuzzy(sample_filter, ['report_description',"openai_report_description","state"],threshold)
is_duplicate_fuzzywuzzy_df  

Unnamed: 0,Text1,Text2,Similarity ratio,predicted state,state
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",33,not duplicated,not duplicated
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,Durban - South Beach area https://t.co/ySUwf3ljyL,35,not duplicated,not duplicated
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC",33,not duplicated,not duplicated
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio","Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #",88,duplicated,duplicated
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,91,duplicated,duplicated
...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.","A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",61,not duplicated,duplicated
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,89,duplicated,not duplicated
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,39,not duplicated,not duplicated
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.","A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.",69,not duplicated,duplicated


In [None]:
accuracy_is_duplicate_fuzzywuzzy_df = (is_duplicate_fuzzywuzzy_df['state'] == is_duplicate_fuzzywuzzy_df['predicted state']).mean()
accuracy_is_duplicate_fuzzywuzzy_df


0.84

In [None]:
entities_is_duplicate_fuzzywuzzy = is_duplicate_fuzzywuzzy(sample_filter,['report_description','entities',"state"],threshold)

accuracy_entities_is_duplicate_fuzzywuzzy = (entities_is_duplicate_fuzzywuzzy['state'] == entities_is_duplicate_fuzzywuzzy['predicted state']).mean()
accuracy_entities_is_duplicate_fuzzywuzzy


0.51

## Levenshtein

In [None]:
import pandas as pd
import Levenshtein
 

# Function to compare texts using Levenshtein distance and return similarity as a percentage
def is_duplicate_Levenshtein(df,cols,threshold):
    similarities = []
    for text1, text2 in zip(df[cols[0]], df[cols[1]]):
        distance = Levenshtein.distance(text1, text2)
        max_length = max(len(text1), len(text2))
        similarity = (max_length - distance) / max_length * 100
        similarities.append(int(similarity))
    
    comparison_df = pd.DataFrame({
        'report_description': df[cols[0]],
        'openai_report_description': df[cols[1]],
        'Similarity Percentage':  similarities,
         "state": df[cols[2]]        
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    
    return comparison_df

# Compare texts using Levenshtein distance and return similarity as a percentage
is_duplicate_Levenshtein_df = is_duplicate_Levenshtein(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_Levenshtein_df


Unnamed: 0,report_description,openai_report_description,Similarity Percentage,state,predicted state
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",19,not duplicated,not duplicated
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,Durban - South Beach area https://t.co/ySUwf3ljyL,20,not duplicated,not duplicated
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC",16,not duplicated,not duplicated
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio","Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #",37,duplicated,not duplicated
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,69,duplicated,not duplicated
...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.","A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",44,duplicated,not duplicated
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,49,not duplicated,not duplicated
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,17,not duplicated,not duplicated
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.","A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.",44,duplicated,not duplicated


In [None]:
accuracy_is_duplicate_Levenshtein_df_df = (is_duplicate_Levenshtein_df['state'] == is_duplicate_Levenshtein_df['predicted state']).mean()
accuracy_is_duplicate_Levenshtein_df_df

0.53

In [None]:
entities_is_duplicate_Levenshtein = is_duplicate_Levenshtein(sample_filter,['report_description','entities',"state"],threshold)

accuracy_entities_is_duplicate_Levenshtein = (entities_is_duplicate_Levenshtein['state'] == entities_is_duplicate_Levenshtein['predicted state']).mean()
accuracy_entities_is_duplicate_Levenshtein


0.5

## embeddings

 ##### SentenceTransformer('distilbert-base-nli-mean-tokens')

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer,util
from sklearn.metrics.pairwise import cosine_similarity
 

# Load the SentenceTransformer model
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# Function to compare texts using SentenceTransformer
def compare_texts_sentence_transformer1(df, cols,threshold):
    embeddings1 = model.encode(df[cols[0]].tolist())
    embeddings2 = model.encode(df[cols[1]].tolist())
    similarities = util.cos_sim(embeddings1, embeddings2)

    comparison_df = pd.DataFrame({
        'Text1': df[cols[0]],
        'Text2': df[cols[1]],
        'Similarity Percentage': similarities.diagonal()*100,
         "state": df[cols[2]] 
          
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    comparison_df['Similarity Percentage'] = comparison_df['Similarity Percentage'].astype(int)
    return comparison_df

# Compare texts using SentenceTransformer
is_duplicate_SentenceTransformer1_df = compare_texts_sentence_transformer1(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_SentenceTransformer1_df


Unnamed: 0,Text1,Text2,Similarity Percentage,state,predicted state
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",58,not duplicated,not duplicated
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,Durban - South Beach area https://t.co/ySUwf3ljyL,46,not duplicated,not duplicated
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC",54,not duplicated,not duplicated
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio","Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #",75,duplicated,duplicated
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,89,duplicated,duplicated
...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.","A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",92,duplicated,duplicated
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,94,not duplicated,duplicated
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,51,not duplicated,not duplicated
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.","A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.",90,duplicated,duplicated


In [None]:
accuracy_is_duplicate_SentenceTransformer1_df = (is_duplicate_SentenceTransformer1_df['state'] == is_duplicate_SentenceTransformer1_df['predicted state']).mean()
accuracy_is_duplicate_SentenceTransformer1_df


0.865

In [None]:
entities_is_duplicate_SentenceTransformer1 = compare_texts_sentence_transformer1(sample_filter,['report_description','entities',"state"],threshold)
 
accuracy_entities_is_duplicate_SentenceTransformer1 = (entities_is_duplicate_SentenceTransformer1['state'] == entities_is_duplicate_SentenceTransformer1['predicted state']).mean()
accuracy_entities_is_duplicate_SentenceTransformer1


0.555

##### SentenceTransformer('all-MiniLM-L6-v2')

In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer,util
from sklearn.metrics.pairwise import cosine_similarity
 


# Load the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Function to compare texts using SentenceTransformer
def compare_texts_sentence_transformer2(df, cols,threshold):
    # embeddings1 = model.encode(df[cols[0]].tolist())
    # embeddings2 = model.encode(df[cols[1]].tolist())
    # similarities = cosine_similarity(embeddings1, embeddings2)
    embeddings1 = model.encode(df[cols[0]].tolist())
    embeddings2 = model.encode(df[cols[1]].tolist())
    similarities = util.cos_sim(embeddings1, embeddings2)
   
    comparison_df = pd.DataFrame({
        'Text1': df[cols[0]],
        'Text2': df[cols[1]],
        'Similarity Percentage':similarities.diagonal()*100,
         "state": df[cols[2]] 
          
    })
    comparison_df =  add_label_column(comparison_df, 'Similarity Percentage', threshold)
    comparison_df['Similarity Percentage'] = comparison_df['Similarity Percentage'].astype(int)
    return comparison_df

# Compare texts using SentenceTransformer
is_duplicate_SentenceTransformer2_df = compare_texts_sentence_transformer2(sample_filter, ['report_description',"openai_report_description","state"],threshold)

is_duplicate_SentenceTransformer2_df


Unnamed: 0,Text1,Text2,Similarity Percentage,state,predicted state
0,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town,"🚧Roadworks🚧\n- N7 Southbound before Potsdam. No delays.\n- N2 Outbound before Borcherds Quarry, Two lanes closed, Expect delays.\n#WCLiveTraffic #BokRadio #Bosbeer2006 #happeningradio #EWNTraffic",39,not duplicated,not duplicated
1,#NorthernLineCT \nInbound - \nT2630 approaching Fisantekraal station en-route Cape Town station .,Durban - South Beach area https://t.co/ySUwf3ljyL,40,not duplicated,not duplicated
2,Protest Action:\nSir Lowrys Pass Village side near to the clinic \nServices on scene \nApproach & pass with caution \n#HappeningRadio #MikeCharlie1 #WCLiveTraffic #EWNTraffic #Bosbeer2006 #WCLiveTelegram #happeningradio #BokRadio,"334366: Veld Fire, N1 Inbound after Klapmuts, all lanes open, drive carefully. #ShareTheRoads https://t.co/KA2BEyW7SC",15,not duplicated,not duplicated
3,"MVA \nSienna drive, Burgandy Estate, on Plattekloof Bridge\n#WCLiveTraffic #Bosbeer2006 \n#happeningradio #EWNTraffic #BokRadio","Drivers in the Burgandy Estate area of Plattekloof Bridge, keep an eye out on the traffic on Sienna Drive due to #WCLiveTraffic, #Bosbeer2006, #happeningradio, #",75,duplicated,duplicated
4,Veld Fire on N2 Inbound after Shell Fuel Station. All lanes open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,There is a wildfire on the N2 Highway Inbound near the Shell Fuel Station. All traffic lanes are open. #BoozeFreeRoads https://t.co/RJzqpEk1pL,81,duplicated,duplicated
...,...,...,...,...,...
195,"There is a water outage in Fink Rd Bridgetown, which could also be affecting the surrounds. The department is attending to this.","A water disruption is being experienced on Fink Rd in Bridgetown, and the issue may have a ripple effect on the surrounding area. The responsible department is on its way to resolve the matter.",83,duplicated,duplicated
196,#SouthernLineCT \nInbound \nT0108 departed Fish Hoek station en-route to Cape Town \nT0106 arrived Newlands station en-route to Cape Town,#SouthernLineCT \nOutbound\nT0119 departed Cape Town station en-route to Fish Hoek.,90,not duplicated,duplicated
197,#CentralLineCT\nOutbound\nT9519 arrived at Langa Station will turn around as Inbound T9520 departing Langa Station enroute to Cape Town via Pinelands Station at 4.30pm.,MBA\nVictoria Rd &amp; Twine Rd (Plumstead)\nTowing On Scene\nServices On Scene\nPass With Caution\n #WCLiveTelegram #WCLiveZello #Bosbeer2006 #WCLiveTraffic #EWNTraffic #BokRadio #happeningradio #CompassTowing,23,not duplicated,not duplicated
198,"Bus broken down on Jakes Gerwel Dr northbound before Viking Way, left lane obstructed. Proceed with caution.","A bus has malfunctioned on Jakes Gerwel Drive heading in a northerly direction before Viking Way, blocking the left lane. Please drive carefully.",76,duplicated,duplicated


In [None]:
accuracy_is_duplicate_SentenceTransformer2_df = (is_duplicate_SentenceTransformer2_df['state'] == is_duplicate_SentenceTransformer2_df['predicted state']).mean()
accuracy_is_duplicate_SentenceTransformer2_df


0.935

In [None]:
entities_is_duplicate_SentenceTransformer2_df = compare_texts_sentence_transformer2(sample_filter,['report_description','entities',"state"],threshold)
accuracy_entities_is_duplicate_SentenceTransformer2_df = (entities_is_duplicate_SentenceTransformer2_df['state'] == entities_is_duplicate_SentenceTransformer2_df['predicted state']).mean()
accuracy_entities_is_duplicate_SentenceTransformer2_df


0.5


at the moment is just comparing the text in whole report with the entities - it make sinse to mark the pair as duplicated if they contain the same entities and in that case the prediction would be a bit higher I think 

In [None]:
df = pd.DataFrame({
    'Method':["SentenceTransformer1","SentenceTransformer2","Levenshtein","fuzzywuzzy","cosine_similarity"],
    'text prediction':  [accuracy_is_duplicate_SentenceTransformer1_df,accuracy_is_duplicate_SentenceTransformer2_df,accuracy_is_duplicate_Levenshtein_df_df,accuracy_is_duplicate_fuzzywuzzy_df,is_duplicate_cosine_similarity_df],  
    'entities text prediction': [accuracy_entities_is_duplicate_SentenceTransformer2_df,accuracy_entities_is_duplicate_SentenceTransformer1,accuracy_entities_is_duplicate_Levenshtein,accuracy_entities_is_duplicate_fuzzywuzzy,accuracy_entities_is_duplicate_cosine_similarity]
})
df['text prediction'] = df['text prediction']*100
df['entities text prediction'] = df['entities text prediction']*100
# Display the DataFrame
# Display the DataFrame
df

Unnamed: 0,Method,text prediction,entities text prediction
0,SentenceTransformer1,86.5,50.0
1,SentenceTransformer2,93.5,55.5
2,Levenshtein,53.0,50.0
3,fuzzywuzzy,84.0,51.0
4,cosine_similarity,67.0,52.0
