## Content
  - [Approach 1: Simple Duplicate Check](#Approach-1-Simple-Duplicate-Check)
  - [Approach 2: Advanced Similarity Check](#Approach-2-Advanced-Similarity-Check)
    - [Threshold = 0.95](#Threshold-0.95)
    - [Threshold = 0.9](#Threshold-0.9)
    - [Threshold = 0.8](#Threshold-0.8)

## Approach 1: Simple Duplicate Check

In [1]:
# 1. Import Libraries
import pandas as pd

# 2. Load Data
df = pd.read_csv('train_split.csv')

# 3. Find duplicates in the 'full_text' column
duplicates = df[df.duplicated(subset=['full_text'], keep=False)]

# 4. Display duplicates
print(f"Number of duplicate entries: {duplicates.shape[0]}")
print(duplicates)


Number of duplicate entries: 0
Empty DataFrame
Columns: [essay_id, full_text, score]
Index: []


## Approach 2: Advanced Similarity Check

In [3]:
## 0.95

In [4]:
# 1. Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 2. Load Data
df = pd.read_csv('train_split.csv')

# 3. Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['full_text'])

# 4. Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 5. Find duplicates based on a similarity threshold
threshold = 0.95  # You can adjust this threshold
duplicates = []
seen_indices = set()

for idx, row in enumerate(cosine_sim):
    if idx in seen_indices:
        continue

    similar_indices = [i for i, sim in enumerate(row) if sim > threshold and i != idx]
    if similar_indices:
        duplicates.append((idx, similar_indices))
        seen_indices.update(similar_indices)

# 6. Collect data for display
duplicate_data = []

for dup in duplicates:
    original_idx = dup[0]
    duplicate_indices = dup[1]
    
    original_text = df.iloc[original_idx]['full_text']
    original_score = df.iloc[original_idx]['score']
    original_text_length = len(original_text)
    
    for duplicate_idx in duplicate_indices:
        duplicate_text = df.iloc[duplicate_idx]['full_text']
        duplicate_score = df.iloc[duplicate_idx]['score']
        duplicate_text_length = len(duplicate_text)
        
        duplicate_data.append({
            'Original Index': original_idx,
            'Duplicate Index': duplicate_idx,
            'Original Text': original_text,
            'Duplicate Text': duplicate_text,
            'Original Score': original_score,
            'Duplicate Score': duplicate_score,
            'Original Text Length': original_text_length,
            'Duplicate Text Length': duplicate_text_length
        })

# 7. Create DataFrame from the collected data
df_duplicates = pd.DataFrame(duplicate_data)

# Reorder columns
df_duplicates = df_duplicates[[
    'Original Index', 'Duplicate Index', 
    'Original Text', 'Duplicate Text', 
    'Original Score', 'Duplicate Score', 
    'Original Text Length', 'Duplicate Text Length'
]]

# Count the number of duplicates
num_duplicates = len(duplicate_data)
print(f"Number of duplicate entries: {num_duplicates}")


Number of duplicate entries: 8


In [7]:
# Display the DataFrame
df_duplicates.head(3)

Unnamed: 0,Original Index,Duplicate Index,Original Text,Duplicate Text,Original Score,Duplicate Score,Original Text Length,Duplicate Text Length
0,125,4576,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,5,3331,3021
1,125,10898,Phones & Driving\n\nCell phones were introduce...,Phones and Driving\n\nEveryday people die in c...,5,5,3331,3168
2,125,17988,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,6,3331,3167


Let's check threshold = 0.9

In [9]:
## 0.9

In [10]:
# 1. Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 2. Load Data
df = pd.read_csv('train_split.csv')

# 3. Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['full_text'])

# 4. Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 5. Find duplicates based on a similarity threshold
threshold = 0.9  # You can adjust this threshold
duplicates = []
seen_indices = set()

for idx, row in enumerate(cosine_sim):
    if idx in seen_indices:
        continue

    similar_indices = [i for i, sim in enumerate(row) if sim > threshold and i != idx]
    if similar_indices:
        duplicates.append((idx, similar_indices))
        seen_indices.update(similar_indices)

# 6. Collect data for display
duplicate_data = []

for dup in duplicates:
    original_idx = dup[0]
    duplicate_indices = dup[1]
    
    original_text = df.iloc[original_idx]['full_text']
    original_score = df.iloc[original_idx]['score']
    original_text_length = len(original_text)
    
    for duplicate_idx in duplicate_indices:
        duplicate_text = df.iloc[duplicate_idx]['full_text']
        duplicate_score = df.iloc[duplicate_idx]['score']
        duplicate_text_length = len(duplicate_text)
        
        duplicate_data.append({
            'Original Index': original_idx,
            'Duplicate Index': duplicate_idx,
            'Original Text': original_text,
            'Duplicate Text': duplicate_text,
            'Original Score': original_score,
            'Duplicate Score': duplicate_score,
            'Original Text Length': original_text_length,
            'Duplicate Text Length': duplicate_text_length
        })

# 7. Create DataFrame from the collected data
df_duplicates = pd.DataFrame(duplicate_data)

# Reorder columns
df_duplicates = df_duplicates[[
    'Original Index', 'Duplicate Index', 
    'Original Text', 'Duplicate Text', 
    'Original Score', 'Duplicate Score', 
    'Original Text Length', 'Duplicate Text Length'
]]

# Count the number of duplicates
num_duplicates = len(duplicate_data)
print(f"Number of duplicate entries: {num_duplicates}")


Number of duplicate entries: 12


In [11]:
# Display the DataFrame
df_duplicates.head(4)

Unnamed: 0,Original Index,Duplicate Index,Original Text,Duplicate Text,Original Score,Duplicate Score,Original Text Length,Duplicate Text Length
0,125,4576,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,5,3331,3021
1,125,10414,Phones & Driving\n\nCell phones were introduce...,Everyday people die or get injured in car acci...,5,5,3331,2881
2,125,10898,Phones & Driving\n\nCell phones were introduce...,Phones and Driving\n\nEveryday people die in c...,5,5,3331,3168
3,125,17988,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,6,3331,3167


Thoughts
- this option seems the best for removal. Scores are similar, length too.
- which one to choose for removal? Texts were analyzed additionally, and those which contain [proper name] inside will be removed. These are 789, 3109

In [13]:
## 0.8

In [14]:
# 1. Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 2. Load Data
df = pd.read_csv('train_split.csv')

# 3. Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['full_text'])

# 4. Calculate cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 5. Find duplicates based on a similarity threshold
threshold = 0.80  # You can adjust this threshold
duplicates = []
seen_indices = set()

for idx, row in enumerate(cosine_sim):
    if idx in seen_indices:
        continue

    similar_indices = [i for i, sim in enumerate(row) if sim > threshold and i != idx]
    if similar_indices:
        duplicates.append((idx, similar_indices))
        seen_indices.update(similar_indices)

# 6. Collect data for display
duplicate_data = []

for dup in duplicates:
    original_idx = dup[0]
    duplicate_indices = dup[1]
    
    original_text = df.iloc[original_idx]['full_text']
    original_score = df.iloc[original_idx]['score']
    original_text_length = len(original_text)
    
    for duplicate_idx in duplicate_indices:
        duplicate_text = df.iloc[duplicate_idx]['full_text']
        duplicate_score = df.iloc[duplicate_idx]['score']
        duplicate_text_length = len(duplicate_text)
        
        duplicate_data.append({
            'Original Index': original_idx,
            'Duplicate Index': duplicate_idx,
            'Original Text': original_text,
            'Duplicate Text': duplicate_text,
            'Original Score': original_score,
            'Duplicate Score': duplicate_score,
            'Original Text Length': original_text_length,
            'Duplicate Text Length': duplicate_text_length
        })

# 7. Create DataFrame from the collected data
df_duplicates = pd.DataFrame(duplicate_data)

# Reorder columns
df_duplicates = df_duplicates[[
    'Original Index', 'Duplicate Index', 
    'Original Text', 'Duplicate Text', 
    'Original Score', 'Duplicate Score', 
    'Original Text Length', 'Duplicate Text Length'
]]

# Count the number of duplicates
num_duplicates = len(duplicate_data)
print(f"Number of duplicate entries: {num_duplicates}")


Number of duplicate entries: 46


In [15]:
# Display the DataFrame
df_duplicates.head(16)

Unnamed: 0,Original Index,Duplicate Index,Original Text,Duplicate Text,Original Score,Duplicate Score,Original Text Length,Duplicate Text Length
0,125,4576,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,5,3331,3021
1,125,5578,Phones & Driving\n\nCell phones were introduce...,Drivers should not be able to use cell phones ...,5,5,3331,2982
2,125,5691,Phones & Driving\n\nCell phones were introduce...,Phones and driving\n\nEveryday people die in c...,5,5,3331,2743
3,125,10414,Phones & Driving\n\nCell phones were introduce...,Everyday people die or get injured in car acci...,5,5,3331,2881
4,125,10898,Phones & Driving\n\nCell phones were introduce...,Phones and Driving\n\nEveryday people die in c...,5,5,3331,3168
5,125,10905,Phones & Driving\n\nCell phones were introduce...,Why drivers shouldn't or should use phone whil...,5,5,3331,4586
6,125,11745,Phones & Driving\n\nCell phones were introduce...,Cell phones have become an epidemic to drivers...,5,5,3331,2885
7,125,17749,Phones & Driving\n\nCell phones were introduce...,Phones & Driving\n\nAlthough cell phones have ...,5,5,3331,3024
8,125,17988,Phones & Driving\n\nCell phones were introduce...,Everyday people die in car accidents because t...,5,6,3331,3167
9,247,11725,Would it be surprising to know that kids learn...,Would it be surprising to know if students to ...,5,5,3417,4103


Texts have different scores, noticeably due to different lengths; hence this option will be skipped. It appears that students might have copied essays from friends and enriched them.