# Data Processing for Dataset 1 - AI Vs Human Text 

## 1. Load the Dataset

In [2]:
import pandas as pd

# Load the dataset from your local path
df = pd.read_csv('../AI_Human.csv')  

# Display basic information about the dataset
df.info()

# Display the first few rows of the dataset
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 487235 entries, 0 to 487234
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   text       487235 non-null  object 
 1   generated  487235 non-null  float64
dtypes: float64(1), object(1)
memory usage: 7.4+ MB


Unnamed: 0,text,generated
0,Cars. Cars have been around since they became ...,0.0
1,Transportation is a large necessity in most co...,0.0
2,"""America's love affair with it's vehicles seem...",0.0
3,How often do you ride in a car? Do you drive a...,0.0
4,Cars are a wonderful thing. They are perhaps o...,0.0


In [3]:
# Checking sample data
df['text'][0]

'Cars. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then. But now, people are starting to question if limiting car usage would be a good thing. To me, limiting the use of cars might be a good thing to do.\n\nIn like matter of this, article, "In German Suburb, Life Goes On Without Cars," by Elizabeth Rosenthal states, how automobiles are the linchpin of suburbs, where middle class families from either Shanghai or Chicago tend to make their homes. Experts say how this is a huge impediment to current efforts to reduce greenhouse gas emissions from tailpipe. Passenger cars are responsible for 12 percent of greenhouse gas emissions in Europe...and up to 50 percent in some carintensive areas in the United States. Cars are the main reason for the greenhouse gas emissions because of a lot of people driving them around all the time getting where they need to go. Article

## 2. Check Null Values & Noise

In [4]:
# Check for missing values in the dataset
df.isnull().sum()


text         0
generated    0
dtype: int64

In [5]:
# Calculate text_length 
df['text_length'] = df['text'].apply(lambda x: len(str(x).split()))

# Find rows with text length less than 16
target_rows = df[df['text_length'] <= 15]

# Display the number of such rows and print them
print(f"Number of rows with text length 0: {len(target_rows)}")
target_rows

Number of rows with text length 0: 32


Unnamed: 0,text,generated,text_length
2380,Code]\n[Email Address]\n[Phone Number],1.0,5
2381,]\n\n[Email]\n\n[Phone Number],1.0,4
2384,],1.0,1
2385,]\n[Email]\n[Phone Number],1.0,4
2388,]\n[Email Address]\n[Phone Number],1.0,5
28737,Facial action coding,1.0,3
29318,Community service is an integral part of ever...,1.0,14
29331,Community service.,1.0,2
29337,Community service refers to the activities an...,1.0,11
29374,"Write an essay on the topic ""A Cowboy Who Rod...",1.0,12


In [6]:
# find duplicates
duplicate_mask = df.duplicated(subset='text', keep=False)
df[duplicate_mask]

Unnamed: 0,text,generated,text_length


## 3. Data Cleaning and Processing

### 3.1 Remove Tags

Text contains tags such as `\n` or `\`

In [8]:
def remove_tags(text):
    tags = ['\n', '\'']
    for tag in tags:
        text = text.replace(tag, ' ' if tag == '\n' else '')
    
    return text


df['text'] = df['text'].apply(remove_tags)

# https://www.kaggle.com/code/saurabhkailaskuche/ai-generated-vs-human

### 3.2 Remove Noise 

Removing rows with length less than 15

In [9]:
df = df[df['text_length'] >= 15]

### 3.3 Pre-process Data

 #### 3.3.1 Process the Data in Parallel using Joblib

In [7]:
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
num_cores

32

#### 3.3.2 Pre-process the text

Steps:
1. Lowercases the input text.
2. Tokenizes the text into words.
3. Removes English stopwords and non-alphanumeric tokens (e.g., punctuation).
4. Applies either lemmatization or stemming to the remaining words.
5. Joins the processed words back into a single string.

In [12]:
from utils import preprocess_text

df['tokens'] = Parallel(n_jobs=num_cores)(
    delayed(preprocess_text)(text) for text in df['text']
)

In [13]:
df.head()

Unnamed: 0,text,generated,text_length,tokens
0,Cars. Cars have been around since they became ...,0.0,584,car car around since became famous 1900s henry...
1,Transportation is a large necessity in most co...,0.0,462,transportation large necessity country worldwi...
2,"""Americas love affair with its vehicles seems ...",0.0,744,america love affair vehicle seems cooling say ...
3,How often do you ride in a car? Do you drive a...,0.0,686,often ride car drive one motor vehicle work st...
4,Cars are a wonderful thing. They are perhaps o...,0.0,871,car wonderful thing perhaps one world greatest...


#### 3.3.3 Examine the Number of Unique Tokens

In [14]:
# Split all tokens and flatten them into a single list
all_tokens = df['tokens'].str.split().explode()

# Get number of unique tokens
unique_token_count = all_tokens.nunique()
print("Number of unique tokens:", unique_token_count)

Number of unique tokens: 250972


#### 3.3.4 Remove Highly Similar Rows

Rows can be largely the same. We will remove rows with 85% overlapping tokens.

##### The following finds the nearly-duplicate rows.

In [15]:
# ---------- Imports ----------
from datasketch import MinHash, MinHashLSH

# ---------- Config ----------
THRESH_OVERLAP = 0.80        # 80% overlapping tokens
NUM_PERM = 64                # MinHash permutations (64 is a good accuracy/memory tradeoff)
LSH_JACCARD_THRESHOLD = 0.75 # LSH threshold (slightly lower to avoid missing borderline pairs)



# ---------- Helpers ----------
def tokens_to_set(tokens_str: str):
    # tokens column is space-separated; convert to a set of unique tokens
    # (if your 'tokens' already has unique words, this is cheap)
    return set(tokens_str.split())

def minhash_from_set(s: set, num_perm=NUM_PERM) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for w in s:
        m.update(w.encode('utf-8'))
    return m

def overlap_coeff(a: set, b: set) -> float:
    # |A ∩ B| / min(|A|, |B|)
    if not a or not b:
        return 0.0
    return len(a & b) / float(min(len(a), len(b)))

# ---------- Main (Incremental LSH to limit memory) ----------
# We iterate rows once; for each row:
# 1) build its MinHash
# 2) query LSH for candidates among prior rows
# 3) verify with exact overlap
# 4) insert into LSH
# This avoids querying every pair twice and keeps RAM in check.

tokens_series = df['tokens']      # <- your DataFrame column
lsh = MinHashLSH(threshold=LSH_JACCARD_THRESHOLD, num_perm=NUM_PERM)

similar_pairs = []  # will store tuples: (i, j, overlap)

for i, tok_str in enumerate(tokens_series):
    A = tokens_to_set(tok_str)

    # Build MinHash for this row
    mh = minhash_from_set(A)

    # Get candidates among previously inserted rows
    candidates = lsh.query(mh)

    # Verify with exact overlap, report each pair once (j < i because only previous inserted)
    for j in candidates:
        B = tokens_to_set(tokens_series.iloc[j])
        oc = overlap_coeff(A, B)
        if oc >= THRESH_OVERLAP:
            similar_pairs.append((j, i, oc))  # store (older_index, current_index, score)

    # Insert current row into index *after* querying to avoid self/duplicate matches
    lsh.insert(i, mh)

# Convert to a DataFrame if you want
pairs_df = pd.DataFrame(similar_pairs, columns=['idx_a', 'idx_b', 'overlap_coeff'])
print("Found candidate near-duplicates:", len(pairs_df))
pairs_df.head()


Found candidate near-duplicates: 1886422


Unnamed: 0,idx_a,idx_b,overlap_coeff
0,1384,1385,0.896552
1,1385,1389,0.906404
2,1387,1389,0.860465
3,1388,1390,0.85446
4,1385,1391,0.866995


##### The following puts the similar rows into groups

In [18]:
import numpy as np

# Use positional labels array to avoid index/key issues
labels_arr = df['generated'].astype(int).to_numpy()
N = len(df)

# Edges as numpy arrays (ints)
a = pairs_df['idx_a'].to_numpy(dtype=np.int64)
b = pairs_df['idx_b'].to_numpy(dtype=np.int64)

# --- Union-Find (Disjoint Set) over N rows (fast & memory-light) ---
class DSU:
    def __init__(self, n):
        self.parent = np.arange(n, dtype=np.int64)
        self.size = np.ones(n, dtype=np.int64)
    def find(self, x):
        # path compression
        while self.parent[x] != x:
            self.parent[x] = self.parent[self.parent[x]]
            x = self.parent[x]
        return x
    def union(self, x, y):
        rx, ry = self.find(x), self.find(y)
        if rx == ry: 
            return
        # union by size
        if self.size[rx] < self.size[ry]:
            rx, ry = ry, rx
        self.parent[ry] = rx
        self.size[rx] += self.size[ry]

dsu = DSU(N)
for i, j in zip(a, b):
    dsu.union(i, j)

# Only nodes that appear in at least one pair
involved = np.unique(np.concatenate([a, b]))

# Root for each involved node -> group id 0..G-1
roots = np.array([dsu.find(i) for i in involved], dtype=np.int64)
uniq_roots, group_ids = np.unique(roots, return_inverse=True)

# Build nodes_df: each member with its group and label
nodes_df = pd.DataFrame({
    'row_idx': involved,           # positional row index in df
    'group_id': group_ids,         # 0..num_groups-1
    'label':   labels_arr[involved]
})

# Per-group label composition
label_counts = (nodes_df
                .groupby(['group_id','label'])
                .size()
                .unstack(fill_value=0)
                .rename(columns={0:'n_human', 1:'n_ai'}))

label_counts['size'] = label_counts['n_human'] + label_counts['n_ai']
label_counts = label_counts.sort_values('size', ascending=False)

# Keep only true "similarity groups" (size >= 2)
label_counts = label_counts[label_counts['size'] >= 2]
num_groups = label_counts.shape[0]

print(f"Total similar groups (connected components, size ≥ 2): {num_groups}")

# Pure vs mixed groups
pure_groups = (label_counts[['n_human','n_ai']].min(axis=1) == 0).sum()
mixed_groups = num_groups - pure_groups
print(f"Pure groups: {pure_groups}   Mixed groups: {mixed_groups}")

# Size distribution (first few sizes)
print("\nGroup size distribution (size -> #groups):")
print(label_counts['size'].value_counts().sort_index().head(20))

# Top 10 largest groups with composition
summary = label_counts.reset_index()
print("\nTop 10 largest groups with label composition:")
print(summary.head(10))

# Members of the largest group (positional row indices in df)
largest_gid = summary.iloc[0]['group_id'] if len(summary) else None
if largest_gid is not None:
    members_largest = nodes_df.loc[nodes_df['group_id'] == largest_gid, 'row_idx'].tolist()
    print(f"\nLargest group id: {largest_gid}  size: {len(members_largest)}")
    # Example: inspect their labels quickly
    print(pd.Series(labels_arr[members_largest]).value_counts().rename(index={0:'human',1:'ai'}))
    # You can also peek at the texts:
    df.iloc[members_largest][['generated','text']].head()


Total similar groups (connected components, size ≥ 2): 51055
Pure groups: 51005   Mixed groups: 50

Group size distribution (size -> #groups):
size
2     1564
3      992
4     2465
5     4986
6     7558
7     8816
8     7556
9     5002
10    2654
11    1184
12     812
13     949
14    1086
15    1072
16    1047
17     774
18     589
19     398
20     259
21     176
Name: count, dtype: int64

Top 10 largest groups with label composition:
label  group_id  n_human  n_ai  size
0          2198      126     0   126
1         25041        0    72    72
2         14120       36     0    36
3           531       33     0    33
4           957       33     0    33
5           530       32     0    32
6          1088       32     0    32
7           291       32     0    32
8           303       32     0    32
9           216       32     0    32

Largest group id: 2198  size: 126
human    126
Name: count, dtype: int64


The result:

```
2     1564
3      992
4     2465
...
```
Explantion of Output:

- The first column is the group size

- The second column is the number of groups in that size

For instance, 2 1564 means that there are 1564 groups with a size of 2: two rows are highly similar.


##### The following filters out the similar rows:

- Remove all mixed groups (rows that are similar but with inconsistent labels)

- Keep the row with the most tokens in each group

In [19]:
# 1) Identify mixed vs pure groups
grp_nlabels = nodes_df.groupby('group_id')['label'].nunique()
mixed_gids = grp_nlabels[grp_nlabels > 1].index
pure_gids  = grp_nlabels[grp_nlabels == 1].index

# 2) All rows that appeared in any pair (i.e., belong to some group)
a = pairs_df['idx_a'].to_numpy(np.int64)
b = pairs_df['idx_b'].to_numpy(np.int64)
involved = np.unique(np.concatenate([a, b]))

# 3) Precompute unique-token counts only for involved rows (saves RAM/CPU)
tokens_array = df['tokens'].to_numpy()
uniq_counts = np.fromiter(
    (len(set(tokens_array[i].split())) for i in involved),
    dtype=np.int32,
    count=len(involved)
)
uniq_df = pd.DataFrame({'row_idx': involved, 'uniq_len': uniq_counts})

# 4) DROP all rows that are in mixed groups
mixed_rows = nodes_df.loc[nodes_df['group_id'].isin(mixed_gids), 'row_idx']
mixed_rows_set = set(mixed_rows.to_numpy())

# 5) For pure groups: keep the row with the most unique tokens (tie-breaker: smallest row_idx)
pure_nodes = nodes_df.loc[nodes_df['group_id'].isin(pure_gids), ['row_idx', 'group_id']]
pure_with_scores = pure_nodes.merge(uniq_df, on='row_idx', how='left')

# Sort so that within each group_id, the first row is the desired representative
pure_reps = (pure_with_scores
             .sort_values(['group_id', 'uniq_len', 'row_idx'],
                          ascending=[True, False, True])
             .drop_duplicates(subset='group_id', keep='first')
             ['row_idx']
             .to_numpy())

pure_reps_set = set(pure_reps)

# 6) Rows not involved in any group are kept as-is
all_idx = set(range(len(df)))
involved_set = set(involved)
not_involved_set = all_idx - involved_set

# 7) Final keep set:
#    - all non-involved rows
#    - one representative per pure group
#    - EXCLUDE all rows from mixed groups
rows_to_keep = (not_involved_set | pure_reps_set) - mixed_rows_set

# 8) Build the deduplicated dataframe
df_dedup = df.iloc[sorted(rows_to_keep)].reset_index(drop=True)

# 9) Reporting
print(f"Original rows                  : {len(df)}")
print(f"Rows in any group              : {len(involved_set)}")
print(f"Pure groups                    : {len(pure_gids)}")
print(f"Mixed groups (fully removed)   : {len(mixed_gids)}")
print(f"Kept reps from pure groups     : {len(pure_reps_set)}")
print(f"Not involved (kept as-is)      : {len(not_involved_set)}")
print(f"New dataset size               : {len(df_dedup)}")


Original rows                  : 487203
Rows in any group              : 431780
Pure groups                    : 51005
Mixed groups (fully removed)   : 50
Kept reps from pure groups     : 51005
Not involved (kept as-is)      : 55423
New dataset size               : 106428


Sampling and examining the similar rows

In [20]:
import random

# Ensure reproducibility (optional)
random.seed(42)

# --- Pure Groups ---
pure_sample_gids = random.sample(list(pure_gids), min(5, len(pure_gids)))

print("\n=== Pure Groups Sample ===")
for gid in pure_sample_gids:
    members = nodes_df.loc[nodes_df['group_id'] == gid, 'row_idx'].tolist()
    sample_rows = random.sample(members, min(3, len(members)))
    print(f"\nGroup {gid} (pure) - {len(members)} rows total")
    for idx in sample_rows:
        print(f"[Row {idx}] Label={df.iloc[idx]['generated']}")
        print(df.iloc[idx]['text'])
        print("-" * 40)

# --- Mixed Groups ---
mixed_sample_gids = random.sample(list(mixed_gids), min(3, len(mixed_gids)))

print("\n=== Mixed Groups Sample ===")
for gid in mixed_sample_gids:
    members = nodes_df.loc[nodes_df['group_id'] == gid, 'row_idx'].tolist()
    sample_rows = random.sample(members, min(3, len(members)))
    print(f"\nGroup {gid} (mixed) - {len(members)} rows total")
    for idx in sample_rows:
        print(f"[Row {idx}] Label={df.iloc[idx]['generated']}")
        print(df.iloc[idx]['text'])
        print("-" * 40)



=== Pure Groups Sample ===

Group 41951 (pure) - 2 rows total
[Row 74953] Label=1.0
 The character of a person is influenced by various factors beyond their control, but it is ultimately their responsibility to make it right. This statement is true because people make mistakes, but that does not make them bad. Our character is revealed through our actions and choices, and it is important to strive to do the right thing, even when it is difficult.  Influence beyond our control can be a challenge, but it is also an opportunity to learn and grow. We must be strong and kind to others, and we must learn to control ourselves in order to make the right choices. We must also be truthful to ourselves and to others, even when it is hard.  It is not always easy to be true to ourselves and to make the right choices, but it is important to do so. We must not be selfish, but we must also be aware of the feelings of others and strive to make choices that will not hurt them. We must also be willing t

using the filtered data frame and free up memory

In [21]:
df = df_dedup
del df_dedup
df.shape

(106428, 4)

### 3.4 Save the data frame

In [22]:
df.to_csv("df1_cleaned.csv", index=False)