**Plagiarized Text Extraction and Translation**

In this stage, we extracted plagiarized text segments from the **PAN corpora** using the metadata provided in XML files (such as offset and length values). These English text segments were then **translated into Kazakh** using machine translation techniques. The resulting the foundation for building a **Kazakh dataset** specifically tailored for **text similarity detection tasks**.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import xml.etree.ElementTree as ET

# Path to the suspicious documents folder
suspicious_folder = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"

total_xml_files = 0
has_plagiarism_count = 0
no_plagiarism_count = 0

for filename in os.listdir(suspicious_folder):
    if filename.endswith(".xml"):
        total_xml_files += 1
        file_path = os.path.join(suspicious_folder, filename)

        tree = ET.parse(file_path)
        root = tree.getroot()

        # We'll scan all <feature> elements until we find a plagiarism element
        file_has_plagiarism = False
        for feature in root.findall(".//feature"):
            if feature.get("name") == "plagiarism":
                file_has_plagiarism = True
                # As soon as we find one plagiarism feature, we can skip checking the rest
                break

        if file_has_plagiarism:
            has_plagiarism_count += 1
        else:
            no_plagiarism_count += 1

print("Total XML files in the folder:", total_xml_files)
print("Number of XML files with a plagiarism feature:", has_plagiarism_count)
print("Number of XML files without a plagiarism feature:", no_plagiarism_count)


Total XML files in the folder: 11148
Number of XML files with a plagiarism feature: 5581
Number of XML files without a plagiarism feature: 5567


In [None]:
import os
import shutil
import xml.etree.ElementTree as ET

# Path to the suspicious documents folder
suspicious_folder = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"

# Paths to the new folders you want to create
plagiarised_folder = "/content/drive/MyDrive/experimental-models/plagiarised_xml"
none_plagiarised_folder = "/content/drive/MyDrive/experimental-models/none_plagiarised_xml"

# Create the new folders if they don't exist
os.makedirs(plagiarised_folder, exist_ok=True)
os.makedirs(none_plagiarised_folder, exist_ok=True)

# Loop over each XML file, check for plagiarism, then copy to the correct folder
for filename in os.listdir(suspicious_folder):
    if filename.endswith(".xml"):
        file_path = os.path.join(suspicious_folder, filename)

        # Parse the XML to see if it has any plagiarism features
        tree = ET.parse(file_path)
        root = tree.getroot()

        file_has_plagiarism = False
        for feature in root.findall(".//feature"):
            if feature.get("name") == "plagiarism":
                file_has_plagiarism = True
                break

        # Copy the file to the appropriate folder
        if file_has_plagiarism:
            shutil.copy2(file_path, plagiarised_folder)
        else:
            shutil.copy2(file_path, none_plagiarised_folder)

print("Finished copying files!")
print("Files with plagiarism have been copied to:", plagiarised_folder)
print("Files without plagiarism have been copied to:", none_plagiarised_folder)


Finished copying files!
Files with plagiarism have been copied to: /content/drive/MyDrive/experimental-models/plagiarised_xml
Files without plagiarism have been copied to: /content/drive/MyDrive/experimental-models/none_plagiarised_xml


In [None]:
import os

# Path to the folder
folder_path1 = "/content/drive/MyDrive/experimental-models/plagiarised_xml"
folder_path2 = "/content/drive/MyDrive/experimental-models/none_plagiarised_xml"

# Function to count files in a folder
def count_files_in_folder(folder_path):
    try:
        # List all entries in the directory
        entries = os.listdir(folder_path)

        # Filter out directories, keeping only files
        files = [entry for entry in entries if os.path.isfile(os.path.join(folder_path, entry))]

        # Count the number of files
        file_count = len(files)

        return file_count

    except FileNotFoundError:
        print(f"Folder not found: {folder_path}")
        return 0

# Call the function to count files in the folder
print(f"The total number of files in the folder '{folder_path1}' is: {count_files_in_folder(folder_path1)}")
print(f"The total number of files in the folder '{folder_path2}' is: {count_files_in_folder(folder_path2)}")

The total number of files in the folder '/content/drive/MyDrive/experimental-models/plagiarised_xml' is: 5581
The total number of files in the folder '/content/drive/MyDrive/experimental-models/none_plagiarised_xml' is: 5567


In [None]:
import os
import xml.etree.ElementTree as ET

# A) Folders we want to scan
plagiarised_folder = "/content/drive/MyDrive/experimental-models/plagiarised_xml"
none_plagiarised_folder = "/content/drive/MyDrive/experimental-models/none_plagiarised_xml"

# A) Counters for the plagiarised folder
#    1) no this_offset + no source_offset
#    2) yes this_offset + no source_offset
#    3) no this_offset + yes source_offset
#    4) yes this_offset + yes source_offset
p_noT_noS = 0
p_yesT_noS = 0
p_noT_yesS = 0
p_yesT_yesS = 0

# B) Counters for the non-plagiarised folder
#    1) no this_offset + no source_offset
#    2) yes this_offset + no source_offset
#    3) no this_offset + yes source_offset
#    4) yes this_offset + yes source_offset
np_noT_noS = 0
np_yesT_noS = 0
np_noT_yesS = 0
np_yesT_yesS = 0

# ---------- PART A: Analyze plagiarised_folder ----------
for filename in os.listdir(plagiarised_folder):
    if filename.endswith(".xml"):
        file_path = os.path.join(plagiarised_folder, filename)

        # Track if we find any this_offset or source_offset
        has_this_offset = False
        has_source_offset = False

        # Parse only as far as needed
        tree = ET.parse(file_path)
        root = tree.getroot()

        for feature in root.findall(".//feature"):
            if feature.get("this_offset") is not None:
                has_this_offset = True
            if feature.get("source_offset") is not None:
                has_source_offset = True

            # If we've found both, no need to keep checking
            if has_this_offset and has_source_offset:
                break

        # Classify this file into one of the four categories
        if not has_this_offset and not has_source_offset:
            p_noT_noS += 1
        elif has_this_offset and not has_source_offset:
            p_yesT_noS += 1
        elif not has_this_offset and has_source_offset:
            p_noT_yesS += 1
        else:
            # has_this_offset and has_source_offset
            p_yesT_yesS += 1

# ---------- PART B: Analyze none_plagiarised_folder ----------
for filename in os.listdir(none_plagiarised_folder):
    if filename.endswith(".xml"):
        file_path = os.path.join(none_plagiarised_folder, filename)

        # Track if we find any this_offset or source_offset
        has_this_offset = False
        has_source_offset = False

        # Parse only as far as needed
        tree = ET.parse(file_path)
        root = tree.getroot()

        for feature in root.findall(".//feature"):
            if feature.get("this_offset") is not None:
                has_this_offset = True
            if feature.get("source_offset") is not None:
                has_source_offset = True

            # If we've found both, no need to keep checking
            if has_this_offset and has_source_offset:
                break

        # Classify this file into one of the four categories
        if not has_this_offset and not has_source_offset:
            np_noT_noS += 1
        elif has_this_offset and not has_source_offset:
            np_yesT_noS += 1
        elif not has_this_offset and has_source_offset:
            np_noT_yesS += 1
        else:
            # has_this_offset and has_source_offset
            np_yesT_yesS += 1

# ---------- Print the results ----------
print("----- Plagiarised Folder -----")
print("1) No this_offset, No source_offset:", p_noT_noS)
print("2) Yes this_offset, No source_offset:", p_yesT_noS)
print("3) No this_offset, Yes source_offset:", p_noT_yesS)
print("4) Yes this_offset, Yes source_offset:", p_yesT_yesS)

print("\n----- None-Plagiarised Folder -----")
print("1) No this_offset, No source_offset:", np_noT_noS)
print("2) Yes this_offset, No source_offset:", np_yesT_noS)
print("3) No this_offset, Yes source_offset:", np_noT_yesS)
print("4) Yes this_offset, Yes source_offset:", np_yesT_yesS)


----- Plagiarised Folder -----
1) No this_offset, No source_offset: 0
2) Yes this_offset, No source_offset: 1659
3) No this_offset, Yes source_offset: 0
4) Yes this_offset, Yes source_offset: 3922

----- None-Plagiarised Folder -----
1) No this_offset, No source_offset: 5567
2) Yes this_offset, No source_offset: 0
3) No this_offset, Yes source_offset: 0
4) Yes this_offset, Yes source_offset: 0


**Plagiarised Folder:**
1.   No this_offset, No source_offset: 0
2.   Yes this_offset, No source_offset: 1659
3.   No this_offset, Yes source_offset: 0
4.   Yes this_offset, Yes source_offset: 3922

Total in plagiarised_folder = 1659 + 3922 = 5581

**None-Plagiarised Folder:**
1.   No this_offset, No source_offset: 5567
2.   Yes this_offset, No source_offset: 0
3.   No this_offset, Yes source_offset: 0
4.   Yes this_offset, Yes source_offset: 0

Total in none_plagiarised_folder = 5567

**Total files (plagiarised and none plagiarised) 5581 + 5567 = 11148**

In [None]:
import os
import shutil
import xml.etree.ElementTree as ET

# Existing folders (already split by plagiarism presence)
plagiarised_folder = "/content/drive/MyDrive/experimental-models/plagiarised_xml"
none_plagiarised_folder = "/content/drive/MyDrive/experimental-models/none_plagiarised_xml"

# New folders you want to create for each category
category_plag_su1_so1 = "/content/drive/MyDrive/experimental-models/su1_so1"
category_plag_su1_so0 = "/content/drive/MyDrive/experimental-models/su1_so0"
category_nonplag_su0_so0 = "/content/drive/MyDrive/experimental-models/su0_so0"

# Make sure these folders exist
os.makedirs(category_plag_su1_so1, exist_ok=True)
os.makedirs(category_plag_su1_so0, exist_ok=True)
os.makedirs(category_nonplag_su0_so0, exist_ok=True)

def has_offsets(file_path):
    """
    Parse the given XML file quickly to see whether it has
    any 'this_offset' or 'source_offset' in any <feature>.
    Returns (has_this_offset, has_source_offset).
    """
    tree = ET.parse(file_path)
    root = tree.getroot()
    found_this_offset = False
    found_source_offset = False
    for feature in root.findall(".//feature"):
        if feature.get("this_offset") is not None:
            found_this_offset = True
        if feature.get("source_offset") is not None:
            found_source_offset = True
        # If both are found, no need to keep checking
        if found_this_offset and found_source_offset:
            break
    return (found_this_offset, found_source_offset)

# ----- A) Plagiarised Folder -----
# We know from your counts that there should be only two real sub-cases here:
#   1) yes this_offset & yes source_offset
#   2) yes this_offset & no source_offset
for filename in os.listdir(plagiarised_folder):
    if filename.endswith(".xml"):
        file_path = os.path.join(plagiarised_folder, filename)
        (has_T, has_S) = has_offsets(file_path)

        # Check which sub-category
        if has_T and has_S:
            # 1) Plagiarised_su1_so1
            shutil.copy2(file_path, category_plag_su1_so1)
        elif has_T and not has_S:
            # 2) Plagiarised_su1_so0
            shutil.copy2(file_path, category_plag_su1_so0)
        else:
            # According to your counts, you don't actually have
            # other categories in the plagiarised folder. But if
            # something was off, you'd handle it here.
            pass

# ----- B) None-Plagiarised Folder -----
# We know from your counts that all are no T & no S.
for filename in os.listdir(none_plagiarised_folder):
    if filename.endswith(".xml"):
        file_path = os.path.join(none_plagiarised_folder, filename)
        (has_T, has_S) = has_offsets(file_path)

        # The only sub-category we expect in non-plagiarised is no T & no S
        if not has_T and not has_S:
            shutil.copy2(file_path, category_nonplag_su0_so0)
        else:
            # If you unexpectedly find any other combos, you'd handle them here
            pass

print("Finished copying files into the three categories:")
print(f"1) plagiarised_su1_so1 -> {category_plag_su1_so1}")
print(f"2) plagiarised_su1_so0 -> {category_plag_su1_so0}")
print(f"3) not_plagiarised_su0_so0 -> {category_nonplag_su0_so0}")


Finished copying files into the three categories:
1) plagiarised_su1_so1 -> /content/drive/MyDrive/experimental-models/su1_so1
2) plagiarised_su1_so0 -> /content/drive/MyDrive/experimental-models/su1_so0
3) not_plagiarised_su0_so0 -> /content/drive/MyDrive/experimental-models/su0_so0


In [None]:
import os

# Paths to the three folders
folder_path1 = "/content/drive/MyDrive/experimental-models/su1_so1"
folder_path2 = "/content/drive/MyDrive/experimental-models/su1_so0"
folder_path3 = "/content/drive/MyDrive/experimental-models/su0_so0"

def count_xml_files_in_folder(folder_path):
    """Count only .xml files in the given folder."""
    try:
        # List all entries in the directory
        entries = os.listdir(folder_path)

        # Filter out anything that's not a file or doesn't end with .xml
        xml_files = [
            entry for entry in entries
            if entry.endswith(".xml") and os.path.isfile(os.path.join(folder_path, entry))
        ]

        return len(xml_files)

    except FileNotFoundError:
        print(f"Folder not found: {folder_path}")
        return 0

# Print the count of .xml files in each folder
print(f"The total number of XML files in the folder '{folder_path1}' is: {count_xml_files_in_folder(folder_path1)}")
print(f"The total number of XML files in the folder '{folder_path2}' is: {count_xml_files_in_folder(folder_path2)}")
print(f"The total number of XML files in the folder '{folder_path3}' is: {count_xml_files_in_folder(folder_path3)}")


The total number of XML files in the folder '/content/drive/MyDrive/experimental-models/su1_so1' is: 3922
The total number of XML files in the folder '/content/drive/MyDrive/experimental-models/su1_so0' is: 1659
The total number of XML files in the folder '/content/drive/MyDrive/experimental-models/su0_so0' is: 5567


In [None]:
import os
import shutil

su1_so1_folder = "/content/drive/MyDrive/experimental-models/su1_so1"

for entry in os.listdir(su1_so1_folder):
    # We only care about files ending with ".xml"
    if entry.endswith(".xml"):
        xml_file_path = os.path.join(su1_so1_folder, entry)

        # Remove the ".xml" extension to get something like "suspicious-documentABCDE"
        base_name = os.path.splitext(entry)[0]

        # Build the path to the subfolder with the same base name
        subfolder_path = os.path.join(su1_so1_folder, base_name)

        # If that subfolder exists, copy the XML file into it
        if os.path.isdir(subfolder_path):
            dst_path = os.path.join(subfolder_path, entry)
            shutil.copy2(xml_file_path, dst_path)
            print(f"Copied: {xml_file_path} -> {dst_path}")
        else:
            print(f"No matching subfolder found for {entry}")

print("Finished copying matching XML files to their subfolders.")


Copied: /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10897.xml -> /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10897/suspicious-document10897.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10884.xml -> /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10884/suspicious-document10884.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10924.xml -> /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10924/suspicious-document10924.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10814.xml -> /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10814/suspicious-document10814.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10934.xml -> /content/drive/MyDrive/experimental-models/su1_so1/suspicious-document10934/suspicious-document10934.xml
Copied: /content/dri

In [None]:
import os
import shutil

su1_so1_folder = "/content/drive/MyDrive/experimental-models/su1_so0"

for entry in os.listdir(su1_so1_folder):
    # We only care about files ending with ".xml"
    if entry.endswith(".xml"):
        xml_file_path = os.path.join(su1_so1_folder, entry)

        # Remove the ".xml" extension to get something like "suspicious-documentABCDE"
        base_name = os.path.splitext(entry)[0]

        # Build the path to the subfolder with the same base name
        subfolder_path = os.path.join(su1_so1_folder, base_name)

        # If that subfolder exists, copy the XML file into it
        if os.path.isdir(subfolder_path):
            dst_path = os.path.join(subfolder_path, entry)
            shutil.copy2(xml_file_path, dst_path)
            print(f"Copied: {xml_file_path} -> {dst_path}")
        else:
            print(f"No matching subfolder found for {entry}")

print("Finished copying matching XML files to their subfolders.")


Copied: /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08695.xml -> /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08695/suspicious-document08695.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08748.xml -> /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08748/suspicious-document08748.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08809.xml -> /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08809/suspicious-document08809.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08692.xml -> /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08692/suspicious-document08692.xml
Copied: /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08775.xml -> /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document08775/suspicious-document08775.xml
Copied: /content/dri

In [None]:
import os
import shutil
import xml.etree.ElementTree as ET

# Folders containing the original suspicious and source documents
suspicious_docs_folder = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"
source_docs_folder = "/content/drive/MyDrive/experimental-models/pan2010/source-documents"

# Folder containing the .xml files we want to parse
plagiarised_xml_folder = "/content/drive/MyDrive/experimental-models/su1_so1"

for filename in os.listdir(plagiarised_xml_folder):
    if filename.endswith(".xml"):
        xml_path = os.path.join(plagiarised_xml_folder, filename)

        # --- Parse the XML to extract references ---
        tree = ET.parse(xml_path)
        root = tree.getroot()

        # The top-level document reference, e.g., suspicious-document00123.txt
        suspicious_doc_txt_name = root.get("reference")
        if suspicious_doc_txt_name is None:
            print(f"[WARNING] No 'reference' attribute in {filename}, skipping.")
            continue

        # Gather all source references from <feature name="plagiarism" source_reference="...">
        source_doc_names = []
        for feature in root.findall(".//feature"):
            if feature.get("name") == "plagiarism":
                source_ref = feature.get("source_reference")
                if source_ref is not None and source_ref not in source_doc_names:
                    source_doc_names.append(source_ref)

        # --- Create a new folder named without the .xml extension ---
        # e.g., from suspicious-document00123.xml -> suspicious-document00123
        base_name = os.path.splitext(filename)[0]  # "suspicious-document00123"
        new_subfolder_path = os.path.join(plagiarised_xml_folder, base_name)
        os.makedirs(new_subfolder_path, exist_ok=True)

        # --- Copy the suspicious document .txt as-is into the new folder ---
        # e.g., suspicious-document00123.txt stays suspicious-document00123.txt
        suspicious_src_path = os.path.join(suspicious_docs_folder, suspicious_doc_txt_name)
        suspicious_dst_path = os.path.join(new_subfolder_path, suspicious_doc_txt_name)

        if os.path.isfile(suspicious_src_path):
            shutil.copy2(suspicious_src_path, suspicious_dst_path)
        else:
            print(f"[WARNING] Could not find suspicious doc: {suspicious_src_path}")

        # --- Copy each source doc as-is (keeping its .txt name) ---
        for src_doc_name in source_doc_names:
            source_src_path = os.path.join(source_docs_folder, src_doc_name)
            source_dst_path = os.path.join(new_subfolder_path, src_doc_name)

            if os.path.isfile(source_src_path):
                shutil.copy2(source_src_path, source_dst_path)
            else:
                print(f"[WARNING] Could not find source doc: {source_src_path}")

print("Finished copying suspicious and source documents into subfolders (with .txt intact).")


Finished copying suspicious and source documents into subfolders (with .txt intact).


In [None]:
import os
import shutil
import xml.etree.ElementTree as ET

# Folders containing the original suspicious and source documents
suspicious_docs_folder = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"

# Folder containing the .xml files we want to parse
plagiarised_xml_folder = "/content/drive/MyDrive/experimental-models/su1_so0"

for filename in os.listdir(plagiarised_xml_folder):
    if filename.endswith(".xml"):
        xml_path = os.path.join(plagiarised_xml_folder, filename)

        # --- Parse the XML to extract references ---
        tree = ET.parse(xml_path)
        root = tree.getroot()

        # The top-level document reference, e.g., suspicious-document00123.txt
        suspicious_doc_txt_name = root.get("reference")
        if suspicious_doc_txt_name is None:
            print(f"[WARNING] No 'reference' attribute in {filename}, skipping.")
            continue

        # --- Create a new folder named without the .xml extension ---
        # e.g., from suspicious-document00123.xml -> suspicious-document00123
        base_name = os.path.splitext(filename)[0]  # "suspicious-document00123"
        new_subfolder_path = os.path.join(plagiarised_xml_folder, base_name)
        os.makedirs(new_subfolder_path, exist_ok=True)

        # --- Copy the suspicious document .txt as-is into the new folder ---
        # e.g., suspicious-document00123.txt stays suspicious-document00123.txt
        suspicious_src_path = os.path.join(suspicious_docs_folder, suspicious_doc_txt_name)
        suspicious_dst_path = os.path.join(new_subfolder_path, suspicious_doc_txt_name)

        if os.path.isfile(suspicious_src_path):
            shutil.copy2(suspicious_src_path, suspicious_dst_path)
        else:
            print(f"[WARNING] Could not find suspicious doc: {suspicious_src_path}")

print("Finished copying suspicious and source documents into subfolders (with .txt intact).")

Finished copying suspicious and source documents into subfolders (with .txt intact).


In [None]:
import os

# Paths to the three folders
folder_path1 = "/content/drive/MyDrive/experimental-models/su1_so1"
folder_path2 = "/content/drive/MyDrive/experimental-models/su1_so0"
folder_path3 = "/content/drive/MyDrive/experimental-models/su0_so0"

def count_xml_files_and_subfolders(folder_path):
    """
    Returns a tuple (num_xml_files, num_subfolders) for the given folder.
    - num_xml_files: count of all .xml files (not directories) in the folder
    - num_subfolders: count of subdirectories in the folder
    """
    try:
        entries = os.listdir(folder_path)

        # Count .xml files (ensure each entry is a file that ends with .xml)
        xml_files = [
            entry
            for entry in entries
            if entry.endswith(".xml") and os.path.isfile(os.path.join(folder_path, entry))
        ]
        num_xml_files = len(xml_files)

        # Count subfolders (directories within this folder)
        subfolders = [
            entry
            for entry in entries
            if os.path.isdir(os.path.join(folder_path, entry))
        ]
        num_subfolders = len(subfolders)

        return num_xml_files, num_subfolders

    except FileNotFoundError:
        print(f"Folder not found: {folder_path}")
        return 0, 0

# Now let's retrieve and print the info for each folder
for folder_path in [folder_path1, folder_path2, folder_path3]:
    xml_count, subfolder_count = count_xml_files_and_subfolders(folder_path)
    print(f"Folder: {folder_path}")
    print(f"  Total number of XML files: {xml_count}")
    print(f"  Total number of subfolders: {subfolder_count}")
    print("--------------------------------------------------")


Folder: /content/drive/MyDrive/experimental-models/su1_so1
  Total number of XML files: 3922
  Total number of subfolders: 3922
--------------------------------------------------
Folder: /content/drive/MyDrive/experimental-models/su1_so0
  Total number of XML files: 1659
  Total number of subfolders: 1659
--------------------------------------------------
Folder: /content/drive/MyDrive/experimental-models/su0_so0
  Total number of XML files: 5567
  Total number of subfolders: 0
--------------------------------------------------


In [None]:
import os
import shutil
import xml.etree.ElementTree as ET

# Folder containing the .txt suspicious documents (original location)
suspicious_docs_folder = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"

# Folder containing the XML files (where we also want to place the .txt files)
not_plagiarised_xml_files = "/content/drive/MyDrive/experimental-models/su0_so0"

for filename in os.listdir(not_plagiarised_xml_files):
    if filename.endswith(".xml"):
        xml_path = os.path.join(not_plagiarised_xml_files, filename)

        # Parse the XML
        tree = ET.parse(xml_path)
        root = tree.getroot()

        # e.g. suspicious-document00123.txt
        suspicious_doc_txt_name = root.get("reference")
        if suspicious_doc_txt_name is None:
            print(f"[WARNING] No 'reference' attribute in {filename}, skipping.")
            continue

        # Where the suspicious text file currently lives
        suspicious_src_path = os.path.join(suspicious_docs_folder, suspicious_doc_txt_name)

        # Where we want to copy the suspicious text file
        # (Same folder as the XML, keep the same .txt name)
        suspicious_dst_path = os.path.join(not_plagiarised_xml_files, suspicious_doc_txt_name)

        if os.path.isfile(suspicious_src_path):
            shutil.copy2(suspicious_src_path, suspicious_dst_path)
            print(f"Copied {suspicious_src_path} to {suspicious_dst_path}")
        else:
            print(f"[WARNING] Could not find suspicious doc: {suspicious_src_path}")

print("Finished copying suspicious .txt documents next to their .xml files.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Copied /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents/suspicious-document09263.txt to /content/drive/MyDrive/experimental-models/su0_so0/suspicious-document09263.txt
Copied /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents/suspicious-document09217.txt to /content/drive/MyDrive/experimental-models/su0_so0/suspicious-document09217.txt
Copied /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents/suspicious-document09241.txt to /content/drive/MyDrive/experimental-models/su0_so0/suspicious-document09241.txt
Copied /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents/suspicious-document09203.txt to /content/drive/MyDrive/experimental-models/su0_so0/suspicious-document09203.txt
Copied /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents/suspicious-document09229.txt to /content/drive/MyDrive/experimental-models/su0_so0/suspiciou

In [None]:
import os

folder_path = "/content/drive/MyDrive/experimental-models/su0_so0"

# Initialize counters
xml_count = 0
txt_count = 0

for entry in os.listdir(folder_path):
    file_path = os.path.join(folder_path, entry)
    if os.path.isfile(file_path):
        if entry.endswith(".xml"):
            xml_count += 1
        elif entry.endswith(".txt"):
            txt_count += 1

print(f"Number of .xml files in '{folder_path}': {xml_count}")
print(f"Number of .txt files in '{folder_path}': {txt_count}")


Number of .xml files in '/content/drive/MyDrive/experimental-models/su0_so0': 5567
Number of .txt files in '/content/drive/MyDrive/experimental-models/su0_so0': 5567


In [None]:
import os
import shutil

# 1) Path to the master source-documents folder
source_docs_folder = "/content/drive/MyDrive/experimental-models/pan2010/source-documents"

# 2) Path to the folder containing subfolders for plagiarised_su1_so1
plagiarised_su1_so1 = "/content/drive/MyDrive/experimental-models/su1_so1"

# 3) Path to the folder where we want to copy the remaining source files
not_plagiarised_su0_so0 = "/content/drive/MyDrive/experimental-models/su0_so0"

# -------------------------------------------------------------------
# A) Gather all .txt file names from source_docs_folder into list_1
# -------------------------------------------------------------------
list_1 = []
for item in os.listdir(source_docs_folder):
    if item.endswith(".txt"):
        list_1.append(item)

# Alternatively, make it a set to simplify difference operations
list_1 = set(list_1)

# -------------------------------------------------------------------
# B) Gather all .txt file names from ALL subfolders in plagiarised_su1_so1 into list_2
#    We'll use os.walk(...) to recurse subdirectories
# -------------------------------------------------------------------
list_2 = []
for root_dir, subdirs, files in os.walk(plagiarised_su1_so1):
    for file_name in files:
        if file_name.endswith(".txt"):
            list_2.append(file_name)

# Convert to a set to handle duplicates and allow easy difference
list_2 = set(list_2)

# -------------------------------------------------------------------
# C) Compute the difference
# -------------------------------------------------------------------
remaining_source_files = list_1 - list_2  # all .txt files in source_docs not used in plagiarised_su1_so1

# -------------------------------------------------------------------
# D) Copy the remaining source files into not_plagiarised_su0_so0
# -------------------------------------------------------------------
for txt_file in remaining_source_files:
    src_path = os.path.join(source_docs_folder, txt_file)
    dst_path = os.path.join(not_plagiarised_su0_so0, txt_file)

    # Double-check the file actually exists in source_docs_folder
    if os.path.isfile(src_path):
        shutil.copy2(src_path, dst_path)
        print(f"Copied: {txt_file} -> {not_plagiarised_su0_so0}")
    else:
        print(f"[WARNING] Missing file in source_docs_folder: {txt_file}")

print("\nDone! Copied all remaining source .txt files to:", not_plagiarised_su0_so0)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Copied: source-document04013.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document08325.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document09486.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document01867.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document00739.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document09157.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document06568.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document07072.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document07647.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document03150.txt -> /content/drive/MyDrive/experimental-models/su0_so0
Copied: source-document07151.txt -> /content/drive/MyDrive/experi

In [None]:
import os

# --- Folder paths ---
folder_1 = "/content/drive/MyDrive/experimental-models/pan2010/source-documents"
folder_2 = "/content/drive/MyDrive/experimental-models/su1_so1"
folder_3 = "/content/drive/MyDrive/experimental-models/su0_so0"

# ----------------------------------------------------------------------------
# A) Count all source files (.txt) in folder_1 that start with "source-"
#    We'll do a simple listdir for a flat directory.
# ----------------------------------------------------------------------------
all_source_files = [
    f for f in os.listdir(folder_1)
    if f.startswith("source-") and f.endswith(".txt")
       and os.path.isfile(os.path.join(folder_1, f))
]
count_1 = len(all_source_files)

# ----------------------------------------------------------------------------
# B) Count all *unique* source-*.txt files in folder_2 (including subfolders)
#    We'll use os.walk(...) to gather them recursively.
# ----------------------------------------------------------------------------
unique_files_folder_2 = set()
for root_dir, sub_dirs, files in os.walk(folder_2):
    for f in files:
        if f.startswith("source-") and f.endswith(".txt"):
            unique_files_folder_2.add(f)
count_2 = len(unique_files_folder_2)

# ----------------------------------------------------------------------------
# C) Count all source-*.txt files in folder_3 (no recursion needed here)
# ----------------------------------------------------------------------------
files_folder_3 = [
    f for f in os.listdir(folder_3)
    if f.startswith("source-") and f.endswith(".txt")
       and os.path.isfile(os.path.join(folder_3, f))
]
count_3 = len(files_folder_3)

# ----------------------------------------------------------------------------
# D) Print all three counts & check if (count_2 + count_3 == count_1)
# ----------------------------------------------------------------------------
print(f"Total 'source-*.txt' files in folder-1 ({folder_1}): {count_1}")
print(f"Unique 'source-*.txt' files in subfolders of folder-2 ({folder_2}): {count_2}")
print(f"Total 'source-*.txt' files in folder-3 ({folder_3}): {count_3}")

if (count_2 + count_3) == count_1:
    print("\nSUCCESS: (count_2 + count_3) == count_1. The sum matches exactly.")
else:
    print("\nWARNING: (count_2 + count_3) != count_1. The sum does NOT match.")


Total 'source-*.txt' files in folder-1 (/content/drive/MyDrive/experimental-models/pan2010/source-documents): 11148
Unique 'source-*.txt' files in subfolders of folder-2 (/content/drive/MyDrive/experimental-models/su1_so1): 4195
Total 'source-*.txt' files in folder-3 (/content/drive/MyDrive/experimental-models/su0_so0): 6953

SUCCESS: (count_2 + count_3) == count_1. The sum matches exactly.


In [None]:
import os

# --- Folder paths ---
folder_1 = "/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents"
folder_2 = "/content/drive/MyDrive/experimental-models/su1_so1"
folder_3 = "/content/drive/MyDrive/experimental-models/su1_so0"
folder_4 = "/content/drive/MyDrive/experimental-models/su0_so0"

# ----------------------------------------------------------------------------
# A) Count all suspicious files (.txt) in folder_1 that start with "suspicious-"
#    We'll do a simple listdir for a flat directory.
# ----------------------------------------------------------------------------
all_source_files = [
    f for f in os.listdir(folder_1)
    if f.startswith("suspicious-") and f.endswith(".txt")
       and os.path.isfile(os.path.join(folder_1, f))
]
count_1 = len(all_source_files)

# ----------------------------------------------------------------------------
# B) Count all *unique* suspicious-*.txt files in folder_2 (including subfolders)
#    We'll use os.walk(...) to gather them recursively.
# ----------------------------------------------------------------------------
unique_files_folder_2 = set()
for root_dir, sub_dirs, files in os.walk(folder_2):
    for f in files:
        if f.startswith("suspicious-") and f.endswith(".txt"):
            unique_files_folder_2.add(f)
count_2 = len(unique_files_folder_2)

# ----------------------------------------------------------------------------
# C) Count all *unique* suspicious-*.txt files in folder_3 (including subfolders)
#    We'll use os.walk(...) to gather them recursively.
# ----------------------------------------------------------------------------
unique_files_folder_3 = set()
for root_dir, sub_dirs, files in os.walk(folder_3):
    for f in files:
        if f.startswith("suspicious-") and f.endswith(".txt"):
            unique_files_folder_3.add(f)
count_3 = len(unique_files_folder_3)

# ----------------------------------------------------------------------------
# D) Count all susupicious-*.txt files in folder_4 (no recursion needed here)
# ----------------------------------------------------------------------------
files_folder_4 = [
    f for f in os.listdir(folder_4)
    if f.startswith("suspicious-") and f.endswith(".txt")
       and os.path.isfile(os.path.join(folder_4, f))
]
count_4 = len(files_folder_4)

# ----------------------------------------------------------------------------
# D) Print all three counts & check if (count_2 + count_3 == count_1)
# ----------------------------------------------------------------------------
print(f"Total 'suspicious-*.txt' files in folder-1 ({folder_1}): {count_1}")
print(f"Unique 'suspicious-*.txt' files in subfolders of folder-2 ({folder_2}): {count_2}")
print(f"Unique 'suspicious-*.txt' files in subfolders of folder-3 ({folder_3}): {count_3}")
print(f"Total 'suspicious-*.txt' files in folder-3 ({folder_4}): {count_4}")

if (count_2 + count_3 + count_4) == count_1:
    print("\nSUCCESS: (count_2 + count_3 + count_4) == count_1. The sum matches exactly.")
else:
    print("\nWARNING: (count_2 + count_3 + count_4) != count_1. The sum does NOT match.")


Total 'suspicious-*.txt' files in folder-1 (/content/drive/MyDrive/experimental-models/pan2010/suspicious-documents): 11148
Unique 'suspicious-*.txt' files in subfolders of folder-2 (/content/drive/MyDrive/experimental-models/su1_so1): 3922
Unique 'suspicious-*.txt' files in subfolders of folder-3 (/content/drive/MyDrive/experimental-models/su1_so0): 1659
Total 'suspicious-*.txt' files in folder-3 (/content/drive/MyDrive/experimental-models/su0_so0): 5567

SUCCESS: (count_2 + count_3 + count_4) == count_1. The sum matches exactly.


In [None]:
import os
import xml.etree.ElementTree as ET

# Path to your main directory containing subfolders like "suspicious-document00001"
base_dir = "/content/drive/MyDrive/experimental-models/su1_so1"

# -------------------------------------------------------------------------
# Process every subfolder in base_dir (no skipping based on previous runs)
# -------------------------------------------------------------------------
for subfolder in os.listdir(base_dir):
    doc_folder = os.path.join(base_dir, subfolder)

    # Only handle directories whose names start with "suspicious-document"
    if os.path.isdir(doc_folder) and subfolder.startswith("suspicious-document"):
        doc_name = subfolder  # e.g. "suspicious-document00192"

        xml_file = os.path.join(doc_folder, f"{doc_name}.xml")
        txt_file = os.path.join(doc_folder, f"{doc_name}.txt")

        # We'll store extracted snippets in a new subfolder named "suspicious-documentXXXX-XML-parts"
        parts_folder = os.path.join(doc_folder, f"{doc_name}-parts")
        os.makedirs(parts_folder, exist_ok=True)

        # Check if required files exist. If not, skip
        if not (os.path.isfile(xml_file) and os.path.isfile(txt_file)):
            print(f"[WARNING] {subfolder} missing .xml or .txt file, skipping.")
            continue

        print(f"\nProcessing subfolder: {subfolder} ...")

        # Parse the XML
        tree = ET.parse(xml_file)
        root = tree.getroot()
        features = root.findall(".//feature")

        # Read the entire suspicious doc text
        with open(txt_file, "r", encoding="utf-8") as f:
            suspicious_full_text = f.read()

        # For each <feature>, extract from suspicious & source if present
        for feature in features:
            this_offset = feature.get("this_offset")
            this_length = feature.get("this_length")

            source_reference = feature.get("source_reference")
            source_offset = feature.get("source_offset")
            source_length = feature.get("source_length")

            # --------- A) Suspicious snippet ---------
            if this_offset and this_length:
                offset_val = int(this_offset)
                length_val = int(this_length)

                # Take the raw substring, unmodified
                raw_snippet = suspicious_full_text[offset_val : offset_val + length_val]

                out_filename = f"{doc_name}-{offset_val}-{length_val}.txt"
                out_path = os.path.join(parts_folder, out_filename)

                with open(out_path, "w", encoding="utf-8") as out_f:
                    out_f.write(raw_snippet)

                print(f"  Suspicious snippet saved: {out_filename}")

            # --------- B) Source snippet ---------
            if source_reference and source_offset and source_length:
                source_doc_path = os.path.join(doc_folder, source_reference)
                if os.path.isfile(source_doc_path):
                    with open(source_doc_path, "r", encoding="utf-8") as src_f:
                        source_full_text = src_f.read()

                    src_offset_val = int(source_offset)
                    src_length_val = int(source_length)

                    # Again, take the raw substring
                    raw_src_snippet = source_full_text[src_offset_val : src_offset_val + src_length_val]

                    source_basename = os.path.splitext(source_reference)[0]
                    out_source_filename = f"{source_basename}-{src_offset_val}-{src_length_val}.txt"
                    out_source_path = os.path.join(parts_folder, out_source_filename)

                    with open(out_source_path, "w", encoding="utf-8") as src_out_f:
                        src_out_f.write(raw_src_snippet)

                    print(f"  Source snippet saved: {out_source_filename}")
                else:
                    print(f"  [WARNING] Source file not found: {source_doc_path}")

        print(f"Finished processing {subfolder}. Extracted snippets are in {parts_folder}")

print("\nAll subfolders processed.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Suspicious snippet saved: suspicious-document10074-12900-2354.txt
  Source snippet saved: source-document11000-17165-2304.txt
  Suspicious snippet saved: suspicious-document10074-27238-2395.txt
  Source snippet saved: source-document11000-25560-2716.txt
  Suspicious snippet saved: suspicious-document10074-56759-3781.txt
  Source snippet saved: source-document01483-191238-3778.txt
  Suspicious snippet saved: suspicious-document10074-96844-1944.txt
  Source snippet saved: source-document11000-4605-1944.txt
  Suspicious snippet saved: suspicious-document10074-110132-224.txt
  Source snippet saved: source-document01483-61168-238.txt
  Suspicious snippet saved: suspicious-document10074-113229-169.txt
  Source snippet saved: source-document11000-25236-226.txt
  Suspicious snippet saved: suspicious-document10074-122947-1428.txt
  Source snippet saved: source-document11000-15061-1430.txt
  Suspicious snippet saved: suspicious-d

In [None]:
import os
import xml.etree.ElementTree as ET

base_dir = "/content/drive/MyDrive/experimental-models/su1_so0"

# -------------------------------------------------------------------------
# Process every subfolder in base_dir (no skipping based on previous runs)
# -------------------------------------------------------------------------
for subfolder in os.listdir(base_dir):
    doc_folder = os.path.join(base_dir, subfolder)

    # Only handle directories whose names start with "suspicious-document"
    if os.path.isdir(doc_folder) and subfolder.startswith("suspicious-document"):
        doc_name = subfolder  # e.g. "suspicious-document00192"

        xml_file = os.path.join(doc_folder, f"{doc_name}.xml")
        txt_file = os.path.join(doc_folder, f"{doc_name}.txt")

        # We'll store extracted snippets in a new subfolder named "suspicious-documentXXXX-parts"
        parts_folder = os.path.join(doc_folder, f"{doc_name}-parts")
        os.makedirs(parts_folder, exist_ok=True)

        # Check if required files exist. If not, skip
        if not (os.path.isfile(xml_file) and os.path.isfile(txt_file)):
            print(f"[WARNING] {subfolder} missing .xml or .txt file, skipping.")
            continue

        print(f"\nProcessing subfolder: {subfolder} ...")

        # Parse the XML
        tree = ET.parse(xml_file)
        root = tree.getroot()
        features = root.findall(".//feature")

        # Read the entire suspicious doc text
        with open(txt_file, "r", encoding="utf-8") as f:
            suspicious_full_text = f.read()

        # For each <feature>, extract the raw substring from suspicious doc
        for feature in features:
            this_offset = feature.get("this_offset")
            this_length = feature.get("this_length")

            if this_offset and this_length:
                offset_val = int(this_offset)
                length_val = int(this_length)

                # Take the raw substring without adjustments
                raw_snippet = suspicious_full_text[offset_val : offset_val + length_val]

                out_filename = f"{doc_name}-{offset_val}-{length_val}.txt"
                out_path = os.path.join(parts_folder, out_filename)

                # Write the raw snippet as-is
                with open(out_path, "w", encoding="utf-8") as out_f:
                    out_f.write(raw_snippet)

                print(f"  Suspicious snippet saved: {out_filename}")

        print(f"Finished processing {subfolder}. Extracted snippets are in {parts_folder}")

print("\nAll subfolders processed.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Suspicious snippet saved: suspicious-document10673-31865-249.txt
Finished processing suspicious-document10673. Extracted snippets are in /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document10673/suspicious-document10673-parts

Processing subfolder: suspicious-document10711 ...
  Suspicious snippet saved: suspicious-document10711-40829-17536.txt
  Suspicious snippet saved: suspicious-document10711-153206-21359.txt
  Suspicious snippet saved: suspicious-document10711-295941-1731.txt
  Suspicious snippet saved: suspicious-document10711-340053-22837.txt
  Suspicious snippet saved: suspicious-document10711-478580-244.txt
Finished processing suspicious-document10711. Extracted snippets are in /content/drive/MyDrive/experimental-models/su1_so0/suspicious-document10711/suspicious-document10711-parts

Processing subfolder: suspicious-document10702 ...
  Suspicious snippet saved: suspicious-document10702-9663-24

In [None]:
import os
import shutil

# Parent folder containing "suspicious-document" and "source-document"
pan2010_folder = "/content/drive/MyDrive/experimental-models/pan2010"

# Subfolders we want to copy
subfolders = ["suspicious-documents", "source-documents"]

# Destination folder for all files
destination_folder = "/content/drive/MyDrive/experimental-models/all-suspicious-and-source-files"

# Make sure the destination folder exists
os.makedirs(destination_folder, exist_ok=True)

# Extensions to skip – these are not real files but Google Doc placeholders
skip_extensions = {".gdoc", ".gsheet", ".gslides", ".gdraw"}

for subfolder in subfolders:
    current_subfolder_path = os.path.join(pan2010_folder, subfolder)
    if os.path.isdir(current_subfolder_path):
        # Copy every *file* in this subfolder, skipping .gdoc etc.
        for file_name in os.listdir(current_subfolder_path):
            source_file_path = os.path.join(current_subfolder_path, file_name)

            if not os.path.isfile(source_file_path):
                # Skip directories or anything not a file
                continue

            # Check if file extension is in our skip list
            ext = os.path.splitext(file_name)[1].lower()
            if ext in skip_extensions:
                print(f"Skipping Google placeholder file: {file_name}")
                continue

            # Otherwise, copy it
            destination_file_path = os.path.join(destination_folder, file_name)
            shutil.copy2(source_file_path, destination_file_path)

        print(f"Copied all supported files from {current_subfolder_path} to {destination_folder}")
    else:
        print(f"[WARNING] Subfolder not found: {current_subfolder_path}")

print("Done copying all suspicious and source documents (skipping .gdoc, etc.)!")


Skipping Google placeholder file: suspicious-document00036.gdoc
Copied all supported files from /content/drive/MyDrive/experimental-models/pan2010/suspicious-documents to /content/drive/MyDrive/experimental-models/all-suspicious-and-source-files
Copied all supported files from /content/drive/MyDrive/experimental-models/pan2010/source-documents to /content/drive/MyDrive/experimental-models/all-suspicious-and-source-files
Done copying all suspicious and source documents (skipping .gdoc, etc.)!


In [None]:
import os
import xml.etree.ElementTree as ET

# -------------------------------------------------------------------------
# Main configuration
# -------------------------------------------------------------------------
su1_so1_base = "/content/drive/MyDrive/experimental-models/su1_so1"

# In all-suspicious-and-source-documents, we have:
all_susp_folder = "/content/drive/MyDrive/experimental-models/all-suspicious-and-source-files"
all_src_folder = "/content/drive/MyDrive/experimental-models/all-suspicious-and-source-files"

def annotate_snippet_in_text(full_text, snippet, offset_val, length_val):

    # Build the annotated snippet
    annotation = f" . {offset_val}-{length_val} . "
    annotated_version = f"{annotation}{snippet}{annotation}"

    # Step 3: Replace only the first occurrence
    index = full_text.find(snippet)
    if index == -1:
        # Not found; we could try searching for snippet_adjusted instead if needed
        print(f"    [WARNING] Could not find snippet in text for offset={offset_val}, length={length_val}")
        return full_text

    # Replace the first occurrence
    new_text = (full_text[:index]
                + annotated_version
                + full_text[index + len(snippet):])
    return new_text

def process_subfolder(folder_name):
    """
    Given a subfolder in su1_so1 (e.g. suspicious-document00192),
    parse the .xml, then for each <feature>:
      - read suspicious snippet from the -parts .txt,
      - find & annotate in all-suspicious .txt,
      - if source snippet exists, read from -parts, find & annotate in all-source .txt
    """
    doc_folder = os.path.join(su1_so1_base, folder_name)
    doc_name = folder_name  # e.g. "suspicious-document00192"

    xml_file = os.path.join(doc_folder, f"{doc_name}.xml")
    if not os.path.isfile(xml_file):
        print(f"[WARNING] No XML file found in {doc_folder}, skipping.")
        return

    # Parse the XML
    tree = ET.parse(xml_file)
    root = tree.getroot()
    features = root.findall(".//feature")

    # Find the suspicious doc reference
    # (Might also be in <document reference="suspicious-document00192.txt">, etc.)
    # But we can assume it matches the subfolder naming:
    suspicious_txt_name = f"{doc_name}.txt"  # "suspicious-document00192.txt"
    suspicious_main_path = os.path.join(all_susp_folder, suspicious_txt_name)

    # Read the existing suspicious doc text from the main folder
    if not os.path.isfile(suspicious_main_path):
        print(f"[WARNING] Suspicious doc {suspicious_txt_name} not in {all_susp_folder}")
        suspicious_main_text = ""
    else:
        with open(suspicious_main_path, "r", encoding="utf-8") as f:
            suspicious_main_text = f.read()

    # We'll accumulate changes in memory and overwrite at the end
    updated_susp_text = suspicious_main_text

    # For each <feature>, we might have this_offset/length + source_reference/offset/length
    for feature in features:
        this_offset = feature.get("this_offset")
        this_length = feature.get("this_length")

        # If suspicious snippet is defined
        if this_offset and this_length:
            snippet_file = os.path.join(doc_folder, f"{doc_name}-parts",
                                        f"{doc_name}-{this_offset}-{this_length}.txt")
            if os.path.isfile(snippet_file):
                with open(snippet_file, "r", encoding="utf-8") as sf:
                    snippet_text = sf.read()

                # Attempt to annotate in the suspicious document
                updated_susp_text = annotate_snippet_in_text(
                    updated_susp_text,
                    snippet_text,
                    this_offset,
                    this_length
                )
            else:
                print(f"    [WARNING] Suspicious snippet file not found: {snippet_file}")

        # If source snippet is defined
        source_reference = feature.get("source_reference")
        source_offset = feature.get("source_offset")
        source_length = feature.get("source_length")

        if source_reference and source_offset and source_length:
            # The snippet file for the source in the -XML-parts folder
            source_snippet_file = os.path.join(
                doc_folder,
                f"{doc_name}-parts",
                f"{os.path.splitext(source_reference)[0]}-{source_offset}-{source_length}.txt"
            )
            if os.path.isfile(source_snippet_file):
                with open(source_snippet_file, "r", encoding="utf-8") as sf:
                    source_snippet_text = sf.read()

                # Now we want to annotate in the "all-suspicious-and-source-documents/source-documents"
                source_main_path = os.path.join(all_src_folder, source_reference)
                if os.path.isfile(source_main_path):
                    with open(source_main_path, "r", encoding="utf-8") as fsrc:
                        source_main_text = fsrc.read()

                    # Annotate
                    updated_source_text = annotate_snippet_in_text(
                        source_main_text,
                        source_snippet_text,
                        source_offset,
                        source_length
                    )

                    # Overwrite the source doc with the updated text
                    with open(source_main_path, "w", encoding="utf-8") as fsrc:
                        fsrc.write(updated_source_text)
                    print(f"    Updated source doc {source_reference} with offset={source_offset}, length={source_length}")
                else:
                    print(f"    [WARNING] Source doc {source_reference} not found in {all_src_folder}")
            else:
                print(f"    [WARNING] Source snippet file not found: {source_snippet_file}")

    # Finally, overwrite the suspicious doc with the updated text
    with open(suspicious_main_path, "w", encoding="utf-8") as f:
        f.write(updated_susp_text)

    print(f"Done updating suspicious doc: {suspicious_txt_name} in {all_susp_folder}")

# -------------------------------------------------------------------------
# MAIN LOOP: Process each subfolder in su1_so1
# -------------------------------------------------------------------------
for entry in os.listdir(su1_so1_base):
    subfolder_path = os.path.join(su1_so1_base, entry)
    if os.path.isdir(subfolder_path) and entry.startswith("suspicious-document"):
        print(f"\n>>> Processing subfolder: {entry}")
        process_subfolder(entry)

print("\nAll subfolders processed!")


In [None]:
import os
import xml.etree.ElementTree as ET

# -------------------------------------------------------------------------
# Folders where we read the original suspicious and source docs
# -------------------------------------------------------------------------
all_susp_folder = "/content/drive/MyDrive/experimental-models/all-suspicious-and-source-files"
all_src_folder  = "/content/drive/MyDrive/experimental-models/all-suspicious-and-source-files"

# -------------------------------------------------------------------------
# The base folder with subfolders like "suspicious-document00192"
# -------------------------------------------------------------------------
su1_so1_base = "/content/drive/MyDrive/experimental-models/su1_so1"

# Subfolders to process (here we'll process all that start with "suspicious-document")
subfolders_to_process = [entry for entry in os.listdir(su1_so1_base)
                         if os.path.isdir(os.path.join(su1_so1_base, entry))
                         and entry.startswith("suspicious-document")]

# Global list to accumulate missing file names (only the file names or paths)
missing_files = []

def log_missing(file_info):
    missing_files.append(file_info)

def annotate_snippet_in_text(full_text, snippet, offset_val, length_val):
    """
    Wrap the snippet (as-is) with offset-length tags on both ends,
    and replace only the first occurrence of the snippet in full_text.
    """
    annotation = f" . {offset_val}-{length_val} . "
    annotated_version = f"{annotation}{snippet}{annotation}"

    index = full_text.find(snippet)
    if index == -1:
        # If snippet is not found, log its information.
        print(f"    [WARNING] Snippet not found for offset={offset_val}, length={length_val}")
        # (Not a missing file, so we do not log the file name here.)
        return full_text

    new_text = full_text[:index] + annotated_version + full_text[index + len(snippet):]
    return new_text

def process_subfolder(folder_name):
    doc_folder = os.path.join(su1_so1_base, folder_name)
    doc_name = folder_name  # e.g., "suspicious-document00192"

    # Check XML file
    xml_file = os.path.join(doc_folder, f"{doc_name}.xml")
    if not os.path.isfile(xml_file):
        log_missing(f"XML file missing: {doc_name}.xml (in {doc_folder})")
        return

    # Parse the XML to get snippet references
    tree = ET.parse(xml_file)
    root = tree.getroot()
    features = root.findall(".//feature")

    # Suspicious doc file name (should be .txt)
    suspicious_txt_name = f"{doc_name}.txt"
    original_susp_path = os.path.join(all_susp_folder, suspicious_txt_name)
    if not os.path.isfile(original_susp_path):
        log_missing(f"Suspicious doc missing: {suspicious_txt_name} (in {all_susp_folder})")
        suspicious_main_text = ""
    else:
        with open(original_susp_path, "r", encoding="utf-8") as f:
            suspicious_main_text = f.read()

    updated_susp_text = suspicious_main_text

    for feature in features:
        # Process suspicious snippet
        this_offset = feature.get("this_offset")
        this_length = feature.get("this_length")
        if this_offset and this_length:
            snippet_file = os.path.join(
                doc_folder,
                f"{doc_name}-parts",
                f"{doc_name}-{this_offset}-{this_length}.txt"
            )
            if os.path.isfile(snippet_file):
                with open(snippet_file, "r", encoding="utf-8") as sf:
                    snippet_text = sf.read()
                updated_susp_text = annotate_snippet_in_text(
                    updated_susp_text,
                    snippet_text,
                    this_offset,
                    this_length
                )
            else:
                log_missing(f"Suspicious snippet missing: {snippet_file}")

        # Process source snippet
        source_reference = feature.get("source_reference")
        source_offset = feature.get("source_offset")
        source_length = feature.get("source_length")
        if source_reference and source_offset and source_length:
            snippet_file_src = os.path.join(
                doc_folder,
                f"{doc_name}-parts",
                f"{os.path.splitext(source_reference)[0]}-{source_offset}-{source_length}.txt"
            )
            if os.path.isfile(snippet_file_src):
                with open(snippet_file_src, "r", encoding="utf-8") as sf:
                    source_snippet_text = sf.read()
                original_source_path = os.path.join(all_src_folder, source_reference)
                if os.path.isfile(original_source_path):
                    with open(original_source_path, "r", encoding="utf-8") as fsrc:
                        source_main_text = fsrc.read()
                    updated_source_text = annotate_snippet_in_text(
                        source_main_text,
                        source_snippet_text,
                        source_offset,
                        source_length
                    )
                    # Overwrite the original source doc with the updated text
                    with open(original_source_path, "w", encoding="utf-8") as fsrc:
                        fsrc.write(updated_source_text)
                    print(f"    Updated source doc '{source_reference}' with offset={source_offset}, length={source_length}")
                else:
                    log_missing(f"Source doc missing: {source_reference} (in {all_src_folder})")
            else:
                log_missing(f"Source snippet missing: {os.path.splitext(source_reference)[0]}-{source_offset}-{source_length}.txt (in {doc_folder})")

    # Overwrite the original suspicious doc with the updated text
    with open(original_susp_path, "w", encoding="utf-8") as f:
        f.write(updated_susp_text)
    print(f"Done updating suspicious doc: {suspicious_txt_name} in {all_susp_folder}")

# MAIN: Process each subfolder in su1_so1 that starts with "suspicious-document"
for entry in os.listdir(su1_so1_base):
    subfolder_path = os.path.join(su1_so1_base, entry)
    if os.path.isdir(subfolder_path) and entry.startswith("suspicious-document"):
        print(f"\n>>> Processing subfolder: {entry}")
        process_subfolder(entry)

# Save the log file with missing file names to a new folder
log_folder = "/content/drive/MyDrive/experimental-models/update_logs"
os.makedirs(log_folder, exist_ok=True)
log_file_path = os.path.join(log_folder, "missing_files_log.txt")
with open(log_file_path, "w", encoding="utf-8") as log_f:
    log_f.write("\n".join(missing_files))

print("\nAll subfolders processed!")
print(f"Log of missing files saved to: {log_file_path}")



>>> Processing subfolder: suspicious-document02708
    Updated source doc 'source-document00216.txt' with offset=115094, length=4959
    Updated source doc 'source-document06124.txt' with offset=1715, length=273
    Updated source doc 'source-document06124.txt' with offset=706, length=242
Done updating suspicious doc: suspicious-document02708.txt in /content/drive/MyDrive/experimental-models/all-suspicious-and-source-files

>>> Processing subfolder: suspicious-document02630
    Updated source doc 'source-document05999.txt' with offset=176156, length=2454
    Updated source doc 'source-document05999.txt' with offset=117535, length=1565
    Updated source doc 'source-document05999.txt' with offset=103171, length=1469
Done updating suspicious doc: suspicious-document02630.txt in /content/drive/MyDrive/experimental-models/all-suspicious-and-source-files

>>> Processing subfolder: suspicious-document02576
    Updated source doc 'source-document01602.txt' with offset=83571, length=23056
   

KeyboardInterrupt: 

In [None]:
import os
import xml.etree.ElementTree as ET
import itertools

# -------------------------------------------------------------------------
# Folders for example files
# -------------------------------------------------------------------------
susp_xml_folder = "/content/drive/MyDrive/example-three-xml-files/selected-files/suspicious-xml"
source_files_folder = "/content/drive/MyDrive/example-three-xml-files/selected-files/source-files"

# -------------------------------------------------------------------------
# Get list of XML files in the suspicious-xml folder
# -------------------------------------------------------------------------
xml_files = [f for f in os.listdir(susp_xml_folder)
             if f.lower().endswith(".xml") and os.path.isfile(os.path.join(susp_xml_folder, f))]

print("Found the following XML files in suspicious-xml folder:")
for xml_file in sorted(xml_files):
    print("  " + xml_file)

# -------------------------------------------------------------------------
# Get list of available source file names (only TXT files) from source-files folder
# -------------------------------------------------------------------------
available_source_files = set(f for f in os.listdir(source_files_folder)
                             if f.lower().endswith(".txt"))

print("\nAvailable source files (from source-files folder):")
for src in sorted(available_source_files):
    print("  " + src)

# -------------------------------------------------------------------------
# For each XML file, parse it and extract its referenced source file names
# -------------------------------------------------------------------------
xml_sources = {}  # key: XML file name, value: set of referenced source file names (that exist)

for xml_file in xml_files:
    xml_path = os.path.join(susp_xml_folder, xml_file)
    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
        # Find all <feature> elements and collect their source_reference attribute
        sources = { feature.get("source_reference")
                    for feature in root.findall(".//feature")
                    if feature.get("source_reference") is not None }
        # Retain only those that exist in the source-files folder
        sources = sources.intersection(available_source_files)
        xml_sources[xml_file] = sources
        print(f"{xml_file} references {len(sources)} source file(s).")
    except Exception as e:
        print(f"[ERROR] Could not parse {xml_file}: {e}")

# -------------------------------------------------------------------------
# Compute pairwise intersections
# -------------------------------------------------------------------------
xml_files_list = list(xml_sources.keys())
print("\nPairwise intersections (2-wise):")
for combo in itertools.combinations(xml_files_list, 2):
    common = set.intersection(*(xml_sources[f] for f in combo))
    print(f"{combo[0]} & {combo[1]}: {len(common)} common source file(s)")

# -------------------------------------------------------------------------
# Compute intersections for 3-wise up to 10-wise (or n-wise if n<10)
# -------------------------------------------------------------------------
n = len(xml_files_list)
for k in range(3, n+1):
    print(f"\n{k}-wise intersections:")
    for combo in itertools.combinations(xml_files_list, k):
        common = set.intersection(*(xml_sources[f] for f in combo))
        combo_str = ", ".join(combo)
        print(f"{combo_str}: {len(common)} common source file(s)")

Found the following XML files in suspicious-xml folder:
  suspicious-document02849.xml
  suspicious-document03110.xml
  suspicious-document07293.xml

Available source files (from source-files folder):
  source-document00026.txt
  source-document00114.txt
  source-document00117.txt
  source-document00129.txt
  source-document00181.txt
  source-document00209.txt
  source-document00250.txt
  source-document00280.txt
  source-document00291.txt
  source-document00440.txt
  source-document00455.txt
  source-document00479.txt
  source-document00483.txt
  source-document00495.txt
  source-document00503.txt
  source-document00677.txt
  source-document00829.txt
  source-document00867.txt
  source-document00942.txt
  source-document00955.txt
  source-document01012.txt
  source-document01113.txt
  source-document01258.txt
  source-document01277.txt
  source-document01390.txt
  source-document01436.txt
  source-document01649.txt
  source-document01666.txt
  source-document01704.txt
  source-documen

In [None]:
import os
import re
import xml.etree.ElementTree as ET
import itertools
from collections import defaultdict

# -------------------------------------------------------------------------
# Base folders (adjust these paths as needed)
# -------------------------------------------------------------------------
# Folder with three subfolders: suspicious-xml, suspicious-files, source-files.
selected_files_folder = "/content/drive/MyDrive/example-three-xml-files/selected-files-1"

susp_xml_folder = os.path.join(selected_files_folder, "suspicious-xml")
susp_main_folder = os.path.join(selected_files_folder, "suspicious-files")
source_main_folder = os.path.join(selected_files_folder, "source-files")

# The folder that contains the "parts" subfolders (external folder structure)
parts_base_folder = "/content/drive/MyDrive/example-three-xml-files"

# -------------------------------------------------------------------------
# Summary data structures
# -------------------------------------------------------------------------
summary = {
    "found": [],
    "missing": []
}

# -------------------------------------------------------------------------
# Step 1: Get list of XML files from suspicious-xml folder
# -------------------------------------------------------------------------
xml_files = [os.path.join(susp_xml_folder, f)
             for f in os.listdir(susp_xml_folder)
             if f.lower().endswith(".xml") and os.path.isfile(os.path.join(susp_xml_folder, f))]
print(f"Found {len(xml_files)} XML files in {susp_xml_folder}")

if not xml_files:
    print("No XML files found. Exiting.")
    exit()

# Process all XML files (or you can adjust if needed)
for xml_path in xml_files:
    xml_filename = os.path.basename(xml_path)            # e.g., suspicious-document02849.xml
    base_name = os.path.splitext(xml_filename)[0]          # e.g., suspicious-document02849
    print(f"\nProcessing XML file: {xml_filename}")

    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
    except Exception as e:
        print(f"Error parsing {xml_filename}: {e}")
        continue

    # ---------------------------------------------------------------------
    # Step 2: Collect snippet information (each <feature> element)
    # ---------------------------------------------------------------------
    snippet_infos = []  # List of dicts.
    for feature in root.findall(".//feature"):
        # Suspicious snippet details (if available)
        this_offset = feature.get("this_offset")
        this_length = feature.get("this_length")
        if this_offset and this_length:
            snippet_infos.append({
                "type": "suspicious",
                "offset": this_offset,
                "length": this_length
            })
        # Source snippet details (if available)
        source_reference = feature.get("source_reference")
        source_offset = feature.get("source_offset")
        source_length = feature.get("source_length")
        if source_reference and source_offset and source_length:
            snippet_infos.append({
                "type": "source",
                "source_reference": source_reference,
                "offset": source_offset,
                "length": source_length
            })

    # ---------------------------------------------------------------------
    # Step 3: Create expected snippet file names from the XML snippet info.
    # ---------------------------------------------------------------------
    expected_files = []
    for info in snippet_infos:
        if info["type"] == "suspicious":
            expected_filename = f"{base_name}-{info['offset']}-{info['length']}.txt"
        else:
            expected_filename = f"{os.path.splitext(info['source_reference'])[0]}-{info['offset']}-{info['length']}.txt"
        expected_files.append((info["type"], expected_filename))

    # ---------------------------------------------------------------------
    # Step 4: Find the corresponding parts folder.
    # We assume that in parts_base_folder there is a subfolder whose name
    # starts with base_name (e.g., suspicious-document02849) and inside it,
    # there is a subfolder ending with "-parts".
    # ---------------------------------------------------------------------
    matched_parts_folder = None
    for candidate in os.listdir(parts_base_folder):
        candidate_path = os.path.join(parts_base_folder, candidate)
        if os.path.isdir(candidate_path) and candidate.startswith(base_name):
            for sub_candidate in os.listdir(candidate_path):
                if sub_candidate.endswith("-parts"):
                    matched_parts_folder = os.path.join(candidate_path, sub_candidate)
                    break
            if matched_parts_folder:
                break

    if not matched_parts_folder:
        summary["missing"].append(f"{base_name}: Parts folder not found in {parts_base_folder}")
        continue

    # List files in the matched parts folder.
    parts_files = os.listdir(matched_parts_folder)

    # ---------------------------------------------------------------------
    # Step 5: For each expected snippet file, match with files in parts folder.
    # ---------------------------------------------------------------------
    for snip_type, expected_filename in expected_files:
        matched = [f for f in parts_files if f.startswith(expected_filename)]
        if matched:
            matched_file = matched[0]
            parts_file_path = os.path.join(matched_parts_folder, matched_file)
            with open(parts_file_path, "r", encoding="utf-8") as pf:
                parts_text = pf.read()

            # Update: Take text starting from the second character up to the second-to-last character.
            if len(parts_text) > 2:
                parts_text = parts_text[0:-1]

            # Determine the corresponding main file based on snippet type.
            if snip_type == "suspicious":
                main_file_path = os.path.join(susp_main_folder, f"{base_name}.txt")
            else:
                src_ref = None
                for info in snippet_infos:
                    if info["type"] == "source":
                        exp_fn = f"{os.path.splitext(info['source_reference'])[0]}-{info['offset']}-{info['length']}.txt"
                        if exp_fn == expected_filename:
                            src_ref = info["source_reference"]
                            break
                if src_ref is None:
                    main_file_path = None
                else:
                    main_file_path = os.path.join(source_main_folder, src_ref)

            if main_file_path and os.path.isfile(main_file_path):
                with open(main_file_path, "r", encoding="utf-8") as mf:
                    main_text = mf.read()
                # Compare the parts text with the main text without any whitespace normalization.
                if parts_text in main_text:
                    summary["found"].append(f"{matched_file} found in {os.path.basename(main_file_path)}")
                else:
                    summary["missing"].append(f"{matched_file} NOT found in {os.path.basename(main_file_path)}")
            else:
                summary["missing"].append(f"{matched_file}: Main file not found (expected: {main_file_path})")
        else:
            summary["missing"].append(f"Expected parts file with prefix '{expected_filename}' not found in {matched_parts_folder} for {base_name}")

# -------------------------------------------------------------------------
# Step 6: Print the summary
# -------------------------------------------------------------------------
print("\nSummary of matched files:")
print("Found:")
for line in summary["found"]:
    print("  " + line)
print("\nMissing:")
for line in summary["missing"]:
    print("  " + line)


Found 3 XML files in /content/drive/MyDrive/example-three-xml-files/selected-files-1/suspicious-xml

Processing XML file: suspicious-document03110.xml

Processing XML file: suspicious-document07293.xml

Processing XML file: suspicious-document02849.xml

Summary of matched files:
Found:
  suspicious-document03110-5514-14996.txt found in suspicious-document03110.txt
  source-document04483-161-15034.txt found in source-document04483.txt
  suspicious-document03110-25140-20998.txt found in suspicious-document03110.txt
  source-document04288-1380-21016.txt found in source-document04288.txt
  suspicious-document03110-54013-18561.txt found in suspicious-document03110.txt
  source-document10795-154980-18552.txt found in source-document10795.txt
  suspicious-document03110-86274-23949.txt found in suspicious-document03110.txt
  source-document08758-334957-24015.txt found in source-document08758.txt
  suspicious-document03110-111731-3752.txt found in suspicious-document03110.txt
  source-document0

In [None]:
!pip install googletrans==3.1.0a0




Next code is for translation the extracted plagiarised texts and also it checks before doing translation whether the document has been already translated before.

In [None]:
import os
import re
import time
from googletrans import Translator

base_folder = "/content/drive/MyDrive/xml-files"

translator = Translator()

def chunk_text_by_paragraph(text, max_size=2000):
    """
    Splits 'text' into chunks of up to 'max_size' characters,
    splitting by paragraphs without adding extra newlines.
    """
    # Split text into paragraphs by identifying natural paragraph breaks
    paragraphs = re.split(r'\n\s*\n', text)

    # Combine paragraphs into chunks without adding extra newlines
    chunks = []
    current_chunk = []
    current_length = 0
    for paragraph in paragraphs:
        if current_length + len(paragraph) <= max_size:
            current_chunk.append(paragraph)
            current_length += len(paragraph)
        else:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
            current_chunk = [paragraph]
            current_length = len(paragraph)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Walk through each folder under base_folder
for root, dirs, files in os.walk(base_folder):
    # Only process folders that end with '-parts'
    if root.endswith("-parts"):
        print(f"Processing folder: {root}")
        for file in files:
            if file.lower().endswith(".txt"):
                original_file_path = os.path.join(root, file)
                # Build the expected translated file name "kz-" + original file name
                translated_file_name = "kz-" + file
                translated_file_path = os.path.join(root, translated_file_name)

                # Check if "kz-" version already exists
                if os.path.isfile(translated_file_path):
                    print(f"Skipping '{file}', already translated as '{translated_file_name}'")
                    continue

                try:
                    with open(original_file_path, "r", encoding="utf-8") as f:
                        content = f.read()

                    max_retries = 3
                    wait_time = 2
                    translated_text = None

                    for attempt in range(max_retries):
                        try:
                            # Chunk the text by paragraph boundaries (max ~2000 chars each)
                            chunks = chunk_text_by_paragraph(content, max_size=2000)

                            outchunks = []
                            for c in chunks:
                                # Attempt translation for each chunk
                                res = translator.translate(c, dest='kk')
                                if not res or not res.text:
                                    raise ValueError("Translation result is None.")
                                outchunks.append(res.text)

                            # Join translated chunks with a single space between them
                            translated_text = " ".join(outchunks)
                            break  # Successful translation; break out of retry loop
                        except Exception as e:
                            print(f"[{file}] Attempt {attempt+1} failed: {e}")
                            time.sleep(wait_time)

                    if translated_text is None:
                        print(f"[ERROR] Could not translate '{file}' after {max_retries} attempts.")
                        continue

                    # Save the result with "kz-" prepended to the original file name
                    with open(translated_file_path, "w", encoding="utf-8") as out_f:
                        out_f.write(translated_text)
                    print(f"Translated {file} -> {translated_file_name}")

                except Exception as e:
                    print(f"Error reading or translating {file}: {e}")


In [None]:
import os
import shutil

# Base folder to search (adjust as needed)
base_folder = "/content/drive/MyDrive/xml-files"

# Destination folder for all unique translated Kazakh documents
dest_folder = "/content/drive/MyDrive/example-three-xml-files/all-unique-translated-kazakh-documents"
os.makedirs(dest_folder, exist_ok=True)

# Walk through the directory tree in the base folder
for root, dirs, files in os.walk(base_folder):
    # Skip if we're already in the destination folder to avoid moving files repeatedly
    if dest_folder in root:
        continue
    for file in files:
        # Identify files that are translated (prefix "kz-" and ending with .txt)
        if file.startswith("kz-") and file.lower().endswith(".txt"):
            source_path = os.path.join(root, file)
            dest_path = os.path.join(dest_folder, file)
            print(f"Moving {source_path} to {dest_path}")
            shutil.move(source_path, dest_path)

print("All translated files have been moved to the destination folder.")


Moving /content/drive/MyDrive/example-three-xml-files/suspicious-document03110/suspicious-document03110-parts/kz-suspicious-document03110-5514-14996.txt to /content/drive/MyDrive/example-three-xml-files/all-unique-translated-kazakh-documents/kz-suspicious-document03110-5514-14996.txt
Moving /content/drive/MyDrive/example-three-xml-files/suspicious-document03110/suspicious-document03110-parts/kz-suspicious-document03110-25140-20998.txt to /content/drive/MyDrive/example-three-xml-files/all-unique-translated-kazakh-documents/kz-suspicious-document03110-25140-20998.txt
Moving /content/drive/MyDrive/example-three-xml-files/suspicious-document03110/suspicious-document03110-parts/kz-suspicious-document03110-54013-18561.txt to /content/drive/MyDrive/example-three-xml-files/all-unique-translated-kazakh-documents/kz-suspicious-document03110-54013-18561.txt
Moving /content/drive/MyDrive/example-three-xml-files/suspicious-document03110/suspicious-document03110-parts/kz-source-document10795-154980-

In [None]:
import os
import re
import time
import xml.etree.ElementTree as ET
from collections import defaultdict
from googletrans import Translator

# -------------------------------------------------------------------------
# Base folders (adjust these paths as needed)
# -------------------------------------------------------------------------
selected_files_folder = "/content/drive/MyDrive/xml-files"

susp_xml_folder = os.path.join(selected_files_folder, "suspicious-xml")
susp_main_folder = os.path.join(selected_files_folder, "suspicious-files")
source_main_folder = os.path.join(selected_files_folder, "source-files")

# Folder with all the Kazakh translated snippet files (unique translated documents)
kazakh_folder = "/content/drive/MyDrive/xml-files/all-unique-translated-kazakh-documents"

# The folder that contains the "parts" subfolders (external folder structure)
parts_base_folder = "/content/drive/MyDrive/xml-files"

# -------------------------------------------------------------------------
# Summary data structures
# -------------------------------------------------------------------------
summary = {
    "found": [],
    "skipped": [],
    "missing": []
}

# Dictionary to track which snippet (expected filename) has been updated in a main file.
updated_snippets = defaultdict(set)

translator = Translator()

# -------------------------------------------------------------------------
# Step 1: Get list of XML files from suspicious-xml folder
# -------------------------------------------------------------------------
xml_files = [os.path.join(susp_xml_folder, f)
             for f in os.listdir(susp_xml_folder)
             if f.lower().endswith(".xml") and os.path.isfile(os.path.join(susp_xml_folder, f))]
print(f"Found {len(xml_files)} XML files in {susp_xml_folder}")

if not xml_files:
    print("No XML files found. Exiting.")
    exit()

# Process all XML files
for xml_path in xml_files:
    xml_filename = os.path.basename(xml_path)            # e.g., suspicious-document02849.xml
    base_name = os.path.splitext(xml_filename)[0]          # e.g., suspicious-document02849
    print(f"\nProcessing XML file: {xml_filename}")

    try:
        tree = ET.parse(xml_path)
        root = tree.getroot()
    except Exception as e:
        print(f"Error parsing {xml_filename}: {e}")
        continue

    # ---------------------------------------------------------------------
    # Step 2: Collect snippet information (each <feature> element)
    # ---------------------------------------------------------------------
    snippet_infos = []  # List of dicts.
    for feature in root.findall(".//feature"):
        # Suspicious snippet details (if available)
        this_offset = feature.get("this_offset")
        this_length = feature.get("this_length")
        if this_offset and this_length:
            snippet_infos.append({
                "type": "suspicious",
                "offset": this_offset,
                "length": this_length
            })
        # Source snippet details (if available)
        source_reference = feature.get("source_reference")
        source_offset = feature.get("source_offset")
        source_length = feature.get("source_length")
        if source_reference and source_offset and source_length:
            snippet_infos.append({
                "type": "source",
                "source_reference": source_reference,
                "offset": source_offset,
                "length": source_length
            })

    # ---------------------------------------------------------------------
    # Step 3: Create expected snippet file names from the XML snippet info.
    # ---------------------------------------------------------------------
    expected_files = []
    for info in snippet_infos:
        if info["type"] == "suspicious":
            expected_filename = f"{base_name}-{info['offset']}-{info['length']}.txt"
        else:
            expected_filename = f"{os.path.splitext(info['source_reference'])[0]}-{info['offset']}-{info['length']}.txt"
        expected_files.append((info["type"], expected_filename))

    # ---------------------------------------------------------------------
    # Step 4: Find the corresponding parts folder.
    # We assume that in parts_base_folder there is a subfolder whose name
    # starts with base_name and inside it, there is a subfolder ending with "-parts".
    # ---------------------------------------------------------------------
    matched_parts_folder = None
    for candidate in os.listdir(parts_base_folder):
        candidate_path = os.path.join(parts_base_folder, candidate)
        if os.path.isdir(candidate_path) and candidate.startswith(base_name):
            for sub_candidate in os.listdir(candidate_path):
                if sub_candidate.endswith("-parts"):
                    matched_parts_folder = os.path.join(candidate_path, sub_candidate)
                    break
            if matched_parts_folder:
                break

    if not matched_parts_folder:
        summary["missing"].append(f"{base_name}: Parts folder not found in {parts_base_folder}")
        continue

    # List files in the matched parts folder.
    parts_files = os.listdir(matched_parts_folder)

    # ---------------------------------------------------------------------
    # Step 5: For each expected snippet file, match with files in parts folder and update main file.
    # ---------------------------------------------------------------------
    for snip_type, expected_filename in expected_files:
        # Skip if already updated in this main file.
        # Determine main file path.
        if snip_type == "suspicious":
            main_file_path = os.path.join(susp_main_folder, f"{base_name}.txt")
        else:
            src_ref = None
            for info in snippet_infos:
                if info["type"] == "source":
                    exp_fn = f"{os.path.splitext(info['source_reference'])[0]}-{info['offset']}-{info['length']}.txt"
                    if exp_fn == expected_filename:
                        src_ref = info["source_reference"]
                        break
            if src_ref is None:
                main_file_path = None
            else:
                main_file_path = os.path.join(source_main_folder, src_ref)

        if main_file_path is None or not os.path.isfile(main_file_path):
            summary["missing"].append(f"{expected_filename}: Main file not found (expected: {main_file_path})")
            continue

        if expected_filename in updated_snippets[main_file_path]:
            summary["skipped"].append(f"{expected_filename} already updated in {os.path.basename(main_file_path)}")
            continue

        matched = [f for f in parts_files if f.startswith(expected_filename)]
        if not matched:
            summary["missing"].append(f"Expected parts file with prefix '{expected_filename}' not found in {matched_parts_folder} for {base_name}")
            continue

        matched_file = matched[0]
        parts_file_path = os.path.join(matched_parts_folder, matched_file)
        with open(parts_file_path, "r", encoding="utf-8") as pf:
            parts_text = pf.read()

        # Take text starting from the second character up to the second-to-last character (if long enough)
        if len(parts_text) > 2:
            parts_text = parts_text[0:0]

        with open(main_file_path, "r", encoding="utf-8") as mf:
            main_text = mf.read()

        # Load the corresponding Kazakh snippet file
        kazakh_file = os.path.join(kazakh_folder, "kz-" + expected_filename)
        if not os.path.isfile(kazakh_file):
            summary["missing"].append(f"Translated Kazakh file not found: {kazakh_file}")
            continue
        with open(kazakh_file, "r", encoding="utf-8") as kf:
            kazakh_snippet = kf.read()

        # Add a leading and trailing space to the Kazakh snippet
        kazakh_snippet = " " + kazakh_snippet + " "

        # Check if the main text already contains the Kazakh snippet
        if kazakh_snippet in main_text:
            summary["skipped"].append(f"{expected_filename} already replaced in {os.path.basename(main_file_path)}")
            updated_snippets[main_file_path].add(expected_filename)
            continue

        # Otherwise, try to find the original snippet in the main text.
        if parts_text in main_text:
            updated_main_text = main_text.replace(parts_text, kazakh_snippet, 1)
            with open(main_file_path, "w", encoding="utf-8") as mf:
                mf.write(updated_main_text)
            summary["found"].append(f"{matched_file} replaced in {os.path.basename(main_file_path)}")
            updated_snippets[main_file_path].add(expected_filename)
        else:
            summary["missing"].append(f"{matched_file} (original snippet) NOT found in {os.path.basename(main_file_path)}")

# -------------------------------------------------------------------------
# Step 6: Print the summary
# -------------------------------------------------------------------------
print("\nSummary of updated files:")
print("Replaced:")
for line in summary["found"]:
    print("  " + line)
print("\nSkipped (already updated):")
for line in summary["skipped"]:
    print("  " + line)
print("\nMissing:")
for line in summary["missing"]:
    print("  " + line)


Found 3 XML files in /content/drive/MyDrive/example-three-xml-files/selected-files-1/suspicious-xml

Processing XML file: suspicious-document03110.xml

Processing XML file: suspicious-document07293.xml

Processing XML file: suspicious-document02849.xml

Summary of updated files:
Replaced:
  source-document05880-7640-1951.txt replaced in source-document05880.txt
  source-document08018-12475-15603.txt replaced in source-document08018.txt

Skipped (already updated):
  suspicious-document03110-5514-14996.txt already replaced in suspicious-document03110.txt
  source-document04483-161-15034.txt already replaced in source-document04483.txt
  suspicious-document03110-25140-20998.txt already replaced in suspicious-document03110.txt
  source-document04288-1380-21016.txt already replaced in source-document04288.txt
  suspicious-document03110-54013-18561.txt already replaced in suspicious-document03110.txt
  source-document10795-154980-18552.txt already replaced in source-document10795.txt
  suspi

In [None]:
import os
import re
import time
from googletrans import Translator

base_folder = "/content/drive/MyDrive/xml-files/suspicious-files"

translator = Translator()

def chunk_text_by_paragraph(text, max_size=2000):
    """
    Splits 'text' into chunks up to 'max_size' characters,
    splitting by paragraphs without adding newlines.
    """
    # Split text into paragraphs by natural paragraph breaks
    paragraphs = re.split(r'\n\s*\n', text)

    # Combine paragraphs into chunks without adding extra newlines
    chunks = []
    current_chunk = []
    current_length = 0
    for paragraph in paragraphs:
        if current_length + len(paragraph) <= max_size:
            current_chunk.append(paragraph)
            current_length += len(paragraph)
        else:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
            current_chunk = [paragraph]
            current_length = len(paragraph)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

# Walk through each folder under base_folder
for root, dirs, files in os.walk(base_folder):
    # Only process folders that end with '-parts'
    if startswith("suspiciosus-document"):
        print(f"Processing folder: {root}")
        for file in files:
            if file.lower().endswith(".txt"):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, "r", encoding="utf-8") as f:
                        content = f.read()

                    max_retries = 3
                    wait_time = 2
                    translated_text = None

                    for attempt in range(max_retries):
                        try:
                            # Chunk the text by paragraph boundaries (max ~2000 chars each)
                            chunks = chunk_text_by_paragraph(content, max_size=2000)

                            outchunks = []
                            for c in chunks:
                                # Attempt translation for each chunk
                                res = translator.translate(c, dest='kk')
                                if not res or not res.text:
                                    raise ValueError("Translation result is None.")
                                outchunks.append(res.text)

                            # Join translated chunks with a single space between them
                            translated_text = " ".join(outchunks)
                            break  # Successful translation; break out of retry loop
                        except Exception as e:
                            print(f"[{file}] Attempt {attempt+1} failed: {e}")
                            time.sleep(wait_time)

                    if translated_text is None:
                        print(f"[ERROR] Could not translate '{file}' after {max_retries} attempts.")
                        continue

                    # Save the result with "kz-" prepended to the original file name
                    new_file_name = "kz-" + file
                    new_file_path = os.path.join(root, new_file_name)
                    with open(new_file_path, "w", encoding="utf-8") as out_f:
                        out_f.write(translated_text)
                    print(f"Translated {file} -> {new_file_name}")

                except Exception as e:
                    print(f"Error reading or translating {file}: {e}")


Next code is for translation the whole documents and also it checks before doing translation whether the document has been already translated before.

If you see repeated failures with 2,000, you can reduce it to 1,000 or even 500 to lower the chance of hitting unexpected rate limits or long-response issues. But in many workflows, 2,000 characters per chunk is a good balance to keep translation context while minimizing request failures.


In [None]:
import os
import re
import time
from googletrans import Translator

# Folders containing files you want to translate
susp_main_folder = "/content/drive/MyDrive/xml-files/suspicious-files"
source_main_folder = "/content/drive/MyDrive/xml-files/source-files"

folders_to_translate = [susp_main_folder, source_main_folder]
translator = Translator()

def chunk_text_by_paragraph(text, max_size=2000):
    paragraphs = re.split(r'\n\s*\n', text)
    chunks = []
    current_chunk = []
    current_length = 0
    for paragraph in paragraphs:
        if current_length + len(paragraph) <= max_size:
            current_chunk.append(paragraph)
            current_length += len(paragraph)
        else:
            if current_chunk:
                chunks.append(" ".join(current_chunk))
            current_chunk = [paragraph]
            current_length = len(paragraph)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

max_retries = 3
wait_time = 2

for folder_path in folders_to_translate:
    print(f"\n=== Now processing folder: {folder_path} ===\n")
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            # Only handle .txt and skip files already named "kz-..."
            if file.lower().endswith(".txt") and not file.startswith("kz-"):
                original_file_path = os.path.join(root, file)

                # If "kz-" version already exists, skip
                translated_file_name = "kz-" + file
                translated_file_path = os.path.join(root, translated_file_name)
                if os.path.isfile(translated_file_path):
                    print(f"Skipping {file}, already translated as {translated_file_name}")
                    continue

                try:
                    with open(original_file_path, "r", encoding="utf-8") as f:
                        content = f.read()

                    translated_text = None
                    for attempt in range(max_retries):
                        try:
                            chunks = chunk_text_by_paragraph(content, max_size=2000)
                            outchunks = []
                            for c in chunks:
                                res = translator.translate(c, dest='kk')
                                if not res or not res.text:
                                    raise ValueError("Translation result is None.")
                                outchunks.append(res.text)
                            translated_text = " ".join(outchunks)
                            break
                        except Exception as e:
                            print(f"[{file}] Attempt {attempt+1} failed: {e}")
                            time.sleep(wait_time)

                    if translated_text is None:
                        print(f"[ERROR] Could not translate '{file}' after {max_retries} attempts.")
                        continue

                    # Save the result
                    with open(translated_file_path, "w", encoding="utf-8") as out_f:
                        out_f.write(translated_text)

                    print(f"Translated {file} -> {translated_file_name}")

                except Exception as e:
                    print(f"Error reading or translating {file}: {e}")

This one is for translating the whole text one by one, and also this code is good for controlling when google translate stops unexpectedly, then you will know from which document to start to translate.  

In [None]:
!pip install googletrans==3.1.0a0

import os
import os.path
import time
from googletrans import Translator
from google.colab import drive

target_language = "kk"

#path for google drive
googledrive='/content/drive'
drive.mount(googledrive)

#folders' paths in the google-drive from where files are going to be read
path1=googledrive+'/MyDrive/PAN2010/suspicious-files/'
path2=googledrive+'/MyDrive/PAN2010/suspicious-translated-files/'

#counting the number of documents for translation
#n=len([entry for entry in os.listdir(path1) if os.path.isfile(os.path.join(path1, entry))])

# Desired chunk size
chunk_size = 4040

m=15820
n=15926

#function for translation
def translate_text(text):
  translator = Translator()
  if len(text)!=0:
    translated_text = translator.translate(text, dest=target_language)
    time.sleep(2)
    return translated_text.text+' '
  else:
    return '\n'

def divide_string(text, chunk_size):
  chunks = []
  while len(text) > chunk_size:
    # Find the last period before the chunk_size limit
    last_period_index = text.rfind('.', 0, chunk_size)

    # If no period is found, just split at chunk_size
    if last_period_index == -1:
      last_period_index = chunk_size

    # Create a chunk that ends with the last period
    chunk = text[:last_period_index + 1].strip()

    # Add the chunk to the list
    chunks.append(chunk)

    # Remove the processed text from the original string
    text = text[last_period_index + 1:].strip()

    # Add the remaining text as the last chunk
  if text:
    chunks.append(text)

  return chunks

#searching and writing the path of each document
for i in range(m,n):

  if i<10:
    if os.path.isfile(path1+'suspicious-document0000'+str(i)+'.txt')==True:
      path3=path1+'suspicious-document0000'+str(i)+'.txt'
    else:
      continue

  if i>9 and i<100:
    if os.path.isfile(path1+'suspicious-document000'+str(i)+'.txt')==True:
      path3=path1+'suspicious-document000'+str(i)+'.txt'
    else:
      continue

  if i>99 and i<1000:
    if os.path.isfile(path1+'suspicious-document00'+str(i)+'.txt')==True:
      path3=path1+'suspicious-document00'+str(i)+'.txt'
    else:
      continue
  if i>999 and i<10000:
    if os.path.isfile(path1+'suspicious-document0'+str(i)+'.txt')==True:
      path3=path1+'suspicious-document0'+str(i)+'.txt'
    else:
      continue

  if i>9999:
    if os.path.isfile(path1+'suspicious-document'+str(i)+'.txt')==True:
      path3=path1+'suspicious-document'+str(i)+'.txt'
    else:
      continue

  path4=path2+'suspicious-translated'+str(i)+'.txt'

#reading a document from a file

  bla=""

  with open(path3,encoding="utf8", errors='ignore') as file:
    for item in file:
      if item!=0:
        bla+=item.strip()
        bla+=' '

  #print("File:",i)

  # Divide the large string into chunks
  text_chunks = divide_string(bla, chunk_size)

  for chunk in text_chunks:
    with open(path4, 'a+') as f:
      f.write(translate_text(chunk))
      #f.write("\n")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


This code to translate all the documents one by one, this is good when google translate unexpectedly stops, then you wil know from which document it is better to start again.

In [None]:
!pip install googletrans==3.1.0a0

import os
import concurrent.futures
import time
from googletrans import Translator
from google.colab import drive

# ... (drive mounting and paths setup)
googledrive = '/content/drive'
drive.mount(googledrive)

# Drive paths
path1 = googledrive + '/MyDrive/PAN2010/source-files/'
path2 = googledrive + '/MyDrive/PAN2010/translated/'

# ... (chunk_size, m, n, and other configurations)
target_language = "kk"
chunk_size = 4040
max_workers = 8
delay_between_chunks = 2

#first and last number of documents
m=1
n=11149

# ... (translate_text and divide_string functions)
# Function for translation
def translate_text(text):
    return translator.translate(text, dest=target_language).text

#function takes care to ensure that chunks end with a complete sentence
def divide_string(text, chunk_size):
  chunks = []
  while len(text) > chunk_size:
    # Find the last period before the chunk_size limit
    last_period_index = text.rfind('.', 0, chunk_size)

    # If no period is found, just split at chunk_size
    if last_period_index == -1:
      last_period_index = chunk_size

    # Create a chunk that ends with the last period
    chunk = text[:last_period_index + 1].strip()

    # Add the chunk to the list
    chunks.append(chunk)

    # Remove the processed text from the original string
    text = text[last_period_index + 1:].strip()

    # Add the remaining text as the last chunk
  if text:
    chunks.append(text)
  return chunks

# Create target directory if not exists
os.makedirs(path2, exist_ok=True)

# Function to translate and save a chunk
def translate_and_save_chunk(chunk, target_path):
    translated_text = translate_text(chunk)
    with open(target_path, 'a+', encoding='utf8') as target_file:
        target_file.write(translated_text + ' ')

def process_document(i):
    # Construct the file number string based on the document index
    if i < 10:
        file_num_str = '0000' + str(i)
    elif i < 100:
        file_num_str = '000' + str(i)
    elif i < 1000:
        file_num_str = '00' + str(i)
    elif i < 10000:
        file_num_str = '0' + str(i)
    elif i < 16000:
        file_num_str = '' + str(i)
    # ... (similar for other cases)

    # Construct the source and target file paths
    source_path = path1 + f'source-document{file_num_str}.txt'
    target_path = path2 + f'source-translated{file_num_str}.txt'

    # Check if the source file exists before processing
    if os.path.isfile(source_path):
        # Open and process the source file
        with open(source_path, encoding="utf8", errors='ignore') as file:
            text = ' '.join(line.strip() for line in file if line.strip())

        text_chunks = divide_string(text, chunk_size)

        # Process each chunk concurrently
        with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
            for chunk in text_chunks:
                executor.submit(translate_and_save_chunk, chunk, target_path)

# Main loop
if __name__ == '__main__':
    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
        for i in range(m, n):
            executor.submit(process_document, i)
            time.sleep(delay_between_chunks)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### **Purpose of the Code:**

The code below is used to **organize all pairs of plagiarised texts** extracted from the **PAN corpora**, along with their **Kazakh translations**, into **separate, structured folders**.


### **Background Workflow Summary:**

1. **Extraction:**  
   Initially, we extracted all plagiarised segments from the **PAN dataset** using the **offset values** provided in the corresponding **PAN XML files**. These segments were saved as **separate `.txt` files** in designated **English folders** (suspicious and source).

2. **Translation:**  
   Each English snippet was then **translated into Kazakh** using **Google Translate**, and stored in parallel **Kazakh folders**. The **file names** of the Kazakh translations **mirror** their English counterparts, except that they include the prefix **`kz-`**.

3. **Organization:**  
   The code below automates the process of **grouping corresponding English and Kazakh text pairs** (both suspicious and source) into **individual folders**. These folders are named based on document references and offset values as specified in the original PAN XML files.

In [None]:
import os
import shutil
import xml.etree.ElementTree as ET
from concurrent.futures import ThreadPoolExecutor, as_completed

# Define directories
suspicious_xml_folder = "/content/drive/MyDrive/updating-pan-xml-files/all-xml-files-with-plagiarism"
eng_suspicious_docs_folder = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised-partitioned-suspicious-and-source-files/su1so1-plagiarised-english-suspicious-parts-with-english-offsets"
eng_source_docs_folder = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised-partitioned-suspicious-and-source-files/su1so1-plagiarised-english-source-parts-with-english-offsets"
kaz_suspicious_docs_folder = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised-partitioned-suspicious-and-source-files/su1so1-plagiarised-kazakh-suspicious-parts-with-english-offsets"
kaz_source_docs_folder = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised-partitioned-suspicious-and-source-files/su1so1-plagiarised-kazakh-source-parts-with-english-offsets"
output_base_folder = "/content/drive/MyDrive/updating-pan-xml-files/su1so1-plagiarised-english-and-kazakh-suspicious-and-source-matched-texts-2"

# Create output directory if it doesn't exist
os.makedirs(output_base_folder, exist_ok=True)

# Define worker tasks
def create_folder_and_return_paths(folder_index, document_ref, this_offset, this_length, source_reference, source_offset, source_length):
    """Create output folder and prepare file paths for processing."""
    try:
        this_end_offset = this_offset + this_length
        source_end_offset = source_offset + source_length

        output_folder = os.path.join(output_base_folder, f"{document_ref.replace('.txt', '')}-{folder_index}")
        os.makedirs(output_folder, exist_ok=True)

        file_paths = {
            "output_folder": output_folder,
            "eng_suspicious": os.path.join(eng_suspicious_docs_folder, f"{document_ref.replace('.txt', '')}-{this_offset}-{this_end_offset}.txt"),
            "kaz_suspicious": os.path.join(kaz_suspicious_docs_folder, f"kz-{document_ref.replace('.txt', '')}-{this_offset}-{this_end_offset}.txt"),
            "eng_source": os.path.join(eng_source_docs_folder, f"{source_reference.replace('.txt', '')}-{source_offset}-{source_end_offset}.txt"),
            "kaz_source": os.path.join(kaz_source_docs_folder, f"kz-{source_reference.replace('.txt', '')}-{source_offset}-{source_end_offset}.txt")
        }

        return file_paths
    except Exception as e:
        print(f"Error creating folder and paths: {e}")
        return None


def copy_file(source_path, destination_folder):
    """Copy a file to the destination folder."""
    try:
        if os.path.isfile(source_path):
            shutil.copy(source_path, destination_folder)
            return True
        else:
            return False
    except Exception as e:
        print(f"Error copying file {source_path}: {e}")
        return False

# Main processing loop
xml_files = [f for f in os.listdir(suspicious_xml_folder) if f.endswith('.xml')]
print(f"Total XML files to process: {len(xml_files)}")

folder_index = 1

with ThreadPoolExecutor(max_workers=5) as executor:
    for xml_filename in xml_files:
        try:
            xml_path = os.path.join(suspicious_xml_folder, xml_filename)
            tree = ET.parse(xml_path)
            root = tree.getroot()

            document_ref = root.attrib.get('reference', '').replace('.xml', '.txt')

            for feature in root.findall(".//feature[@name='plagiarism']"):
                try:
                    this_offset = int(feature.attrib.get("this_offset", -1))
                    this_length = int(feature.attrib.get("this_length", 0))
                    source_offset = int(feature.attrib.get("source_offset", -1))
                    source_length = int(feature.attrib.get("source_length", 0))
                    source_reference = feature.attrib.get("source_reference", "").strip()
                except ValueError:
                    print(f"Invalid feature attributes in {xml_filename}. Skipping.")
                    continue

                if not source_reference or this_length <= 0 or source_length <= 0 or this_offset < 0 or source_offset < 0:
                    print(f"Skipping invalid feature in {xml_filename}")
                    continue

                file_paths = create_folder_and_return_paths(folder_index, document_ref, this_offset, this_length, source_reference, source_offset, source_length)

                if file_paths:
                    futures = []
                    futures.append(executor.submit(copy_file, file_paths["eng_suspicious"], file_paths["output_folder"]))
                    futures.append(executor.submit(copy_file, file_paths["kaz_suspicious"], file_paths["output_folder"]))
                    futures.append(executor.submit(copy_file, file_paths["eng_source"], file_paths["output_folder"]))
                    futures.append(executor.submit(copy_file, file_paths["kaz_source"], file_paths["output_folder"]))

                    # Wait for all futures to complete
                    for future in as_completed(futures):
                        if not future.result():
                            print(f"File missing or failed to copy for folder {file_paths['output_folder']}")

                    print(f"Completed folder: {file_paths['output_folder']}")  # Added print statement to confirm folder completion

                folder_index += 1

        except Exception as e:
            print(f"Error processing file {xml_filename}: {e}")

print("Processing completed.")


Then we saved all plagiarised pairs of text in csv format

In [None]:
import os
import pandas as pd

# Define the base directory
base_dir = "/content/drive/MyDrive/updating-pan-xml-files/su1so1-plagiarised-english-and-kazakh-suspicious-and-source-matched-texts-1"

# Initialize a list to store data
data = []

# Iterate over subfolders
for subfolder in os.listdir(base_dir):
    subfolder_path = os.path.join(base_dir, subfolder)

    if os.path.isdir(subfolder_path):
        # Extract the core suspicious document identifier
        subfolder_name_parts = subfolder.split("-")
        if len(subfolder_name_parts) >= 2:
            suspicious_doc_id = subfolder_name_parts[0]  # Extract suspicious-document00001 part
        else:
            continue  # Skip if format is unexpected

        # Initialize variables for file content and names
        eng_source_text = eng_suspicious_text = kz_source_text = kz_suspicious_text = ""
        eng_suspicious_filename = eng_source_filename = ""

        # Iterate over files in the subfolder
        for file_name in os.listdir(subfolder_path):
            file_path = os.path.join(subfolder_path, file_name)
            if os.path.isfile(file_path):
                try:
                    with open(file_path, "r", encoding="utf-8") as f:
                        text = f.read().strip()

                    if file_name.startswith("source-document"):
                        eng_source_text = text
                        eng_source_filename = file_name
                    elif file_name.startswith("suspicious-document"):
                        eng_suspicious_text = text
                        eng_suspicious_filename = file_name
                    elif file_name.startswith("kz-source-document"):
                        kz_source_text = text
                    elif file_name.startswith("kz-suspicious-document"):
                        kz_suspicious_text = text
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")

        # Append the extracted data to the list
        data.append([
            eng_source_text, eng_suspicious_text, kz_source_text, kz_suspicious_text,
            1, eng_suspicious_filename, eng_source_filename, suspicious_doc_id
        ])

# Create a DataFrame
df = pd.DataFrame(data, columns=[
    "english_source_text", "english_suspicious_text", "kazakh_source_text",
    "kazakh_suspicious_text", "label", "english_suspicious_filename",
    "english_source_filename", "subfolder_name"
])

# Define output CSV file path
output_csv_path = "/content/drive/MyDrive/updating-pan-xml-files/su1so1_matched_texts.csv"

# Save DataFrame to CSV
df.to_csv(output_csv_path, index=False, encoding="utf-8")

# Display DataFrame
import ace_tools as tools
tools.display_dataframe_to_user(name="Matched Texts Dataset", dataframe=df)


In [None]:
import os
import pandas as pd

# Define the base directory
base_dir = "/content/drive/MyDrive/updating-pan-xml-files/su1so1-plagiarised-english-and-kazakh-suspicious-and-source-matched-texts-1"
output_csv = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised_texts_dataset.csv"

# Initialize a list to store data
data = []

# Iterate over subfolders
for subfolder in os.listdir(base_dir):
    subfolder_path = os.path.join(base_dir, subfolder)

    if os.path.isdir(subfolder_path):  # Check if it's a directory
        # Identify file paths
        eng_source_file = None
        eng_suspicious_file = None
        kz_source_file = None
        kz_suspicious_file = None

        for filename in os.listdir(subfolder_path):
            file_path = os.path.join(subfolder_path, filename)

            if filename.startswith("source-document") and not filename.startswith("kz-"):
                eng_source_file = file_path
            elif filename.startswith("suspicious-document") and not filename.startswith("kz-"):
                eng_suspicious_file = file_path
            elif filename.startswith("kz-source-document"):
                kz_source_file = file_path
            elif filename.startswith("kz-suspicious-document"):
                kz_suspicious_file = file_path

        # Ensure all required files exist
        if eng_source_file and eng_suspicious_file and kz_source_file and kz_suspicious_file:
            try:
                with open(eng_source_file, "r", encoding="utf-8") as f:
                    eng_source_text = f.read().strip()

                with open(eng_suspicious_file, "r", encoding="utf-8") as f:
                    eng_suspicious_text = f.read().strip()

                with open(kz_source_file, "r", encoding="utf-8") as f:
                    kz_source_text = f.read().strip()

                with open(kz_suspicious_file, "r", encoding="utf-8") as f:
                    kz_suspicious_text = f.read().strip()

                # Extract file names and subfolder info
                eng_suspicious_filename = os.path.basename(eng_suspicious_file)
                eng_source_filename = os.path.basename(eng_source_file)
                subfolder_name = subfolder.split("-")[0]  # Extract "suspicious-documentXXXX"

                # Append to dataset list
                data.append([
                    eng_source_text, eng_suspicious_text,
                    kz_source_text, kz_suspicious_text,
                    1,  # Label = 1 (Plagiarized)
                    eng_suspicious_filename, eng_source_filename,
                    subfolder_name
                ])
            except Exception as e:
                print(f"Error reading files in {subfolder}: {e}")

# Create a DataFrame and save to CSV
df = pd.DataFrame(data, columns=[
    "english_source_text", "english_suspicious_text",
    "kazakh_source_text", "kazakh_suspicious_text",
    "label", "english_suspicious_filename", "english_source_filename",
    "subfolder_name"
])

df.to_csv(output_csv, index=False, encoding="utf-8")

print(f"✅ Dataset successfully created and saved to {output_csv}")


In [None]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised_texts_dataset.csv"

# Load the dataset (only the first row to improve performance)
df = pd.read_csv(dataset_path, nrows=0)

# Get the column names
column_names = df.columns.tolist()

# Print the column names
print("Column Names:", column_names)


Column Names: ['english_source_text', 'english_suspicious_text', 'kazakh_source_text', 'kazakh_suspicious_text', 'label', 'english_suspicious_filename', 'english_source_filename', 'subfolder_name']


In [None]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised_texts_dataset.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Get the first three rows
first_three_rows = df.head(3)

# Print each row on a new line
for index, row in first_three_rows.iterrows():
    print(f"Row {index + 1}:\n{row.to_string(index=False)}\n{'-'*80}")


Row 1:
. These provisions, whatever the opinion\nenter...
the These provisions, whatever the opinion ente...
Бұл ережелер, мұндай заңдардың даналығына қатыс...
Бұл ережелер, мұндай жинақтың даналығына қатыст...
                                                 1
            suspicious-document00001-1269-4699.txt
                 source-document07076-422-3872.txt
                                        suspicious
--------------------------------------------------------------------------------
Row 2:
. And being come to her\nHouse, (to find which ...
And being come to House, (to find which she of ...
Оның үйіне келгенде (ол бұрын оған нұсқау берге...
Үйге келгенде (ол оның нұсқау бергенін білу үші...
                                                 1
          suspicious-document08204-21799-22043.txt
                source-document09029-2918-3164.txt
                                        suspicious
--------------------------------------------------------------------------------
Row 3:
.

In [None]:
import pandas as pd

dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised_texts_dataset.csv"
df = pd.read_csv(dataset_path)

# Count label distribution
label_counts = df['label'].value_counts()
print("Label Distribution:\n", label_counts)


Label Distribution:
 label
1    38999
Name: count, dtype: int64


**Human Evaluation Set:**  
In addition, we prepared a set of **2,000 English–Kazakh text pairs** specifically for **native Kazakh speakers** to evaluate the **quality of plagiarism detection** and the **accuracy of Kazakh translations**. This dataset supports human assessment of the alignment between original and translated texts in the context of plagiarism:

In [None]:
import os
import shutil
from concurrent.futures import ThreadPoolExecutor

# Define the main folder and the destination folder
main_folder = "/content/drive/MyDrive/updating-pan-xml-files/su1so1-plagiarised-english-and-kazakh-suspicious-and-source-matched-texts-1"
destination_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/kz-eng-1500-similar-files"

# Create the destination folder if it doesn't exist
os.makedirs(destination_folder, exist_ok=True)

# Function to calculate the size of a folder
def get_folder_size(folder_path):
    try:
        return sum(
            os.path.getsize(os.path.join(dirpath, filename))
            for dirpath, _, filenames in os.walk(folder_path)
            for filename in filenames
        )
    except Exception as e:
        print(f"Error calculating size for {folder_path}: {e}")
        return 0

# Function to copy a folder
def copy_folder(src, dest):
    try:
        shutil.copytree(src, dest)
        print(f"Copied subfolder: {os.path.basename(src)}")
    except Exception as e:
        print(f"Error copying {src} to {dest}: {e}")

# List all subfolders in the main folder
subfolders = [
    os.path.join(main_folder, subfolder)
    for subfolder in os.listdir(main_folder)
    if os.path.isdir(os.path.join(main_folder, subfolder))
]

# Calculate folder sizes in parallel
print("Calculating folder sizes...")
with ThreadPoolExecutor(max_workers=8) as executor:
    subfolder_sizes = list(executor.map(get_folder_size, subfolders))

# Pair subfolders with their sizes
subfolder_sizes = list(zip(subfolders, subfolder_sizes))

# Sort subfolders by size in ascending order
print("Sorting subfolders by size...")
subfolder_sizes.sort(key=lambda x: x[1])

# Select the first 1500 least sized subfolders
least_sized_subfolders = subfolder_sizes[:1500]

# Copy the selected subfolders in parallel
print("Copying subfolders...")
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = [
        executor.submit(
            copy_folder,
            subfolder,
            os.path.join(destination_folder, os.path.basename(subfolder))
        )
        for subfolder, _ in least_sized_subfolders
    ]
    for future in futures:
        future.result()

print("Process completed. First 1500 least sized subfolders have been copied.")


**Human Evaluation Setup:**  
We divided the **2,000 English–Kazakh text pairs** among native Kazakh speakers in such a way that **each evaluator received an equal number of plagiarised and non-plagiarised text pairs**, along with their **Kazakh translations**. The dataset was also **shuffled** to prevent evaluators from predicting the labels based on the order or position of the text pairs, ensuring a fair and unbiased evaluation process:

In [None]:
import os
import shutil

# Define directories
source_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/kz-eng-1500-similar-files"
destination_base = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/sorted-by-obfuscation"

# Create destination subfolders
obfuscation_folders = {
    "high": os.path.join(destination_base, "obfuscation-high"),
    "low": os.path.join(destination_base, "obfuscation-low"),
    "none": os.path.join(destination_base, "obfuscation-none"),
    "absent": os.path.join(destination_base, "obfuscation-absent")
}

for folder in obfuscation_folders.values():
    os.makedirs(folder, exist_ok=True)

# Counters
obfuscation_counts = {
    "high": 0,
    "low": 0,
    "none": 0,
    "absent": 0
}

# Process each subfolder
subfolders = [
    os.path.join(source_folder, subfolder)
    for subfolder in os.listdir(source_folder)
    if os.path.isdir(os.path.join(source_folder, subfolder))
]

for subfolder in subfolders:
    try:
        plagiarism_info_path = os.path.join(subfolder, "plagiarism_info.txt")

        if not os.path.isfile(plagiarism_info_path):
            print(f"plagiarism_info.txt not found in {subfolder}. Skipping.")
            continue

        # Read the file
        with open(plagiarism_info_path, "r", encoding="utf-8") as f:
            content = f.read()

        # Determine obfuscation level
        if "obfuscation: high" in content:
            category = "high"
        elif "obfuscation: low" in content:
            category = "low"
        elif "obfuscation: none" in content:
            category = "none"
        else:
            category = "absent"

        # Copy the subfolder to the appropriate destination
        destination_subfolder = os.path.join(obfuscation_folders[category], os.path.basename(subfolder))
        shutil.copytree(subfolder, destination_subfolder)

        # Increment counter
        obfuscation_counts[category] += 1
        print(f"Copied {subfolder} to {destination_subfolder} (category: {category})")

    except Exception as e:
        print(f"Error processing {subfolder}: {e}")

# Print summary
print("\nSummary:")
print(f"Obfuscation High: {obfuscation_counts['high']}")
print(f"Obfuscation Low: {obfuscation_counts['low']}")
print(f"Obfuscation None: {obfuscation_counts['none']}")
print(f"Obfuscation Absent: {obfuscation_counts['absent']}")


In [None]:
import os
import shutil

# Define source folders
obfuscation_high_folder = "/content/drive/MyDrive/obfuscation-high"
obfuscation_low_folder = "/content/drive/MyDrive/obfuscation-low"
obfuscation_none_folder = "/content/drive/MyDrive/obfuscation-none"
obfuscation_absent_folder = "/content/drive/MyDrive/obfuscation-absent"
not_similar_folder = "/content/drive/MyDrive/not-similar"

# Define destination base folder
output_base_folder = "/content/drive/MyDrive/distributed-folders"
os.makedirs(output_base_folder, exist_ok=True)

# Create 10 output folders
output_folders = [os.path.join(output_base_folder, f"folder-{i+1}") for i in range(10)]
for folder in output_folders:
    os.makedirs(folder, exist_ok=True)

# Collect subfolders from each source folder
subfolders_high = [os.path.join(obfuscation_high_folder, f) for f in os.listdir(obfuscation_high_folder) if os.path.isdir(os.path.join(obfuscation_high_folder, f))]
subfolders_low = [os.path.join(obfuscation_low_folder, f) for f in os.listdir(obfuscation_low_folder) if os.path.isdir(os.path.join(obfuscation_low_folder, f))]
subfolders_none = [os.path.join(obfuscation_none_folder, f) for f in os.listdir(obfuscation_none_folder) if os.path.isdir(os.path.join(obfuscation_none_folder, f))]
subfolders_absent = [os.path.join(obfuscation_absent_folder, f) for f in os.listdir(obfuscation_absent_folder) if os.path.isdir(os.path.join(obfuscation_absent_folder, f))]
subfolders_not_similar = [os.path.join(not_similar_folder, f) for f in os.listdir(not_similar_folder) if os.path.isdir(os.path.join(not_similar_folder, f))]

# Combine all subfolders in the specified order
all_subfolders = []
while any([subfolders_high, subfolders_low, subfolders_none, subfolders_absent, subfolders_not_similar]):
    if subfolders_high:
        all_subfolders.append(subfolders_high.pop(0))
    if subfolders_low:
        all_subfolders.append(subfolders_low.pop(0))
    if subfolders_none:
        all_subfolders.append(subfolders_none.pop(0))
    if subfolders_absent:
        all_subfolders.append(subfolders_absent.pop(0))
    if subfolders_not_similar:
        all_subfolders.append(subfolders_not_similar.pop(0))

# Distribute subfolders equally into 10 folders
folder_index = 0
for subfolder in all_subfolders:
    destination_folder = output_folders[folder_index]
    destination_subfolder = os.path.join(destination_folder, os.path.basename(subfolder))
    shutil.copytree(subfolder, destination_subfolder)
    print(f"Copied {subfolder} to {destination_folder}")
    folder_index = (folder_index + 1) % 10

print("Distribution completed.")


In [None]:
import os
import shutil

# Define source folders
obfuscation_high_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-high-365"
obfuscation_low_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-low-365"
obfuscation_none_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-none-365"
obfuscation_absent_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-absent-365"
not_similar_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/not-similar-365"

# Define destination base folder
output_base_folder = "/content/drive/MyDrive/experimental-models/1820-tasks-to-10-people"
os.makedirs(output_base_folder, exist_ok=True)

# Create 10 output folders
output_folders = [os.path.join(output_base_folder, f"folder-{i+1}") for i in range(10)]
for folder in output_folders:
    os.makedirs(folder, exist_ok=True)

# Create example folder for the last 5 subfolders
example_folder = os.path.join(output_base_folder, "example")
os.makedirs(example_folder, exist_ok=True)

# Log file path
log_file_path = os.path.join(output_base_folder, "shared_subfolders_log.txt")

# Initialize log file
with open(log_file_path, "w", encoding="utf-8") as log_file:
    log_file.write("Shared Subfolders Log\n")
    log_file.write("=" * 50 + "\n")

# Collect subfolders from each source folder
subfolders_high = [os.path.join(obfuscation_high_folder, f) for f in os.listdir(obfuscation_high_folder) if os.path.isdir(os.path.join(obfuscation_high_folder, f))]
subfolders_low = [os.path.join(obfuscation_low_folder, f) for f in os.listdir(obfuscation_low_folder) if os.path.isdir(os.path.join(obfuscation_low_folder, f))]
subfolders_none = [os.path.join(obfuscation_none_folder, f) for f in os.listdir(obfuscation_none_folder) if os.path.isdir(os.path.join(obfuscation_none_folder, f))]
subfolders_absent = [os.path.join(obfuscation_absent_folder, f) for f in os.listdir(obfuscation_absent_folder) if os.path.isdir(os.path.join(obfuscation_absent_folder, f))]
subfolders_not_similar = [os.path.join(not_similar_folder, f) for f in os.listdir(not_similar_folder) if os.path.isdir(os.path.join(not_similar_folder, f))]

# Set to track used subfolders
used_subfolders = set()

# Combine all subfolders in the specified order
all_subfolders = []
while any([subfolders_high, subfolders_low, subfolders_none, subfolders_absent, subfolders_not_similar]):
    if subfolders_high:
        subfolder = subfolders_high.pop(0)
        if subfolder not in used_subfolders:
            all_subfolders.append(subfolder)
            used_subfolders.add(subfolder)
    if subfolders_low:
        subfolder = subfolders_low.pop(0)
        if subfolder not in used_subfolders:
            all_subfolders.append(subfolder)
            used_subfolders.add(subfolder)
    if subfolders_none:
        subfolder = subfolders_none.pop(0)
        if subfolder not in used_subfolders:
            all_subfolders.append(subfolder)
            used_subfolders.add(subfolder)
    if subfolders_absent:
        subfolder = subfolders_absent.pop(0)
        if subfolder not in used_subfolders:
            all_subfolders.append(subfolder)
            used_subfolders.add(subfolder)
    if subfolders_not_similar:
        subfolder = subfolders_not_similar.pop(0)
        if subfolder not in used_subfolders:
            all_subfolders.append(subfolder)
            used_subfolders.add(subfolder)

# Distribute first 1800 subfolders equally into 10 folders
print("Distributing the first 1800 subfolders...")
folder_index = 0
for subfolder in all_subfolders[:1800]:
    destination_folder = output_folders[folder_index]
    destination_subfolder = os.path.join(destination_folder, os.path.basename(subfolder))
    shutil.copytree(subfolder, destination_subfolder)
    with open(log_file_path, "a", encoding="utf-8") as log_file:
        log_file.write(f"Copied to {destination_folder}: {subfolder}\n")
    print(f"Copied {subfolder} to {destination_folder}")
    folder_index = (folder_index + 1) % 10

# Prepare remaining 25 subfolders
remaining_subfolders = all_subfolders[1800:]

# Separate 20 shared subfolders
shared_subfolders = remaining_subfolders[:20]

# Add 20 shared subfolders to all 10 folders
print("Adding 20 shared subfolders to each folder...")
for subfolder in shared_subfolders:
    for destination_folder in output_folders:
        destination_subfolder = os.path.join(destination_folder, os.path.basename(subfolder))
        shutil.copytree(subfolder, destination_subfolder)
        with open(log_file_path, "a", encoding="utf-8") as log_file:
            log_file.write(f"Shared with {destination_folder}: {subfolder}\n")
        print(f"Shared {subfolder} with {destination_folder}")

# Add the last 5 subfolders to the "example" folder
print("Saving the last 5 subfolders to the example folder...")
for subfolder in remaining_subfolders[20:]:
    destination_subfolder = os.path.join(example_folder, os.path.basename(subfolder))
    shutil.copytree(subfolder, destination_subfolder)
    with open(log_file_path, "a", encoding="utf-8") as log_file:
        log_file.write(f"Saved to example folder: {subfolder}\n")
    print(f"Saved {subfolder} to example folder")

print("Final distribution completed.")


In [None]:
import os
import shutil
from math import floor

# Define source folders
obfuscation_high_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-high-365"
obfuscation_low_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-low-365"
obfuscation_none_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-none-365"
obfuscation_absent_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/obfuscation-absent-365"
not_similar_folder = "/content/drive/MyDrive/experimental-models/kz-eng-1500-matched-files/3000-files-with-obfuscation/not-similar-365"

# Define destination base folder
output_base_folder = "/content/drive/MyDrive/experimental-models/distribution-200-tasks-to-10-people"
os.makedirs(output_base_folder, exist_ok=True)

# Create 10 output folders
output_folders = [os.path.join(output_base_folder, f"folder-{i+1}") for i in range(10)]
for folder in output_folders:
    os.makedirs(folder, exist_ok=True)

# Create folders for shared subfolders and example
shared_folder = os.path.join(output_base_folder, "20-shared-matched-folders")
os.makedirs(shared_folder, exist_ok=True)
example_folder = os.path.join(output_base_folder, "example")
os.makedirs(example_folder, exist_ok=True)

# Log file path
log_file_path = os.path.join(output_base_folder, "shared_subfolders_log.txt")

# Initialize log file
with open(log_file_path, "w", encoding="utf-8") as log_file:
    log_file.write("Shared Subfolders Log\n")
    log_file.write("=" * 50 + "\n")

# Collect subfolders from each source folder
subfolders_high = [os.path.join(obfuscation_high_folder, f) for f in os.listdir(obfuscation_high_folder) if os.path.isdir(os.path.join(obfuscation_high_folder, f))]
subfolders_low = [os.path.join(obfuscation_low_folder, f) for f in os.listdir(obfuscation_low_folder) if os.path.isdir(os.path.join(obfuscation_low_folder, f))]
subfolders_none = [os.path.join(obfuscation_none_folder, f) for f in os.listdir(obfuscation_none_folder) if os.path.isdir(os.path.join(obfuscation_none_folder, f))]
subfolders_absent = [os.path.join(obfuscation_absent_folder, f) for f in os.listdir(obfuscation_absent_folder) if os.path.isdir(os.path.join(obfuscation_absent_folder, f))]
subfolders_not_similar = [os.path.join(not_similar_folder, f) for f in os.listdir(not_similar_folder) if os.path.isdir(os.path.join(not_similar_folder, f))]

# Ensure equal distribution
num_folders = 10
subfolders_per_type = floor(180 / 5)  # 36 subfolders of each type per folder

# Function to copy subfolders evenly
remaining_subfolders = []
def distribute_subfolders(subfolders, folder_index):
    distributed = []
    for _ in range(subfolders_per_type):
        if subfolders:
            subfolder = subfolders.pop(0)
            destination_folder = output_folders[folder_index % num_folders]
            destination_subfolder = os.path.join(destination_folder, os.path.basename(subfolder))
            shutil.copytree(subfolder, destination_subfolder)
            distributed.append(subfolder)
            folder_index += 1
    return folder_index, distributed

# Distribute 1800 unique subfolders
folder_index = 0
for subfolder_group in [subfolders_high, subfolders_low, subfolders_none, subfolders_absent, subfolders_not_similar]:
    folder_index, distributed = distribute_subfolders(subfolder_group, folder_index)
    remaining_subfolders.extend(distributed)

# Distribute 20 shared subfolders equally
shared_subfolders = []
for _ in range(4):
    for subfolder_group in [subfolders_high, subfolders_low, subfolders_none, subfolders_absent, subfolders_not_similar]:
        if subfolder_group:
            subfolder = subfolder_group.pop(0)
            shared_subfolders.append(subfolder)
            shared_destination = os.path.join(shared_folder, os.path.basename(subfolder))
            shutil.copytree(subfolder, shared_destination)

# Copy shared subfolders to all 10 folders
for subfolder in shared_subfolders:
    for destination_folder in output_folders:
        destination_subfolder = os.path.join(destination_folder, os.path.basename(subfolder))
        shutil.copytree(subfolder, destination_subfolder)
        with open(log_file_path, "a", encoding="utf-8") as log_file:
            log_file.write(f"Shared with {destination_folder}: {subfolder}\n")

# Save last 5 subfolders to the "example" folder
last_subfolders = subfolders_high + subfolders_low + subfolders_none + subfolders_absent + subfolders_not_similar
for subfolder in last_subfolders[:5]:
    destination_subfolder = os.path.join(example_folder, os.path.basename(subfolder))
    shutil.copytree(subfolder, destination_subfolder)
    with open(log_file_path, "a", encoding="utf-8") as log_file:
        log_file.write(f"Saved to example folder: {subfolder}\n")

print("Final distribution completed.")


Shuffling all prepared evaluation text for Kazakh speakers

In [None]:
import os
import random

# Define folders
folder1_path = "/content/drive/MyDrive/experimental-models/200-tasks-for-10/folder-1"
matched_folders_path = "/content/drive/MyDrive/experimental-models/200-tasks-for-10/20-matched-folders"

# Define the obfuscation groups
primary_group = ["obfuscation-high", "obfuscation-low"]
secondary_group = ["obfuscation-none", "obfuscation-absent"]
other_group = ["not-similar"]

# Helper function to determine obfuscation type from plagiarism_info.txt
def get_obfuscation_type(subfolder_path):
    info_file_path = os.path.join(subfolder_path, "plagiarism_info.txt")
    if os.path.isfile(info_file_path):
        try:
            with open(info_file_path, "r", encoding="utf-8") as f:
                lines = f.readlines()
            obfuscation_line = next((line for line in lines if line.startswith("obfuscation:")), None)
            if obfuscation_line:
                if "high" in obfuscation_line:
                    return "obfuscation-high"
                elif "low" in obfuscation_line:
                    return "obfuscation-low"
                elif "none" in obfuscation_line:
                    return "obfuscation-none"
            else:
                return "obfuscation-absent"
        except Exception as e:
            print(f"Error reading {info_file_path}: {e}")
    return "not-similar"

# Collect subfolder names
folder1_subfolders = [f for f in os.listdir(folder1_path) if os.path.isdir(os.path.join(folder1_path, f))]
matched_subfolders = [f for f in os.listdir(matched_folders_path) if os.path.isdir(os.path.join(matched_folders_path, f))]
matched_set = set(matched_subfolders)

# Classify subfolders by obfuscation type
classified_subfolders = {
    "obfuscation-high": [],
    "obfuscation-low": [],
    "obfuscation-none": [],
    "obfuscation-absent": [],
    "not-similar": []
}

for subfolder in folder1_subfolders:
    subfolder_path = os.path.join(folder1_path, subfolder)
    obfuscation_type = get_obfuscation_type(subfolder_path)
    classified_subfolders[obfuscation_type].append(subfolder)

# Initialize counters and renaming logic
regular_index = 1
remaining_index = 181
unprocessed_subfolders = []

# Process folder1 subfolders based on the dynamically shuffled sequence
while any(classified_subfolders.values()):
    sequence = [primary_group, secondary_group, other_group]
    random.shuffle(sequence)  # Shuffle the sequence dynamically for each iteration

    for group in sequence:
        random.shuffle(group)  # Shuffle the group order dynamically
        for obfuscation_type in group:
            if classified_subfolders[obfuscation_type]:
                subfolder = classified_subfolders[obfuscation_type].pop(0)
                if subfolder in matched_set:
                    unprocessed_subfolders.append(subfolder)
                else:
                    subfolder_path = os.path.join(folder1_path, subfolder)
                    new_name = f"{regular_index}-{subfolder}"
                    new_path = os.path.join(folder1_path, new_name)
                    os.rename(subfolder_path, new_path)
                    print(f"Renamed: {subfolder} -> {new_name}")
                    regular_index += 1

# Shuffle the remaining matched subfolders for each iteration
while unprocessed_subfolders:
    random.shuffle(unprocessed_subfolders)
    for subfolder in unprocessed_subfolders[:]:
        subfolder_path = os.path.join(folder1_path, subfolder)
        new_name = f"{remaining_index}-{subfolder}"
        new_path = os.path.join(folder1_path, new_name)
        os.rename(subfolder_path, new_path)
        print(f"Renamed: {subfolder} -> {new_name}")
        remaining_index += 1
        unprocessed_subfolders.remove(subfolder)

print("Processing completed.")


Based on the evaluation of 2000 plagiarised and not plagiarised Kazakh texts by Kazakh speakers we create a small 2000 CSV dataset.

In [None]:
import os
import pandas as pd

# Define the base folder path
base_folder = "/content/drive/MyDrive/experimental-models/200-tasks-for-10"
output_csv = "/content/drive/MyDrive/bert_training_dataset_numeric_labels_cleaned.csv"

# Function to clean text
def clean_text(text):
    text = text.strip()  # Remove leading/trailing spaces
    text = text.lstrip(".").lstrip()  # Remove leading dots, then any spaces again
    text = " ".join(text.split())  # Normalize all sequences of spaces to a single space
    return text

# Initialize an empty list to store dataset rows
dataset = []

# Iterate over the 10 subfolders (folder-1 to folder-10)
for folder_name in os.listdir(base_folder):
    folder_path = os.path.join(base_folder, folder_name)

    if os.path.isdir(folder_path):
        # Iterate over the 200 sub-subfolders
        for subfolder_name in os.listdir(folder_path):
            subfolder_path = os.path.join(folder_path, subfolder_name)

            if os.path.isdir(subfolder_path):
                # Check for the existence of plagiarism_info.txt
                info_file_path = os.path.join(subfolder_path, "plagiarism_info.txt")
                label = 1 if os.path.isfile(info_file_path) else 0

                # Locate source and suspicious document files
                source_file = None
                suspicious_file = None

                for file_name in os.listdir(subfolder_path):
                    if file_name.startswith("source-document"):
                        source_file = os.path.join(subfolder_path, file_name)
                    elif file_name.startswith("suspicious-document"):
                        suspicious_file = os.path.join(subfolder_path, file_name)

                # Ensure both source and suspicious files exist
                if source_file and suspicious_file:
                    try:
                        # Read the content of the source and suspicious documents
                        with open(source_file, "r", encoding="utf-8") as src:
                            source_text = clean_text(src.read())
                        with open(suspicious_file, "r", encoding="utf-8") as susp:
                            suspicious_text = clean_text(susp.read())

                        # Append the data to the dataset
                        dataset.append({
                            "source_text": source_text,
                            "suspicious_text": suspicious_text,
                            "label": label
                        })
                    except Exception as e:
                        print(f"Error reading files in {subfolder_path}: {e}")

# Create a DataFrame and remove duplicate rows
df = pd.DataFrame(dataset)
df = df.drop_duplicates()  # Remove duplicate rows

# Save the cleaned dataset to CSV
df.to_csv(output_csv, index=False, encoding="utf-8")

print(f"Cleaned dataset created successfully and saved to {output_csv}")


**Model Training on Small Dataset:**  
Using the small dataset of **2,000 English–Kazakh text pairs**, we trained two separate BERT-based models for comparison purposes. Specifically:  
- An **English BERT model** was trained on the English portion of the dataset.  
- A **Kazakh BERT model** was trained on the Kazakh translations of the same pairs.

This setup allowed us to assess the performance of language-specific models on a controlled and aligned dataset.

In [None]:
from datasets import load_dataset
from google.colab import drive
from transformers import TrainingArguments, Trainer, BertTokenizer, BertForSequenceClassification
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Mount Google Drive
googledrive = "/content/drive"
drive.mount(googledrive)

# Load dataset
dataset = load_dataset('csv', data_files='/content/drive/MyDrive/bert_training_dataset_numeric_labels_cleaned3.csv')

# Initialize tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization function
def tokenize_function(examples):
    # Combine source and suspicious texts with a separator
    combined_texts = [s + " [SEP] " + t for s, t in zip(examples['source_text'], examples['suspicious_text'])]
    return tokenizer(combined_texts, padding="max_length", truncation=True)

# Tokenize dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Add labels to the tokenized dataset
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

# Split the dataset into training and testing sets
tokenized_datasets = tokenized_datasets['train'].train_test_split(test_size=0.2)

# Initialize model for binary classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define compute_metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Define training arguments
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/bert-training/results',  # Directory in Google Drive
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir='/content/drive/MyDrive/bert-training/logs',  # Log directory
    save_strategy='epoch',  # Save the model after every epoch
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()
print("Evaluation Results:", results)

# Save the model and tokenizer
model.save_pretrained('/content/drive/MyDrive/bert-training/saved_model')
tokenizer.save_pretrained('/content/drive/MyDrive/bert-training/saved_model')


**Generating Negative Samples (Label=0) for the Large Dataset:**  
Our large dataset initially consisted only of **positive examples (label = 1)**—pairs of texts that are similar or plagiarized. To generate **negative examples (label = 0)**, we utilized the **BERT models trained on the small aligned English–Kazakh dataset**.

These trained models were used to evaluate text pairs within the large dataset and identify those that are **dissimilar**. By assigning label = 0 to such pairs, we augmented the dataset with **non-plagiarized examples**, ultimately balancing the dataset for training purposes.

In [None]:
import pandas as pd
import torch
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import csv

# ✅ Load BERT model & tokenizer
model_path = "/content/drive/MyDrive/bert-training/saved_model"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval().to("cuda" if torch.cuda.is_available() else "cpu")

# ✅ Load dataset
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/plagiarised_texts_dataset.csv"
df = pd.read_csv(dataset_path)

# ✅ Set to track already used negative samples
used_negatives = set()

# ✅ Output file path
output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_plagiarism_dataset_balanced1.csv"

# ✅ Define CSV headers
headers = ["english_source_text", "english_suspicious_text", "kazakh_source_text", "kazakh_suspicious_text", "label"]

# ✅ Write headers to file (overwrite mode)
with open(output_path, "w", newline="", encoding="utf-8") as f:
    writer = csv.writer(f)
    writer.writerow(headers)  # Write column headers

# ✅ Text preprocessing function
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Function to classify similarity using BERT
def classify_plagiarism(text1, text2):
    inputs = tokenizer(f"{text1} {tokenizer.sep_token} {text2}",
                       padding="max_length", truncation=True, max_length=512, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
    return prediction  # 1 = Plagiarized, 0 = Not Plagiarized

# ✅ Process dataset row by row
for i in range(len(df)):
    # ✅ Process source texts
    source_text_en = clean_text(df.loc[i, "english_source_text"])
    source_text_kz = df.loc[i, "kazakh_source_text"]

    # ✅ Copy **original positive pairs (label = 1)**
    suspicious_text_en = clean_text(df.loc[i, "english_suspicious_text"])
    suspicious_text_kz = df.loc[i, "kazakh_suspicious_text"]

    # ✅ Append positive sample to CSV
    with open(output_path, "a", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow([source_text_en, suspicious_text_en, source_text_kz, suspicious_text_kz, 1])

    # ✅ 2️⃣ **Find a single unique negative sample**
    source_length = df.loc[i, "english_source_text_length"]
    found_negative = False

    # ✅ Try different word length differences: 0 → 1 → 2 → 3 → 4 → 5
    for length_diff in range(6):
        potential_neg_sample = df.query(
            "(index != @i) & (abs(english_source_text_length - english_suspicious_text_length) <= @length_diff)"
        )

        # ✅ If a candidate is found, break out of the loop
        if not potential_neg_sample.empty:
            neg_row = potential_neg_sample.iloc[0]  # ✅ Pick the first candidate
            neg_suspicious_text_en = clean_text(neg_row["english_suspicious_text"])
            neg_suspicious_text_kz = neg_row["kazakh_suspicious_text"]

            # **Avoid reusing negative samples**
            pair_key = (source_text_en, neg_suspicious_text_en)
            if pair_key in used_negatives:
                continue  # ✅ Skip if already used

            # ✅ Mark as used
            used_negatives.add(pair_key)

            # ✅ Predict label with BERT
            label = classify_plagiarism(source_text_en, neg_suspicious_text_en)

            # ✅ Only add if BERT says it's NOT plagiarized (label=0)
            if label == 0:
                with open(output_path, "a", newline="", encoding="utf-8") as f:
                    writer = csv.writer(f)
                    writer.writerow([source_text_en, neg_suspicious_text_en, source_text_kz, neg_suspicious_text_kz, 0])
                found_negative = True
            break  # ✅ Stop searching once we find a valid negative

    # ✅ If no valid negative is found after checking all length differences, skip

print(f"✅ New balanced dataset saved: {output_path}")


Removing duplicates if they exist

In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_plagiarism_dataset_balanced1.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_pan_paragraph_texts_balanced_unique.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function to normalize text for comparison
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text.lower()  # Convert to lowercase for better duplicate detection

# ✅ Apply cleaning to all text columns
df["english_source_text"] = df["english_source_text"].apply(clean_text)
df["english_suspicious_text"] = df["english_suspicious_text"].apply(clean_text)
df["kazakh_source_text"] = df["kazakh_source_text"].apply(clean_text)
df["kazakh_suspicious_text"] = df["kazakh_suspicious_text"].apply(clean_text)

# ✅ Remove duplicate rows
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_pan_paragraph_texts_balanced_unique.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_plagiarised_texts_labelled.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only English texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["english_source_text"] = df["english_source_text"].apply(clean_text)
df["english_suspicious_text"] = df["english_suspicious_text"].apply(clean_text)

# ✅ Remove duplicates **based only on English source & suspicious text pairs**
df_unique = df.drop_duplicates(subset=["english_source_text", "english_suspicious_text"])

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")


From our previous english-kazakh dataset separating them and creating two separate datasets English and Kazakh

In [None]:
import pandas as pd

# Load the original dataset
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_kz_pan_unique_paragraph_texts_with_balanced_labels3.csv"
df = pd.read_csv(dataset_path)

# ✅ Create English dataset
english_df = df[['english_source_text', 'english_suspicious_text', 'label']]
english_output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_plagiarism_dataset_60000_paragraphs.csv"
english_df.to_csv(english_output_path, index=False)
print(f"✅ English dataset saved at: {english_output_path}")

# ✅ Create Kazakh dataset
kazakh_df = df[['kazakh_source_text', 'kazakh_suspicious_text', 'label']]
kazakh_output_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs.csv"
kazakh_df.to_csv(kazakh_output_path, index=False)
print(f"✅ Kazakh dataset saved at: {kazakh_output_path}")

In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs_updated.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Rename columns
df.columns = ['suspicious_text', 'source_text', 'label']

# ✅ Save the updated dataset
df.to_csv(output_path, index=False)

print(f"✅ Column names updated! New dataset saved at: {output_path}")


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs_updated.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs_update1.csv"

# ✅ Load dataset
df = pd.read_csv(dataset_path)

# ✅ Separate `label=1` and `label=0` rows
df_label_1 = df[df["label"] == 1].reset_index(drop=True)
df_label_0 = df[df["label"] == 0].reset_index(drop=True)

# ✅ Interleave rows: (1,0,1,0,1,0,...)
min_length = min(len(df_label_1), len(df_label_0))
df_interleaved = pd.concat([df_label_1.iloc[:min_length], df_label_0.iloc[:min_length]]).sort_index(kind="merge")

# ✅ Add remaining rows (if any)
remaining_rows = pd.concat([df_label_1.iloc[min_length:], df_label_0.iloc[min_length:]])
df_final = pd.concat([df_interleaved, remaining_rows]).reset_index(drop=True)

# ✅ Save the updated dataset
df_final.to_csv(output_path, index=False)

print(f"✅ Dataset updated! New dataset saved at: {output_path}")


Shuffling the datasets

In [None]:
import pandas as pd

# ✅ Define dataset paths
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs_update1.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_plagiarism_dataset_60000_paragraphs9_fully_shuffled.csv"

# ✅ Load dataset
df = pd.read_csv(dataset_path)

# ✅ Shuffle all rows randomly
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# ✅ Save shuffled dataset
df_shuffled.to_csv(output_path, index=False)

# ✅ Display shuffled dataset
import ace_tools as tools
tools.display_dataframe_to_user(name="Fully Shuffled Plagiarism Dataset", dataframe=df_shuffled)

print(f"✅ Fully shuffled dataset saved at: {output_path}")


In [None]:
import pandas as pd

# ✅ Define dataset paths
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_plagiarism_dataset_60000_paragraphs_update1.csv"
output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_plagiarism_dataset_60000_paragraphs_fully_shuffled.csv"

# ✅ Load dataset
df = pd.read_csv(dataset_path)

# ✅ Shuffle all rows randomly
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)

# ✅ Save shuffled dataset
df_shuffled.to_csv(output_path, index=False)

print(f"✅ Fully shuffled dataset saved at: {output_path}")


Dataset is now balanced:

In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs1.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    30810
1    30810
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs7.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    30810
0    30529
Name: count, dtype: int64


Checking for duplication, if they exist removing and making the dataset unique:

In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs1.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique1.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)


# ✅ Remove duplicates **based only on suspicious and source text pairs**
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique1.csv


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs1.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique3.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)


# ✅ Remove duplicates
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique3.csv


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs7.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)

# ✅ Remove duplicates **based only on suspicious and source text pairs**
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique.csv


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60000_paragraphs7.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique2.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)

# ✅ Remove duplicates
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique2.csv


Checking again whether the dataset is still balanced or not:

In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    30810
0    30529
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique1.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    30810
1    30810
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique2.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    30810
0    30529
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique3.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    30810
1    30810
Name: count, dtype: int64


Doing the same for english dataset, checking for duplication, making it unique and them checking it for balance

In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60000_paragraphs1.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique1.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)

# ✅ Remove duplicates
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique1.csv


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique1.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    30824
1    30824
Name: count, dtype: int64


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique1.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique2.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)


# ✅ Remove duplicates **based only on suspicious and source text pairs**
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique2.csv


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique2.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    30824
1    30824
Name: count, dtype: int64


In [None]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60000_paragraphs7.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique3.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)

# ✅ Remove duplicates
df_unique = df.drop_duplicates()
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")

✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique3.csv


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique3.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    30824
0    30506
Name: count, dtype: int64


## ✅ **Summary-1:**

### 🧩 **Plagiarized Text Extraction and Translation**  
We extracted paragraph-level plagiarism cases from the **PAN corpora** using XML metadata (i.e., offset and length values). The identified **English paragraphs** were then **translated into Kazakh** using machine translation, enabling us to build a **bilingual English–Kazakh dataset** for plagiarism and semantic similarity analysis.

### 👥 **Human Evaluation Set**  
To evaluate the translation quality and plagiarism detection effectiveness, we curated a dataset of **2,000 English–Kazakh text pairs** for **native Kazakh speakers**. Each pair consists of an original English paragraph and its Kazakh translation, labeled as either **plagiarized** or **non-plagiarized**.

### 📝 **Human Evaluation Setup**  
The 2,000 text pairs were **shuffled** and then **evenly distributed** among evaluators. Each Kazakh-speaking participant received a **balanced mix** of plagiarized and non-plagiarized pairs, ensuring a fair and unbiased evaluation without label-position bias.

### 🤖 **Model Training on Small Dataset**  
We trained two separate **BERT-based models** on this manually curated dataset:

- 📘 **English BERT** was trained on English text pairs.  
- 📗 **Kazakh BERT** was trained on their Kazakh translations.

This allowed us to **benchmark** the effectiveness of language-specific models in identifying text similarity and plagiarism across both languages.

### ⚖️ **Generating Negative Samples (Label = 0) for the Large Dataset**  
The original large dataset contained only **positive samples (label = 1)**. We used our trained BERT models to identify **dissimilar pairs**, which were then labeled as **negative examples (label = 0)**. This strategy allowed us to **balance the dataset** and make it suitable for supervised learning.

### 📦 **Final Training Datasets**  
We cleaned, deduplicated, balanced, and augmented our data to create two final, ready-to-train datasets:

- ✅ `kz_pan_dataset_60K_paragraphs_unique.csv`  
- ✅ `en_pan_dataset_60K_paragraphs_unique3.csv`

These datasets are well-suited for building robust **text similarity and plagiarism detection models** in both **Kazakh** and **English**.

**Data Augmentation by Swapping Text Pairs**

To enhance the robustness of the model and increase the size of the training data, we applied a simple yet effective data augmentation technique. Specifically, we **swapped the positions of the suspicious and source texts** for each data pair while preserving the original label. For example, the initial dataset consisted of pairs such as:

1. *(Text-A1, Text-A2, Label = 1)*  
2. *(Text-B1, Text-B2, Label = 0)*

After augmentation, the dataset became:

1. *(Text-A1, Text-A2, Label = 1)*  
2. *(Text-A2, Text-A1, Label = 1)*  
3. *(Text-B1, Text-B2, Label = 0)*  
4. *(Text-B2, Text-B1, Label = 0)*

The rationale behind this approach is that if the model is trained to recognize that “hi” is similar to “hello” (Label = 1), it should also learn that “hello” is similar to “hi”. By introducing both directions of the text pair, we help the model generalize better and reinforce the concept of semantic equivalence. This technique effectively **doubles the dataset size** and maintains class balance, thereby supporting better learning of text similarity patterns.

In [20]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_pan_dataset_60K_paragraphs_unique.csv"

# Load the dataset
df = pd.read_csv(dataset_path)
print("Original dataset shape:", df.shape)

# Create the swapped DataFrame
swapped_df = pd.DataFrame({
    'suspicious_text': df['source_text'],
    'source_text': df['suspicious_text'],
    'label': df['label']
})

# Concatenate the original and swapped DataFrames
combined_df = pd.concat([df, swapped_df], ignore_index=True)
print("Combined dataset shape before duplicate removal:", combined_df.shape)

# Save the new dataset to a CSV file
output_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_paragraphs.csv"
combined_df.to_csv(output_path, index=False)
print("New dataset saved to:", output_path)

Original dataset shape: (61339, 3)
Combined dataset shape before duplicate removal: (122678, 3)
New dataset saved to: /content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_paragraphs.csv


We can see that our 120K dataset is still balanced, which is good:

In [23]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_paragraphs.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    61620
0    61058
Name: count, dtype: int64


In [21]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_paragraphs.csv"

# Load the dataset (only the first row to improve performance)
df = pd.read_csv(dataset_path, nrows=0)

# Get the column names
column_names = df.columns.tolist()

# Print the column names
print("Column Names:", column_names)

Column Names: ['suspicious_text', 'source_text', 'label']


Again checking for duplication, and if duplication exist the code will remove it and save the new dataset which has unique data, means no duplicated datas.

In [22]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_paragraphs.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_ready_for_training.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)


# ✅ Remove duplicates **based only on suspicious and source text pairs**
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")


✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_ready_for_training.csv


In [24]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_ready_for_training.csv"

# Load the dataset (only the first row to improve performance)
df = pd.read_csv(dataset_path, nrows=0)

# Get the column names
column_names = df.columns.tolist()

# Print the column names
print("Column Names:", column_names)

Column Names: ['suspicious_text', 'source_text', 'label']


Checking number of labels 1s and 0s, and just to see whether the dataset still balanced or not after deletion of duplicates.

In [25]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_ready_for_training.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    61109
0    61058
Name: count, dtype: int64


This code we used just to see the first 5 rows of data and last three rows of data in the dataset and ensure whether they are shuffled.

In [26]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz_dataset_120K_ready_for_training.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Function to print each row in a structured format
def print_structured_rows(df_subset, title):
    print(f"\n🔹 {title}")
    for index, row in df_subset.iterrows():
        print(f"Row {index + 1}:")
        for col in df.columns:
            print(f"  {col}: {row[col]}")
        print("-" * 50)  # Separator for readability

# Print the first 5 rows
print_structured_rows(df.head(5), "First 5 Rows")

# Print the last 5 rows
print_structured_rows(df.tail(5), "Last 5 Rows")



🔹 First 5 Rows
Row 1:
  suspicious_text: Бірақ, көріп тұрсыңдар, ешкім оның құпиясын жасырмады - бәрі оның назарын дұрыс емес бағыттады және әр жолы оның орындауын қате түсіндірді - сондықтан олар оның ақымақ қателіктерін данышпандық шабыт деп қабылдады; олар адал істеді!
  source_text: Жолбасшымыздың сұр жүзі енді оқуға айналды. оның мұндай төтенше жағдай туралы нұсқауы болмады. бұл сұрақ оның бейшара ақылымен соғысты. бір сәт олар селт еткізіп, жеңілгендерін сезіп, берілмек болды.
  label: 0
--------------------------------------------------
Row 2:
  suspicious_text: Менің, бірақ басқа жаратылыстардың мойынсұнудан бас тартуы қате болды ма - олар талап ететін бірдеңені істеу керек «адамның менен балама күнкөрісі болмауы мүмкін, ал мен істеуім керек және олар мұны қалай талап етеді. олардың қажеттіліктерінен гөрі өз талғамым мен бейімділігіммен кеңесу керек пе?' содан кейін оның ойына келеді, бұл үлкен заң қалай болғанда да, машинадағы әрбір дұрыс және лайықты орынға кіру үшін шынайы 

We did the same for english dataset below:

In [27]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_pan_dataset_60K_paragraphs_unique3.csv"

# Load the dataset
df = pd.read_csv(dataset_path)
print("Original dataset shape:", df.shape)

# Create the swapped DataFrame
swapped_df = pd.DataFrame({
    'suspicious_text': df['source_text'],
    'source_text': df['suspicious_text'],
    'label': df['label']
})

# Concatenate the original and swapped DataFrames
combined_df = pd.concat([df, swapped_df], ignore_index=True)
print("Combined dataset shape before duplicate removal:", combined_df.shape)

# Save the new dataset to a CSV file
output_path = "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_paragraphs.csv"
combined_df.to_csv(output_path, index=False)
print("New dataset saved to:", output_path)

Original dataset shape: (61330, 3)
Combined dataset shape before duplicate removal: (122660, 3)
New dataset saved to: /content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_paragraphs.csv


Cheking the number of labels to ensure they are balanced

In [28]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_paragraphs.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
1    61648
0    61012
Name: count, dtype: int64


Removing the duplicates if they exist, in order to make the dataset with unique datas.

In [29]:
import pandas as pd
import re

# Define dataset paths
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_paragraphs.csv"
output_path =  "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_ready_for_training.csv"

# Load dataset
df = pd.read_csv(dataset_path)

# ✅ Text preprocessing function (clean only texts)
def clean_text(text):
    if pd.isna(text):
        return ""
    text = re.sub(r'^\W+|\d+', '', text)  # Remove leading special characters/numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

# ✅ Apply cleaning **only to English texts**
df["suspicious_text"] = df["suspicious_text"].apply(clean_text)
df["source_text"] = df["source_text"].apply(clean_text)


# ✅ Remove duplicates **based only on suspicious and source text pairs**
df_unique = df.drop_duplicates(subset=["suspicious_text", "source_text"])
df_unique = df.drop_duplicates()

# ✅ Save new dataset without duplicates
df_unique.to_csv(output_path, index=False)

print(f"✅ Unique dataset saved at: {output_path}")


✅ Unique dataset saved at: /content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_ready_for_training.csv


After removing duplicated rows in the dataset we are again checking for its balance, and you can see they were really decreased, but it is still balanced which is good.

In [30]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_ready_for_training.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    61010
1    58155
Name: count, dtype: int64


Here we are ensuring whether the datas are shuffled in the dataset:

In [31]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en_dataset_120K_ready_for_training.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Function to print each row in a structured format
def print_structured_rows(df_subset, title):
    print(f"\n🔹 {title}")
    for index, row in df_subset.iterrows():
        print(f"Row {index + 1}:")
        for col in df.columns:
            print(f"  {col}: {row[col]}")
        print("-" * 50)  # Separator for readability

# Print the first 5 rows
print_structured_rows(df.head(5), "First 5 Rows")

# Print the last 5 rows
print_structured_rows(df.tail(5), "Last 5 Rows")



🔹 First 5 Rows
Row 1:
  suspicious_text: Relatively often there are pale gray to brownish, strongly succulent appearing colonies, whose more or less broad marginal zone has a jagged edge and transparent. Their surface is smooth, often radiating striations bear.
  source_text: On certain occasions he even went out after wood in the daylight, slithering along on all fours towards his objective, and would be fired at until recalled by one of his own officers.
  label: 0
--------------------------------------------------
Row 2:
  suspicious_text: Relatively often there are pale gray to brownish, strongly succulent appearing colonies, whose more or less broad marginal zone has a jagged edge and transparent. Their surface is smooth, often radiating striations bear.
  source_text: Do let me see at once, HELENE I [Looks eagerly.] Ah, yes; all gone; nothing visible but one smoke-pipe, three stove-pipe hats, four bits of orange-peel, some pea-nut shells, and thirteen copies of the New-York Ledg

**Summary**

We have independently prepared **Kazakh** and **English** datasets tailored for text similarity detection tasks. Both datasets are **labeled**, which is a critical requirement for supervised learning. Additionally, they have been carefully processed to ensure they are **balanced**, **unique**, and **shuffled**. As a result, these datasets are now well-suited for training text similarity models.

- `kz_dataset_120K_ready_for_training.csv`  
- `en_dataset_120K_ready_for_training.csv`




### **Dataset for Testing Purposes**

Next, we prepared separate **English** and **Kazakh** datasets specifically for **testing purposes**. As with the training datasets, we conducted duplication checks to ensure all entries are **unique**. The datasets were also **shuffled** and **balanced** to support reliable and unbiased evaluation of the models.

In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz-3950-sentences-for-testing1.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    1978
1    1977
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en-4500-sentences-for-testing6.csv"

# Load dataset fully
df = pd.read_csv(dataset_path)

# Count occurrences of each label (1 = Plagiarized, 0 = Not Plagiarized)
label_counts = df["label"].value_counts()

# Print results
print(label_counts)

label
0    2268
1    2267
Name: count, dtype: int64


In [None]:
import pandas as pd

# Define the dataset path
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/kz-3950-sentences-for-testing1.csv"

# Load the dataset (only the first row to improve performance)
df = pd.read_csv(dataset_path, nrows=0)

# Get the column names
column_names = df.columns.tolist()

# Print the column names
print("Column Names:", column_names)

Column Names: ['source_text', 'suspicious_text', 'label']


In [None]:
import pandas as pd

# Define the dataset path
dataset_path =  "/content/drive/MyDrive/updating-pan-xml-files/en-4500-sentences-for-testing6.csv"

# Load the dataset (only the first row to improve performance)
df = pd.read_csv(dataset_path, nrows=0)

# Get the column names
column_names = df.columns.tolist()

# Print the column names
print("Column Names:", column_names)

Column Names: ['source_text', 'suspicious_text', 'label']


In [None]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/kz-3950-sentences-for-testing1.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Function to print each row in a structured format
def print_structured_rows(df_subset, title):
    print(f"\n🔹 {title}")
    for index, row in df_subset.iterrows():
        print(f"Row {index + 1}:")
        for col in df.columns:
            print(f"  {col}: {row[col]}")
        print("-" * 50)  # Separator for readability

# Print the first 5 rows
print_structured_rows(df.head(5), "First 5 Rows")

# Print the last 5 rows
print_structured_rows(df.tail(5), "Last 5 Rows")



🔹 First 5 Rows
Row 1:
  source_text: Медитациядан туындаған алдамшы болғандықтан, ол болғандай, ол армандар еліне кірді.
  suspicious_text: Алдағы ой жүгіртуден туындаған галлюцинация түрінде, ол болғандықтан, ол армандар саласына кірді.
  label: 1
--------------------------------------------------
Row 2:
  source_text: Олар үшін олар кәсіпорынды көрсетпейтіндер үшін, кем дегенде, данышпанның аяқ киімін көрсетпейді, бұл дақылдар ешқашан болмайды.
  suspicious_text: Олар кәсіпорындағы кәсіпорынды аз данышпанның аяқ киімін көрсетпейді, олар ешқашан дақылдардан асып кетпейді.
  label: 1
--------------------------------------------------
Row 3:
  source_text: Сібір қарағайы, балқарағай және балқарағай тәрізді бірнеше азиялық түрлер, сондай-ақ солтүстік-шығыста өседі, сонымен қатар солтүстік-шығысқа қарай азиаттық далалар мен шөптер өседі.
  suspicious_text: Екінші жағынан, бірнеше азиялық түрі Сібір қарағайы, балқарағай, балқарағай солтүстік-шығыста еркін өседі, ал азиаттық даладан бірнеш

In [None]:
import pandas as pd

# Define the dataset path
dataset_path = "/content/drive/MyDrive/updating-pan-xml-files/en-4500-sentences-for-testing6.csv"

# Load the dataset
df = pd.read_csv(dataset_path)

# Function to print each row in a structured format
def print_structured_rows(df_subset, title):
    print(f"\n🔹 {title}")
    for index, row in df_subset.iterrows():
        print(f"Row {index + 1}:")
        for col in df.columns:
            print(f"  {col}: {row[col]}")
        print("-" * 50)  # Separator for readability

# Print the first 5 rows
print_structured_rows(df.head(5), "First 5 Rows")

# Print the last 5 rows
print_structured_rows(df.tail(5), "Last 5 Rows")



🔹 First 5 Rows
Row 1:
  source_text: This being true it is not necessary to turn needle in to the right firmly but merely far enough to be sure that when turning back to the left, to the notch registering with guide post, that the needle is not more than once around or one turn from its seat.
  suspicious_text: Model "s" marvel carbureter if the engine idles too fast with throttle closed, the latter may be adjusted by means of the throttle lever adjusting screw.
  label: 0
--------------------------------------------------
Row 2:
  source_text: Next come descriptions of regions, cities and architectural marvels and then follow articles on the various manners and customs of rural and town life.
  suspicious_text: After this description of regions, cities and architectural marvels are given then article on the variouss manners and customs of rural and town life is described afterwards.the national arts are treated elaborately and at the end chapter of the latest statistics is given whic

## 🔍 Summary

We created a high-quality **English and Kazakh datasets** for **text similarity and plagiarism detection** by:

- **Extracting** plagiarized text pairs from the PAN corpora using XML offset metadata.
- **Translating** English text segments into **Kazakh** to build a multilingual dataset.
- Preparing a **human evaluation set** of 2,000 English–Kazakh pairs, evenly distributed and shuffled, for native Kazakh speakers to assess plagiarism and translation quality.
- **Training BERT-based models** separately for English and Kazakh using the human-curated dataset.
- Using the trained models to **generate negative samples** (label = 0) for the larger dataset, ensuring label balance.
- **Augmenting** the dataset by swapping suspicious and source texts to increase size and improve model generalization.
- Producing two final labeled, balanced, unique, and shuffled datasets, ready for training and testing:

### ✅ Final Training Datasets:
- `kz_dataset_120K_ready_for_training.csv`  
- `en_dataset_120K_ready_for_training.csv`  

### ✅ Final Testing Datasets:
- `kz-3950-sentences-for-testing1.csv`  
- `en-4500-sentences-for-testing6.csv`  

These resources support robust model development in both **Kazakh (low-resource)** and **English**.