# Parallel Corpus Alignment Verification
## Objective
The sole objective of this notebook is to perform a critical data validation check. It verifies that two separate, monolingual text files (one in Odia, one in German) have the exact same number of lines. This is an essential quality assurance step before merging them into a parallel corpus, as any mismatch in line counts would lead to data corruption and catastrophic failure during model training.

## Methodology
The script programmatically counts the total number of lines, including blank lines, in each of the two specified source files. It then compares these two counts and provides a clear success or error message to the user, immediately flagging any alignment issues.

## Workflow
1. Mounts Google Drive to access the source files.

2. Configures the file paths for the Odia and German `.txt` files.

3. Reads both files line-by-line to get an accurate count.

4. Compares the final line counts.

5. Prints a status report indicating whether the files are perfectly aligned or if a mismatch exists.

## Input & Output
* **Input:** Two `.txt` files (`authentic_odia_corpus.txt` and `authentic_german_corpus.txt`).
* **Output:** A printed message to the console confirming the line counts and the alignment status. No new files are created.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

In [None]:
ODIA_FILE = '/content/drive/MyDrive/Thesis/test/data/raw/authentic_odia_corpus_v1.txt'
GERMAN_FILE = '/content/drive/MyDrive/Thesis/test/data/raw/authentic_german_corpus_v1.txt'

In [None]:
def verify_line_counts():
  """
  Verifies that two files, specified by ODIA_FILE and GERMAN_FILE, have the same number of lines.

  This function checks if both files exist in the current directory and compares their line counts.
  If either file is missing, it prints an error message and exits. If the files exist, it uses
  the `count_lines_and_sentences` function to count total lines and non-empty lines in each file,
  ensuring they match for parallel corpus validation. Results are printed to the console.

  Note:
    - Assumes global variables `ODIA_FILE` and `GERMAN_FILE` are defined with the file paths.
    - Does not return a value; outputs results or errors to the console.

  Example:
    >>> ODIA_FILE = "odia.txt"
    >>> GERMAN_FILE = "german.txt"
    >>> verify_line_counts()
    # Output if files are missing:
    ⛔️ ERROR: The Odia file was not found.
    Please make sure a file named 'odia.txt' exists in this directory.
  """
  # --- Check if the files actually exist ---
  if not os.path.exists(ODIA_FILE):
    print(f"⛔️ ERROR: The Odia file was not found.")
    print(f"Please make sure a file named '{ODIA_FILE}' exists in this directory.")
    return

  if not os.path.exists(GERMAN_FILE):
    print(f"⛔️ ERROR: The German file was not found.")
    print(f"Please make sure a file named '{GERMAN_FILE}' exists in this directory.")
    return

# --- Define a helper function to count lines efficiently ---
def count_lines_and_sentences(filename):
  """
  Counts the total lines and non-empty lines (sentences) in a specified file.

  This function reads a file with UTF-8 encoding and counts:
  - Total lines, including empty lines.
  - Non-empty lines (sentences) after stripping whitespace.
  It is designed for efficient line counting in text files, such as those used in parallel corpora.

  Args:
    filename (str): The path to the file to be analyzed.

  Returns:
    tuple[int, int]: A tuple containing:
    - Total number of lines in the file.
    - Number of non-empty lines (sentences) after stripping whitespace.

  Example:
    >>> with open("sample.txt", "w", encoding="utf-8") as f:
    ...     f.write("Line 1\\n\\nLine 2\\n")
    >>> count_lines_and_sentences("sample.txt")
    (3, 2)
  """
  total_lines = 0
  sentence_count = 0
  with open(filename, 'r', encoding='utf-8') as f:
    for line in f:
      total_lines += 1
      # Count non-empty lines after stripping whitespace
      if line.strip():
        sentence_count += 1
  return total_lines, sentence_count

# --- Count the lines in both files ---
print("Counting lines and sentences in each file...")
odia_total_lines, odia_sentences = count_lines_and_sentences(ODIA_FILE)
german_total_lines, german_sentences = count_lines_and_sentences(GERMAN_FILE)

# --- Display results ---
print(f"\n--- Results ---")
print(f"Odia file '{ODIA_FILE}':")
print(f"  Total lines: {odia_total_lines}")
print(f"  Sentences: {odia_sentences}")
print(f"German file '{GERMAN_FILE}':")
print(f"  Total lines: {german_total_lines}")
print(f"  Sentences: {german_sentences}")

# --- Compare line counts and report status ---
if odia_total_lines == german_total_lines:
  if odia_total_lines == 0:
    print("\n⚠️ Warning: Both files are empty.")
    else:
      print("\n✅ Success! The files are perfectly aligned by line count.")
      if odia_sentences == german_sentences:
        print("✅ Sentence counts also match, ready for the next step.")
      else:
        print(f"⚠️ Warning: Sentence counts do not match (Odia: {odia_sentences}, German: {german_sentences}).")
        print("Please inspect the files for formatting issues.")
  else:
    print("\n⛔️ ERROR! Line counts do not match.")
    difference = abs(odia_total_lines - german_total_lines)
    print(f"There is a mismatch of {difference} line(s).")
    print("Please manually inspect the files to fix the discrepancy before proceeding.")

In [None]:
# --- Run the main function ---
if __name__ == "__main__":
  verify_line_counts()

Counting lines and sentences in each file...

--- Results ---
Odia file '/content/drive/MyDrive/Thesis/test/data/raw/authentic_odia_corpus_v1.txt':
  Total lines: 7351
  Sentences: 3676
German file '/content/drive/MyDrive/Thesis/test/data/raw/authentic_german_corpus_v1.txt':
  Total lines: 7351
  Sentences: 3676

✅ Success! The files are perfectly aligned by line count.
✅ Sentence counts also match, ready for the next step.
