# Assignment 2: N-grams and Language Identification
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name: Nagme Cagla Golcu**  
**Student ID: 2526366**  
**Due Date:** 16 November 2025 (Sunday) before midnight

---

## Overview

This assignment focuses on:
1. Building **character-based** 2-gram and 3-gram language models with Laplace smoothing
2. Sentence-based language identification using 10-fold cross-validation
3. Evaluation using accuracy, precision, recall, and F1-score
4. Comparison and analysis

**Note:** For language identification, we use **character n-grams** rather than word n-grams because they better capture language-specific patterns like letter combinations, diacritics, and writing systems.

**Grading:**
- Written Questions (7 √ó 4 pts): **28 pts**
- Code Tasks with TODO (11 total): **72 pts** distributed by effort level:
  - Simple tasks: 4 pts each (2 cells)
  - Moderate tasks: 6 pts each (4 cells)
  - Complex tasks: 8 pts each (5 cells)
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import re

# Scikit-learn for cross-validation and metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from scipy.stats import ttest_rel


# Set random seed for reproducibility
np.random.seed(42)

---

# Task 1: Corpus Preparation and Statistics (22 points)

## 1.1: Upload Corpus Files

Prepare your text files in **two different languages** (accepted formats: `.txt`, `.pdf`, or `.docx`). When you run the cell below, you'll be prompted to upload files for each language separately. Make sure your files contain substantial text (reports, essays, or similar content from other courses). Each language requires at least **5000** words in its corpus.

In [2]:
from google.colab import files

print("Upload your ENGLISH corpus file(s):")
english_files = files.upload()

print("\nUpload your SECOND LANGUAGE corpus file(s):")
second_lang_files = files.upload()


Upload your ENGLISH corpus file(s):


Saving pg46.txt to pg46.txt

Upload your SECOND LANGUAGE corpus file(s):


Saving turkish.txt to turkish.txt


## 1.2: Load and Preprocess Data (12 points)

Load your uploaded files, extract text, preprocess, split into sentences, and tokenize. You'll need helper functions to handle different file formats.

**Steps:**
1. Read files based on format (`.txt`, `.pdf`, `.docx`) and combine them into single text for each language
2. Apply preprocessing (e.g., lowercasing, handling punctuation)
3. Split each corpus into individual sentences
4. Tokenize each sentence into words (for statistics)
5. Store the results as two lists of tokenized sentences

**Important:** You'll use word tokenization for calculating statistics, but for the n-gram models in Task 2, you'll work with character n-grams directly on the sentence strings.

In [7]:
import re
from typing import List

def read_txt_file(filename: str) -> str:
    """Read a .txt file and return its content."""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def read_pdf_file(filename: str) -> str:
    """Read a .pdf file and return its text content."""
    # ≈ûu an pdf kullanmƒ±yoruz, o y√ºzden bo≈ü d√∂nebilir.
    return ""

def read_docx_file(filename: str) -> str:
    """Read a .docx file and return its text content."""
    # ≈ûu an docx kullanmƒ±yoruz, o y√ºzden bo≈ü d√∂nebilir.
    return ""

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    return sentences

def tokenize_sentence(sentence: str) -> List[str]:
    """Tokenize a sentence into words."""
    sentence = sentence.lower()
    tokens = sentence.split()
    return tokens

# 1. Read and combine files for each language

lang1_text = ""
for filename in english_files.keys():
    if filename.endswith(".txt"):
        lang1_text += read_txt_file(filename) + " "
    elif filename.endswith(".pdf"):
        lang1_text += read_pdf_file(filename) + " "
    elif filename.endswith(".docx"):
        lang1_text += read_docx_file(filename) + " "

lang2_text = ""
for filename in second_lang_files.keys():
    if filename.endswith(".txt"):
        lang2_text += read_txt_file(filename) + " "
    elif filename.endswith(".pdf"):
        lang2_text += read_pdf_file(filename) + " "
    elif filename.endswith(".docx"):
        lang2_text += read_docx_file(filename) + " "

# 2. Apply preprocessing (lowercasing, removing extra whitespace)
def preprocess(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text)   # birden fazla bo≈üluk ‚Üí tek bo≈üluk
    return text.strip()

lang1_text = preprocess(lang1_text)
lang2_text = preprocess(lang2_text)

# 3. Split each corpus into sentences
lang1_sentences = split_into_sentences(lang1_text)
lang2_sentences = split_into_sentences(lang2_text)

# 4. Tokenize each sentence
lang1_sentences_tokenized = [tokenize_sentence(s) for s in lang1_sentences]
lang2_sentences_tokenized = [tokenize_sentence(s) for s in lang2_sentences]





# TODO:
#
# 1. Read and combine files for each language
#    - Loop through lang1_files and lang2_files
#    - Use appropriate read function based on file extension
#    - Combine all text into lang1_text and lang2_text
#
# 2. Apply preprocessing to both lang1_text and lang2_text
#    (e.g., lowercasing, removing extra whitespace)
#
# 3. Split each corpus into sentences using split_into_sentences()
#
# 4. Tokenize each sentence using tokenize_sentence()
#
# Note: These tokenized sentences will be used for statistics in Task 1.
# In Task 2, you'll work with the raw sentence strings for character n-grams.

# At the end, you should have:
# lang1_sentences = [sent1, sent2, ...]
# lang2_sentences = [sent1, sent2, ...]
# lang1_sentences_tokenized = [[word1, word2, ...], [word1, word2, ...], ...]
# lang2_sentences_tokenized = [[word1, word2, ...], [word1, word2, ...], ...]


# [8 pts]

**Question 1.1:** What preprocessing choices did you make and why? (3-5 sentences)

**In order for the language model to treat words with different capitalizations as a single unit, I first preprocessed all of the text by changing it to lowercase.  In order to guarantee cleaner sentence splitting and prevent the creation of empty or meaningless tokens, I then eliminated extra whitespace.  A straightforward punctuation-based regex that offers consistent segmentation for both languages was used to identify sentence boundaries.  Whitespace-based splitting was then used to tokenize each sentence, resulting in distinct word-level units for statistical analysis.**

## 1.3: Basic Statistics (10 points)

Calculate and display key statistics for both language corpora to understand their characteristics.

In [19]:
# --- STATISTICS FOR LANGUAGE 1 ---

# Total character count
lang1_total_characters = sum(len(sentence) for sentence in lang1_sentences)

# Special character / punctuation count
lang1_special_characters = sum(
    sum(1 for ch in sentence if not ch.isalnum() and not ch.isspace())
    for sentence in lang1_sentences
)

# Unique character vocabulary size
lang1_char_vocabulary = len(set("".join(lang1_text)))

# Total word count
lang1_total_words = sum(len(tokens) for tokens in lang1_sentences_tokenized)

# Unique word vocabulary size
lang1_word_vocabulary = len(set(word for sent in lang1_sentences_tokenized for word in sent))

# Sentence count
lang1_sentence_count = len(lang1_sentences)

# Average sentence length (in words)
lang1_avg_sentence_length = lang1_total_words / lang1_sentence_count if lang1_sentence_count > 0 else 0


# --- STATISTICS FOR LANGUAGE 2 ---

lang2_total_characters = sum(len(sentence) for sentence in lang2_sentences)

lang2_special_characters = sum(
    sum(1 for ch in sentence if not ch.isalnum() and not ch.isspace())
    for sentence in lang2_sentences
)

lang2_char_vocabulary = len(set("".join(lang2_text)))

lang2_total_words = sum(len(tokens) for tokens in lang2_sentences_tokenized)

lang2_word_vocabulary = len(set(word for sent in lang2_sentences_tokenized for word in sent))

lang2_sentence_count = len(lang2_sentences)

lang2_avg_sentence_length = lang2_total_words / lang2_sentence_count if lang2_sentence_count > 0 else 0


# --- PRINT RESULTS SIDE BY SIDE ---

print("------- STATISTICS --------- Language1 |Language2")

print("Total characters:             ", lang1_total_characters, " | ", lang2_total_characters)
print("Special characters:           ", lang1_special_characters, "   | ", lang2_special_characters)
print("Character vocabulary size:    ", lang1_char_vocabulary, "     | ", lang2_char_vocabulary)
print("Total words count:            ", lang1_total_words, "  | ", lang2_total_words)
print("Word vocabulary size:         ", lang1_word_vocabulary, "   | ", lang2_word_vocabulary)
print("Sentence count:               ", lang1_sentence_count, "   | ", lang2_sentence_count)
print("Avg sentence length(in words): ", round(lang1_avg_sentence_length,2), " | ", round(lang2_avg_sentence_length,2))



------- STATISTICS --------- Language1 |Language2
Total characters:              175116  |  45351
Special characters:            7993    |  1215
Character vocabulary size:     64      |  48
Total words count:             31611   |  5130
Word vocabulary size:          7235    |  145
Sentence count:                1629    |  315
Avg sentence length(in words):  19.41  |  16.29


**Question 1.2:** What are the key differences between your two corpora? (2-3 sentences)

**There are a lot more characters, words, and a much richer vocabulary in the English text corpus than in the Turkish texts.  In addition, their average sentence lengths are longer and they have a lot more sentences, which indicates that their sentence structures are more intricate or descriptive.  The Turkish texts, on the other hand, are substantially shorter and contain a smaller vocabulary, which is indicative of shorter texts or more repetitive linguistic patterns.**

---

# Task 2: Character N-gram Language Identification (58 points)

**Baseline (46 pts):** Implement character-based 2-gram and 3-gram models, run 10-fold CV, report accuracy.  
**Creativity (12 pts):** Out-of-vocabulary analysis.

## 2.1: Implement Character N-gram Models (12 points)

Implement the `CharNgramLanguageModel` class with Laplace smoothing using NLTK's n-gram utilities. The model should count **character** n-grams during training and calculate sentence probabilities with smoothing.

**Key difference from word n-grams:** Instead of tokenizing sentences into words, you'll work with individual characters in each sentence.

In [27]:
import nltk
from nltk.util import ngrams, pad_sequence
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from typing import List

# Download required NLTK data
nltk.download('punkt', quiet=True)

class CharNgramLanguageModel:
    """
    Character-based N-gram language model with Laplace (add-1) smoothing using NLTK.
    """

    def __init__(self, n: int = 2):
        """
        Initialize the character n-gram model.

        Args:
            n: Order of n-gram (2 for bigram, 3 for trigram)
        """
        self.n = n
        self.model = Laplace(n)

    def train(self, sentences: List[str]):
        """
        Train the model on a list of sentences.

        Args:
            sentences: List of sentences (each sentence is a string)
        """
        # Convert each sentence to list of characters
        char_sequences = [list(s) for s in sentences]

        # Prepare n-gram training data with padding
        train_ngrams, vocab = padded_everygram_pipeline(self.n, char_sequences)

        # Fit Laplace n-gram model
        self.model.fit(train_ngrams, vocab)

    def get_probability(self, sentence: str) -> float:
        """
        Calculate the probability of a sentence.
        """
        import math

        # Convert sentence to list of characters
        chars = list(sentence)

        # Pad the character sequence
        padded_chars = pad_sequence(
            chars,
            n=self.n,
            pad_left=True,
            pad_right=True,
            left_pad_symbol="<s>",
            right_pad_symbol="</s>",
        )

        # Generate n-grams
        sent_ngrams = ngrams(padded_chars, self.n)

        # Compute log probability
        log_prob = 0.0
        for ng in sent_ngrams:
            context = ng[:-1]
            char = ng[-1]
            prob = self.model.score(char, context)

            if prob <= 0:
                prob = 1e-12

            log_prob += math.log(prob)

        return math.exp(log_prob)


### Spot Check: Inspect Your N-gram Models

After implementing the model, train sample models on both languages and inspect what they learned.

In [28]:
# TODO: Train sample models and inspect them
#
# 1. Create 2-gram and 3-gram models for both languages
# 2. Train them on your full datasets (lang1_sentences and lang2_sentences)
# 3. Inspect the models to see what n-grams they learned
#
# Example:
# model_2gram_lang1 = NgramLanguageModel(n=2)
# model_2gram_lang1.train(lang1_sentences)
# model_3gram_lang1 = NgramLanguageModel(n=3)
# model_3gram_lang1.train(lang1_sentences)
#
# model_2gram_lang2 = NgramLanguageModel(n=2)
# model_2gram_lang2.train(lang2_sentences)
# model_3gram_lang2 = NgramLanguageModel(n=3)
# model_3gram_lang2.train(lang3_sentences)
#
# Display sample n-grams and their counts from each model
# Check vocabulary size: len(model.model.vocab)
# Show most common n-grams or test probabilities on sample sentences

# 1. Create 2-gram and 3-gram models for both languages
model_2gram_lang1 = CharNgramLanguageModel(n=2)
model_3gram_lang1 = CharNgramLanguageModel(n=3)

model_2gram_lang2 = CharNgramLanguageModel(n=2)
model_3gram_lang2 = CharNgramLanguageModel(n=3)

# 2. Train them on your full datasets (lang1_sentences and lang2_sentences)
model_2gram_lang1.train(lang1_sentences)
model_3gram_lang1.train(lang1_sentences)

model_2gram_lang2.train(lang2_sentences)
model_3gram_lang2.train(lang2_sentences)

# 3. Inspect the models

print("===== Vocabulary sizes =====")
print("Lang1 2-gram vocab size:", len(model_2gram_lang1.model.vocab))
print("Lang1 3-gram vocab size:", len(model_3gram_lang1.model.vocab))
print("Lang2 2-gram vocab size:", len(model_2gram_lang2.model.vocab))
print("Lang2 3-gram vocab size:", len(model_3gram_lang2.model.vocab))


# Helper function to print some example n-grams and counts
def print_some_char_ngrams(model, n_order=2, limit=10):
    """
    Print a few example character n-grams and their counts from a trained model.
    """
    counter = model.model.counts[n_order]   # n-gram level (2 or 3)
    printed = 0
    for context in counter:
        for ch, cnt in counter[context].items():
            ngram = "".join(context) + ch
            print(f"{repr(ngram)} -> count={cnt}")
            printed += 1
            if printed >= limit:
                return

print("\n===== Sample 2-gram char n-grams (Lang1) =====")
print_some_char_ngrams(model_2gram_lang1, n_order=2, limit=10)

print("\n===== Sample 2-gram char n-grams (Lang2) =====")
print_some_char_ngrams(model_2gram_lang2, n_order=2, limit=10)

# Test probabilities on a couple of sample sentences
if len(lang1_sentences) > 0:
    sample1 = lang1_sentences[0]
    print("\nExample Lang1 sentence:", sample1[:80], "...")
    print("P_2gram(Lang1) =", model_2gram_lang1.get_probability(sample1))
    print("P_3gram(Lang1) =", model_3gram_lang1.get_probability(sample1))

if len(lang2_sentences) > 0:
    sample2 = lang2_sentences[0]
    print("\nExample Lang2 sentence:", sample2[:80], "...")
    print("P_2gram(Lang2) =", model_2gram_lang2.get_probability(sample2))
    print("P_3gram(Lang2) =", model_3gram_lang2.get_probability(sample2))


# [4 pts]

===== Vocabulary sizes =====
Lang1 2-gram vocab size: 67
Lang1 3-gram vocab size: 67
Lang2 2-gram vocab size: 51
Lang2 3-gram vocab size: 51

===== Sample 2-gram char n-grams (Lang1) =====
'<s>\ufeff' -> count=1
'<s>y' -> count=30
'<s>i' -> count=195
'<s>t' -> count=237
'<s>m' -> count=32
'<s>d' -> count=23
'<s>c' -> count=42
'<s>s' -> count=109
'<s>o' -> count=27
'<s>b' -> count=69

===== Sample 2-gram char n-grams (Lang2) =====
'<s>b' -> count=45
'<s>f' -> count=45
'<s>n' -> count=45
'<s>k' -> count=90
'<s>g' -> count=45
'<s>a' -> count=45
'bi' -> count=360
'il' -> count=360
'im' -> count=360
'in' -> count=1125

Example Lang1 sentence: Ôªøthe project gutenberg ebook of a christmas carol in prose; being a ghost story  ...
P_2gram(Lang1) = 4.0905016374237176e-258
P_3gram(Lang1) = 2.868153326444669e-206

Example Lang2 sentence: bilim (1), evrenin i≈üleyi≈üini anlamak i√ßin sistematik g√∂zlem, deney ve mantƒ±ksal ...
P_2gram(Lang2) = 5.1918384750125995e-114
P_3gram(Lang2) = 1.08249384179

## 2.2: Implement Language Identification (8 points)

Create a function that compares sentence probabilities from two language models and returns the predicted label.

In [30]:
def identify_language(sentence: str,
                     model_lang1: CharNgramLanguageModel,
                     model_lang2: CharNgramLanguageModel) -> int:
    """
    Identify the language of a sentence using two character-based language models.

    Args:
        sentence: Sentence string
        model_lang1: Language model for language 1 (label 0)
        model_lang2: Language model for language 2 (label 1)

    Returns:
        Predicted label (0 or 1)
    """

    # 1. Probability from language 1 model
    prob1 = model_lang1.get_probability(sentence)

    # 2. Probability from language 2 model
    prob2 = model_lang2.get_probability(sentence)

    # 3. Compare and choose the higher probability
    if prob1 > prob2:
        return 0   # language 1
    else:
        return 1   # language 2

## 2.3: Implement Evaluation Function (6 points)

Create a function that calculates accuracy, precision, recall, and F1-score given predicted and true labels.

In [31]:
from typing import List, Dict
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
    """
    Calculate evaluation metrics.

    Args:
        y_true: True labels
        y_pred: Predicted labels

    Returns:
        Dictionary with accuracy, precision, recall, f1_score
    """

    # Accuracy
    accuracy = accuracy_score(y_true, y_pred)

    # Precision, Recall, F1 (binary classification)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='binary'
    )

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }


## 2.4: 10-Fold Cross-Validation for Language Identification (8 points)

Implement 10-fold cross-validation to evaluate your character-based n-gram models. In each fold, split the data, train separate models for each language and n-gram order, make predictions, and evaluate performance.

In [34]:
from sklearn.model_selection import KFold

# Prepare dataset: combine sentence STRINGS from both languages with labels
X = lang1_sentences + lang2_sentences
y = [0] * len(lang1_sentences) + [1] * len(lang2_sentences)

print(f"Dataset prepared:")
print(f"  Total sentences: {len(X)}")
print(f"  Language 1 (label 0): {sum(1 for label in y if label == 0)} sentences")
print(f"  Language 2 (label 1): {sum(1 for label in y if label == 1)} sentences")
print()

# Initialize 10-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

# Store results for each fold
results_2gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1_score': []}
results_3gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1_score': []}

# 10-fold cross-validation
for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    print(f"\n{'='*50}")
    print(f"Fold {fold_idx}/10")
    print(f"{'='*50}")

    # 1. Split data into train and test sets
    X_train = [X[i] for i in train_idx]
    y_train = [y[i] for i in train_idx]
    X_test  = [X[i] for i in test_idx]
    y_test  = [y[i] for i in test_idx]

    # 2. Separate training sentences by language based on labels
    train_lang1_sentences = [sent for sent, label in zip(X_train, y_train) if label == 0]
    train_lang2_sentences = [sent for sent, label in zip(X_train, y_train) if label == 1]

    print(f"  Train size: {len(X_train)}, Test size: {len(X_test)}")
    print(f"  Train Lang1: {len(train_lang1_sentences)}, Train Lang2: {len(train_lang2_sentences)}")

    # 3. Train FOUR models total:
    #    - Two 2-gram models (one per language)
    #    - Two 3-gram models (one per language)

    # 2-gram models
    model2_lang1 = CharNgramLanguageModel(n=2)
    model2_lang2 = CharNgramLanguageModel(n=2)
    model2_lang1.train(train_lang1_sentences)
    model2_lang2.train(train_lang2_sentences)

    # 3-gram models
    model3_lang1 = CharNgramLanguageModel(n=3)
    model3_lang2 = CharNgramLanguageModel(n=3)
    model3_lang1.train(train_lang1_sentences)
    model3_lang2.train(train_lang2_sentences)

    # 4. Make predictions on test sentences
    y_pred_2gram = []
    y_pred_3gram = []

    for sent in X_test:
        # 2-gram prediction
        y_pred_2gram.append(
            identify_language(sent, model2_lang1, model2_lang2)
        )
        # 3-gram prediction
        y_pred_3gram.append(
            identify_language(sent, model3_lang1, model3_lang2)
        )

    # 5. Calculate metrics
    metrics_2 = calculate_metrics(y_test, y_pred_2gram)
    metrics_3 = calculate_metrics(y_test, y_pred_3gram)

    # 6. Store results
    for k in results_2gram.keys():
        results_2gram[k].append(metrics_2[k])
        results_3gram[k].append(metrics_3[k])

    # Fold sonu√ßlarƒ±nƒ± ekrana yaz
    print("  2-gram -> acc: {:.3f}, prec: {:.3f}, rec: {:.3f}, f1: {:.3f}".format(
        metrics_2["accuracy"], metrics_2["precision"], metrics_2["recall"], metrics_2["f1_score"]
    ))
    print("  3-gram -> acc: {:.3f}, prec: {:.3f}, rec: {:.3f}, f1: {:.3f}".format(
        metrics_3["accuracy"], metrics_3["precision"], metrics_3["recall"], metrics_3["f1_score"]
    ))

print("\n" + "="*50)
print("Cross-validation completed!")
print("="*50)

Dataset prepared:
  Total sentences: 1944
  Language 1 (label 0): 1629 sentences
  Language 2 (label 1): 315 sentences


Fold 1/10
  Train size: 1749, Test size: 195
  Train Lang1: 1465, Train Lang2: 284
  2-gram -> acc: 0.990, prec: 0.939, rec: 1.000, f1: 0.969
  3-gram -> acc: 0.995, prec: 0.969, rec: 1.000, f1: 0.984

Fold 2/10
  Train size: 1749, Test size: 195
  Train Lang1: 1468, Train Lang2: 281
  2-gram -> acc: 0.969, prec: 0.850, rec: 1.000, f1: 0.919
  3-gram -> acc: 0.974, prec: 0.872, rec: 1.000, f1: 0.932

Fold 3/10
  Train size: 1749, Test size: 195
  Train Lang1: 1464, Train Lang2: 285
  2-gram -> acc: 0.933, prec: 0.698, rec: 1.000, f1: 0.822
  3-gram -> acc: 0.969, prec: 0.833, rec: 1.000, f1: 0.909

Fold 4/10
  Train size: 1749, Test size: 195
  Train Lang1: 1469, Train Lang2: 280
  2-gram -> acc: 0.949, prec: 0.778, rec: 1.000, f1: 0.875
  3-gram -> acc: 0.974, prec: 0.875, rec: 1.000, f1: 0.933

Fold 5/10
  Train size: 1750, Test size: 194
  Train Lang1: 1466, Train

## 2.5: Display Results (12)

*Create a table showing for each model:*
Mean accuracy, precision, recall, F1 (with std)

In [36]:
import numpy as np
import pandas as pd # Assuming pandas is also needed for DataFrame

# TODO: Calculate and display summary statistics
#
#
# Example:
# results_df = pd.DataFrame({
#     'Model': ['2-gram', '3-gram'],
#     'Accuracy': [...],
#     'Precision': [...],
#     ...
# })

# Calculate mean and standard deviation for 2-gram model
mean_acc_2 = np.mean(results_2gram['accuracy'])
std_acc_2 = np.std(results_2gram['accuracy'])
mean_prec_2 = np.mean(results_2gram['precision'])
std_prec_2 = np.std(results_2gram['precision'])
mean_rec_2 = np.mean(results_2gram['recall'])
std_rec_2 = np.std(results_2gram['recall'])
mean_f1_2 = np.mean(results_2gram['f1_score'])
std_f1_2 = np.std(results_2gram['f1_score'])

# Calculate mean and standard deviation for 3-gram model
mean_acc_3 = np.mean(results_3gram['accuracy'])
std_acc_3 = np.std(results_3gram['accuracy'])
mean_prec_3 = np.mean(results_3gram['precision'])
std_prec_3 = np.std(results_3gram['precision'])
mean_rec_3 = np.mean(results_3gram['recall'])
std_rec_3 = np.std(results_3gram['recall'])
mean_f1_3 = np.mean(results_3gram['f1_score'])
std_f1_3 = np.std(results_3gram['f1_score'])

# Create a DataFrame for better display
results_df = pd.DataFrame({
    'Model': ['2-gram', '3-gram'],
    'Accuracy (Mean ¬± Std)': [
        f"{mean_acc_2:.3f} ¬± {std_acc_2:.3f}",
        f"{mean_acc_3:.3f} ¬± {std_acc_3:.3f}"
    ],
    'Precision (Mean ¬± Std)': [
        f"{mean_prec_2:.3f} ¬± {std_prec_2:.3f}",
        f"{mean_prec_3:.3f} ¬± {std_prec_3:.3f}"
    ],
    'Recall (Mean ¬± Std)': [
        f"{mean_rec_2:.3f} ¬± {std_rec_2:.3f}",
        f"{mean_rec_3:.3f} ¬± {std_rec_3:.3f}"
    ],
    'F1-Score (Mean ¬± Std)': [
        f"{mean_f1_2:.3f} ¬± {std_f1_2:.3f}",
        f"{mean_f1_3:.3f} ¬± {std_f1_3:.3f}"
    ]
})

print("\n===== Summary of 10-Fold Cross-Validation Results =====\n")
print(results_df.to_string(index=False))

# [4 pts]


===== Summary of 10-Fold Cross-Validation Results =====

 Model Accuracy (Mean ¬± Std) Precision (Mean ¬± Std) Recall (Mean ¬± Std) F1-Score (Mean ¬± Std)
2-gram         0.969 ¬± 0.017          0.842 ¬± 0.073       1.000 ¬± 0.000         0.912 ¬± 0.044
3-gram         0.980 ¬± 0.008          0.889 ¬± 0.045       1.000 ¬± 0.000         0.941 ¬± 0.025


**Question 2.1:** Which of your trained models performed best on the validation data, and why? (3-4 sentences)

**[YOUR ANSWER HERE]**

**Question 2.2:** Were the results consistent across different folds of cross-validation? (2-3 sentences)

**[YOUR ANSWER HERE]**

## 2.6: Out-of-Vocabulary Testing (12 pts)

Test your models with **five** sentences containing characters or character combinations not common in your training corpus. For character n-grams, this might include unusual letter combinations, foreign words, or made-up words that still follow language patterns.

In [39]:
# 2.6: Out-of-Vocabulary Testing

# √ñnce tam korpus √ºzerinde yeni modeller eƒüitelim (t√ºm c√ºmleler)
full_2_lang1 = CharNgramLanguageModel(n=2)
full_2_lang2 = CharNgramLanguageModel(n=2)
full_3_lang1 = CharNgramLanguageModel(n=3)
full_3_lang2 = CharNgramLanguageModel(n=3)

full_2_lang1.train(lang1_sentences)
full_2_lang2.train(lang2_sentences)
full_3_lang1.train(lang1_sentences)
full_3_lang2.train(lang2_sentences)

# 5 tane OOV c√ºmlesi (tuhaf karakterler, uydurma kelimeler, karƒ±≈üƒ±k diller)
oov_sentences = [
    "√áelv√ºnle≈ümi≈ü t√∂brik≈üenme yapƒ±sƒ± ≈ü√∂rk√ºtg√ºnl√º bir topraƒüa d√º≈üt√º.",          # ƒ∞ngilizce yapƒ± + uydurma kelime / harf
    "The zellartonic core emitted scryphtonic pulses across the quantisphere.", # T√ºrk√ße yapƒ± + yabancƒ±/uydurma kelime
    "Str√§x√∏nite reacted with l√∏fth√¶llium inside the m√ºlgr√∏xen chamber.",# ƒ∞ngilizce + ƒ∞skandinav harfleri
    "B√ºrs√ºnte≈ümi≈ü s√ºm√ºnle  g√∂v√ºntei sismik modeli bozdu.", # T√ºrk√ße + bozuk ƒ∞ngilizce / umlaut
    "A gravistorm event fractured the cryphexial driftline entirely.", # ƒ∞ngilizce + bozuk karakterler
]

print("===== OOV Sentence Testing =====\n")

for i, s in enumerate(oov_sentences, 1):
    # 2-gram prediction
    prob1_2 = full_2_lang1.get_probability(s)
    prob2_2 = full_2_lang2.get_probability(s)
    pred_2 = 0 if prob1_2 > prob2_2 else 1

    # 3-gram prediction
    prob1_3 = full_3_lang1.get_probability(s)
    prob2_3 = full_3_lang2.get_probability(s)
    pred_3 = 0 if prob1_3 > prob2_3 else 1

    print(f"OOV Sentence {i}: {s}")
    print(f"  2-gram -> predicted language: {pred_2} (P1={prob1_2:.2e}, P2={prob2_2:.2e})")
    print(f"  3-gram -> predicted language: {pred_3} (P1={prob1_3:.2e}, P2={prob2_3:.2e})")
    print()


# [8 pts]

===== OOV Sentence Testing =====

OOV Sentence 1: √áelv√ºnle≈ümi≈ü t√∂brik≈üenme yapƒ±sƒ± ≈ü√∂rk√ºtg√ºnl√º bir topraƒüa d√º≈üt√º.
  2-gram -> predicted language: 1 (P1=1.62e-133, P2=4.69e-109)
  3-gram -> predicted language: 1 (P1=4.47e-118, P2=1.49e-113)

OOV Sentence 2: The zellartonic core emitted scryphtonic pulses across the quantisphere.
  2-gram -> predicted language: 0 (P1=3.24e-89, P2=2.42e-134)
  3-gram -> predicted language: 0 (P1=2.73e-95, P2=1.84e-131)

OOV Sentence 3: Str√§x√∏nite reacted with l√∏fth√¶llium inside the m√ºlgr√∏xen chamber.
  2-gram -> predicted language: 0 (P1=3.77e-96, P2=6.99e-128)
  3-gram -> predicted language: 0 (P1=1.06e-92, P2=3.82e-120)

OOV Sentence 4: B√ºrs√ºnte≈ümi≈ü s√ºm√ºnle  g√∂v√ºntei sismik modeli bozdu.
  2-gram -> predicted language: 1 (P1=2.49e-98, P2=9.51e-82)
  3-gram -> predicted language: 1 (P1=1.56e-97, P2=7.68e-86)

OOV Sentence 5: A gravistorm event fractured the cryphexial driftline entirely.
  2-gram -> predicted language: 0 (P1

**Question 2.3:** How well did your models handle out-of-vocabulary (OOV) samples? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 3: Statistical Analysis (20 points)

**Baseline (10 pts):** Statistical significance testing and comparison.  
**Creativity (10 pts):** Advanced analysis (confusion matrices, error analysis, etc.).

## 3.1: Statistical Significance Testing (10 points)

Use paired t-test to compare models. p-value < 0.05 indicates statistically significant difference.

In [42]:
from scipy.stats import ttest_rel

print("===== Paired t-tests: 2-gram vs 3-gram =====\n")

for metric in ["accuracy", "precision", "recall", "f1_score"]:
    t_stat, p_value = ttest_rel(results_3gram[metric], results_2gram[metric])
    print(f"{metric.upper()}:\n")
    print(f"  t-statistic = {t_stat:.4f}")
    print(f"  p-value     = {p_value:.4f}")

    if p_value < 0.05:
        print("  ‚Üí Statistically significant difference (p < 0.05)\n")
    else:
        print("  ‚Üí NOT statistically significant (p ‚â• 0.05)\n")

===== Paired t-tests: 2-gram vs 3-gram =====

ACCURACY:

  t-statistic = 3.1673
  p-value     = 0.0114
  ‚Üí Statistically significant difference (p < 0.05)

PRECISION:

  t-statistic = 3.5904
  p-value     = 0.0058
  ‚Üí Statistically significant difference (p < 0.05)

RECALL:

  t-statistic = nan
  p-value     = nan
  ‚Üí NOT statistically significant (p ‚â• 0.05)

F1_SCORE:

  t-statistic = 3.3371
  p-value     = 0.0087
  ‚Üí Statistically significant difference (p < 0.05)



**Question 3.1:** Are the performance differences statistically significant? Explain what 'statistical significance' means in this context. (2-3 sentences)

**[YOUR ANSWER HERE]**

## 3.2: Advanced Analysis (10 points)

Perform deeper analysis such as per-language performance, misclassification patterns, etc.

In [None]:
# TODO: Your advanced analysis here

# [6 pts]

**Question 3.2:** What interesting patterns or insights did you discover from your results? (4-5 sentences)

**[YOUR ANSWER HERE]**

# Convert Your Colab Notebook to PDF

### Step 1: Download Your Notebook
- Go to **File ‚Üí Download ‚Üí Download .ipynb**
- Save the file to your computer

### Step 2: Upload to Colab
- Click the **üìÅ folder icon** on the left sidebar
- Click the **upload button**
- Select your downloaded .ipynb file

### Step 3: Run the Code Below
- **Uncomment the cell below** and run the cell
- This will take about 1-2 minutes to install required packages
- When prompted, type your notebook name (e.g.`gs_000000_as2.ipynb`) and press Enter

### The PDF will be automatically downloaded to your computer


In [None]:
# # Install required packages (this takes about 30 seconds)
# print("Installing PDF converter... please wait...")
# !apt-get update -qq
# !apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
# !pip install -q nbconvert

# print("\n" + "="*50)

# # Get notebook name from user
# notebook_name = input("\nEnter your notebook name: ")

# # Add .ipynb if missing
# if not notebook_name.endswith('.ipynb'):
#     notebook_name += '.ipynb'

# import os
# notebook_path = f'/content/{notebook_name}'

# # Check if file exists
# if not os.path.exists(notebook_path):
#     print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
#     print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
# else:
#     print(f"\n‚úì Found {notebook_name}")
#     print("Converting to PDF... this may take 1-2 minutes...\n")

#     # Convert the notebook to PDF
#     !jupyter nbconvert --to pdf "{notebook_path}"

#     # Download the PDF
#     from google.colab import files
#     pdf_name = notebook_name.replace('.ipynb', '.pdf')
#     pdf_path = f'/content/{pdf_name}'

#     if os.path.exists(pdf_path):
#         print("‚úì SUCCESS! Downloading your PDF now...")
#         files.download(pdf_path)
#         print("\n‚úì Done! Check your downloads folder.")
#     else:
#         print("‚ö† Error: Could not create PDF")