## Load Bengali dataset

- imports the ```load_dataset``` function from Hugging Face ```datasets``` library <br>
- ```split="beb_Beng"``` loads the bengali language subset of the dataset <br>
- ```streaming=True``` this tells Hugging Face to stream the data lazily that means data is not fully downloaded into memory

In [10]:
from datasets import load_dataset

dataset = load_dataset("ai4bharat/IndicCorpV2", "indiccorp_v2", split="ben_Beng", streaming=True)
first_item = next(iter(dataset))
print(first_item['text'])

নয়াদিল্লি: আজ ভ্যালেন্টাইনস ডে। এমন একটি দিনে প্রিয় মানুষকে হৃদয় উজাড় করা ভালোবাসার বার্তা দিতে পৌঁছে দিতে উন্মুখ অনেকেই। এমন একটি দিনে ভারতীয় দলের প্রাক্তন ক্রিকেটার সচিন তেন্ডুলকর জানালেন তাঁর প্রথম ভালোবাসার কথা। একটি ভিডিও পোস্ট করেছেন ভারতীয় দলের প্রাক্তন ব্যাটিং স্তম্ভ। ভিডিওতে আন্তর্জাতিক ক্রিকেট ১০০ শতরানের মালিককে নেটে অনুশীলন করতে দেখা যাচ্ছে। ওই ভিডিও শেয়ার করে সচিন লিখেছেন, আমার প্রথম ভালোবাসা।


## Sentence Tokenizer and word tokenizer

In [None]:
import re

def tokenize_sentences(text):
    sentences = re.split(r'[।.!?]+\s*', text)
    return [s.strip() for s in sentences if s.strip()]

def tokenize_words(text):
    pattern = r'''
        \d{1,2}[/-]\d{1,2}[/-]\d{2,4}     |  # Dates like 12/08/2025
        \d{4}-\d{2}-\d{2}                 |  # Dates like 2025-08-05
        https?://\S+                     |  # URLs with http or https
        www\.\S+                         |  # URLs starting with www
        [\w._%+-]+@[\w.-]+\.\w+          |  # Email addresses
        \d+\.\d+                         |  # Decimal numbers
        \d+                              |  # Whole numbers
        [\u0980-\u09FF]+                 |  # Bengali words
        [^\s\u0980-\u09FF]               # Punctuation and symbols
    '''
    return re.findall(pattern, text, re.VERBOSE)

- ```import re``` python's regular expression module, used here for pattern matching <br>
- Sentence tokenizer splits the paragraph into sentences using punctuation 
- Word tokenizer breaks the sentences into individual words called **tokens**

## Tokenization and writing into file

In [None]:
output_file = "tokenized_bengali.txt"

sentence_list = []
word_list = []

with open(output_file, "w", encoding="utf-8") as f_out:
    count = 0
    for item in dataset:
        paragraph = item["text"]
        sentences = tokenize_sentences(paragraph)

        for sent in sentences:
            words = tokenize_words(sent)
            if words:
                f_out.write(" ".join(words) + "\n")
                sentence_list.append(words)
                word_list.extend(words)

        # Limit for quick testing
        count += 1
        if count >= 1000:
            break


- ```output_file = "tokenized_bengali.txt"``` all tokenized sentences will be saved here
- sentence_list and word_list will store the tokens for later corpus statistics calculation

## Corpus Statistics

In [8]:
def compute_statistics(sentences, words):
    total_sentences = len(sentences)
    total_words = len(words)
    total_chars = sum(len(word) for word in words)
    avg_sentence_len = total_words / total_sentences if total_sentences > 0 else 0
    avg_word_len = total_chars / total_words if total_words > 0 else 0
    ttr = len(set(words)) / total_words if total_words > 0 else 0

    return {
        'Total Sentences': total_sentences,
        'Total Words': total_words,
        'Total Characters': total_chars,
        'Average Sentence Length': round(avg_sentence_len, 2),
        'Average Word Length': round(avg_word_len, 2),
        'Type/Token Ratio (TTR)': round(ttr, 4)
    }


## Print Corpus Statistics

In [9]:
stats = compute_statistics(sentence_list, word_list)
print("\n--- Corpus Statistics ---")
for key, value in stats.items():
    print(f"{key}: {value}")


--- Corpus Statistics ---
Total Sentences: 1914
Total Words: 24216
Total Characters: 117045
Average Sentence Length: 12.65
Average Word Length: 4.83
Type/Token Ratio (TTR): 0.3351
