<div style="text-align: center">
    <img src="https://y.yarn.co/879fb637-70a2-4697-9dbf-e078573403e6_text.gif" alt="Alt Text" style="display: block; margin: 0 auto;">
</div>



In [None]:
# Import libraries
import torch
import torch.nn as nn
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
from transformers import BertModel


In [None]:
# Download pytorch model

# Define the repository name and filename
repo_name = "zicsx/Hindi-Punk"
filename = "Hindi-Punk-model.pth"

# Download the file
model_path = hf_hub_download(repo_id=repo_name, filename=filename)




In [3]:
# Define Punctuation Model Class

class CustomTokenClassifier(nn.Module):
    def __init__(self, hidden_size, num_classes):
        super(CustomTokenClassifier, self).__init__()
        if num_classes > 0:
            self.classifier = nn.Linear(hidden_size, num_classes)
        else:
            self.classifier = None

    def forward(self, hidden_states):
        if self.classifier:
            return self.classifier(hidden_states)
        else:
            return None

class PunctuationModel(nn.Module):
    def __init__(self, bert_model_name, punct_num_classes, hidden_size):
        super(PunctuationModel, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.punct_classifier = CustomTokenClassifier(hidden_size, punct_num_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        hidden_states = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
        punct_logits = self.punct_classifier(hidden_states) if self.punct_classifier else None
        return punct_logits


In [6]:
model = PunctuationModel(
    bert_model_name='google/muril-base-cased',
    punct_num_classes=5,  # Number of punctuation classes (including 'O')
    hidden_size=768       # Hidden size of the BERT model
)

model.load_state_dict(torch.load(model_path))
# model.eval()  # Set the model to evaluation mode


<All keys matched successfully>

In [7]:
# Load and test the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
                    pretrained_model_name_or_path="zicsx/Hindi-Punk", use_fast=True,
                )
# test an example input
example_text = "आप कैसे हैं मुझे आपसे मिलकर खुशी हुई"
encoded_input = tokenizer(example_text, return_tensors="pt")
print(encoded_input['input_ids'])


tensor([[  104,  1840,  6345,  1145,  4254, 48690, 13570, 20597,  2044,   105]])


In [11]:
# Function to perform inference and get punctuation and capitalization predictions
def predict_punctuation_capitalization(model, text, tokenizer):
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Determine the device to use (CPU or GPU)
    device = next(model.parameters()).device

    # Move inputs to the same device as the model
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Perform inference
    with torch.no_grad():
        punct_logits = model(**inputs)

    # Convert logits to probabilities and get the indices of the highest probability labels
    punct_probs = torch.nn.functional.softmax(punct_logits, dim=-1)
    punct_predictions = torch.argmax(punct_probs, dim=-1)

    return punct_predictions

# Function to map predictions to labels and combine them with the original text
def combine_predictions_with_text(text, tokenizer, punct_predictions, punct_index_to_label):
    # Tokenize the input text and get offset mappings
    encoded = tokenizer.encode_plus(text, return_tensors='pt', return_offsets_mapping=True)
    tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
    offset_mapping = encoded['offset_mapping'][0].tolist()

    # Combine tokens with their predictions
    combined = []
    current_word = ''
    current_punct = ''
    for i, (token, punct) in enumerate(zip(tokens, punct_predictions.squeeze())):
        # Skip special tokens
        if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
            continue

        # Remove "##" prefix from subword tokens
        if token.startswith("##"):
            token = token[2:]
        else:
            # If not the first token, add a space before starting a new word
            if current_word:
                combined.append(current_word + current_punct)
                current_word = ''
                current_punct = ''
        
        current_word += token

        # Update the current punctuation if predicted
        if punct_index_to_label[punct.item()] != 'O':
            current_punct = punct_index_to_label[punct.item()]

    # Append the last word and punctuation (if any) to the combined text
    combined.append(current_word + current_punct)

    return ' '.join(combined)


In [12]:
# Punctuation label to index mapping
punct_index_to_label = {0: '', 1: '!', 2: ',', 3: '?', 4: '।'}

# Example usage
text = "सलामअलैकुम कहाँ जा रहे हैं जी आओ बैठो छोड़ देता हूँ हेलो एक्सक्यूज मी आपका क्या नाम है तुम लोगों को बाद में देख लेता हूँ"

# Predict punctuation
punct_predictions = predict_punctuation_capitalization(model, text, tokenizer)

# Combine predictions with the original text
combined_text = combine_predictions_with_text(text, tokenizer, punct_predictions, punct_index_to_label)
print("Combined Text:", combined_text)


Combined Text: सलामअलैकुम, कहाँ जा रहे हैं जी? आओ, बैठो, छोड़ देता हूँ? हेलो, एक्सक्यूज, मी आपका क्या नाम है? तुम लोगों को बाद में देख लेता हूँ।


First lets see how the tokenizer is working and how we can implement pre-processing and post-processing steps to fix the issues

In [25]:
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
                    pretrained_model_name_or_path="zicsx/Hindi-Punk", use_fast=True,
                )
# Test cases with Hindi text strings, newline separators, and special characters
test_cases = [
    "नमस्ते दुनिया।\nकैसे हो?",
    "यह एक परीक्षण है।\n#विशेष_वर्ण",
    "प्रोग्रामिंग में नया पंक्ति विभाजक।\nनई लाइन।\nऔर एक।",
    "पहली पंक्ति।\nदूसरी पंक्ति।\nतीसरी पंक्ति।",
    "विराम चिह्न: , ; : ' \" ( ) [ ] { }",
    "Hello नमस्ते 123 १२३ + - = * /",
    "यह एक स्माइली है :) 😉",
    "यह एक ईमोजी है 🌟🚀"
]

# Tokenize and decode each test case
for test_case in test_cases:
    encoded = tokenizer.encode(test_case)
    decoded = tokenizer.decode(encoded)
    print(f"Original: {test_case}")
    print(f"Decoded: {decoded}")
    print("\n")


Original: नमस्ते दुनिया।
कैसे हो?
Decoded: [CLS] नमस्ते दुनिया । कैसे हो? [SEP]


Original: यह एक परीक्षण है।
#विशेष_वर्ण
Decoded: [CLS] यह एक परीक्षण है । # विशेष _ वर्ण [SEP]


Original: प्रोग्रामिंग में नया पंक्ति विभाजक।
नई लाइन।
और एक।
Decoded: [CLS] प्रोग्रामिंग में नया पंक्ति विभाजक । नई लाइन । और एक । [SEP]


Original: पहली पंक्ति।
दूसरी पंक्ति।
तीसरी पंक्ति।
Decoded: [CLS] पहली पंक्ति । दूसरी पंक्ति । तीसरी पंक्ति । [SEP]


Original: विराम चिह्न: , ; : ' " ( ) [ ] { }
Decoded: [CLS] विराम चिह्न :, ; :'" ( ) [ ] { } [SEP]


Original: Hello नमस्ते 123 १२३ + - = * /
Decoded: [CLS] Hello नमस्ते 123 १२३ + - = * / [SEP]


Original: यह एक स्माइली है :) 😉
Decoded: [CLS] यह एक स्माइली है : ) [UNK] [SEP]


Original: यह एक ईमोजी है 🌟🚀
Decoded: [CLS] यह एक ईमोजी है [UNK] [SEP]




Based on the outputs from the test cases, we can conclude the following about its handling of text:

1. **New Line Characters**: The tokenizer does not preserve new line characters (`\n`). Text that was originally separated by new lines is merged into a single line in the decoded output.

2. **Punctuation and Special Characters**: The tokenizer tends to separate certain punctuation marks from the words they follow, such as the Hindi full stop (।). However, other punctuation marks like commas, semicolons, and brackets are preserved as in the original text. Special characters like hashtags (`#`) and underscores (`_`) are also separated from the surrounding text.

3. **Mixed English and Hindi Text**: The tokenizer effectively handles text containing a mix of English and Hindi characters, as well as numbers. The decoded text remains faithful to the original in this aspect.

4. **Emoticons and Emoji**: The tokenizer does not handle emoticons and emoji well. In the test cases, the emoticon `:)` is split into `:` and `)`, and the emoji `🌟🚀` is replaced by `[UNK]` (unknown token), indicating the tokenizer may not have representations for certain emoticons and emoji.

5. **Special Tokens**: The tokenizer adds special tokens `[CLS]` at the beginning and `[SEP]` at the end of each decoded sequence. These tokens are used in transformer models for classification (`[CLS]`) and separation (`[SEP]`) tasks.

6. **Consistency in Decoding**: Aside from the above points, the tokenizer consistently decodes the text in a way that is mostly faithful to the original content, particularly with respect to the words and their order.

These findings highlight the tokenizer's behavior in handling different types of input, including its treatment of new line characters, punctuation, mixed language content, and special characters. Understanding these behaviors is important for effectively using the tokenizer in natural language processing tasks, especially those involving Hindi text.

---

To address the issues identified with the tokenizer, we can implement pre-processing and post-processing steps. Here are some suggestions:

### Pre-processing:
1. **Replace New Line Characters**: Replace new line characters (`\n`) with a special token or a unique string that we can easily identify and convert back to new lines in post-processing.

2. **Handle Special Characters and Punctuation**: If certain punctuation marks are being incorrectly separated, consider replacing them with equivalent tokens or merging them with adjacent words before tokenization.

3. **Encode Emoticons and Emoji**: Convert emoticons and emoji into text representations or special tokens that the tokenizer can handle.

### Post-processing:
1. **Restore New Line Characters**: Convert the special tokens or unique strings used to represent new lines back into actual new line characters (`\n`).

2. **Adjust Punctuation and Special Characters**: If punctuation or special characters were modified during pre-processing, revert them to their original form.

3. **Handle Special Tokens (`[CLS]` and `[SEP]`)**: Remove the `[CLS]` and `[SEP]` tokens added by the tokenizer, if they are not needed for your specific task.

4. **Decode Emoticons and Emoji**: If emoticons and emoji were converted to text representations or special tokens, convert them back to their original form.

By applying these pre-processing and post-processing steps, we can mitigate some of the issues observed with the tokenizer, ensuring that the input text is handled more accurately and the output aligns better with your requirements.