# Turkish Diacritisation | YZV 405E NLP Term Project

Author: Bora Boyacıoğlu

Student ID: 150200310

## Step 5: Combination of the Model and the Rule Based Algorithm

In this final notebook, I will combine the predictions from the model, and the probabilities from the rule based algorithm.

Import necessary libraries.

In [45]:
import csv
import json

import datetime as dt
from typing import Dict, List

from dataset import DiacritizationDataset
from utils.main_utils import *

In [46]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Define the selection function.

In [47]:
def select(probs: Dict[str, Dict[str, int]], sent: List[str], pred: List[str]) -> str:
    """_summary_

    Args:
        probs (Dict[str, Dict[str, int]]): Probabilities of the words and acronyms.
        sent (List[str]): Sentence to diacritize.
        pred (List[str]): Predictions from the model.

    Returns:
        str: Diacritized sentence.
    """
    diacritized = []
    
    for word in sent:
        
        ### 1st Case: The word is already in ASCII form.
        if word not in probs:
            diacritized.append(word)
            continue
        
        ### 2nd Case: There is only one acronym.
        if len(probs[word]) == 1:
            diacritized.append(list(probs[word].keys())[0])
            continue
        
        ### 3rd Case: There are multiple acronyms.
        
        # Get the possible acronyms.
        possible = list(probs[word].keys()) + [word]  # Add the word itself just in case.
        in_pred = False
        
        for acronym in possible:
            # Check if the acronym is in the prediction.
            if acronym not in pred:
                continue
            
            ## 3.1. Case: Select the acronym and break.
            diacritized.append(acronym)
            in_pred = True
            break
        
        ## 3.2. Case: Select the most probable acronym.
        if not in_pred:
            most_probable = max(probs[word], key=probs[word].get)
            diacritized.append(most_probable)
    
    # Return the diacritized sentence.
    return ' '.join(diacritized)

Load the probabilities and predictions.

In [48]:
# Load "data/probs.json".
with open('data/comb/probs.json', 'r') as f:
    probs = json.load(f)

# Load "data/predictions.csv".
preds = []
with open('data/comb/predictions.csv', 'r') as f:
    reader = csv.reader(f)
    
    # Skip the header.
    next(reader)
    
    # Read the predictions.
    for row in reader:
        splits = row[1].split()
        preds.append(splits)

Load the non-filtered test data.

In [49]:
# Load "data/test.csv".
test_data = DiacritizationDataset('data/test.csv', type='test', filter=False)

# Normalize the train data.
normalize(test_data)

# Tokenize the train data.
tokenize(test_data)

length = len(test_data)

Normalizing text 100.00%
Tokenizing... 100.00%


Do the selection.

In [50]:
diacritized = []

for i in range(length):
    sent = test_data.get(i, 'und')
    pred = preds[i]
    
    d = select(probs, sent, pred)
    diacritized.append(d)

Save the diacritisated outputs.

In [51]:
timestamp = dt.datetime.now().strftime('%Y%m%d%H%M%S')
save_file = f'submits/{timestamp}.csv'

with open(save_file, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['ID', 'Sentence'])
    
    for i, sentence in enumerate(diacritized):
        if not sentence:
            sentence = ' '
        
        writer.writerow([i, sentence])