# Building a Unigram Language Model in NLP - Japanese

## By Brea Koenes

### Overview


Develop a program that calculates the probability of each word within a provided japanese text corpus using a unigram model.

Steps:

- 1 - Importing the Data

- 2 - Data Cleaning and Tokenization: Preprocess the text by removing special characters and converting everything to lowercase to ensure uniformity. Split the text into individual words (unigrams) to analyze their frequencies.

- 3 - Probability Calculation: Determine the probability of each word by dividing its frequency by the total number of words in the text.

- 4 - Presentation: Display the top 10 most probable words along with their corresponding probabilities.

### 01 - Importing the Data

Data: Phillips Oppenheim's novel "入れかわった男".

In [4]:
# Read in data
data = 'pg34158.txt'

# Each element is a line from the book
oppenheim = []
with open (data) as file:
    for line in file:
        line = line.rstrip('\n')
        oppenheim.append(line)

### 02 - Data Cleaning

In traditional Japanese writing, spaces are not used to separate words. Japanese text is typically composed of continuous characters with no spaces between them. I add a space between every Japanese idiogram.

Create a function that takes a list of text as input and returns a list of cleaned tokens. Each token in this list should be stripped of punctuation, any english, and free of ['stop words'](https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b248dadad47).

- For the 'stop words' the list is: ['の', 'に', 'は', 'を', 'た', 'が', 'で', 'て', 'と', 'し', 'れ', 'さ', 'ある', 'いる', 'も', 'する', 'から', 'な', 'こと', 'として', 'い', 'や', 'する', 'など', 'なり', 'なく', 'まで', 'だ', 'へ', 'か', 'だっ', 'その', 'あっ', 'よう', 'また', 'もの', 'という', 'あり', 'まし', 'ませ', 'う', 'ない', 'ながら', 'なけれ', 'なし', 'ず', 'なっ', 'れる', 'られ', 'なる', 'べき', 'ほど', 'ます', 'てる', 'なら', 'せる', 'され', 'して']

- For the punctuation the list is: ['。', '、', '？', '！', '「', '」', '『', '』', '（', '）', '；', '：', '-']

In [5]:
import re
def is_japanese(word):
    """
    Determine if a word is written in Japanese script.

    This function checks if the given word contains any Japanese characters (Hiragana, Katakana, or Kanji)
    and does not contain any Latin alphabet characters. If the word contains Japanese characters and no
    Latin characters, it is considered a Japanese word.

    Parameters:
    word (str): The word to check.

    Returns:
    bool: True if the word is Japanese, False otherwise.
    """
    # Regex to identify if the word contains any Japanese character
    # Hiragana: U+3040-U+309F, Katakana: U+30A0-U+30FF, Kanji: U+4E00-U+9FAF
    jap_regex = r'[\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FAF]'

    # Regex to identify if the word contains any Latin alphabet character
    eng_regex = r'[a-zA-Z]'

    # If the word contains any Japanese characters and no Latin characters, it's considered Japanese
    if re.search(jap_regex, word) and not re.search(eng_regex, word):
        return True
    else:
        return False

In [15]:
# Imports
import string
from janome.tokenizer import Tokenizer

# Takes list of text as input and returns a list of cleaned tokens.
def cleanText(texts):
    # Initialize
    punctuation_list = {'。', '、', '？', '！', '「', '」', '『', '』', '（', '）', '；', '：', '-'}
    stop_words = {'の', 'に', 'は', 'を', 'た', 'が', 'で', 'て', 'と', 'し', 'れ', 'さ', 'ある', 'いる', 'も', 'する', 'から', 'な', 'こと', 'として', 'い', 'や', 'する', 'など', 'なり', 'なく', 'まで', 'だ', 'へ', 'か', 'だっ', 'その', 'あっ', 'よう', 'また', 'もの', 'という', 'あり', 'まし', 'ませ', 'う', 'ない', 'ながら', 'なけれ', 'なし', 'ず', 'なっ', 'れる', 'られ', 'なる', 'べき', 'ほど', 'ます', 'てる', 'なら', 'せる', 'され', 'して'}
    
    cleaned_texts = []

    # Initialize tokenizer
    tokenizer = Tokenizer()

    for text in texts:
        
        # Remove punctuation first
        text = ''.join([char if char not in punctuation_list else ' ' for char in text])

        # Tokenize text and remove non-japanese
        tokens = [token.surface for token in tokenizer.tokenize(text) if is_japanese(token.surface)]

        # Split each token into individual characters
        char_list = [char for token in tokens for char in token]

        # Remove stop words
        filtered_chars = [char for char in char_list if char not in stop_words]

        # Ensure this specific character is removed
        filtered_chars = [char for char in filtered_chars if char != '々']

        cleaned_texts.append(filtered_chars)

    return cleaned_texts

oppenheim = cleanText(oppenheim)
oppenheim[81:83]

[['そ', 'ん', '長', 'ド', 'ミ', 'ニ', 'ー', 'つ', 'ぶ', '総', '督', '何'],
 ['植',
  '民',
  '地',
  '領',
  '陸',
  '軍',
  '司',
  '令',
  '官',
  'す',
  '特',
  '別',
  '任',
  '務',
  '受',
  'け',
  'こ',
  '地',
  'お',
  'ら',
  'る',
  'す']]

### 3 - Probability Calculation:

Calculate the probability of occurrence for each word within the text. This is achieved by dividing the frequency of each word by the total number of words present in the text.

In [16]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Generate sets with 1 as the maximum size of n-grams and dracula as the data
train_data, padded_vocab = padded_everygram_pipeline(1, oppenheim)

# Transform padded_vocab into a list
padded_vocab = list(padded_vocab)

# Create an MLE unigram model
unigram_model = MLE(1)

# Fit the model using train_data and padded_vocab
unigram_model.fit(train_data, padded_vocab)

# Construct a dictionary of unigram probabilities
unigram_probs = {word: unigram_model.score(word) for word in padded_vocab}

### 4 - Presentation:

Display the top 10 most probable words along with their corresponding probabilities.

In [17]:
top_10_probs = sorted(unigram_probs.items(), key=lambda x:x[1],reverse=True)[:10]

for word, prob in top_10_probs:
    print(f"{word}:{prob}")

っ:0.03913621814856383
る:0.02861952861952862
ら:0.025501932909340316
こ:0.024036662925551816
ー:0.019131645674855553
ま:0.018404206675811614
あ:0.017915783347882113
す:0.017915783347882113
り:0.017271480234443196
わ:0.017177952363137548


### 5 - Export Models

In [9]:
import pickle

#Save the models in disk
with open('unigram_model_japanese.pkl', 'wb') as file:
    pickle.dump(unigram_model , file)