<a href="https://colab.research.google.com/github/dominiksakic/sentimentAnalysisJp/blob/main/prep_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Vectorizing text is the process of transforming text into numeric tensors

- standardize, tokenize and convert

- 3 different ways to tokenize the standardized text:

1. Word-level tokenization
2. N-gram tokenization
3. Character-level tokenizattion

- You can care about word order or not care about it, which is called:
1. sequence model (word order)
2. bag-of-words models (no word order)


# Workflow:
1. Text Preprocessing, clean and remove unnecessary characters

2. Train Senteice Piece model on the entire dataset -> helpful to tokenize words into subwords

3. Tokenize the dataset using the SEntecePiece model.

4. Generate Bigrams

5. Build Vocabulary

6. Generate BoW Vector

# Vocab indexing

## Main idea
```
vocabulary = {}
for text in dataset:
  text = standardize(text)
  tokens = tokenize(text)
  for token in tokens:
    if token not in vocabulary:
      vocabulary[token] = len(vocabulary)

def one_hot_encode_token(token):
  vector = np.zeros(len(vocabulary),)
  token_index = vocabulary[token]
  vector[token_index] = 1
  return vector
```
- The result would be a Vector with just one 1 in it.

In [15]:
!pip install fugashi[unidic-lite] -q
!pip install sentencepiece -q

# Step 1 and 2
- no cleaning needed

In [37]:
import sentencepiece as spm

texts = [
    "このコードは非常に効率的です",
    "プログラムのデバッグを行います",
    "新しいアルゴリズムを試しています",
    "関数のテストを実行します",
    "データベースに接続しています",
    "ユーザーインターフェイスを更新しています",
    "エラーが発生しました",
    "新しい機能を追加しました",
    "コードの最適化を行っています",
    "APIのドキュメントを作成しています",
    "ユーザーからのフィードバックを受け取っています",
    "セキュリティ対策を強化しました",
    "パフォーマンスを向上させました",
    "プロジェクトの進行状況を確認しています",
    "バージョン管理を使用しています",
    "プログラムの動作確認をしています",
    "コードのリファクタリングを行います",
    "新しいライブラリをインストールしました",
    "複雑なデータ構造を処理しています",
    "メモリ使用量を最適化しています",
    "コードの可読性を向上させました",
    "ユニットテストを追加しました",
    "デバッグログを出力しています",
    "コードの最終チェックを行っています"
]

# Write the texts to a file
with open('text_data.txt', 'w', encoding='utf-8') as f:
    for text in texts:
        f.write(text + '\n')

# Train a SentencePiece Model on my text
spm.SentencePieceTrainer.train(input='text_data.txt', model_prefix='spm_model', vocab_size=125)


# Example outputs
- Vocab
- One Hot encoding

In [44]:
import numpy as np

# Load model
sp = spm.SentencePieceProcessor(model_file='spm_model.model')


# Get the size of the vocabulary
vocab_size = sp.get_piece_size()
print(f"Vocabulary size: {vocab_size}")

# Build vocabulary from SentencePiece model
vocabulary = {sp.id_to_piece(i): i for i in range(vocab_size)}

# Print some sample vocabulary items
for token, idx in list(vocabulary.items())[:5]:
    print(f"Token: {token}, Index: {idx}")

def one_hot_encode_token(token, vocabulary):
    vector = np.zeros(len(vocabulary), dtype=int)
    if token in vocabulary:
        vector[vocabulary[token]] = 1

    return vector

# One-hot encode each token in the vocabulary
one_hot_vectors = {token: one_hot_encode_token(token, vocabulary) for token in vocabulary}

# Display the one-hot encoding of the first few tokens
for token, vector in list(one_hot_vectors.items())[:5]:
    print(f"Token: {token} -> One-Hot Vector: {vector}")

Vocabulary size: 125
Token: <unk>, Index: 0
Token: <s>, Index: 1
Token: </s>, Index: 2
Token: ま, Index: 3
Token: す, Index: 4
Token: <unk> -> One-Hot Vector: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: <s> -> One-Hot Vector: [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: </s> -> One-Hot Vector: [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: ま -> One-H

# Step 3 Tokenize Text

In [45]:
texts = [
    "私はAIが大好きです",
    "AIは素晴らしい技術です",
    "日本語のテキストを処理します"
]

# Tokenize the texts
tokenized_texts = [sp.encode(text, out_type=str) for text in texts]
print(f"Tokenized texts: {tokenized_texts}")

Tokenized texts: [['▁', '私', 'は', 'A', 'I', 'が', '大好き', 'で', 'す'], ['▁', 'A', 'I', 'は', '素晴', 'ら', 'し', 'い', '技術', 'で', 'す'], ['▁', '日本語', 'の', 'テ', 'キ', 'ス', 'ト', 'を', '処', '理', 'し', 'ま', 'す']]


# Step 4 Generate N-Grams

In [46]:
def generate_bigrams(tokens):
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

# Generate bigrams for each text
bigram_texts = [generate_bigrams(text) for text in tokenized_texts]
print(f"Bigram texts: {bigram_texts}")

Bigram texts: [[('▁', '私'), ('私', 'は'), ('は', 'A'), ('A', 'I'), ('I', 'が'), ('が', '大好き'), ('大好き', 'で'), ('で', 'す')], [('▁', 'A'), ('A', 'I'), ('I', 'は'), ('は', '素晴'), ('素晴', 'ら'), ('ら', 'し'), ('し', 'い'), ('い', '技術'), ('技術', 'で'), ('で', 'す')], [('▁', '日本語'), ('日本語', 'の'), ('の', 'テ'), ('テ', 'キ'), ('キ', 'ス'), ('ス', 'ト'), ('ト', 'を'), ('を', '処'), ('処', '理'), ('理', 'し'), ('し', 'ま'), ('ま', 'す')]]


# Step 5 Build Vocab
- Why build the Vocab form the Bigrams and not the other way around?
- The Vocab defines what features the model look at! In a Bigram way, the features are 2 Words

In [48]:
# Build vocabulary from bigrams
vocabulary = {}
for text in bigram_texts:
    for bigram in text:
        if bigram not in vocabulary:
            vocabulary[bigram] = len(vocabulary)

print(f"Vocabulary: {vocabulary}")

Vocabulary: {('▁', '私'): 0, ('私', 'は'): 1, ('は', 'A'): 2, ('A', 'I'): 3, ('I', 'が'): 4, ('が', '大好き'): 5, ('大好き', 'で'): 6, ('で', 'す'): 7, ('▁', 'A'): 8, ('I', 'は'): 9, ('は', '素晴'): 10, ('素晴', 'ら'): 11, ('ら', 'し'): 12, ('し', 'い'): 13, ('い', '技術'): 14, ('技術', 'で'): 15, ('▁', '日本語'): 16, ('日本語', 'の'): 17, ('の', 'テ'): 18, ('テ', 'キ'): 19, ('キ', 'ス'): 20, ('ス', 'ト'): 21, ('ト', 'を'): 22, ('を', '処'): 23, ('処', '理'): 24, ('理', 'し'): 25, ('し', 'ま'): 26, ('ま', 'す'): 27}


# Step 6 Generate BoW

In [49]:

def bag_of_words_bigrams(text, vocabulary):
    vector = np.zeros(len(vocabulary), dtype=int)

    # Count the frequency of each bigram in the vocabulary
    for bigram in text:
        if bigram in vocabulary:
            vector[vocabulary[bigram]] += 1

    return vector

# Generate BoW
bow_vectors = [bag_of_words_bigrams(text, vocabulary) for text in bigram_texts]

# Display BoW vectors
for i, bow_vector in enumerate(bow_vectors):
    print(f"Text {i + 1} BoW: {bow_vector}")

Text 1 BoW: [1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Text 2 BoW: [0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
Text 3 BoW: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1]


# Compare to Unigram output

In [50]:
tokenized_texts = [
    ["私", "は", "AI", "が", "大好き", "です"],
    ["AI", "は", "素晴らしい", "技術", "です"],
    ["日本語", "の", "テキスト", "を", "処理", "します"]
]

# Build unigram vocabulary
unigram_vocab = {}
for text in tokenized_texts:
    for token in text:
        if token not in unigram_vocab:
            unigram_vocab[token] = len(unigram_vocab)

print("Unigram Vocab:", unigram_vocab)

# Create BoW vectors
def unigram_bow_vector(tokens, vocab):
    vector = np.zeros(len(vocab), dtype=int)
    for token in tokens:
        if token in vocab:
            vector[vocab[token]] += 1
    return vector

unigram_bows = [unigram_bow_vector(text, unigram_vocab) for text in tokenized_texts]

for i, bow in enumerate(unigram_bows):
    print(f"Text {i+1} Unigram BoW: {bow}")

Unigram Vocab: {'私': 0, 'は': 1, 'AI': 2, 'が': 3, '大好き': 4, 'です': 5, '素晴らしい': 6, '技術': 7, '日本語': 8, 'の': 9, 'テキスト': 10, 'を': 11, '処理': 12, 'します': 13}
Text 1 Unigram BoW: [1 1 1 1 1 1 0 0 0 0 0 0 0 0]
Text 2 Unigram BoW: [0 1 1 0 0 1 1 1 0 0 0 0 0 0]
Text 3 Unigram BoW: [0 0 0 0 0 0 0 0 1 1 1 1 1 1]
