<a href="https://colab.research.google.com/github/dominiksakic/sentimentAnalysisJp/blob/main/prep_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- Vectorizing text is the process of transforming text into numeric tensors

- standardize, tokenize and convert

- 3 different ways to tokenize the standardized text:

1. Word-level tokenization
2. N-gram tokenization
3. Character-level tokenizattion

- You can care about word order or not care about it, which is called:
1. sequence model (word order)
2. bag-of-words models (no word order)

# Vocab indexing

## Main idea
```
vocabulary = {}
for text in dataset:
  text = standardize(text)
  tokens = tokenize(text)
  for token in tokens:
    if token not in vocabulary:
      vocabulary[token] = len(vocabulary)

def one_hot_encode_token(token):
  vector = np.zeros(len(vocabulary),)
  token_index = vocabulary[token]
  vector[token_index] = 1
  return vector
```
- The result would be a Vector with just one 1 in it.

In [15]:
!pip install fugashi[unidic-lite] -q
!pip install sentencepiece -q

In [20]:
import sentencepiece as spm

texts = [
    "私はAIが大好きです",
    "AIは素晴らしい技術です",
    "日本語のテキストを処理します"
]

# Write the texts to a file
with open('text_data.txt', 'w', encoding='utf-8') as f:
    for text in texts:
        f.write(text + '\n')

# Train a SentencePiece Model on my text
spm.SentencePieceTrainer.train(input='text_data.txt', model_prefix='spm_model', vocab_size=34)


In [24]:
# Load model
sp = spm.SentencePieceProcessor(model_file='spm_model.model')

# Example Japanese text to tokenize
text = "私はAIが大好きです"

# Tokenize the text into subword units
tokens = sp.encode(text, out_type=str)

print("Tokens:", tokens)

Tokens: ['▁', '私', 'は', 'AI', 'が', '大', '好', 'き', 'で', 'す']


In [34]:
import numpy as np

# Get the size of the vocabulary
vocab_size = sp.get_piece_size()
print(f"Vocabulary size: {vocab_size}")

# Build vocabulary from SentencePiece model
vocabulary = {sp.id_to_piece(i): i for i in range(vocab_size)}

# Print some sample vocabulary items
for token, idx in list(vocabulary.items())[:5]:
    print(f"Token: {token}, Index: {idx}")

Vocabulary size: 34
Token: <unk>, Index: 0
Token: <s>, Index: 1
Token: </s>, Index: 2
Token: ▁, Index: 3
Token: す, Index: 4


In [33]:
def one_hot_encode_token(token, vocabulary):
    vector = np.zeros(len(vocabulary), dtype=int)
    if token in vocabulary:
        vector[vocabulary[token]] = 1

    return vector

# One-hot encode each token in the vocabulary
one_hot_vectors = {token: one_hot_encode_token(token, vocabulary) for token in vocabulary}

# Display the one-hot encoding of the first few tokens
for token, vector in list(one_hot_vectors.items())[:5]:
    print(f"Token: {token} -> One-Hot Vector: {vector}")

Token: <unk> -> One-Hot Vector: [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: <s> -> One-Hot Vector: [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: </s> -> One-Hot Vector: [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: ▁ -> One-Hot Vector: [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Token: す -> One-Hot Vector: [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
