### Text processing
- What is the topic of this text? (classification)
- Does this text contain abuse? (moderation)
- Positive or negative? (sentiment analysis)
- Sentence completion. (language modelling)
- How to say in Dutch? (Translation)
- Produce a summary. (summarization)
  
Viktigt: nätverket förstår egentligen ingentig, det är bara matematik i bakgrunden.
  
Ungefär fram till 2017 var RNN och LSTM stora inom text processing. Sedan tog LLM'er över.  
  
#### What need to be done to process text for neural networks?
- Standardizing: convert to lower (or upper) case, remove punctuation (med försiktighet, eftersom en del punkter, frågetecken mm innehåller information om sentiment).
- tex frågetecken innehåler information om att det är en fråga.
- Tokenization: split the text into units (tokens), such as characters, words, groups of words, clauses in sentences, etc.
- Convert all tokens to a tensor. This means (typically) indexing the tokens.
  
#### Example
The cat sat on the mat.  
the cat sat on the mat  
["cat", "sat", "on", "mat"]  
[2, 34, 53, 8]
sedan gör man one hot encoding (vanligt i alla fall), den blir väldigt stor, lika många dimensioner som det finns tokens.  
Ett annat sätt man gör om: è => e (detta innebär en risk, eftersom betydelsen kan bli lidande)
I många språk kan man inte göra "punctuation" hur som helst, och det beror på språket.  

### Three ways of handling tokens
#### Word-level tokenization
So called "world-level tokenization"  
Tokens are space-separated substrings (or punctuation-separated if appropriate).  A variant also splits into subwords, which is especialy important for agglutinating and composing languages, such as Finnish or Swedish. (bildörr, en, ett, dörren, bilen, bil osv) Man behöver kunskap om språken.
#### N-gram tokenization (vanligast)
- Tokens are groups of N consecutive words. For example, "the cat", "he was", "over there" ("2-grams", "bigrams")
#### Caracter-level tokenization
- each caracter is its own token. In practice, useful for languages with rich writing systems or pictographic writing (cyrillic, japanese, chinese etc)
- kollar på varje bokstav, men det är ett specialfall
  
Dataset: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Det 

### Ladda ner och extrahera datan:
- Ladda ner filen (.gz fil) med data och lägg den i data foldern.
- Extrahera den till en .tar fil:

In [3]:
import gzip
import shutil

# Path to the .gz file and the output file
input_file = r"data\aclImdb_v1.tar.gz"
output_file = r"data\aclImdb_v1.tar"

# Extract the .gz file
with gzip.open(input_file, 'rb') as f_in:
    with open(output_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

print("Extraction complete!")


Extraction complete!


Extrahera .tar filen till data\text foldern:

In [6]:
import tarfile

# Path to the .tar file
input_tar = r"data\aclImdb_v1.tar"
output_dir = r"data\text"

# Extract the .tar file
with tarfile.open(input_tar, 'r:') as tar:
    tar.extractall(path=output_dir)

print("Extraction complete!")


  tar.extractall(path=output_dir)


Extraction complete!


Remove unnecessary files:

In [7]:
import os

files_to_remove = [
    r"data\text\aclImdb\train\unsup",
    r"data\text\aclImdb\train\urls_unsup.txt",
    r"data\aclImdb_v1.tar",
    r"data\aclImdb_v1.tar.gz"
]

for file_path in files_to_remove:
    try:
        if os.path.isfile(file_path):
            os.remove(file_path)
            print(f"Removed file: {file_path}")
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
            print(f"Removed directory: {file_path}")
    except Exception as e:
        print(f"Error removing {file_path}: {e}")

Removed directory: data\text\aclImdb\train\unsup
Removed file: data\text\aclImdb\train\urls_unsup.txt
Removed file: data\aclImdb_v1.tar
Removed file: data\aclImdb_v1.tar.gz


### Skapar validerings dataset

In [1]:
import os, pathlib, shutil, random
basedir = pathlib.Path("data/text/aclImdb")
val_dir = basedir / "val"
train_dir = basedir / "train"

for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir / category / fname)

FileNotFoundError: [WinError 3] Det går inte att hitta sökvägen: 'data\\text\\aclImdb\\train\\neg'

In [6]:
import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(val_dir, batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(basedir / "test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [8]:
for inputs, targets in train_ds:
    print(f"inputs: {inputs.shape}, {inputs.dtype}")
    print(f"targets: {targets.shape}, {targets.dtype}")
    break

inputs: (32,), <dtype: 'string'>
targets: (32,), <dtype: 'int32'>


Text vektorisering, görs utanför lagret

In [12]:
from keras import layers
text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode="multi_hot")
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds) # adapt is similar to fit, but it only uses the first batch

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))


In [None]:
def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [17]:
model = get_model()
model.summary()

In [18]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)]
model.fit(binary_1gram_train_ds, validation_data=binary_1gram_val_ds, epochs=10, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m78s[0m 121ms/step - accuracy: 0.7862 - loss: 0.4836 - val_accuracy: 0.8750 - val_loss: 0.3087
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 49ms/step - accuracy: 0.8978 - loss: 0.2666 - val_accuracy: 0.8844 - val_loss: 0.3042
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 40ms/step - accuracy: 0.9208 - loss: 0.2294 - val_accuracy: 0.8854 - val_loss: 0.3259
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 34ms/step - accuracy: 0.9260 - loss: 0.2116 - val_accuracy: 0.8858 - val_loss: 0.3445
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 36ms/step - accuracy: 0.9339 - loss: 0.2011 - val_accuracy: 0.8826 - val_loss: 0.3549
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 40ms/step - accuracy: 0.9368 - loss: 0.1960 - val_accuracy: 0.8844 - val_loss: 0.3749
Epoch 7/10
[1m

<keras.src.callbacks.history.History at 0x1d56778f320>

laddar den sparade modellen, samt gör evaluate på den.

In [19]:
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m179s[0m 227ms/step - accuracy: 0.8873 - loss: 0.2906
Test acc: 0.887


Man kan se att inputvektorn är endast 1 o 0.

In [21]:
for inputs, targets in binary_1gram_test_ds:
    print(f"inputs: {inputs.shape}, {inputs.dtype}")
    print(f"targets: {targets.shape}, {targets.dtype}")
    print(f"inputs[0]: {inputs[0].numpy()}")
    print(f"targets[0]: {targets[0].numpy()}")
    break

inputs: (32, 20000), <dtype: 'int64'>
targets: (32,), <dtype: 'int32'>
inputs[0]: [1 1 1 ... 0 0 0]
targets[0]: 0


Provar  med bigram.

In [20]:
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="tf_idf")
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [22]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras", save_best_only=True)]
model.fit(tfidf_2gram_train_ds, validation_data=tfidf_2gram_val_ds, epochs=10, callbacks=callbacks)


Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 126ms/step - accuracy: 0.7243 - loss: 0.8280 - val_accuracy: 0.8830 - val_loss: 0.3030
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - accuracy: 0.8673 - loss: 0.3307 - val_accuracy: 0.8860 - val_loss: 0.3072
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 48ms/step - accuracy: 0.8896 - loss: 0.2914 - val_accuracy: 0.8786 - val_loss: 0.3287
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 42ms/step - accuracy: 0.9035 - loss: 0.2576 - val_accuracy: 0.8814 - val_loss: 0.3244
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 42ms/step - accuracy: 0.9055 - loss: 0.2449 - val_accuracy: 0.8778 - val_loss: 0.3328
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step - accuracy: 0.9122 - loss: 0.2336 - val_accuracy: 0.8856 - val_loss: 0.3325
Epoch 7/10
[1m

<keras.src.callbacks.history.History at 0x1d5681a9af0>

In [24]:
model = keras.models.load_model("tfidf_2gram.keras")
print((f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}"))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 199ms/step - accuracy: 0.8892 - loss: 0.2975
Test acc: 0.888


Del 3, vi ska titta på sekvenser nu. Förbereder datan för att använda en LSTM.

In [26]:
max_length = 600
max_tokens = 20000

text_vectorization = layers.TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [29]:
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(max_tokens, 128)(inputs)
#embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])


Del 4, Lucia:
### Embeddings:
- sidan 332 i boken
- Bidirectional layers, gradient descent åt båda hållen
- 

Gör ett nytt neural network, adderar något som heter "mask". Sätt  `mask_zero = True`
Detta blir inte jättemycket bättre.  

Hittills är TFIDF den bästa.

Nästa steg är det som kallas TRANSFORMERS! Skapar en ny fil 11 för det.