### Text processing
- What is the topic of this text? (classification)
- Does this text contain abuse? (moderation)
- Positive or negative? (sentiment analysis)
- Sentence completion. (language modelling)
- How to say in Dutch? (Translation)
- Produce a summary. (summarization)
  
Viktigt: nätverket förstår egentligen ingentig, det är bara matematik i bakgrunden.
  
Ungefär fram till 2017 var RNN och LSTM stora inom text processing. Sedan tog LLM'er över.  
  
#### What need to be done to process text for neural networks?
- Standardizing: convert to lower (or upper) case, remove punctuation (med försiktighet, eftersom en del punkter, frågetecken mm innehåller information om sentiment).
- tex frågetecken innehåler information om att det är en fråga.
- Tokenization: split the text into units (tokens), such as characters, words, groups of words, clauses in sentences, etc.
- Convert all tokens to a tensor. This means (typically) indexing the tokens.
  
#### Example
The cat sat on the mat.  
the cat sat on the mat  
["cat", "sat", "on", "mat"]  
[2, 34, 53, 8]
sedan gör man one hot encoding (vanligt i alla fall), den blir väldigt stor, lika många dimensioner som det finns tokens.  
Ett annat sätt man gör om: è => e (detta innebär en risk, eftersom betydelsen kan bli lidande)
I många språk kan man inte göra "punctuation" hur som helst, och det beror på språket.  

### Three ways of handling tokens
#### Word-level tokenization
So called "world-level tokenization"  
Tokens are space-separated substrings (or punctuation-separated if appropriate).  A variant also splits into subwords, which is especialy important for agglutinating and composing languages, such as Finnish or Swedish. (bildörr, en, ett, dörren, bilen, bil osv) Man behöver kunskap om språken.
#### N-gram tokenization (vanligast)
- Tokens are groups of N consecutive words. For example, "the cat", "he was", "over there" ("2-grams", "bigrams")
#### Caracter-level tokenization
- each caracter is its own token. In practice, useful for languages with rich writing systems or pictographic writing (cyrillic, japanese, chinese etc)
- kollar på varje bokstav, men det är ett specialfall
  
Dataset: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

Det 

### Ladda ner och extrahera datan:
- Ladda ner filen (.gz fil) som innehåller all data. Flytta den från nddladdad foldern till data foldern i repot manuellt.
- Koden nedan extraherar .gz-filen till en .tar fil:

In [50]:
import gzip
import shutil

# Path to the .gz file and the output file
input_file = r"data\aclImdb_v1.tar.gz"
output_file = r"data\aclImdb_v1.tar"

# Extract the .gz file
with gzip.open(input_file, 'rb') as f_in:
    with open(output_file, 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

print("Extraction complete!")


Extraction complete!


Extrahera .tar filen till data\text foldern. Alla folders mm som behövs skapas av sig själv.

In [51]:
import tarfile

# Path to the .tar file
input_tar = r"data\aclImdb_v1.tar"
output_dir = r"data\text"

# Extract the .tar file
with tarfile.open(input_tar, 'r:') as tar:
    tar.extractall(path=output_dir)

print("Extraction complete!")


  tar.extractall(path=output_dir)


Extraction complete!


Tar bort nedladdade filen och lite andra filer som inte behövs.

In [52]:
import os

files_to_remove = [
    r"data\text\aclImdb\train\unsup",
    r"data\text\aclImdb\train\urls_unsup.txt",
    r"data\aclImdb_v1.tar",
    r"data\aclImdb_v1.tar.gz"
]

for file_path in files_to_remove:
    try:
        if os.path.isfile(file_path):
            os.remove(file_path)
            print(f"Removed file: {file_path}")
        elif os.path.isdir(file_path):
            shutil.rmtree(file_path)
            print(f"Removed directory: {file_path}")
    except Exception as e:
        print(f"Error removing {file_path}: {e}")

Removed directory: data\text\aclImdb\train\unsup
Removed file: data\text\aclImdb\train\urls_unsup.txt
Removed file: data\aclImdb_v1.tar
Removed file: data\aclImdb_v1.tar.gz


### Skapar validerings dataset

In [53]:
import os, pathlib, shutil, random
basedir = pathlib.Path("data/text/aclImdb")
val_dir = basedir / "val"
train_dir = basedir / "train"

for category in ("neg", "pos"):
    if not os.path.exists(val_dir / category):
        os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir / category / fname)

In [54]:
import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(train_dir, batch_size=batch_size)
val_ds = keras.utils.text_dataset_from_directory(val_dir, batch_size=batch_size)
test_ds = keras.utils.text_dataset_from_directory(basedir / "test", batch_size=batch_size)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


Examining the train_ds variable a bit. The file contains a text and an associated label:

In [55]:
print(train_ds.class_names)
print(train_ds.element_spec)
print(type(train_ds))

for text_batch, label_batch in train_ds.take(1):
    for i in range(5):
        print(f"Text: {text_batch.numpy()[i]}")
        print(f"Label: {label_batch.numpy()[i]}")

print(train_ds.element_spec)
print(f"Number of batches in train_ds: {len(train_ds)}")

['neg', 'pos']
(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))
<class 'tensorflow.python.data.ops.prefetch_op._PrefetchDataset'>
Text: b'"8 SIMPLE RULES... FOR DATING MY TEENAGE DAUGHTER," is my opinion, is an absolute ABC classic! I\'m not sure I haven\'t seen every episode, but I still enjoyed it. It\'s hard to say which episode was my favorite. However, I think it was always funny when a mishap occurred. I always laughed at that. Despite the fact that James Garner and David Spade were good, I liked the show more when John Ritter was the leading man. If you ask me, his sudden passing was very tragic. Everyone always gave a good performance, the production design was spectacular, the costumes were well-designed, and the writing was always very strong. In conclusion, I hope some network brings it back on the air for fans of the show to see.'
Label: 1
Text: b"Rob Roy is and underrated epic of passion and action!SOME MILD SPOI

In [58]:
for inputs, targets in train_ds:
    print(f"inputs: {inputs.shape}, {inputs.dtype}")
    print(f"targets: {targets.shape}, {targets.dtype}")
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs: (32,), <dtype: 'string'>
targets: (32,), <dtype: 'int32'>
inputs[0]: tf.Tensor(b"This movie is a blatant attempt by the left in Hollywood to portray Reagan's administration as incompetent and bungling. Some mistakes may have been made at the time of the crisis, but I'm sure not to the extent portrayed in this lame movie. My first reaction was that this movie had to have been directed by Oliver Stone, but I was wrong this time. There are apparently many others.", shape=(), dtype=string)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


In [85]:
for text_batch, label_batch in train_ds.take(1):
    print("Targets:", label_batch.numpy())
    print("Text:", text_batch.numpy())

Targets: [1 0 0 1 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1]
Text: [b'After the SuperFriends and Scooby Doo left the Saturday morning airwaves in the fall of 1986, I pretty much stopped watching Saturday morning cartoons at that point since those were the only two that kept me tuning in. And since neither the Real Ghostbusters nor the Flintstone Kids seemed very promising to me, I "retired" and started sleeping in on Saturday mornings. I only returned to Saturday morning TV in 1988 for that one year only for one and only one animated show. <br /><br />A new animated show of Superman was something I was not going to pass up. I was 17 and in high school at the time, but so what! I loved this show. From what I can recall, this series was a gift to fans I suppose in celebration of Superman\'s 50th birthday that particular year. It had the theme music and the music style reminiscent of John Williams movie score from the Richard Donner/Christopher Reeve Superman movies. I hones

Print the first text in the first batch:

In [64]:
for text_batch, label_batch in train_ds.take(1):
    print(f"First text in the first batch: {text_batch.numpy()[0].decode('utf-8')}")

First text in the first batch: awful, just awful! my old room mate used to watch this junk and it drove me crazy. the book is one of my favorites and its a shame that some people will never know what it is really like because their first impressions are from dribble like this. they changed so much it is hardly recognisable. which baffles me since the book reads like a soap opera anyway, providing enough fodder for modern day entertainment. it's like one of those Lifetime movies that say "based on a true story" but are completely fictional. there is none of the emotion or depth of the book, just mindless melodrama. if you are a high school student looking for a way to get out of reading, i suggest you try another version.


Text vektorisering, görs utanför lagret

In [65]:
from keras import layers
text_vectorization = layers.TextVectorization(max_tokens=20000, output_mode="multi_hot")
text_only_train_ds = train_ds.map(lambda x, y: x)
text_vectorization.adapt(text_only_train_ds) # adapt is similar to fit, but it only uses the first batch

binary_1gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))


Texten är nu vektoriserad, samt targets är också binary.

In [91]:
for text_batch, label_batch in binary_1gram_train_ds.take(1):
    print("Targets:", label_batch.numpy())
    print("Text:", text_batch.numpy())

Targets: [0 0 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0]
Text: [[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]]


In [74]:
for text_batch, label_batch in train_ds.take(1):
    vectorized_text = text_vectorization(text_batch)
    print(f"First vectorized text in the first batch: {vectorized_text.numpy()[0]}")
    print(f"Length of the first vectorized text: {len(vectorized_text.numpy()[0])}")

First vectorized text in the first batch: [1 1 0 ... 0 0 0]
Length of the first vectorized text: 20000


In [75]:
def get_model(max_tokens=20000, hidden_dim=16):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])
    return model

In [76]:
model = get_model()
model.summary()

In [77]:
callbacks = [keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)]
model.fit(binary_1gram_train_ds, validation_data=binary_1gram_val_ds, epochs=10, callbacks=callbacks)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 7ms/step - accuracy: 0.7723 - loss: 0.4865 - val_accuracy: 0.8796 - val_loss: 0.3040
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - accuracy: 0.8978 - loss: 0.2737 - val_accuracy: 0.8852 - val_loss: 0.3028
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9171 - loss: 0.2429 - val_accuracy: 0.8876 - val_loss: 0.3177
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9250 - loss: 0.2208 - val_accuracy: 0.8844 - val_loss: 0.3393
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9312 - loss: 0.2137 - val_accuracy: 0.8860 - val_loss: 0.3554
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 6ms/step - accuracy: 0.9351 - loss: 0.2113 - val_accuracy: 0.8840 - val_loss: 0.3707
Epoch 7/10
[1m625/625[0m 

<keras.src.callbacks.history.History at 0x134eb250ec0>

laddar den sparade modellen, samt gör evaluate på den.

In [19]:
model = keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m179s[0m 227ms/step - accuracy: 0.8873 - loss: 0.2906
Test acc: 0.887


In [92]:
# Example text to predict
example_texts = [
    "This movie was fantastic! I really enjoyed it.",
    "I did not like this movie at all. It was boring and too long."
]

# Vectorize the example texts
vectorized_texts = text_vectorization(example_texts)

# Make predictions
predictions = model.predict(vectorized_texts)

# Print the predictions
for text, prediction in zip(example_texts, predictions):
    print(f"Text: {text}")
    print(f"Prediction (0 = negative, 1 = positive): {prediction[0]:.4f}")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
Text: This movie was fantastic! I really enjoyed it.
Prediction (0 = negative, 1 = positive): 0.9442
Text: I did not like this movie at all. It was boring and too long.
Prediction (0 = negative, 1 = positive): 0.3506


Man kan se att inputvektorn är endast 1 o 0.

In [21]:
for inputs, targets in binary_1gram_test_ds:
    print(f"inputs: {inputs.shape}, {inputs.dtype}")
    print(f"targets: {targets.shape}, {targets.dtype}")
    print(f"inputs[0]: {inputs[0].numpy()}")
    print(f"targets[0]: {targets[0].numpy()}")
    break

inputs: (32, 20000), <dtype: 'int64'>
targets: (32,), <dtype: 'int32'>
inputs[0]: [1 1 1 ... 0 0 0]
targets[0]: 0


Provar  med bigram.

In [20]:
text_vectorization = layers.TextVectorization(ngrams=2, max_tokens=20000, output_mode="tf_idf")
text_vectorization.adapt(text_only_train_ds)
tfidf_2gram_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
tfidf_2gram_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [22]:
model = get_model()
callbacks = [keras.callbacks.ModelCheckpoint("tfidf_2gram.keras", save_best_only=True)]
model.fit(tfidf_2gram_train_ds, validation_data=tfidf_2gram_val_ds, epochs=10, callbacks=callbacks)


Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 126ms/step - accuracy: 0.7243 - loss: 0.8280 - val_accuracy: 0.8830 - val_loss: 0.3030
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 43ms/step - accuracy: 0.8673 - loss: 0.3307 - val_accuracy: 0.8860 - val_loss: 0.3072
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 48ms/step - accuracy: 0.8896 - loss: 0.2914 - val_accuracy: 0.8786 - val_loss: 0.3287
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 42ms/step - accuracy: 0.9035 - loss: 0.2576 - val_accuracy: 0.8814 - val_loss: 0.3244
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 42ms/step - accuracy: 0.9055 - loss: 0.2449 - val_accuracy: 0.8778 - val_loss: 0.3328
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 39ms/step - accuracy: 0.9122 - loss: 0.2336 - val_accuracy: 0.8856 - val_loss: 0.3325
Epoch 7/10
[1m

<keras.src.callbacks.history.History at 0x1d5681a9af0>

In [24]:
model = keras.models.load_model("tfidf_2gram.keras")
print((f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}"))

[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m156s[0m 199ms/step - accuracy: 0.8892 - loss: 0.2975
Test acc: 0.888


Del 3, vi ska titta på sekvenser nu. Förbereder datan för att använda en LSTM.

In [26]:
max_length = 600
max_tokens = 20000

text_vectorization = layers.TextVectorization(max_tokens=max_tokens, output_mode="int", output_sequence_length=max_length)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(lambda x, y: (text_vectorization(x), y))
int_val_ds = val_ds.map(lambda x, y: (text_vectorization(x), y))
int_test_ds = test_ds.map(lambda x, y: (text_vectorization(x), y))

In [29]:
import tensorflow as tf

inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(max_tokens, 128)(inputs)
#embedded = tf.one_hot(inputs, depth=max_tokens)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["accuracy"])


Del 4, Lucia:
### Embeddings:
- sidan 332 i boken
- Bidirectional layers, gradient descent åt båda hållen
- 

Gör ett nytt neural network, adderar något som heter "mask". Sätt  `mask_zero = True`
Detta blir inte jättemycket bättre.  

Hittills är TFIDF den bästa.

Nästa steg är det som kallas TRANSFORMERS! Skapar en ny fil 11 för det.