# Meaningful memories: oral history pipeline - processing text

## Prepare data
Simply copy the raw text you want to format in the cell under "Load data". If you want to process in another format, we can help you during the workshop.

## Prepare environment

In [2]:
!git clone https://github.com/SURF-ML/meaningful-memories.git

Cloning into 'meaningful-memories'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 72 (delta 26), reused 56 (delta 15), pack-reused 0 (from 0)[K
Receiving objects: 100% (72/72), 2.37 MiB | 5.11 MiB/s, done.
Resolving deltas: 100% (26/26), done.


In [3]:
!pip install -e meaningful-memories --quiet

  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.0/65.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m111.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━

Make sure you reload the notebook (Runtime -> Restart session) so new package is recognized.

## Optional: update config
If you want to experiment with different models (sizes or types) or parameters, you can change these by updating the config in ```meaningful-memories/meaningful_memories/configs/config.yaml```

Please make sure you reload the notebook again after any changes to ensure changes are also loaded in this environment.

## Load relevant pipeline modules

In [1]:
from meaningful_memories.interview import Interview
from meaningful_memories.transcript import Transcript
from meaningful_memories.transcriber import WhisperTranscriber
from meaningful_memories.extracter import EntityExtracter, LLMTopicExtracter

In [2]:
import pandas as pd
import os
from tqdm.notebook import tqdm


## Load data

In [8]:
your_text = "Dit is een tekst over Amsterdam."


interview = Interview(
              ".",
              skip_convert=True,
              original_uri="some-uri",
              interview_label="some-label",
          )
interview.transcript = Transcript([{"text": your_text}])

## Run extraction

In [9]:
extracter = EntityExtracter()
# topic_extracter = LLMTopicExtracter()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/153 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.52k [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]



In [10]:
extracter.extract(interview)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Write results to file
The final output will consist of:

*   JSON file with transcript, extracted entities etc.
* JSONLD file with W3 annotation
* HTML visualization of the transcript and tagged entities



In [11]:
args = type('Config', (), {"text_only": True})()
interview.combine_chunks()
interview.visualize()
interview.write_to_file(args)