# Meaningful memories: oral history pipeline - processing videos

## Prepare data
Make sure you put the data you want to process in a subdirectory of the root of this notebook (in colab: go to files and create a directory "data" in which you upload your video).

Your structure should look like:
```
notebook_root/
  data/
    dir_name_of_your_video/   # name this the same or similar to your video (no spaces so use underscores)
      yourvideo.mp4
```

## Prepare environment

In [None]:
!git clone https://github.com/SURF-ML/meaningful-memories.git

Cloning into 'meaningful-memories'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 72 (delta 26), reused 56 (delta 15), pack-reused 0 (from 0)[K
Receiving objects: 100% (72/72), 2.37 MiB | 3.26 MiB/s, done.
Resolving deltas: 100% (26/26), done.


In [None]:
!pip install -e meaningful-memories --quiet

  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.0/65.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━

Make sure you reload the notebook (Runtime -> Restart session) so new package is recognized.

## Optional: update config
If you want to experiment with different models (sizes or types) or parameters, you can change these by updating the config in ```meaningful-memories/meaningful_memories/configs/config.yaml```

Please make sure you reload the notebook again after any changes to ensure changes are also loaded in this environment.

## Load relevant pipeline modules

In [1]:
from meaningful_memories.interview import Interview
from meaningful_memories.transcript import Transcript
from meaningful_memories.transcriber import WhisperTranscriber
from meaningful_memories.extracter import EntityExtracter, LLMTopicExtracter

In [2]:
import pandas as pd
import os
from tqdm.notebook import tqdm


In [3]:
# set some flags
skip_convert = False  # set to true if processing audio-only to skip video->audio conversion

## Load data

In [4]:
path = "data/"

interviews = []
for dir in os.listdir(path):
  if dir.startswith(".ipynb"):
    continue
  print(f"Running pipeline for {dir}")
  interviews.append(
      Interview(
          input_dir=os.path.join(path, dir),
          skip_convert=skip_convert,
      )
  )

Running pipeline for 20_jaar_internet
Converting data/20_jaar_internet/20_jaar_internet.mp4 to audio (WAV).


  audio, sr = librosa.load(str(input_path))
	Deprecated as of librosa version 0.10.0.
	It will be removed in librosa version 1.0.
  y, sr_native = __audioread_load(path, offset, duration, dtype)


## Transcribe data

In [5]:
transcriber = WhisperTranscriber()
for interview in interviews:
  transcriber.transcribe(interview)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cuda:0
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` 

## Run extraction

In [6]:
extracter = EntityExtracter()
# topic_extracter = LLMTopicExtracter()


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.52k [00:00<?, ?B/s]

gliner_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/153 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]



In [7]:
for interview in interviews:
    extracter.extract(interview)
    # if args.include_llm_topics:
    #     topic_extracter.extract(interview)
    #     topic_extracter.aggregate_topics(interview)
    #     location_extracter.extract(interview)



Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Write results to file
The final output will consist of:


*   converted audio
*   JSON file with transcript, extracted entities etc.
* JSONLD file with W3 annotation
* HTML visualization of the transcript and tagged entities



In [11]:
args = type('Config', (), {"text_only": False})()
for interview in interviews:
    interview.combine_chunks()
    interview.visualize()
    interview.write_to_file(args)