# Access and Push

In [None]:
# Access my drive
from google.colab import drive
drive.mount('/content/drive')

# Access github
!git clone

Mounted at /content/drive
Cloning into 'Capstone-Tang'...
remote: Enumerating objects: 7079, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 7079 (delta 4), reused 12 (delta 1), pack-reused 7056 (from 1)[K
Receiving objects: 100% (7079/7079), 1.09 GiB | 54.83 MiB/s, done.
Resolving deltas: 100% (5/5), done.
Updating files: 100% (6474/6474), done.


Git push:

In [None]:
%cd /content/Capstone-Tang
!git status

/content/Capstone-Tang
On branch main
Your branch is up to date with 'origin/main'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mMOSEI/-3g5yACwYnA/[m
	[31mMOSEI/-3nNcZdcdvU/[m
	[31mMOSEI/-HwX2H8Z4hY/[m
	[31mMOSEI/-NFrJFQijFE/[m
	[31mMOSEI/-THoVjtIkeU/[m
	[31mMOSEI/-UuX1xuaiiE/[m
	[31mMOSEI/-WXXTNIJcVM/[m
	[31mMOSEI/-ZgjBOA1Yhw/[m
	[31mMOSEI/-a55Q6RWvTA/[m
	[31mMOSEI/-aNfi7CP8vM/[m
	[31mMOSEI/-aqamKhZ1Ec/[m
	[31mMOSEI/-dxfTGcXJoc/[m
	[31mMOSEI/-egA8-b7-3M/[m
	[31mMOSEI/-iRBcNs9oI8/[m
	[31mMOSEI/-lqc32Zpr7M/[m
	[31mMOSEI/-lzEya4AM_4/[m
	[31mMOSEI/-mJ2ud6oKI8/[m
	[31mMOSEI/-mqbVkbCndg/[m
	[31mMOSEI/-t217m2on-s/[m
	[31mMOSEI/-tANM6ETl_M/[m
	[31mMOSEI/-tPCytz4rww/[m
	[31mMOSEI/-vxjVxOeScU/[m
	[31mMOSEI/-wMB_hJL-3o/[m
	[31mMOSEI/-wny0OAz3g8/[m
	[31mMOSEI/03X1FwF6udc/[m
	[31mMOSEI/07Z16yRBFUQ/[m
	[31mMOSEI/08d4NTXkSxw/[m
	[31mMOSEI/0AA-wmk8WdA/[m
	[31mMOSEI/0AkGtPzl7D8/[m
	[31mMOSEI/0BVed2nBq

In [None]:
!git add .
!git config --global user.email "tianyitang666@gmail.com"
!git config --global user.name "floragreen666"
!git commit -m ""  # don't forget commit message
!git push

# MOSEI Dataset

Since the goal of this capstone is to transcribe audios and do sentiment analysis on transcripts, we'll need a multimodal sentiment dataset that contains both audios, transcripts, and sentiment labels.

MOSEI is a large-scale dataset with diverse, spontaneous spoken content from online videos. It includes transcriptions, which can be used to fine-tune transcription models. It also includes sentiment and emotion labels, which can be used to train sentiment analysis models on transcripts.

The original paper of MOSEI can be found here: https://aclanthology.org/P18-1208/

I also considered several other datasets: MELD is limited to the TV show "Friends", which may be difficult to generalize; IEMOCAP has limited number of actors and features acted emotions, which may not reflect diverse and spontaneous speech (same for CREMA-D and SAVEE); MOSI is similar to MOSEI, but not as comprehensive.

# Transfer Videos to Audios

I tried to access the dataset through its official repo, yet following its instructions results in errors. Then I looked through a range of repos and datasets, gotten several pkl files which contain features for the video and audios of the dataset. I tried to decode them, but there were no instrctions on the form of the feature files, which makes it hard.

I finally came across a repo that has a google drive link to the raw videos of the dataset: https://drive.google.com/drive/folders/1o2pOWQg8fxJkgBJVWk9mjCrbcc1jX4eq

In this part, I would transfer them to audios. For now, I only transfered 6000 of the videos, and there are more than 22000 videos to transfer.

In [None]:
import os
import pandas as pd
import subprocess

In [None]:
# Paths
input_csv_path = '/content/Capstone-Tang/MOSEI/label.csv'
video_base_path = '/content/drive/MyDrive/MOSEI/Raw'
audio_base_path = '/content/Capstone-Tang/MOSEI/Audio'

In [None]:
# Create the audio base path if it does not exist
os.makedirs(audio_base_path, exist_ok=True)

# Load the CSV file
df = pd.read_csv(input_csv_path)

# Function to extract and save audio using ffmpeg
def extract_audio_ffmpeg(video_id, clip_id):
    video_path = os.path.join(video_base_path, video_id, f"{clip_id}.mp4")
    audio_output_dir = os.path.join(audio_base_path, video_id)
    os.makedirs(audio_output_dir, exist_ok=True)
    audio_output_path = os.path.join(audio_output_dir, f"{clip_id}.wav")

    # Use ffmpeg to extract audio
    command = [
        'ffmpeg',
        '-i', video_path,
        '-vn',  # No video
        '-ar', '16000',  # Set audio sample rate to 16kHz
        '-ac', '1',  # Set number of audio channels to 1 (mono)
        audio_output_path
    ]
    subprocess.run(command, check=True)

# Iterate through the CSV and process each video
for index, row in df.iterrows():
    video_id = row['video_id']
    clip_id = row['clip_id']
    extract_audio_ffmpeg(video_id, clip_id)

    if index % 30 == 0:
        print(f"Processed {index} rows")

print("Audio extraction complete.")

Audio extraction complete.


# Transcribe Audios

Next I'll choose a transcription model that transcribes audio to text. I'll fine-tune it on the MOSEI model to get a better performance.

Wav2Vec 2.0 model is a powerful model for speech recognition, which is good for our task.

I also find one of its variants: https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english. This model is fine-tuned for English, and since English is our targeted language, it would likely have a higher accuracy. However, upon testing, I found it 2 to 3 times slower than the original Wav2Vec 2.0. Thus, I'll stick to the original version for time's sake.

First, I'll varify that the model works on my transferred MOSEI audioes.

In [None]:
output_csv_path = '/content/Capstone-Tang/MOSEI/transcriptions.csv'

In [None]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import soundfile as sf

In [None]:
# Load model and processor
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

# Function to transcribe audio
def transcribe_audio(audio_path):
    # Load the audio file
    audio_input, sample_rate = sf.read(audio_path)

    # Tokenize the audio
    input_values = processor(audio_input, return_tensors="pt", padding="longest", sampling_rate=16000).input_values

    # Perform inference
    with torch.no_grad():
        logits = model(input_values).logits

    # Decode the logits to get the transcription
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
    return transcription[0]

transcriptions = []
count = 0
for video_id in os.listdir(audio_base_path):
    video_folder_path = os.path.join(audio_base_path, video_id)
    for clip_filename in os.listdir(video_folder_path):
        clip_path = os.path.join(video_folder_path, clip_filename)
        clip_id = os.path.splitext(clip_filename)[0]
        transcription = transcribe_audio(clip_path)
        transcriptions.append({
            'video_id': video_id,
            'clip_id': clip_id,
            'transcription': transcription
        })
        count += 1
        if count % 30 == 0:
            print(f"Processed {count} rows")
    # temporary: stop here to see an example
    break

transcriptions_df = pd.DataFrame(transcriptions)
transcriptions_df.to_csv(output_csv_path, index=False)
print("Transcription complete.")

print(f'\n{transcriptions_df.head()}')

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You sho

Transcription complete.

      video_id clip_id                                      transcription
0  -3g5yACwYnA      10  ON KEYS PART OF AH THE PEOPLE THAT WE USE TOTO...
1  -3g5yACwYnA      13  THAT WE DO O THEY'VE BEEN ABLE TO FIND SOLUTIO...
2  -3g5yACwYnA       3  OM WE'RE A HUGE A USERVE IT HE SAYS FOR OUR OP...
3  -3g5yACwYnA       2  ERATIONS AM KEE BRINGS THE KEEP ALROM BRINGS A...
4  -3g5yACwYnA       4  KEY BRINGS THOSE TYPES OF A ASPECTS TO OUR BUS...


Next up:
1. Transfer all videos to audios
2. Fine tune transcription model on MOSEI data

# Sentiment Analysis on Transcripts

Although this is future part, I've think of some choices of text-based sentiment analysis model for transfer learning.

I would like to go with the BERT family. There are several popular BERT models: RoBERTa, BERT, ALBERT, DistilBERT

I'm still trying to decide which one to choose, as better accuracy means more complexity. I have to consider my limited computing resources (I'm on Colab Pro+ right now which is $50 per month).