<a href="https://colab.research.google.com/github/cgenevier/CSCI5622-HW4/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Study 1: Designing explainable speech-based machine learning models of depression

To open this ipynb in Colab, click the "Open in Colab" button at the top of the ipynb in Github, or [follow this link](https://colab.research.google.com/github/cgenevier/CSCI5622-HW4/blob/main/main.ipynb).

Given that Colab doesn't automatically load any of the content (data or other functions) from the Github repo, running the code below will copy the repo into the workspace directory for use. To save this ipynb file back to Github, select **File > Save** (which should show the repo if you're signed in) or **File > Save a copy in Github** if it's in the menu.

Note that the content of the data files or any of the other file structures are not saved back to Github, so make sure that if you make changes to things there, that you put them in Github separately.

In [2]:
# Clone Github Repo into the temporary local environment so data can be accessed and manipulated
!git clone https://github.com/cgenevier/CSCI5622-HW4.git
%cd CSCI5622-HW4

Cloning into 'CSCI5622-HW4'...
remote: Enumerating objects: 406, done.[K
remote: Counting objects: 100% (406/406), done.[K
remote: Compressing objects: 100% (403/403), done.[K
remote: Total 406 (delta 10), reused 391 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (406/406), 5.14 MiB | 5.55 MiB/s, done.
Resolving deltas: 100% (10/10), done.
/Library/WebServer/Sites/School/CSCI5622-HW4/CSCI5622-HW4


In [17]:
# Import necessary libraries

# Helpers
import glob

# Pandas, seaborn, and numpy for data manipulation
import pandas as pd
pd.set_option("display.max_rows", None)
import statistics as stat
import seaborn as sns
import numpy as np
np.random.seed(42)

# Keras & TensorFlow for building the neural networks
import itertools, json, time
from itertools import count
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, regularizers, callbacks, backend as K
tf.random.set_seed(42)

# Feature extraction
!pip install vaderSentiment transformers torch
from sklearn.feature_extraction.text import TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from transformers import BertTokenizer, BertModel
import torch

# Matplotlib for graphing
import matplotlib.pyplot as plt



In [18]:
# Import Depression Labels
# Columns: Participant_ID, PHQ_Score
depression_labels = pd.read_csv("data/DepressionLabels.csv")

# Rename Participant_ID to ParticipantID to match accoustic files below & force trimmed string type
depression_labels = depression_labels.rename(columns={"Participant_ID": "ParticipantID"})
depression_labels["ParticipantID"] = depression_labels["ParticipantID"].astype(str).str.strip()

# Import Text Dataset (for text feature extraction)
# Note: When comparing the E-DAIC_Transcripts files to the corresponding E-DAIC Acoustics files,
# it looks like the transcripts sometimes only contain partial data from the accoustics text column -
# for example, 386_Transcript.csv - so it seems to make sense to concatenate Text data in the
# Acoustics file for completeness.
rows = []
for p in glob.glob("data/E-DAIC_Acoustics/*_utterance_agg.csv"):
    df = pd.read_csv(p)
    df["ParticipantID"] = df["ParticipantID"].astype(str).str.strip()
    full_text = " ".join(df["Text"].dropna().astype(str))
    full_text = " ".join(full_text.split())  # collapse whitespace
    rows.append({"ParticipantID": df["ParticipantID"].iloc[0], "FullText": full_text})

# Columns: ParticipantID, FullText
text_df = pd.DataFrame(rows)
# Merge with labels. Columns: ParticipantID, FullText, PHQ_Score
lang_df = depression_labels.merge(text_df, on="ParticipantID", how="inner")


# Import Accoustic Dataset (for part c, d)
# Note: May need to aggregate the data per participant, ex: mean, std, iqr for each feature
# @TODO


### (a) (2 points) Extracting language features.

**Syntactic vectorizers:** count vectorizer (e.g., CountVectorizer from sklearn) transforming
a collection of text documents into a numerical matrix of word or token counts; TFIDF vectorizer (e.g., TfidfVectorizer from sklearn) incorporating document-level weighting,
which emphasizes words significant to specific documents’ part-of-speech features counting
the distribution of part of speech tags over a document

In [19]:
# Use TfidfVectorizer from sklearn
vect = TfidfVectorizer(max_features=1000)
X_tfidf = vect.fit_transform(lang_df["FullText"])

# Convert sparse matrix to DataFrame
syntactic_df = pd.DataFrame(
    X_tfidf.toarray(),
    columns=vect.get_feature_names_out()
)

# Add ParticipantID column & move to first column
syntactic_df["ParticipantID"] = lang_df["ParticipantID"].values
cols = ["ParticipantID"] + [c for c in syntactic_df.columns if c != "ParticipantID"]
syntactic_df = syntactic_df[cols]

# Add back in PHQ_Score & move to second column
syntactic_df = syntactic_df.merge(depression_labels, on="ParticipantID", how="inner")
cols = ["ParticipantID", "PHQ_Score"] + [c for c in syntactic_df.columns if c not in ["ParticipantID", "PHQ_Score"]]
syntactic_df = syntactic_df[cols]

# Inspect dataframe
syntactic_df.head()

Unnamed: 0,ParticipantID,PHQ_Score,10,12,15,16,18,19,20,30,...,yes,yesterday,yet,york,you,young,younger,youngest,your,yourself
0,386,11,0.008627,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.010623,0.0,0.0,0.0,0.352474,0.011232,0.0,0.0,0.016914,0.019322
1,387,2,0.027755,0.0,0.0,0.0,0.0,0.0,0.032038,0.0,...,0.0,0.0,0.0,0.047486,0.346512,0.072275,0.0,0.0,0.108838,0.031083
2,388,17,0.031186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.038404,0.0,0.040133,0.0,0.283166,0.0,0.0,0.0,0.020382,0.034926
3,389,14,0.054573,0.052964,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.325182,0.0,0.0,0.0,0.089168,0.0
4,390,9,0.0,0.0,0.0,0.0,0.0,0.0,0.016775,0.0,...,0.008948,0.020119,0.0,0.0,0.123706,0.0,0.0,0.0,0.009498,0.016275


**Semantic features:** sentiment scores (e.g., Vader, https://github.com/cjhutto/vaderSentiment),
topic distribution (using topic modeling), or named entities

In [20]:
# Using Vader to analyze sentiment of the text data
analyzer = SentimentIntensityAnalyzer()

# Apply Vader to the text data (creates 4 new columns)
vader_scores = lang_df["FullText"].apply(lambda x: pd.Series(analyzer.polarity_scores(str(x))))
semantic_df = pd.concat([lang_df, vader_scores], axis=1)

# Inspect dataframe
semantic_df.head()


Unnamed: 0,ParticipantID,PHQ_Score,FullText,neg,neu,pos,compound
0,386,11,might have pulled something that I'm going to ...,0.046,0.77,0.184,0.9999
1,387,2,when she's done she'll let you know alrighty t...,0.05,0.665,0.285,0.9996
2,388,17,are you okay with yes doing all right from Pas...,0.07,0.769,0.161,0.9953
3,389,14,and please are you okay sure I'm okay small to...,0.057,0.827,0.116,0.9822
4,390,9,and now she's going to chat with you for a bit...,0.067,0.74,0.193,0.9996


**Advanced features:** word embeddings, such as Word2Vec or BERT (e.g., pytorch-pretrainedbert)) for capturing contextual meaning

In [21]:
# Use BERT to capture contextual meaning

# Load uncased base model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval() 

# Loop through text data and get embeddings
embeddings = []
for text in lang_df["FullText"]:
    # Truncate long text (BERT max = 512 tokens)
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze().numpy()  # [CLS] token
    embeddings.append(cls_embedding)

# Convert list of embeddings (each 768-dim) to DataFrame
bert_df = pd.DataFrame(np.vstack(embeddings))
bert_df.columns = [f"bert_{i}" for i in range(bert_df.shape[1])]

# Add ParticipantID and PHQ_Score
bert_df = pd.concat([lang_df[["ParticipantID", "PHQ_Score"]].reset_index(drop=True), bert_df], axis=1)

# Inspect dataframe
bert_df.head()

ImportError: 
BertModel requires the PyTorch library but it was not found in your environment.
However, we were able to find a TensorFlow installation. TensorFlow classes begin
with "TF", but are otherwise identically named to our PyTorch classes. This
means that the TF equivalent of the class you tried to import would be "TFBertModel".
If you want to use TensorFlow, please use TF classes instead!

If you really do want to use PyTorch please go to
https://pytorch.org/get-started/locally/ and follow the instructions that
match your environment.


### (b) (2 points) Estimating depression severity with interpretable models using language features.

### (c) (2 points) Estimating depression severity with interpretable models using acoustic features.

### (d) (2 points) Estimating depression severity with unimodal and multimodal deep learning models.

### (e) (2 points) Explainable ML.

### (f) (Bonus, 2 points) Experimenting with transformers.