<a href="https://colab.research.google.com/github/eliseobao/redsm5/blob/main/analysis/linguistic/verbs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Verbs Usage Analysis

This notebook will guide you through analyzing the usage of verbs in the ReDSM5 dataset. We will use the `spaCy` library to process and analyze the text data.

## Setup

First, we need to set up the environment and install the necessary libraries.

In [10]:
import os

os.environ["SHELL"] = "/bin/bash"

## Installing Required Libraries

We will install `spaCy` and download the English language model `en_core_web_sm`.

In [11]:
%%capture
!pip install spacy
!python3 -m spacy download en_core_web_sm

## Importing Libraries

Next, we import the necessary libraries for our analysis.

In [12]:
import spacy
import pandas as pd
from tqdm import tqdm

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm")

## Defining Symptoms

We define a list of symptoms that we will analyze. Each symptom corresponds to a category in our dataset.

In [13]:
SYMPTOMS = [
    "NO_SYMPTOMS",
    "DEPRESSED_MOOD",
    "ANHEDONIA",
    "APPETITE_CHANGE",
    "SLEEP_ISSUES",
    "PSYCHOMOTOR",
    "FATIGUE",
    "WORTHLESSNESS",
    "COGNITIVE_ISSUES",
    "SUICIDAL_THOUGHTS",
]

## Loading Data

We load the dataset from a CSV file and organize the texts by symptom.

In [14]:
# Load the dataset
data = pd.read_csv("data/redsm5.csv")

# Organize texts by symptom
texts_per_symptom = {}
for symptom in SYMPTOMS:
    texts_per_symptom[symptom] = data.loc[
        data["labels"].str.contains(symptom), "text"
    ].tolist()

## Counting Verb Tenses

We define a function to count the occurrences of past, present, and future tense verbs in a given text.

In [15]:
def get_verb_counts(text):
    """
    Count the occurrences of past, present, and future tense verbs in the given text.

    Parameters:
    - text (str): Input text to analyze.

    Returns:
    Tuple[int, int, int]: A tuple containing counts of past, present, and future tense verbs.
    """
    past_count, present_count, future_count = 0, 0, 0
    doc = nlp(text)

    for token in doc:
        # Checking the part-of-speech tag of each token
        if token.tag_ == "VBD":  # Past tense
            past_count += 1
        elif token.tag_ == "VBP" or token.tag_ == "VBZ":  # Present tense
            present_count += 1
        elif token.tag_ == "MD":  # Modal (indicating future)
            future_count += 1

    return past_count, present_count, future_count

## Analyzing Texts

We analyze the texts for each symptom, counting the verbs and calculating the percentages of past, present, and future tense usage.

In [16]:
results = {}

for symptom, texts in texts_per_symptom.items():
    print(f"Analyzing {symptom} texts")

    total_past_count, total_present_count, total_future_count = 0, 0, 0

    for text in tqdm(texts):
        past_count, present_count, future_count = get_verb_counts(text)

        total_past_count += past_count
        total_present_count += present_count
        total_future_count += future_count

    total_verbs = total_past_count + total_present_count + total_future_count
    past_percentage = (total_past_count / total_verbs) * 100
    present_percentage = (total_present_count / total_verbs) * 100
    future_percentage = (total_future_count / total_verbs) * 100

    results[symptom] = {
        "total_verbs": total_verbs,
        "past_percentage": past_percentage,
        "present_percentage": present_percentage,
        "future_percentage": future_percentage,
    }

Analyzing NO_SYMPTOMS texts


100%|██████████| 392/392 [00:28<00:00, 13.99it/s]


Analyzing DEPRESSED_MOOD texts


100%|██████████| 328/328 [00:14<00:00, 22.10it/s]


Analyzing ANHEDONIA texts


100%|██████████| 124/124 [00:03<00:00, 32.34it/s]


Analyzing APPETITE_CHANGE texts


100%|██████████| 44/44 [00:02<00:00, 21.89it/s]


Analyzing SLEEP_ISSUES texts


100%|██████████| 102/102 [00:04<00:00, 25.35it/s]


Analyzing PSYCHOMOTOR texts


100%|██████████| 35/35 [00:02<00:00, 17.48it/s]


Analyzing FATIGUE texts


100%|██████████| 124/124 [00:05<00:00, 21.75it/s]


Analyzing WORTHLESSNESS texts


100%|██████████| 311/311 [00:12<00:00, 24.28it/s]


Analyzing COGNITIVE_ISSUES texts


100%|██████████| 59/59 [00:02<00:00, 29.28it/s]


Analyzing SUICIDAL_THOUGHTS texts


100%|██████████| 165/165 [00:05<00:00, 31.21it/s]


## Displaying Results

Finally, we display the results of our analysis, showing the percentage of past, present, and future tense verbs for each symptom.

In [17]:
for symptom, data in results.items():
    print(f"\nSymptom: {symptom}")
    print(f"Total Verbs: {data['total_verbs']}")
    print(f"Past Tense Percentage: {data['past_percentage']:.2f}%")
    print(f"Present Tense Percentage: {data['present_percentage']:.2f}%")
    print(f"Future Tense Percentage: {data['future_percentage']:.2f}%")


Symptom: NO_SYMPTOMS
Total Verbs: 23881
Past Tense Percentage: 49.99%
Present Tense Percentage: 38.96%
Future Tense Percentage: 11.04%

Symptom: DEPRESSED_MOOD
Total Verbs: 13163
Past Tense Percentage: 42.74%
Present Tense Percentage: 46.40%
Future Tense Percentage: 10.86%

Symptom: ANHEDONIA
Total Verbs: 3369
Past Tense Percentage: 26.51%
Present Tense Percentage: 61.83%
Future Tense Percentage: 11.67%

Symptom: APPETITE_CHANGE
Total Verbs: 1706
Past Tense Percentage: 40.33%
Present Tense Percentage: 50.23%
Future Tense Percentage: 9.44%

Symptom: SLEEP_ISSUES
Total Verbs: 3451
Past Tense Percentage: 46.19%
Present Tense Percentage: 43.03%
Future Tense Percentage: 10.78%

Symptom: PSYCHOMOTOR
Total Verbs: 1761
Past Tense Percentage: 40.55%
Present Tense Percentage: 47.81%
Future Tense Percentage: 11.64%

Symptom: FATIGUE
Total Verbs: 4727
Past Tense Percentage: 39.37%
Present Tense Percentage: 48.04%
Future Tense Percentage: 12.59%

Symptom: WORTHLESSNESS
Total Verbs: 11762
Past Tens