In [1]:
import pandas as pd
import numpy as np
import pickle

Approach :

1. Import vector embedding model
2. Create three lexicons : 
    - Crisis
        - High
        - Medium
        - Low
        - Compute the TF-IDF for each category and use their average as score
    - Temporal
    - Severity
3. Use argmax(threshold average similarity) -> crisis lexicon to classify embeddings 
4. Use tf-idf weighted average similarity -> temporal and severity lexicon to classify embeddings
5. Final score  = (crisis_score + temporal_score * tf-idf + severity_score * tf-idf) / 3

1. Data Preprocessing:

Clean the text data (remove irrelevant characters, handle URLs, etc.).
2. Lexicon Creation:

Crisis Keyword Lexicon: Keywords with base risk scores (e.g., "kill myself": 90, "hopeless": 60, "sad": 30). Categorize these (e.g., ideation, planning, attempt).
Temporal Indicator Lexicon: Terms with weights (e.g., "tonight": +20, "tomorrow": +10, "someday": +2).
Severity Modifier Lexicon: Modifiers with multipliers (e.g., "extremely": 1.8, "slightly": 0.7).
3. Sample Phrase Processing (for Cluster Initialization):

Create a sample set of phrases categorized as High, Medium, and Low risk. This is used only for initializing the clusters, not for direct comparison later.
Generate embeddings for these sample phrases.
Perform k-means clustering (k=3) on these embeddings.
Calculate the centroid (average embedding) for each of the three clusters. These centroids represent your High, Medium, and Low risk categories in the embedding space.
4. Inference Text Processing and Initial Risk Assignment (Embedding-Based):

Generate an embedding for the inference text.
Calculate the cosine similarity between the inference text embedding and each of the three cluster centroids (from Step 3).
Assign the inference text to the category (High, Medium, Low) with the highest similarity.
Assign an Initial Risk score based on this category:
High: Initial Risk = 3
Medium: Initial Risk = 2
Low: Initial Risk = 1
5. Lexicon-Based Scoring (Refinement and Transitions):

5.1 Keyword Scoring:
Find all crisis keywords in the inference text.
For each keyword, calculate: Keyword Score = TF-IDF(keyword) * Base Risk(keyword).
5.2 Severity Modification:
Use dependency parsing to find severity modifiers linked to crisis keywords.
For each modified keyword: Modified Keyword Score = Keyword Score * Modifier Multiplier.
5.3 Temporal Adjustment:
Find temporal indicators in the inference text.
Temporal Score = Weight(temporal indicator).

- **5.4 Sentiment Analysis**
    - Use VADER to get sentiment score.
5.5 Combined Lexicon Score:
Lexicon Score = Sum(Modified Keyword Scores) + Temporal Score + Sentiment Score
6. Final Risk Score (with Embedding-Informed Transitions):

This is where we combine the embedding-based Initial Risk with the Lexicon Score. The key is to use the Lexicon Score to adjust the Initial Risk, allowing for transitions between risk levels.
Threshold-Based Transitions:

If Lexicon Score > Transition Threshold Up:
  Final Risk = min(Initial Risk + 1, 3)  # Move up one level, max High
Else If Lexicon Score < Transition Threshold Down:
  Final Risk = max(Initial Risk - 1, 1)  # Move down one level, min Low
Else:
  Final Risk = Initial Risk  # Stay at the initial level
Transition Threshold Up and Transition Threshold Down are values you'll need to tune (e.g., 25 and -15). These control how easily posts move between risk levels.

Method: Sigmoid Mapping of Similarity Difference

Prerequisites (Steps 1-3 remain the same):

Clean text.

Create lexicons (needed for later refinement, not this step).

Create sample phrases (Low, Med, High).

Generate embeddings for samples.

K-means clustering (k=3) on sample embeddings.

Calculate centroids: cent_low, cent_med, cent_high.

Inference Text Embedding (Step 4a):

Generate the embedding for the inference text: emb_inf.

Calculate Similarities (Step 4b):

Calculate cosine similarities:

sim_low = cosine_similarity(emb_inf, cent_low)

sim_med = cosine_similarity(emb_inf, cent_med) (We might not use this directly in the simplest sigmoid, but calculate it anyway)

sim_high = cosine_similarity(emb_inf, cent_high)

(Ensure similarities are in a reasonable range, typically [-1, 1] for cosine similarity, though often [0, 1] with common embedding models).

Calculate Similarity Difference (New Step 4c):

The core idea is to see how much more "High" the text is than "Low" based on embedding similarity.

similarity_diff = sim_high - sim_low

This value (similarity_diff) will be higher if the text is much closer to the High centroid than the Low, lower if closer to Low, and near zero if equidistant or if both similarities are low/high but similar. Its range is typically [-2, 2] or potentially narrower based on your embeddings/data.

Apply Scaled Sigmoid Function (New Step 4d):

Use the logistic function, scaled and shifted, to map similarity_diff to your desired score range (e.g., 1 to 3).

The general form of a scaled logistic function is:
Score = RangeMin + (RangeMax - RangeMin) / (1 + exp(-k * (input - midpoint)))

Let's set:

RangeMin = 1

RangeMax = 3

input = similarity_diff

midpoint = 0 (We assume a difference of 0 should correspond to the middle score, i.e., 2).

k = Steepness parameter (Controls how quickly the score transitions from 1 to 3. Higher k means a sharper transition around similarity_diff = 0. This needs tuning).

Initial_Risk_Score = 1 + (3 - 1) / (1 + exp(-k * (similarity_diff - 0)))

Initial_Risk_Score = 1 + 2 / (1 + exp(-k * (sim_high - sim_low)))

Tuning Parameter k:

Start with k=1 or k=2.

Look at the similarity_diff values for your sample texts (or a validation set).

If texts clearly belonging to High risk often have sim_high - sim_low around, say, 0.5, and Low risk texts have it around -0.5, you want k such that the function gives scores close to 3 for inputs >= 0.5 and close to 1 for inputs <= -0.5.

If your similarity_diff values tend to be much smaller (e.g., +/- 0.1), you'll need a larger k (e.g., k=10 or k=20) to get sufficient separation.

Plot the function 1 + 2 / (1 + exp(-k*x)) for different k values with x ranging across your typical similarity_diff values to visualize the effect.

In [2]:

#Serverity risk sentences
high_risk = [
    "I don't want to live anymore", "I feel like ending it all", "No point in going on",
    "I need to disappear", "I wish I never woke up", "It hurts too much to stay",
    "I'm ready to give up", "I need a way to make it stop", "I keep thinking about ending things",
    "I just want the pain to end", "Nobody would miss me if I were gone", 
    "I'm tired of pretending everything is okay", "I can't do this anymore",
    "Everything is too much", "I feel trapped in my own mind", "I see no future for myself",
    "No one understands what I'm going through", "I can't escape my thoughts",
    "I don't deserve to live", "I just want to disappear forever",
    "I wish I could sleep and never wake up", "I feel like I'm drowning and no one notices",
    "I hate myself so much", "I can't stop thinking about how to end it",
    "I've been planning my way out", "I hope this is the last time I feel this way",
    "I feel like a burden to everyone", "Nothing makes sense anymore",
    "Every day is unbearable", "The pain never stops", "I want to turn everything off",
    "I just need a way out", "I'm done with everything", "I've lost all hope",
    "No one would care if I was gone", "I feel completely empty inside",
    "I have no energy left to fight", "I'm exhausted from pretending I'm okay",
    "I wish I had the courage to do it", "I don't belong in this world",
    "Why am I still here?", "I'm scared of myself", "I can't keep fighting these thoughts",
    "I don't want to feel this anymore", "I feel like I'm suffocating",
    "The voices in my head won't stop", "I'm tired of struggling", "I don't see the point anymore",
    "I just need the pain to end", "I feel so hopeless", "This is my last goodbye"
]

# medium_risk = [
#     "I feel lost lately", "I don't know who to talk to", "I'm struggling to cope",
#     "Everything feels overwhelming", "I hate waking up every day", 
#     "I can't concentrate on anything anymore", "I just feel so numb",
#     "I feel so alone even with people around", "I'm exhausted but I can't sleep",
#     "Nothing makes me happy anymore", "I cry myself to sleep every night",
#     "I feel like I'm drowning in my thoughts", "I don't want to burden anyone with my problems",
#     "I drink to forget", "Smoking helps me escape my thoughts", 
#     "I don't even get high for fun anymore", "I need something stronger to cope",
#     "I can't stop using, even though I want to", "I feel like I'm stuck in a loop",
#     "I have no motivation for anything", "I'm just so tired all the time",
#     "I keep overthinking every little thing", "I hate the way I feel every day",
#     "Even simple tasks feel impossible", "I'm scared to open up to anyone",
#     "I don't know how to explain what I'm feeling", "Everything feels so heavy",
#     "I feel like I'm just existing, not living", "I hate how my mind never shuts up",
#     "I can't tell if I'm sad or just empty", "Every day feels like the same nightmare",
#     "I wish someone would check on me", "I don't know what I need, I just need help",
#     "I feel like I'm constantly disappointing everyone", "I'm stuck in my own thoughts",
#     "My mind is my worst enemy", "I've lost interest in everything I used to love",
#     "I just want to feel normal again", "No matter what I do, I feel the same",
#     "I keep pretending I'm okay but I'm not", "I can't stop the negative thoughts",
#     "I feel like I'm losing myself", "I don't know what's wrong with me",
#     "I wish I could just pause life for a while", "I miss being happy",
#     "I wish someone understood me", "I feel so disconnected from reality",
#     "I can't shake this feeling of emptiness", "I'm scared I'll feel like this forever",
#     "I just want to be okay"
# ]

low_risk = [
    "Mental health is important", "Therapy has helped me a lot", 
    "We need to talk about depression more", "Journaling has really helped my anxiety",
    "It's okay to ask for help", "Taking a break for my mental health",
    "Finding the right medication changed my life", "Self-care is so important",
    "It's okay to not be okay", "Healing takes time", "Learning to set boundaries is hard",
    "Exercise really helps my mood", "Meditation helps me stay grounded",
    "Talking to friends makes a big difference", "Therapy isn't just for when you're struggling",
    "Getting enough sleep is key to my mental health",
    "Protecting my peace at all costs", "Gotta focus on my mental health today",
    "Normalize taking mental health days", "I need to touch grass",
    "Sending good vibes to everyone struggling", "Being mindful helps me stay present",
    "A good routine helps my mental health", "Music is my therapy",
    "I'm finally learning to love myself", "Having a support system is everything",
    "Deep breathing really helps my anxiety", "Taking time for myself feels so good",
    "Trying to stay positive every day", "Therapy has changed my perspective",
    "I'm working on improving my mindset", "Setting boundaries has been life-changing",
    "Talking about mental health should be normal", "Happiness is a journey, not a destination",
    "Being kind to yourself is the first step", "Gratitude helps shift my perspective",
    "Mental health days should be mandatory", "I'm finally prioritizing myself",
    "Healing isn't linear, and that's okay", "Sleep is my best coping mechanism",
    "Sometimes all you need is a deep breath", "I'm learning to forgive myself",
    "Fresh air and a walk always help", "I try to focus on the little things",
    "Checking in with yourself is important", "Mental health matters more than productivity",
    "Journaling my thoughts helps me process emotions", "Being in nature helps clear my mind",
    "Meditation is a game-changer", "Prioritizing my mental well-being every day"
]


In [3]:
#Find the average embeddings for each of the three categories
from dotenv import load_dotenv
import os
import requests
load_dotenv("./.env")
def get_response(data):
    api_key = os.getenv("JINA_API_KEY")
    URL = "https://api.jina.ai/v1/embeddings"

    resp = requests.post(URL, headers={"Authorization": f"Bearer {api_key}"}, json={"input": data, "model" : "jina-embeddings-v3", "task" : "classification"})
    return resp.json()["data"][0]["embedding"]

In [4]:
import numpy as np
def calculate_mean_pool(sentences):
    embeddings = []
    for sentence in sentences:
        embeddings.append(get_response(sentence)["data"][0]["embedding"])
    return np.mean(embeddings, axis=0)

In [5]:
high_risk_avg = calculate_mean_pool(high_risk)
low_risk_avg = calculate_mean_pool(low_risk)

TypeError: list indices must be integers or slices, not str

In [None]:
# import pickle

# with open("high_risk_avg.pkl", "wb") as f:
#     pickle.dump(high_risk_avg, f)

# with open("medium_risk_avg.pkl", "wb") as f:
#     pickle.dump(medium_risk_avg, f)

# with open("low_risk_avg.pkl", "wb") as f:
#     pickle.dump(low_risk_avg, f)

In [16]:
high_risk_avg = pickle.load(open("high_risk_avg.pkl", "rb"))
low_risk_avg = pickle.load(open("low_risk_avg.pkl", "rb"))

In [None]:
sim_diff = high_risk_avg - low_risk_avg
sim_diff = sim_diff / np.linalg.norm(sim_diff)

# Save the sim_diff vector
with open("sim_diff.pkl", "wb") as f:
    pickle.dump(sim_diff, f)

# Load the sim_diff vector
sim_diff = pickle.load(open("sim_diff.pkl", "rb"))

def sigmoid_function(x, k = 4.5):
    return 1+ (2 / (1 + np.exp(- k * x)))

def cosine_similarity_with_sim_diff(text_embedding, sim_diff_vec = sim_diff):
    return np.dot(text_embedding, sim_diff_vec) / (np.linalg.norm(text_embedding) * np.linalg.norm(sim_diff_vec))

In [18]:
def inference_basic_score (inference_text):
    text_embedding = get_response(inference_text)
    return sigmoid_function(cosine_similarity_with_sim_diff(text_embedding))

In [19]:
inference_text = "Prioritizing my mental well-being every day"
# inf_2 = "when all else is lost, the hope still remains. It is mankind's greatest gift"
# inf_2 = "I'm working on improving my mindset"

print(f"Score for neg statement : {inference_basic_score(inference_text)}")
# print(f"Score for pos statement : {inference_basic_score(inf_2)}")

Score for neg statement : 1.3821438542628512
