<a href="https://colab.research.google.com/github/UBGidado/My_Research/blob/main/df3_Relation_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installing Dependences and imports

In [1]:
!pip install -q kaggle datasets spacy transformers
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:
import os
import numpy as np
import pandas as pd
import kagglehub
import spacy
from spacy.matcher import Matcher
from datasets import load_dataset
from google.colab import files

## Setting up Kaggle API

In [3]:
# Upload kaggle.json file
uploaded = files.upload()  # Upload your kaggle.json here
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle (1).json to kaggle (1) (2).json
mv: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


## Loading Dataset 1

In [4]:
!wget https://huggingface.co/datasets/halilbabacan/autotrain-data-cognitive_distortions/resolve/main/raw/Cognitive_distortions.csv

# Load and structure the data
df1 = pd.read_csv("Cognitive_distortions.csv")
df1 = df1.rename(columns={
    'Text': 'Patient Question',
    'Label': 'Dominant Distortion'
})
df1.insert(1, "Distorted part", value=np.nan)
df1.insert(3, "Secondary Distortion (Optional)", value=np.nan)

# Display the formatted DataFrame
df1.head()

--2025-06-12 11:34:55--  https://huggingface.co/datasets/halilbabacan/autotrain-data-cognitive_distortions/resolve/main/raw/Cognitive_distortions.csv
Resolving huggingface.co (huggingface.co)... 13.35.202.121, 13.35.202.40, 13.35.202.34, ...
Connecting to huggingface.co (huggingface.co)|13.35.202.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021440 (998K) [text/plain]
Saving to: ‘Cognitive_distortions.csv.2’


2025-06-12 11:34:57 (934 KB/s) - ‘Cognitive_distortions.csv.2’ saved [1021440/1021440]



Unnamed: 0,Patient Question,Distorted part,Dominant Distortion,Secondary Distortion (Optional)
0,I'm such a failure I never do anything right.,,Distortion,
1,Nobody likes me because I'm not interesting.,,Distortion,
2,I can't try new things because I'll just mess...,,Distortion,
3,My boss didn't say 'good morning' she must be...,,Distortion,
4,My friend didn't invite me to the party I mus...,,Distortion,


## Load Dataset 2 (Kaggle)

In [5]:
# Download the dataset
multiclass_dataset_path = kagglehub.dataset_download(
    "sagarikashreevastava/cognitive-distortion-detetction-dataset"
)
print("Path to dataset files:", multiclass_dataset_path)
multiclass_dataset_file_path = multiclass_dataset_path + "/Annotated_data.csv"

Path to dataset files: /kaggle/input/cognitive-distortion-detetction-dataset


### Cleaning & Structuring df2

In [6]:
df2 = pd.read_csv(multiclass_dataset_file_path)
df2 = df2.drop('Id_Number', axis=1) # delete columnb with id
df2

Unnamed: 0,Patient Question,Distorted part,Dominant Distortion,Secondary Distortion (Optional)
0,"Hello, I have a beautiful,smart,outgoing and a...",The voice are always fimilar (someone she know...,Personalization,
1,Since I was about 16 years old I’ve had these ...,I feel trapped inside my disgusting self and l...,Labeling,Emotional Reasoning
2,So I’ve been dating on and off this guy for a...,,No Distortion,
3,My parents got divorced in 2004. My mother has...,,No Distortion,
4,I don’t really know how to explain the situati...,I refused to go because I didn’t know if it wa...,Fortune-telling,Emotional Reasoning
...,...,...,...,...
2525,I’m a 21 year old female. I spent most of my l...,,No Distortion,
2526,I am 21 female and have not had any friends fo...,Now I am at university my peers around me all ...,Overgeneralization,
2527,From the U.S.: My brother is 19 years old and ...,He claims he’s severely depressed and has outb...,Mental filter,Mind Reading
2528,From the U.S.: I am a 21 year old woman who ha...,,No Distortion,


## Concatenate into df3

In [7]:
df3 = pd.concat([df1, df2], ignore_index=True)
df3

Unnamed: 0,Patient Question,Distorted part,Dominant Distortion,Secondary Distortion (Optional)
0,I'm such a failure I never do anything right.,,Distortion,
1,Nobody likes me because I'm not interesting.,,Distortion,
2,I can't try new things because I'll just mess...,,Distortion,
3,My boss didn't say 'good morning' she must be...,,Distortion,
4,My friend didn't invite me to the party I mus...,,Distortion,
...,...,...,...,...
6052,I’m a 21 year old female. I spent most of my l...,,No Distortion,
6053,I am 21 female and have not had any friends fo...,Now I am at university my peers around me all ...,Overgeneralization,
6054,From the U.S.: My brother is 19 years old and ...,He claims he’s severely depressed and has outb...,Mental filter,Mind Reading
6055,From the U.S.: I am a 21 year old woman who ha...,,No Distortion,


In [8]:
df3 = pd.concat([df1[["Patient Question"]], df2], ignore_index=True)
df3 = df3.rename(columns={"Patient Question": "text"})
df3 = df3.reset_index().rename(columns={"index": "id"})

print("✅ Dataset 1 and Dataset 2 loaded and combined into df3.")
print(df3.head())

✅ Dataset 1 and Dataset 2 loaded and combined into df3.
   id                                               text Distorted part  \
0   0      I'm such a failure I never do anything right.            NaN   
1   1       Nobody likes me because I'm not interesting.            NaN   
2   2   I can't try new things because I'll just mess...            NaN   
3   3   My boss didn't say 'good morning' she must be...            NaN   
4   4   My friend didn't invite me to the party I mus...            NaN   

  Dominant Distortion Secondary Distortion (Optional)  
0                 NaN                             NaN  
1                 NaN                             NaN  
2                 NaN                             NaN  
3                 NaN                             NaN  
4                 NaN                             NaN  


In [9]:
import tensorflow as tf

gpu_available = tf.config.list_physical_devices('GPU')

if gpu_available:
    print("GPU is available!")
    # Print GPU details
    for gpu in gpu_available:
        print("Name:", gpu.name, "Type:", gpu.device_type)
else:
    print("No GPU available. Make sure you have selected 'GPU' as the hardware accelerator in the Runtime settings.")

GPU is available!
Name: /physical_device:GPU:0 Type: GPU


### Extracting triples from text column

In [10]:
nlp = spacy.load("en_core_web_sm")
# Define social/psychological relationship verbs

SOCIAL_RELATION_VERBS = {
    "like", "love", "hate", "trust", "distrust", "fear", "admire", "resent",
    "blame", "support", "oppose", "befriend", "avoid", "confide", "believe",
    "doubt", "respect", "despise", "envy", "forgive", "help", "betray"
}

def extract_entity_or_chunk(token, doc):
    for ent in doc.ents:
        if token.i in range(ent.start, ent.end):
            return ent.text
    for chunk in doc.noun_chunks:
        if token.i in range(chunk.start, chunk.end):
            return chunk.text
    return token.text

def extract_relationships(text):
    """
    Extract social/psychological relationships (subject-verb-object triples)
    with sentence-level context.
    Returns: List of {'subject':, 'relation':, 'object':, 'context':}
    """
    doc = nlp(text)
    relationships = []

    for sent in doc.sents:
        for token in sent:
            # Check if verb is a social relationship marker
            if token.pos_ == "VERB" and token.lemma_ in SOCIAL_RELATION_VERBS:
                subj = obj = None

                # Find subject and object
                for child in token.children:
                    if "subj" in child.dep_ or child.dep_ == "agent":
                        subj = extract_entity_or_chunk(child, doc)
                    if "obj" in child.dep_ or child.dep_ == "pobj":
                        obj = extract_entity_or_chunk(child, doc)

                if subj and obj:
                    relationships.append({
                        'subject': subj,
                        'relation': token.lemma_,
                        'object': obj,
                        'context': sent.text
                    })

    return relationships

# Example usage:
sample_text = "John trusts Mary but fears hospitals. He eats breakfast daily."
print(extract_relationships(sample_text))

[{'subject': 'John', 'relation': 'trust', 'object': 'Mary', 'context': 'John trusts Mary but fears hospitals.'}]


In [11]:
from tqdm import tqdm

texts = df3["text"].tolist()
all_relationships = []

for doc in tqdm(nlp.pipe(texts, batch_size=64, n_process=1), total=len(texts)):
    relationships = []
    for sent in doc.sents:
        for token in sent:
            if token.pos_ == "VERB" and token.lemma_ in SOCIAL_RELATION_VERBS:
                subj = obj = None
                for child in token.children:
                    if "subj" in child.dep_ or child.dep_ == "agent":
                        subj = extract_entity_or_chunk(child, doc)
                    if "obj" in child.dep_ or child.dep_ == "pobj":
                        obj = extract_entity_or_chunk(child, doc)
                if subj and obj:
                    relationships.append({
                        "subject": subj,
                        "relation": token.lemma_,
                        "object": obj,
                        "context": sent.text
                    })
    all_relationships.append(relationships)

# Save results to DataFrame
df3["relationships"] = all_relationships

100%|██████████| 6057/6057 [01:25<00:00, 70.48it/s]


## Combined Solution (Summary + Detailed Triples)

In [13]:
import pandas as pd

# 1. Create Summary DataFrame (id, text, list-of-relationships)
summary_df = df3[["id", "text"]].copy()
summary_df["list-of-relationships"] = df3["relationships"]

# 2. Create Detailed Relationships DataFrame
detailed_records = []
for _, row in df3.iterrows():
    for relationship in row["relationships"]:
        subj = relationship.get('subject')
        rel = relationship.get('relation')
        obj = relationship.get('object')

        if subj and rel and obj:
             detailed_records.append({
                "text_id": row["id"],
                "context": row["text"],  # Using full text as context
                "relationship": (rel, subj, obj)  # Stored as (relation, subject, object)
            })

triples_df = pd.DataFrame(detailed_records)
# Only reset and rename index if the DataFrame is not empty
if not triples_df.empty:
    triples_df.reset_index(inplace=True)
    triples_df.rename(columns={"index": "id"}, inplace=True)

# Display the first few rows of the created DataFrames
print("Summary DataFrame:")
display(summary_df.head())

print("\nTriples DataFrame:")
display(triples_df.head())

Summary DataFrame:


Unnamed: 0,id,text,list-of-relationships
0,0,I'm such a failure I never do anything right.,[]
1,1,Nobody likes me because I'm not interesting.,"[{'subject': ' Nobody', 'relation': 'like', 'o..."
2,2,I can't try new things because I'll just mess...,[]
3,3,My boss didn't say 'good morning' she must be...,[]
4,4,My friend didn't invite me to the party I mus...,[]



Triples DataFrame:


Unnamed: 0,id,text_id,context,relationship
0,0,1,Nobody likes me because I'm not interesting.,"(like, Nobody, me)"
1,1,10,My partner didn't say 'I love you' today our ...,"(love, I, you)"
2,2,12,I didn't get a reply to my email they must ha...,"(hate, they, me)"
3,3,23,No one will ever love me because I'm too shy.,"(love, No one, me)"
4,4,43,My girlfriend broke up with me nobody can lov...,"(love, nobody, me)"


## Saving Results

In [17]:
from google.colab import drive
import os

drive.mount('/content/drive')

drive_folder_path = '/content/drive/My Drive/df3_extraction_results'

os.makedirs(drive_folder_path, exist_ok=True)

# Define the full paths for the files
summary_file_path = os.path.join(drive_folder_path, "texts_with_relationships.csv")
triples_file_path = os.path.join(drive_folder_path, "detailed_relationships.csv")

# 3. Save Results
summary_df.to_csv(summary_file_path, index=False)
triples_df.to_csv(triples_file_path, index=False)

# Show completion summary
print("✅ Processing complete!")
print(f"🗂️ Total texts processed: {len(summary_df)}")
print(f"🔍 Total relationship triples extracted: {len(triples_df)}")
print(f"📁 Files saved to Google Drive:")
print(f" - {summary_file_path}")
print(f" - {triples_file_path}")

Mounted at /content/drive
✅ Processing complete!
🗂️ Total texts processed: 6057
🔍 Total relationship triples extracted: 1545
📁 Files saved to Google Drive:
 - /content/drive/My Drive/df3_extraction_results/texts_with_relationships.csv
 - /content/drive/My Drive/df3_extraction_results/detailed_relationships.csv


## Distribution Statistics

In [21]:
from collections import Counter

relationship_counts = summary_df["list-of-relationships"].apply(lambda x: len(x) if isinstance(x, list) else 0)
distribution = Counter(relationship_counts)

print("\n📊 Relationship Count Distribution:")
for rel_count, num_texts in sorted(distribution.items()):
    print(f"Texts with {rel_count} relationships: {num_texts}")

# Calculate and display the total ratio of texts with relationships
total_texts = len(summary_df)
texts_with_relationships = total_texts - distribution.get(0, 0) # Subtract texts with 0 relationships
ratio_with_relationships = (texts_with_relationships / total_texts) * 100 if total_texts > 0 else 0

print(f"\n Percentage of texts with at least one relationship: {ratio_with_relationships:.2f}%")

# Display the ratio in the requested format
print(f"\n Ratio of texts with relationships to total texts: {{{total_texts}: {texts_with_relationships}}}")


📊 Relationship Count Distribution:
Texts with 0 relationships: 4964
Texts with 1 relationships: 805
Texts with 2 relationships: 192
Texts with 3 relationships: 55
Texts with 4 relationships: 25
Texts with 5 relationships: 9
Texts with 6 relationships: 4
Texts with 7 relationships: 2
Texts with 8 relationships: 1

 Percentage of texts with at least one relationship: 18.05%

 Ratio of texts with relationships to total texts: {6057: 1093}
