<a href="https://colab.research.google.com/github/amgi22/HateXplain/blob/master/GPT4_Stories_Generation_2908.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Research questions

*   ***Question 1.*** Can LLMs provide useful prior knowledge to guide causal discovery and structure learning?
*    **Question 2.** Context versus Training data: Which approach provides more in-
formation for causal discovery and structure learning? (maybe a second paper?



## Background

***Causal Discovery***
Refers to the process or algorithms used to identify and infer the underlying causal relationships among a set of variables. It's a methodological approach that involves analyzing data (either observational or experimental) to deduce the causal connections between variables. Causal discovery aims to go beyond mere statistical correlation and uncover the genuine cause-and-effect mechanisms that govern the system.

**Causal Structure**
Refers to the underlying pattern or model that represents the causal relationships among the variables. It's the result or outcome of causal discovery and provides a graphical or mathematical representation of how the variables in the system interact and influence each other. Causal structures are often represented using directed acyclic graphs (DAGs), where nodes correspond to variables and directed edges represent causal connections.





### Step 1: Generate Sample Set

Suppose we have a ground truth distribution,\(p(G, θG)), over graphs, \( G \), and graph parameters, θG that jointly define BN structures over a joint distribution \( p(X_{1:d}) \) of some Random variables \( X_{1:d} \). (Note: in the simplest case, the graph distribution can be a Dirac delta on a single ground-truth graph).



Given the ground truth distribution \(p(G, θG)), we can generate \( N \) realizations of \( d \) binary random variables. Let's assume \( d = 3 \) and the variables are \( I \), \( D \), and \( G \) representing intelligence, exam difficulty, and grade, respectively.


In [None]:
!pip install pgmpy --upgrade
!pip install openai
!pip install pandas

Collecting pgmpy
  Downloading pgmpy-0.1.23-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pgmpy
Successfully installed pgmpy-0.1.23
Collecting openai
  Downloading openai-0.27.9-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.9


In [None]:
from pgmpy.models import BayesianNetwork

# Define the structure
# Defining the model structure
model = BayesianNetwork([('Difficulty', 'Grade'),
                       ('Intelligence', 'Grade')])

# Print the edges to verify
print(model.edges())


[('Difficulty', 'Grade'), ('Intelligence', 'Grade')]


In [None]:
from pgmpy.factors.discrete import TabularCPD


# Difficulty CPD
cpd_difficulty = TabularCPD(variable='Difficulty', variable_card=2,
                            values=[[0.6], [0.4]],
                            state_names={'Difficulty': ['Easy', 'Hard']})
# Intelligence CPD
cpd_intelligence = TabularCPD(variable='Intelligence', variable_card=2,
                              values=[[0.7], [0.3]],
                              state_names={'Intelligence': ['Low', 'High']})
# Grade CPD
cpd_grade = TabularCPD(variable='Grade', variable_card=3,
                       values=[[0.3, 0.05, 0.9, 0.5],
                               [0.4, 0.25, 0.08, 0.3],
                               [0.3, 0.7, 0.02, 0.2]],
                       evidence=['Difficulty', 'Intelligence'],
                       evidence_card=[2, 2],
                       state_names={'Grade': ['A', 'B', 'C'],
                                    'Difficulty': ['Easy', 'Hard'],
                                    'Intelligence': ['Low', 'High']})

# Adding the CPDs to the model
model.add_cpds(cpd_difficulty, cpd_intelligence, cpd_grade)

# Verify the model
assert model.check_model()


In [None]:
from pgmpy.sampling import BayesianModelSampling

# Number of samples
N = 2

# Create a sampler object
sampler = BayesianModelSampling(model)

# Generate samples
samples = sampler.forward_sample(size=N)

# Print the first few samples
print(samples.head())


  0%|          | 0/3 [00:00<?, ?it/s]

  Difficulty Grade Intelligence
0       Easy     B          Low
1       Hard     A          Low


### Step 2: Convert Sample Set to Human-Readable Language and Use as Context
We'll convert these samples to human-understandable language and use them as a context in a conversation with an LLM.

Example: let we have \( d = 3 \) binary random variables \( I \), \( D \), and \( G \) that denote intelligence, (exam's) difficulty, and grade, respectively. Let a sample be \( (I = 1, D = 0, G = 0) \).
- We first convert the sample to "clunky" English automatically (i.e., via using a template). E.g. "Alice is intelligent, Exam is not difficult, Exam grade is low".
- We use an LLM to make a story out of these clunky forms. For instance, "Despite the fact that Alice is known to be quite intelligent and the Math exam was not that hard, unfortunately her performance was not satisfactory..."
- We feed this story as the context material to another LLM. (or as training data for fine-tuning to answer Question 2)


In [None]:
# Function to translate a sample to English
def translate_to_english(sample):
    intelligence = "intelligent" if sample['Intelligence'] == 'High' else "not intelligent"
    difficulty = "not difficult" if sample['Difficulty'] == 'Easy' else "difficult"
    grade = sample['Grade']

    sentence = f"The student is {intelligence}, Exam is {difficulty}, Exam grade is {grade}."
    return sentence

# Apply the translation function to each row in the DataFrame
english_sentences = samples.apply(translate_to_english, axis=1)

# Print the first two translated sentences
for i in range(10):
    print(f"Sample {i + 1}: {english_sentences.iloc[i]}\n")



Sample 1: The student is not intelligent, Exam is not difficult, Exam grade is C.

Sample 2: The student is intelligent, Exam is not difficult, Exam grade is C.

Sample 3: The student is not intelligent, Exam is not difficult, Exam grade is C.

Sample 4: The student is not intelligent, Exam is difficult, Exam grade is A.

Sample 5: The student is not intelligent, Exam is difficult, Exam grade is A.

Sample 6: The student is not intelligent, Exam is difficult, Exam grade is A.

Sample 7: The student is not intelligent, Exam is not difficult, Exam grade is B.

Sample 8: The student is not intelligent, Exam is difficult, Exam grade is A.

Sample 9: The student is intelligent, Exam is difficult, Exam grade is B.

Sample 10: The student is not intelligent, Exam is difficult, Exam grade is A.



### Generate the stories using GPT4 and store the stories in a CSV file

We use an LLM to make a story out of these samples. For
instance “Despite the fact that Alice is known to be quite intelligent
and the Math exam was not that hard, unfortunately her performance
we not satisfactory...”

In [None]:
import openai
import os

# Set the OpenAI API key from environment variables
os.environ['OPENAI_API_KEY'] = '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'
openai.api_key = os.getenv('OPENAI_API_KEY')


In [None]:
import pandas as pd
import time
import openai
import csv

# Create or open a CSV file to store the stories
csv_file_path = '/content/drive/MyDrive/Colab Notebooks/Stories/stories.csv'
csv_file = open(csv_file_path, 'w', newline='')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Story'])  # Write the header

def generate_chat_from_sample(translated_sentence):
    try:
        # Creating a message for GPT-4 using the translated sentence
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Generate a story out of this sample: {translated_sentence}. Include a random first and last name for a student. Limit to 500 characters."},
        ]

        # Make the API call to GPT-4
        response = openai.ChatCompletion.create(
          model="gpt-4-0613",
          messages=messages,
          max_tokens=4000
        )

        # Extract the generated story from the response
        story = response.choices[0].message['content']

        # Write the story to the CSV file
        csv_writer.writerow([story])

        return story

    except openai.Error as e:
        print(f"OpenAI API Error: {e}. Waiting for 60 seconds before retry.")
        time.sleep(60)
        return generate_chat_from_sample(translated_sentence)

    except Exception as e:
        print(f"An error occurred: {e}.")
        return None

# Assuming english_sentences is a pandas Series containing your translated sentences
# Uncomment and modify the next line according to where you get english_sentences from
# english_sentences = ...

# Apply the chat generation function to each translated sentence
stories_series = english_sentences.apply(generate_chat_from_sample)

# Close the CSV file
csv_file.close()

# Print the first two stories in full
for i in range(2):
    print(f"Story {i + 1}:\n{stories_series.iloc[i]}\n" + "=" * 50 + "\n")

# Confirmation message
print(f"Stories saved to {csv_file_path}")

AttributeError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
