# Tetrad Example

This script will be an example of how to use Tetrad to analyze causal relationships with regard to user posts taken from Twitter and associated stock market data over time.  

In [1]:
!sudo pip install -U --break-system-packages datasets JPype1 git+https://github.com/cmu-phil/py-tetrad
#!sudo pip install -U --break-system-packages torch

Collecting git+https://github.com/cmu-phil/py-tetrad
  Cloning https://github.com/cmu-phil/py-tetrad to /tmp/pip-req-build-5o9dvfo6
  Running command git clone --filter=blob:none --quiet https://github.com/cmu-phil/py-tetrad /tmp/pip-req-build-5o9dvfo6
  Resolved https://github.com/cmu-phil/py-tetrad to commit 275850d12bb29e392580ee96bff633928a8296bb
  Preparing metadata (setup.py) ... [?25ldone
[0m

## Import Packages

In [2]:
from datetime import datetime
import pandas as pd
import pytetrad.tools.translate as ptt
import pytetrad.tools.TetradSearch as ts
import edu.cmu.tetrad.util as util
import edu.cmu.tetrad.data as td
import edu.cmu.tetrad.algcomparison.simulation as sim
import edu.cmu.tetrad.algcomparison.algorithm.multi as multi
import java.util as jutil

## Import and Clean Tweet Sentiment Data
Using the huggingface dataset emad12/stock_tweets_sentiment

In [3]:
splits = {'train': 'data/train-00000-of-00001-49baa0648effea14.parquet', 'test': 'data/test-00000-of-00001-cb0233e05c1cc1c9.parquet'}
twt_df = pd.read_parquet("hf://datasets/emad12/stock_tweets_sentiment/" + splits["train"])

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
tweet_df = twt_df.copy(deep=True)

# Function for standardizing the various date formats in the dataset
def fix_dates(r):
    first_str = r['post_date'].split(None, 1)[0]
    if '-' in first_str:
        return first_str
    elif first_str.isdigit():
        return datetime.fromtimestamp(int(first_str))
    else:
        return datetime.strptime(r['post_date'], '%a %b %d %X %z %Y').strftime('%Y-%m-%d')

# Normalize date format
tweet_df['post_date'] = tweet_df.apply(fix_dates, axis=1)
tweet_df['post_date'] = pd.to_datetime(tweet_df['post_date']).dt.strftime('%Y-%m-%d')

# Filter for only those ticker symbols that have a large amount of data
tweet_df = tweet_df[tweet_df['ticker_symbol'].isin(['AMZN', 'AAPL', 'TSLA', 'GOOG', 'MSFT'])]

# Drop columns we won't be using
tweet_df = tweet_df.drop(columns=['Unnamed: 0', 'tweet', 'tweet_cleaned', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'])

# Rename columns for later when we combine this dataset with the stock price dataset
tweet_df = tweet_df.rename(columns={'post_date': 'Date', 'sentiment': 'Sentiment', 'ticker_symbol': 'Ticker'})

In [5]:
print(tweet_df['Ticker'].value_counts())
print(tweet_df['Date'].min())
print(tweet_df['Date'].max())
tweet_df.head(5)

Ticker
TSLA    34440
AAPL    21281
AMZN    10026
GOOG     6132
MSFT     2563
Name: count, dtype: int64
2015-01-01
2019-12-31


Unnamed: 0,Date,Sentiment,Ticker
2,2015-08-26,0,AMZN
3,2019-09-10,1,AAPL
4,2019-05-16,0,AAPL
5,2018-01-24,1,TSLA
6,2017-08-04,-1,TSLA


## Import and Clean Historical Stock Data for Selected Stocks
Using the huggingface dataset no-ry/world-stock-prices-daily-updating

In [6]:
st_df = pd.read_csv("hf://datasets/no-ry/world-stock-prices-daily-updating/World-Stock-Prices-Dataset.csv")

In [7]:
stock_df = st_df.copy(deep=True)
# Convert dates to a standardized format
stock_df['Date'] = pd.to_datetime(stock_df['Date'], utc=True).dt.strftime('%Y-%m-%d')

# Drop columns we will not be using
stock_df = stock_df.drop(columns=['Industry_Tag', 'Country', 'Dividends', 'Stock Splits', 'Capital Gains', 'Brand_Name'])

# Keep only those ticker symbols for which we have sufficient sentiment data
stock_df = stock_df[stock_df['Ticker'].isin(['AMZN', 'AAPL', 'TSLA', 'GOOG', 'MSFT'])]

# Remove duplicate records 
stock_df = stock_df.drop_duplicates(subset=['Date', 'Ticker'], keep='last')

In [8]:
print(stock_df['Date'].min())
print(stock_df['Date'].max())
stock_df.head(5)

2000-01-03
2025-07-03


Unnamed: 0,Date,Open,High,Low,Close,Volume,Ticker
86,2025-07-03,212.149994,214.649994,211.809998,213.550003,34955800.0,AAPL
96,2025-07-03,493.809998,500.130005,493.440002,498.839996,13984800.0,MSFT
97,2025-07-03,221.820007,224.009995,221.360001,223.410004,29632400.0,AMZN
99,2025-07-03,317.98999,318.450012,312.76001,315.350006,58042300.0,TSLA
111,2025-07-02,219.729996,221.600006,219.059998,219.919998,30840800.0,AMZN


## Combine the Datasets 
Before we can analyze this data with Tetrad, we need to combine these two datasets, keeping relevant columns and creating new aggregated fields.

In [9]:
#Combining the two datasets on the date and ticker symbol
agg_tweet_df = tweet_df.groupby(['Date', 'Ticker'], as_index=False).agg(
    Sentiment=pd.NamedAgg(column='Sentiment', aggfunc='mean'),
    Tweet_Volume=pd.NamedAgg(column='Sentiment', aggfunc='count'))
merged_df = stock_df.merge(agg_tweet_df, on=['Date', 'Ticker'])
merged_df['Tweet_Volume'] = merged_df['Tweet_Volume'].astype(float)
print(merged_df['Date'].min())
print(merged_df['Date'].max())
merged_df.head(5)

2015-01-02
2019-12-31


Unnamed: 0,Date,Open,High,Low,Close,Volume,Ticker,Sentiment,Tweet_Volume
0,2019-12-31,27.0,28.086,26.805332,27.888666,154285500.0,TSLA,-0.057143,35.0
1,2019-12-31,92.099998,92.663002,91.611504,92.391998,50130000.0,AMZN,0.266667,15.0
2,2019-12-31,70.707799,71.622344,70.607807,71.615028,100805600.0,AAPL,0.176471,17.0
3,2019-12-31,151.376144,152.341738,151.067147,152.274139,18369400.0,MSFT,-1.0,1.0
4,2019-12-30,153.519779,153.548745,151.337527,152.167938,16348400.0,MSFT,0.166667,6.0


## Split the Dataset by Ticker Symbol, Normalize, and Create Features
Since each ticker symbol has its own magnitudes of data, we will normalize the "Volume" column and calculate deltas after splitting the dataset by ticker symbol.

In [10]:
# A function to help normalize a column
def normalize_column(column, dataframe):
    return (dataframe[column] - dataframe[column].min()) \
        / (dataframe[column].max() - dataframe[column].min())

dfs = {}
symbols = merged_df['Ticker'].unique()
for symbol in symbols:
    # Filter per ticker symbol
    dfs[symbol] = merged_df[merged_df['Ticker'] == symbol].sort_values('Date').copy(deep=True)
    
    # Normalize volume
    dfs[symbol]['Volume'] = normalize_column('Volume', dfs[symbol])
    
    # Create delta columns as a daily price movement indicators
    dfs[symbol]['Daily_Stock_Close_Delta'] = (dfs[symbol]['Close'] - dfs[symbol]['Open']) / dfs[symbol]['Open']
    dfs[symbol]['Daily_Stock_Low_Delta'] = (dfs[symbol]['Low'] - dfs[symbol]['Open']) / dfs[symbol]['Open']
    dfs[symbol]['Daily_Stock_High_Delta'] = (dfs[symbol]['High'] - dfs[symbol]['Open']) / dfs[symbol]['Open']

    # Create exponential moving average to capture a bit of history
    dfs[symbol]['EMA'] = dfs[symbol]['Open'].ewm(span=5).mean()
    dfs[symbol]['EMA_Delta'] = (dfs[symbol]['EMA'] - dfs[symbol]['Open']) / dfs[symbol]['Open']
    
    # 5 day of opens against open
    dfs[symbol] = dfs[symbol].rename(columns={'Volume': 'Stock_Volume'})
    dfs[symbol] = dfs[symbol].drop(columns=['Ticker', 'Date', 'High', 'Low', 'Open', 'Close', 'EMA'])
    dfs[symbol] = dfs[symbol].fillna(0)

dfs['TSLA'].head(10)

Unnamed: 0,Stock_Volume,Sentiment,Tweet_Volume,Daily_Stock_Close_Delta,Daily_Stock_Low_Delta,Daily_Stock_High_Delta,EMA_Delta
4582,0.123139,0.0,5.0,-0.015973,-0.043119,0.001705,0.0
4578,0.141477,0.0,4.0,-0.020788,-0.034444,0.009089,0.015512
4574,0.168598,0.333333,3.0,0.005808,-0.027849,0.019709,0.019588
4569,0.08301,-0.333333,3.0,-0.010291,-0.013157,0.004652,0.003749
4565,0.120221,-0.75,4.0,-0.010818,-0.018955,0.005074,0.013825
4564,0.159139,0.166667,6.0,-0.004137,-0.018715,0.006993,0.027372
4560,0.114423,-0.315789,19.0,0.004574,-0.011853,0.0211,0.016799
4555,0.329185,-0.157895,19.0,0.036915,-0.004467,0.050422,0.073477
4554,0.136863,0.333333,3.0,-0.013471,-0.023086,0.006479,0.01689
4551,0.087889,-1.0,2.0,0.012428,-0.005506,0.019874,0.024515


## Set the Prior Knowledge
Before running the TetradSearch algorithm, we can apply our prior knowledge to the model.

In [11]:
# Set knowledge tiers
kno_tier0 = ['EMA_Delta']
kno_tier1 = ['Daily_Stock_Low_Delta', 'Daily_Stock_High_Delta', 'Sentiment', 'Stock_Volume', 'Tweet_Volume']
kno_tier2 = ['Daily_Stock_Close_Delta']

knowledge = td.Knowledge()
for col in kno_tier0:
    knowledge.addToTier(0, col)
for col in kno_tier1:
    knowledge.addToTier(1, col)
for col in kno_tier2:
    knowledge.addToTier(2, col)
    
# Create a required edge
knowledge.setRequired('EMA_Delta', 'Sentiment')

## Run the FGES Search Algorithm
We will use the FGES algorithm to determine the Completed Partially Directed Acyclic Graph (CPDAG).

In [12]:
# Create a TetradSearch object, set its knowledge, indicate tests to be used
fges_search = ts.TetradSearch(dfs['TSLA'])
fges_search.set_knowledge(knowledge)
fges_search.use_sem_bic()
fges_search.use_fisher_z(alpha=0.05)
fges_search.set_time_lag(1)


# Run the FGES algorithm
fges_result = fges_search.run_fges()
print(fges_search.get_string())

Dec 05, 2025 10:38:57 PM java.util.prefs.FileSystemPreferences$6 run


Error computing BIC: Graph must not be null.
here
Graph Nodes:
Stock_Volume;Sentiment;Tweet_Volume;Daily_Stock_Close_Delta;Daily_Stock_Low_Delta;Daily_Stock_High_Delta;EMA_Delta;Stock_Volume:1;Sentiment:1;Tweet_Volume:1;Daily_Stock_Close_Delta:1;Daily_Stock_Low_Delta:1;Daily_Stock_High_Delta:1;EMA_Delta:1

Graph Edges:
1. Daily_Stock_Close_Delta:1 --> Daily_Stock_Close_Delta
2. Daily_Stock_Close_Delta:1 --> Daily_Stock_High_Delta:1
3. Daily_Stock_Close_Delta:1 --> Daily_Stock_Low_Delta
4. Daily_Stock_Close_Delta:1 --> Daily_Stock_Low_Delta:1
5. Daily_Stock_Close_Delta:1 --> EMA_Delta
6. Daily_Stock_High_Delta --> Daily_Stock_Close_Delta
7. Daily_Stock_High_Delta --> Daily_Stock_Low_Delta
8. Daily_Stock_High_Delta --> EMA_Delta
9. Daily_Stock_High_Delta:1 --> Daily_Stock_Low_Delta
10. Daily_Stock_High_Delta:1 --> Stock_Volume:1
11. Daily_Stock_Low_Delta --> Daily_Stock_Close_Delta
12. Daily_Stock_Low_Delta:1 --> Daily_Stock_High_Delta:1
13. Daily_Stock_Low_Delta:1 --> Daily_Stock_Low_De

The error displayed in the above output, "Error computing BIC," is an issue with the py-tetrad library that appears on any FGES-based search. The score attributes indicate the strength of the graph and its constituent nodes, with higher values indicating higher confidence in the determined graph. The individual nodes' scores represent how that node's relationship within the broader graph contributes to the strength of the graph as a whole. 

These graph edges indicate some degree causality determined by the search algorithm given the data. For example, "Tweet_Volume --> Sentiment" indicates that tweet volume has some causal effect on sentiment within this dataset. Similarly, "Tweet_Volume --- Stock_Volume" indicates that there is some unoriented relationship between tweet volume and stock volume due to a lack of a collider in the DAGs that comprise this CPDAG. 

## Run the FCI Search Algorithm 
We will use the FCI algorithm to determine a Partial Ancestral Graph (PAG).

In [13]:
# Create a TetradSearch object, set its knowledge, indicate tests to be used
fci_search = ts.TetradSearch(dfs['TSLA'])
fci_search.set_knowledge(knowledge)
fci_search.use_sem_bic()
fci_search.use_fisher_z(alpha=0.05)
fci_search.set_time_lag(1)

# Run the FCI algorithm
graph = fci_search.run_fci()
print(fci_search.get_string())

Doing possible dsep search.
Removed Sentiment:1 o-> Tweet_Volume:1 by possible dsep
Removed Sentiment:1 o-o EMA_Delta:1 by possible dsep
Graph Nodes:
Stock_Volume;Sentiment;Tweet_Volume;Daily_Stock_Close_Delta;Daily_Stock_Low_Delta;Daily_Stock_High_Delta;EMA_Delta;Stock_Volume:1;Sentiment:1;Tweet_Volume:1;Daily_Stock_Close_Delta:1;Daily_Stock_Low_Delta:1;Daily_Stock_High_Delta:1;EMA_Delta:1

Graph Edges:
1. Daily_Stock_Close_Delta:1 --> Daily_Stock_High_Delta:1
2. Daily_Stock_Close_Delta:1 o-> Daily_Stock_Low_Delta:1
3. Daily_Stock_Close_Delta:1 o-> EMA_Delta
4. Daily_Stock_High_Delta --> Daily_Stock_Close_Delta
5. Daily_Stock_High_Delta <-> Daily_Stock_Low_Delta
6. Daily_Stock_High_Delta:1 --> Tweet_Volume:1
7. Daily_Stock_Low_Delta --> Daily_Stock_Close_Delta
8. Daily_Stock_Low_Delta:1 --> Daily_Stock_High_Delta:1
9. Daily_Stock_Low_Delta:1 --> Tweet_Volume:1
10. EMA_Delta --> Sentiment
11. Sentiment o-o Sentiment:1
12. EMA_Delta:1 o-> Daily_Stock_Low_Delta:1
13. EMA_Delta:1 o-> EMA_

These edges in the produced PAG represent relationships between the nodes that can and cannot be determined by the search algorithm. For instance, "EMA_Delta --> Sentiment" indicates that EMA_Delta is a cause of Sentiment, though it may be direct or indirect and there may exist and unmeasured confounder. 

"EMA_Delta:1 o-> EMA_Delta" indicates that the previous day's exponential moving average causes that day's exponential moving average, that there is an unmeasured variable that is a cause of both, or that both of the previous statements are true. 

"Tweet_Volume <-> Stock_Volume:1" indicates that neither causes the other and that there exists an unmeasured variable that directly or indirectly causes both.

Lastly, "Sentiment o-o Sentiment:1" indicates some causal relationship between the two variables, or that there is an unmeasured variable that causes both variables, or a combination of the two. Since we know from the prior knowledge that a given day's sentiment cannot cause the previous day's sentiment, we can rule out that possibility.

## Simulation and Sensitivity Analysis

In [14]:
import pytetrad.tools.cpn as cpn
import torch.nn as nn

# Setup a CausalPerceptronNetwork to simulate data from the PAG
noise_distributions = {}
for node in fges_search.java.getNodes():
    noise_distributions[node] = cpn.NoiseDistribution(distribution_type="normal", mean=0, std=1)

cpn = cpn.CausalPerceptronNetwork(
    graph=fges_search.java,
    num_samples=10000,
    noise_distributions=noise_distributions,  # Function to generate noise
    hidden_dimensions=[50, 50, 50, 50, 50],
    input_scale=1,
    activation_module=nn.LeakyReLU(),
    nonlinearity='leaky_relu',
    discrete_prob=0,  # No discrete variables
    seed=None # Random to show how sensitive the model is on repeated runs
)

# Simulate the data from the CPN that was created from the PAG
simulated_df = cpn.generate_data()

# Reconstruct the CPDAG from this simulated Data
search = ts.TetradSearch(simulated_df)
search.set_knowledge(knowledge)
search.use_sem_bic()
search.use_fisher_z(alpha=0.05)
search.set_time_lag(1)

# Run the FGES algorithm to try to reproduce the CPDAG
graph = search.run_fges()
print(search.get_string())

hereError computing BIC: Graph must not be null.

Graph Nodes:
Stock_Volume;Sentiment;Tweet_Volume;Daily_Stock_Close_Delta;Daily_Stock_Low_Delta;Daily_Stock_High_Delta;EMA_Delta;Stock_Volume:1;Sentiment:1;Tweet_Volume:1;Daily_Stock_Close_Delta:1;Daily_Stock_Low_Delta:1;Daily_Stock_High_Delta:1;EMA_Delta:1

Graph Edges:
1. Daily_Stock_Close_Delta --> Daily_Stock_High_Delta:1
2. Daily_Stock_Close_Delta:1 --> Daily_Stock_High_Delta:1
3. Daily_Stock_Close_Delta:1 --> Daily_Stock_Low_Delta:1
4. Daily_Stock_Close_Delta:1 --> EMA_Delta
5. Daily_Stock_High_Delta --> Daily_Stock_Close_Delta
6. Daily_Stock_High_Delta --> Daily_Stock_Low_Delta
7. Daily_Stock_High_Delta --> Daily_Stock_Low_Delta:1
8. Daily_Stock_High_Delta --> EMA_Delta
9. Daily_Stock_High_Delta --> Sentiment:1
10. Daily_Stock_High_Delta --> Stock_Volume
11. Daily_Stock_High_Delta --> Tweet_Volume:1
12. Daily_Stock_Low_Delta --> Daily_Stock_Close_Delta
13. Daily_Stock_Low_Delta --> Sentiment
14. Daily_Stock_Low_Delta --> Daily_Sto

Comparing this graph's score to the original graph's score, we can note that this reconstructed graph is a weaker representation. There are 49 edges remaining in this graph, compared to the 30 in the original graph, because the search algorithm was less able to determine independence between variables to eliminate edges.

## An Alternative Algorithm: IMaGES
The IMaGES algorithm is effective at building a combined CPDAG from multiple separate but similarly structured datasets.

In [15]:
# Instantiate the Images algorithm 
alg = multi.Images()
# Set knowledge
alg.setKnowledge(knowledge)
# Set parameters. Images uses SEM BIC for scoring by default
params = util.Parameters()
params.set(util.Params.PENALTY_DISCOUNT, 2)
params.set(util.Params.TIME_LAG, 1)
# Add data to a list, keeping separated by ticker symbol
data_list = jutil.ArrayList()
for symbol in symbols:
    data_list.add(ptt.pandas_data_to_tetrad(dfs[symbol]))

# Run and print resultss
cpdag = alg.search(data_list, params)
ptt.print_java(cpdag)

Graph Nodes:
Stock_Volume;Sentiment;Tweet_Volume;Daily_Stock_Close_Delta;Daily_Stock_Low_Delta;Daily_Stock_High_Delta;EMA_Delta;Stock_Volume:1;Sentiment:1;Tweet_Volume:1;Daily_Stock_Close_Delta:1;Daily_Stock_Low_Delta:1;Daily_Stock_High_Delta:1;EMA_Delta:1

Graph Edges:
1. Daily_Stock_Close_Delta:1 --- Daily_Stock_Low_Delta:1
2. Daily_Stock_High_Delta --> Daily_Stock_Close_Delta
3. Daily_Stock_High_Delta:1 --- Daily_Stock_Close_Delta:1
4. EMA_Delta --> Sentiment
5. EMA_Delta --- Sentiment:1
6. EMA_Delta:1 --- EMA_Delta
7. Stock_Volume --- Daily_Stock_High_Delta
8. Stock_Volume --- Daily_Stock_Low_Delta
9. Stock_Volume --- Tweet_Volume
10. Stock_Volume:1 --- Stock_Volume
11. Stock_Volume:1 --- Tweet_Volume:1

Graph Attributes:
Score: 43235.130487

Graph Node Attributes:
Score: [Stock_Volume: 1806.561269494074;Sentiment: -944.0424844068725;Tweet_Volume: -8139.990429853084;Daily_Stock_Close_Delta: 7273.004519313469;Daily_Stock_Low_Delta: 7354.210952229297;Daily_Stock_High_Delta: 7398.1676

Comparing this result to the original graph, we can see that it has a slightly higher score but far fewer oriented edges. This indicates that the search algorithm was less able to orient the remaining edges but that it was able to determine more independence between variables, thus eliminating edges. 