[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/danielmlow/construct-tracker/blob/main/tutorials/construct_tracker.ipynb)

# Tutorial to use construct-tracker to measure the constructs you choose in text: 

- Author: Daniel M. Low
- License: Apache 2.0
- Date: 29/07/2024
- **If you use, please cite**: Low DM, Rankin O, Coppersmith DDL, Bentley KH, Nock MK, Ghosh SS (2024). Building lexicons with Generative AI result in lightweight and interpretable text models with high content validity. arXiv.


### construct-tracker
##### - **lightweight**: no GPU needed (unlike LLMs)
##### - **private and free**: you can run on your local computer instead of submitting to a cloud API (OpenAI) which may not be secure
##### - **interpretable**: understand why the model outputs a given score, which can help avoid biases
##### - **high content validity**: measure what you actually want to measure (unlike existing lexicons or models that measure something only slightly related)


<!-- There are three options:
* **Lexicon**: Create a lexicon with Generative AI for the constructs you want to measure and obtain the counts of those constructs in text.
* **Lexicon + construct-text similarity (CTS)**: find similar meaning phrases: Since counting exact matches will miss similar and relevant words, use CTS to include similar phrases. **recommended**
* **Just construct-text similarity (CTS)**: skip creating a lexicon and just provide CTS with a few examples (might not work as well). No API key needed.  -->


We provide A) "Quick start" section below followed by a B) "Use all special features" section




In [4]:
# Install construct-tracker 
!pip install --upgrade construct-tracker 

Collecting construct-tracker
  Downloading construct_tracker-1.0.13-py3-none-any.whl.metadata (13 kB)
Downloading construct_tracker-1.0.13-py3-none-any.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m38.7 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: construct-tracker
  Attempting uninstall: construct-tracker
    Found existing installation: construct-tracker 1.0.12
    Uninstalling construct-tracker-1.0.12:
      Successfully uninstalled construct-tracker-1.0.12
Successfully installed construct-tracker-1.0.13


In [7]:
# Import packages
import pandas as pd
import sys
import os
import shutil 
import datetime
import copy
from construct_tracker import lexicon

Let's create a lexicon to measure constructs related to insight and mindfulness. Here are some examples from a target dataset of documents (e.g., survey responses, social media posts).

In [8]:
documents = [
 'Every time I speak with my cousin Bob, I have great moments of insight, clarity, and wisdom',
 "He meditates a lot, but he's not super smart",
 'He is too competitive']	

In [9]:
# Or load from Google Drive 
google_drive = False # Load files from Google Drive

if google_drive:
	current_path = '/content/drive/My Drive/Colab Notebooks'
	# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	from google.colab import drive
	drive.mount('/content/drive')
	sys.path.append(current_path)
	# Change path and column name accordingly.
	documents = pd.read_csv('/content/drive/My Drive/insight_project/my_documents.csv')['documents_column'].tolist()

In [10]:
# load reddit posts and count 
# Info: https://zenodo.org/records/3941387
# Citation: Low, D. M., Rumker, L., Torous, J., Cecchi, G., Ghosh, S. S., & Talkar, T. (2020). Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study. Journal of medical Internet research, 22(10), e22635.

reddit_df = reddit_df = pd.read_csv('https://mair.sites.fas.harvard.edu/datasets/rmhd_27subreddits_1300posts_train.csv', index_col = 0)
reddit_df

Unnamed: 0,subreddit,author,date,post,automated_readability_index,coleman_liau_index,flesch_kincaid_grade_level,flesch_reading_ease,gulpease_index,gunning_fog_index,...,tfidf_wish,tfidf_without,tfidf_wonder,tfidf_work,tfidf_worri,tfidf_wors,tfidf_would,tfidf_wrong,tfidf_x200b,tfidf_year
0,EDAnonymous,lillylourose,2018/11/28,"The reason why I stoped with eating? Well, for...",0.478964,2.747789,2.109524,95.205000,86.583333,4.761905,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0
1,EDAnonymous,tinyTRONgirl,2019/10/15,I’m freaking out WHY can’t my body just digest...,5.559945,5.475852,6.247874,80.769913,67.394161,8.309854,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.131764,0.000000,0.00000,0.0
2,EDAnonymous,Vetmyana,2019/07/02,Tw weight loss achievement Just lost 8lbs (wei...,2.546452,4.004821,3.520194,91.910290,75.451613,4.960000,...,0.0,0.254812,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0
3,EDAnonymous,Fastingcametome,2019/07/25,When not eating is your solution to everything...,0.836122,2.859536,2.359891,95.598204,82.741497,5.824762,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.145753,0.00000,0.0
4,EDAnonymous,bananamo7,2019/04/19,How to dedicate a long weekend to beginning re...,5.119444,6.724697,6.719444,70.932500,69.185185,10.585185,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28075,unitedkingdom,Anomalous-Entity,2019/05/07,When someone is having a go at the U.S. and ma...,1.758324,4.527215,4.326374,77.275769,88.038462,6.048352,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.258879,0.000000,0.00000,0.0
28076,unitedkingdom,Squigglish,2019/03/28,My MP voted against every single Indicative Mo...,6.098205,9.032098,6.976923,52.759744,104.384615,7.887179,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0
28077,unitedkingdom,javaxcore,2019/09/02,What is Order 66? I have heard much talk of th...,0.761500,2.701924,1.290000,103.625000,82.500000,4.000000,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0
28078,unitedkingdom,AlwaysGoForAusInRisk,2019/05/13,Seeing the GB Ambassador for Denmark today at ...,8.632452,8.773091,9.139258,61.853071,60.968504,12.296513,...,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.00000,0.0


In [11]:
if google_drive:
	# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	
	
	lexicon_dir = current_path+'/ct_lexicons/' #to save from the lexicon
	output_dir = current_path+'/ct_datasets/' # to save final datasets
	# CHANGE PATH
	documents = pd.read_csv('/content/drive/My Drive/your_project_name/your_csv_file.csv')['documents_column_name'].tolist()
else:

	lexicon_dir = './ct_lexicons/' #to save from the lexicon
	output_dir = './ct_datasets/' # to save final datasets

os.makedirs(lexicon_dir, exist_ok=True)
os.makedirs(output_dir, exist_ok=True)

# Create lexicon with generative AI

If you want to use generative AI models from python, you need to provide a key from a given provider (e.g., OpenAI, Google, etc.). Alternatively, you can use chatgpt, bing or other browsers, to obtain a list and then copy and paste when indicated below. 

# Set OpenRouterAI API key and choose model
- API key is associated to your account. Create an account by signing in here, adding credits and creating an API key here: https://openrouter.ai/

- All models: https://openrouter.ai/models
    - Paid models:
        - 'gpt-4o' (recommended)
        - 'gpt-4o-mini' (cheaper)
        - 'anthropic/claude-3.5-sonnet' (recommended)
    - Free models: https://openrouter.ai/models?max_price=0
        - free models: certain requests per minute (e.g., 2) and certain requests per day (depends on model). See https://openrouter.ai/docs/limits
        - "google/gemini-2.0-flash-exp:free"
        - "meta-llama/llama-3.1-70b-instruct:free"
        - "meta-llama/llama-3.1-8b-instruct:free"

Add your private API key. get it from OpenRouter AI: https://openrouter.ai/settings/keys

- (a) 'input_box': past key into the input box that will appear below the cell or at the top of the screen. This keeps it hidden in case you share this notebook with someone. 
- (b) 'add_manually': assign it to the variable. But then do not share this notebook publically or others can use your key. 
- (c) 'google_drive': add a file on google drive called api_keys.py with a variable called openrouter_key = 'YOUR_KEY'
- (d) 'local': add a file on your local computer called api_keys.py with a variable called openrouter_key = 'YOUR_KEY'. Careful with sharing that file. Add it to your .gitignore if you're using github.


In [12]:
add_api_key = 'local' 

if add_api_key == 'input_box':
	# This key will only work for free models. get your own and add some dollars from OpenRouter AI: https://openrouter.ai/settings/keys
	# Here is one of mine with $0 "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" that might work
	OPENROUTER_API_KEY = input('Add key (input box may appear at the top of the notebook or below):') 
	
elif add_api_key == 'add_manually':
	OPENROUTER_API_KEY = 'YOUR-API-KEY'

elif add_api_key == 'google_drive':
# Sign into drive to gain access to your documents in a dataframe and/or api_keys.py
	from google.colab import drive
	drive.mount('/content/drive')
	sys.path.append('/content/drive/My Drive/Colab Notebooks')
	import api_keys
	OPENROUTER_API_KEY = api_keys.openrouter_key 
elif add_api_key == 'local':
	try:
		import api_keys
		OPENROUTER_API_KEY = api_keys.openrouter_key 
	except:
		print("Module 'api_key' not found. Add your key here. ") 
os.environ["api_key"]  = OPENROUTER_API_KEY

		

Now specify which model to use

In [13]:
model = 'gpt-4o-mini' # Many options including: "gpt-4o", "gpt-4o-mini" (cheaper), 'meta-llama/llama-3.1-70b-instruct:free' (free for like a few a minute, won't work for the whole tutorial)  others: https://openrouter.ai/models


# A. Quick start

### 1. Build lexicon with Generative AI and then count in documents

In [14]:
my_lexicon = lexicon.Lexicon()         # Initialize lexicon

my_lexicon.add('Insight', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	  examples = ['insight', 'realized'])

my_lexicon.add('Mindfulness', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['mindfulness', 'meditate'])

my_lexicon.add('Compassion', section = 'tokens', value = 'create', source = model, max_tokens = 150,
	examples = ['compassion', 'love', 'kind', 'help others'])

# max_tokens=150 because creating larger lexicons will be harder to validate and will take longer to generate, but you can experiment. 

INFO: Adding 'Insight' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Insight']
INFO: Adding 'Mindfulness' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Mindfulness']
INFO: Adding 'Compassion' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Compassion']


1.1. View lexicons

In [15]:
print(my_lexicon.constructs['Insight']['tokens'])

print(my_lexicon.constructs['Mindfulness']['tokens'])

print(my_lexicon.constructs['Compassion']['tokens'])


['acumen', 'analysis', 'analytical thinking', 'astuteness', 'awareness', 'breakthrough', 'clarity', 'clarity of thought', 'clarity of vision', 'cognitive awareness', 'comprehension', 'critical thinking', 'deep dive', 'deep understanding', 'discern', 'discernment', 'discovery', 'enlightenment', 'epiphany', 'foresight', 'grasp', 'insight', 'insight-driven', 'insightfully', 'insightfulness', 'interpretation', 'intuition', 'intuition-based', 'keen observation', 'knowledge', 'lucidity', 'mental clarity', 'observation', 'perception', 'perceptive', 'perceptiveness', 'perspective', 'profound', 'realization', 'realized', 'recognition', 'reflection', 'revelation', 'sagacity', 'self-awareness', 'shrewdness', 'thoughtful', 'thoughtful analysis', 'uncover', 'understanding', 'unveil', 'unveil insights', 'unveilment', 'wisdom']
['acceptance', 'attention', 'authenticity', 'awareness', 'awareness of breath', 'balance', 'body scan', 'breathing', 'breathwork', 'calm', 'centering', 'clarity', 'compassion'

1.2. Now count whether tokens appear in document:


In [16]:

counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = False,
                                                                                      )

display(counts)

extracting... 


100%|██████████| 3/3 [00:00<00:00,  3.67it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",3,3,0,17
1,1,"He meditates a lot, but he's not super smart",0,1,0,8
2,2,He is too competitive,0,0,0,4


You can save files. If you're on google colab you need to mount google Drive (see code above):
    
```python
counts.to_csv(f'{output_dir}insight_lexicon_counts.csv', index = False)
```


1.3. Interpret counts: visualize matches in context  


In [17]:
construct = 'Mindfulness'
print(f'Matches for {construct}:')
lexicon.highlight_matches(documents, construct, matches_construct2doc, max_matches = 2)
print()


construct = 'Insight'
print(f'Matches for {construct}:')
lexicon.highlight_matches(documents, construct,matches_construct2doc, max_matches = 2)
print()

Matches for Mindfulness:



Matches for Insight:



# B. Use all special features

Recommended: download and run this on your local computer using jupyter lab, vscode, or your favorite development environment. Saving to and from files is quicker if run locally on your computer than on Google Drive. 

### 1. Build lexicon with Generative AI 


#### Lexicon step 1.1: First provide api keys and general info on the lexicon

In [18]:
my_lexicon = lexicon.Lexicon()			# Initialize lexicon
my_lexicon.name = 'Insight'		# Set lexicon name
my_lexicon.description = 'Insight lexicon with constructs inspired by items of the Emotional Insight Scale'
my_lexicon.creator = 'DML' 				# your name or initials for transparency in logging who made changes
my_lexicon.version = '1.0'				# Set version. Over time, others may modify your lexicon, so good to keep track. MAJOR.MINOR. (e.g., MAJOR: new constructs or big changes to a construct, Minor: small changes to a construct)


In [19]:
# Fill out this information
domain = 'psychology' # Optional: bias definitions to a certain domain. this can be "mental health", "economics" or set to None (no quotation marks)
models_for_lexicon = [model] # Here you could add more: [model,command_nightly]
model_for_definition = model # You can ask model to create definition for each construct in the lexicon. Or set to None (no quotation marks)
temperatures = [0,0.5,1] # Temperature is another important parameter that defines how creative a model should be (0: a model outputs the most probable tokens given the prompt and training data; 1=more creative and less predictable responses, with a maximum of 2, which is not often used).

#### Lexicon step 1.2: Add examples and definitions manually or using a model 

**Examples**

Add a few prototypical tokens you don't want to miss. With these examples, help guide the model as to whether you want adjectives, nouns, phrases or all of the above. 


**Definitions**

We recommend this for expert validation (step after creating the lexicon, so raters can decide whether to include a token as a function of a specific definition). However, you can decide whether the Generative AI model sees the definition or not so as to guide or not guide the lexicon creation.

We'll show an example where we manually provided a definition (insight), where we'll have our model provide it, and where we will not include it. 

In [20]:
# Fill in

construct_information = {
	'Insight': {
		'examples': ['clarity', 'enlightenment', 'wise'],
		'definition': "the clarity of understanding of one's thoughts, feelings and behavior",
		'reference': "Grant, A. M., Franklin, J., & Langford, P. (2002). The self-reflection and insight scale: A new measure of private self-consciousness. Social Behavior and Personality: an international journal, 30(8), 821-835."
			 },
	'Mindfulness': {
		'examples': ['mindful','meditation', 'awareness', 'nonjudgemental', 'present-focused'],
		'definition': model_for_definition, # it will later check if definition == model_for_definition, if it does, it will generate using that model
		'reference':model_for_definition
			 },	
	'Compassion': {
		'examples':['compassion', 'love', 'kind', 'help others'],
		'definition': None,
		'reference':None
			 },
}

## Definition will be added if model_for_definition is not None using our own prompt. Or you can can directly prompt APIs here with your own prompt. This is our default one:
# prompt = f"Provide a brief definition of {construct} (in the {domain} domain). Provide reference in APA format where you got it from. Return result in the following format: {'{construct}': 'the_definition', 'reference':'the_reference'}"
# definition = api_request(prompt, model = model)

In [21]:
# Loop through constructs and generate lexicon and definition as instructed above in construct_information

for construct, information in construct_information.items():
	print('\n',construct, ' ------------------------')
	examples, definition, reference = information['examples'], information['definition'], information['reference']
	
	# DEFINITION: add definition or use model to get one
	# ===================================================
	construct_dict = construct_information.get(construct)
	# if definition exists (is str), use it.
	if construct_dict['definition'] == model_for_definition:
		# Use model to create a definition
		definition, reference = my_lexicon.generate_definition(construct, model = model_for_definition, domain=domain)
		print(f'Generated definition: {definition}\n')
		print(f'Reference: {reference}\n')
		# Update with definition
		construct_information[construct] = {'definition': definition, 'reference': reference}
	elif isinstance(construct_dict['definition'], str):
		definition = construct_dict['definition']
		reference = construct_dict['reference']
	elif construct_dict['definition'] == None:
		definition = None
		reference = None

	# PROMPT: add definition and examples to prompt
	# ===================================================
	prompt = lexicon.generate_prompt(construct,
                         prompt_name=construct,
                         domain = domain,
						 definition = definition,
						 examples = examples)
	
	print('Prompt used:')
	print(prompt)
	print()

	'''
	## Or create your own prompt here:
	prompt = """
	 Create a list of words and phrases related to {construct} in the {domain} domain. Here are some examples: {examples}
	 """
	prompt.format(construct = construct, domain = domain, example = examples)
	print(prompt)
	'''

	# Generate lexicon
	# ===================================================
	# Each model and temperature will create redudant and new tokens. They will merge with other tokens already generated. 
	for model in models_for_lexicon:
		for temperature in temperatures:
			my_lexicon.add(construct, section = 'tokens', value = 'create', prompt = prompt, source = model, temperature = temperature, max_tokens = 150,
				# so these are saved to metadata
				domain = domain, examples = examples, definition = definition,definition_references = reference)
	print('Models used to generate tokens:')
	[print(n) for n in list(my_lexicon.constructs[construct]['tokens_metadata'].keys())]
	print()
	print('Generated lexicon:', my_lexicon.constructs[construct]['tokens']) 
	print()




INFO: Adding 'Insight' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Insight']



 Insight  ------------------------
Prompt used:
Provide many single words and some short phrases (all in lowercase unless the word is generally in uppercase) related to Insight (in the psychology domain). Each token should be separated by a semicolon. Do not return duplicate tokens. Do not provide any explanation or additional text beyond the tokens.
Here is a definition of Insight: the clarity of understanding of one's thoughts, feelings and behavior
Here are some examples (include these in the list): clarity; enlightenment; wise

Models used to generate tokens:
gpt-4o-mini, temperature-0, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-04.346697
gpt-4o-mini, temperature-0.5, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-08.028427
gpt-4o-mini, temperature-1, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-11.554414

Generated lexicon: ['acceptance', 'acknowledgment', 'analysis', 'awaken', 'awareness', 'awareness of self', 'behavioral insight', 'breakthrough', 'clarity', 'clarity

INFO: Adding 'Mindfulness' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Mindfulness']


invalid syntax (<string>, line 1)
Error parsing string in content: {'id': 'gen-1737496274-zCocHH6WayoUJXm5LihH', 'provider': 'OpenAI', 'model': 'openai/gpt-4o-mini', 'object': 'chat.completion', 'created': 1737496274, 'choices': [{'logprobs': None, 'finish_reason': 'stop', 'native_finish_reason': 'stop', 'index': 0, 'message': {'role': 'assistant', 'content': '```json\n{"Mindfulness": "Mindfulness is the psychological process of bringing one\'s attention to the present moment, which can be cultivated through meditation and other practices, and is often associated with increased awareness and acceptance of one\'s thoughts, feelings, and bodily sensations.", "reference":"Kabat-Zinn, J. (1990). Full Catastrophe Living: Using the Wisdom of Your Body and Mind to Face Stress, Pain, and Illness. Delacorte."}\n```', 'refusal': ''}}], 'system_fingerprint': 'fp_72ed7ab54c', 'usage': {'prompt_tokens': 56, 'completion_tokens': 96, 'total_tokens': 152}}. Will try by splitting
New definition: 
"Mind

INFO: Adding 'Compassion' to the lexicon (it was not previously there). If you meant to add to an existing construct (and have a typo in the construct name, which is case sensitive), run: del your_lexicon_name.constructs['Compassion']


Models used to generate tokens:
gpt-4o-mini, temperature-0, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-16.047825
gpt-4o-mini, temperature-0.5, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-18.718109
gpt-4o-mini, temperature-1, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-21.401681

Generated lexicon: ['acceptance', 'acceptance and commitment', 'acceptance and commitment therapy', 'acceptance of feelings', 'acceptance of self', 'acceptance practice', 'alertness', 'attention', 'awareness', 'awareness of body', 'awareness of breath', 'awareness of feelings', 'awareness of sensations', 'awareness of thoughts', 'awareness training', 'body scan', 'breath', 'breathing', 'clarity', 'cognitive flexibility', 'compassion', 'conscious awareness', 'cultivating presence', 'deep listening', 'detachment', 'disengagement', 'emotional regulation', 'emotions', 'equanimity', 'feelings', 'flow state', 'flowing', 'focus', 'gratitude', 'grounding', 'harmony', 'inner peace', 'intention', 'intenti


#### Lexicon step 1.3: Important: review resulting lexicon. Add or remove tokens manually.

In [22]:
for construct in construct_information.keys():
	print(f"{construct}:", my_lexicon.constructs[construct]['tokens'])


Insight: ['acceptance', 'acknowledgment', 'analysis', 'awaken', 'awareness', 'awareness of self', 'behavioral insight', 'breakthrough', 'clarity', 'clarity of feeling', 'clarity of mind', 'clarity of perception', 'clarity of thought', 'cognition', 'cognitive', 'cognitive clarity', 'cognitive insight', 'coherence', 'comprehension', 'conscious thought', 'contemplation', 'critical thinking', 'deep understanding', 'delve', 'depth', 'depth of understanding', 'discernment', 'emotional awareness', 'emotional clarity', 'emotional intelligence', 'enlightenment', 'epiphany', 'focus', 'grasp', 'growth', 'holistic', 'inner vision', 'inner voice', 'inner wisdom', 'insightfulness', 'interpret', 'introspection', 'introspective journey', 'intuition', 'knowledge', 'learning', 'life lessons', 'lucidity', 'mental clarity', 'mental insight', 'mindfulness', 'observation', 'perception', 'perceptual insight', 'personal growth', 'personal insight', 'perspective', 'profound realization', 'profound understandin

Add or remove tokens in a list that definitely should /shouldn't be there. BE CAREFUL WITH TYPOS OR MISPELLINGS. 

In source you can add any additional description. Recommend putting your initials as well available in `my_lexicon.creator`. 

In [23]:
print(my_lexicon.creator)
my_lexicon.add('Mindfulness', section ='tokens',value = ['meditate', 'pay attention'], source=my_lexicon.creator + ": added a few verb forms") # Up to you as an expert
my_lexicon.remove('Mindfulness', remove_tokens = ['being', 'flow'], source =my_lexicon.creator + ": might capture non-compassion-related context too often") # Up to you as an expert

DML


Now those appear as entries in the metadata. All additions are merged and all removed are removed

In [24]:
for n in my_lexicon.constructs['Mindfulness']['tokens_metadata'].keys():
	print('action: ', 
	my_lexicon.constructs['Mindfulness']['tokens_metadata'][n]['action'],
	'; from source: ',
	n)
	

action:  create ; from source:  gpt-4o-mini, temperature-0, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-16.047825
action:  create ; from source:  gpt-4o-mini, temperature-0.5, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-18.718109
action:  create ; from source:  gpt-4o-mini, temperature-1, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-21.401681
action:  manually added ; from source:  DML: added a few verb forms 25-01-21T21-51-37.825581
action:  remove ; from source:  DML: might capture non-compassion-related context too often 25-01-21T21-51-37.826021


Or export to csv, rate spreadsheat and import back into python.

See section "Validate with experts or just crowdsourcing" below on how to code and reload spreadsheet back into python

#### Lexicon step 1.4: Confirm additions and removals were successful. 

`['tokens']` contains the final tokens. 



In [25]:
for construct in construct_information.keys():
	print(f"{construct}:",my_lexicon.constructs[construct]['tokens'])



Insight: ['acceptance', 'acknowledgment', 'analysis', 'awaken', 'awareness', 'awareness of self', 'behavioral insight', 'breakthrough', 'clarity', 'clarity of feeling', 'clarity of mind', 'clarity of perception', 'clarity of thought', 'cognition', 'cognitive', 'cognitive clarity', 'cognitive insight', 'coherence', 'comprehension', 'conscious thought', 'contemplation', 'critical thinking', 'deep understanding', 'delve', 'depth', 'depth of understanding', 'discernment', 'emotional awareness', 'emotional clarity', 'emotional intelligence', 'enlightenment', 'epiphany', 'focus', 'grasp', 'growth', 'holistic', 'inner vision', 'inner voice', 'inner wisdom', 'insightfulness', 'interpret', 'introspection', 'introspective journey', 'intuition', 'knowledge', 'learning', 'life lessons', 'lucidity', 'mental clarity', 'mental insight', 'mindfulness', 'observation', 'perception', 'perceptual insight', 'personal growth', 'personal insight', 'perspective', 'profound realization', 'profound understandin

#### Lexicon step 1.5: Save preliminary lexicon

In [26]:
clean_lexicon_name = my_lexicon.name.replace(" ", "-").lower()+'_v'+my_lexicon.version.replace('.', '-').lower()

if google_drive:
  preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/preprocessing'

else:
  # local
  preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/preprocessing'

my_lexicon.save(preprocessing_dir) # may take a few minutes to appear on Google Drive

INFO: Saved lexicon to ./ct_lexicons//insight_v1-0/preprocessing/insight_25-01-21T21-51-44


For a given construct, you can review metadata, which contains the history of different actions (creating using Gen AI, adding, removing), which comes with a timestamp, the specific prompt used (including examples, definition, if any). All of these entries 

In [27]:

import json
for construct in construct_information.keys():
	print("==="*30)
	print(construct)
	print()
	print(json.dumps(my_lexicon.constructs[construct]['tokens_metadata'], indent=4))



Insight

{
    "gpt-4o-mini, temperature-0, top_p-1, max_tokens-150, seed-42, 25-01-21T21-51-04.346697": {
        "action": "create",
        "tokens": [
            "acceptance",
            "analysis",
            "awareness",
            "awareness of self",
            "breakthrough",
            "clarity",
            "clarity of feeling",
            "clarity of mind",
            "clarity of perception",
            "clarity of thought",
            "cognition",
            "cognitive insight",
            "comprehension",
            "contemplation",
            "critical thinking",
            "depth",
            "discernment",
            "emotional awareness",
            "emotional intelligence",
            "enlightenment",
            "epiphany",
            "focus",
            "growth",
            "holistic",
            "inner vision",
            "insightfulness",
            "introspection",
            "intuition",
            "knowledge",
            "learning",

Also available in `_metadata.json` file when saving the lexicon



#### Lexicon step 1.6: Extract counts

Calibration step: this can be done on a examples you come up with or on training set to make sure it's doing a good job

In [29]:
# First lemmatize lexicon tokens

my_lexicon = lexicon.lemmatize_tokens(my_lexicon) # if not this will be done automatically, but won't be saved in my_lexicon.

100%|██████████| 3/3 [00:00<00:00,  3.42it/s]


In [30]:
# Now count whether tokens appear in document:

counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = False,
                                                                                      )
display(counts)

extracting... 


100%|██████████| 3/3 [00:00<00:00, 2707.75it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",2,1,0,17
1,1,"He meditates a lot, but he's not super smart",0,1,0,8
2,2,He is too competitive,0,0,0,4


In [31]:
# Interpret counts: visualize matches in context  

for construct in construct_information.keys():
	print(f'------- Matches for {construct}:')
	lexicon.highlight_matches(documents, construct, matches_construct2doc, max_matches = 2)
	print()


------- Matches for Insight:

------- Matches for Mindfulness:



------- Matches for Compassion:



#### Lexicon step 1.7: Validate with experts or just crowdsourcing (OPTIONAL)

1.7.1. Create `human_ratings/` folder

In [32]:
perform_human_ratings = False

if perform_human_ratings:
	clean_lexicon_name = my_lexicon.name.replace(" ", "-")+'_v'+my_lexicon.version.replace('.', '-').lower()

	if google_drive:
		preprocessing_dir = f'{lexicon_dir}/{clean_lexicon_name}/human_ratings'

	else:
		# local
		ratings_dir = f'{lexicon_dir}/{clean_lexicon_name}/human_ratings/' 
		



	
	os.makedirs(ratings_dir, exist_ok=True)
	print(f'saving to {ratings_dir}')

1.7.2. Take the `<date>_ratings.csv` in the `preprocessing/` directory and send a copy to each human rater (aka coder, annotator) with their ID appended to the filename. Here we'll automatically find the last one created.

In [33]:
if perform_human_ratings:

	# I'll create three duplicates of the ratings file identifying each file with raters initials

	# or use digit IDs:
	n_raters = 3
	rater_ids = [str(i).zfill(3) for i in range(n_raters)] # ['001', '002', '003']
	# Save names for each ID separately in case you need to check in about a certain rating. 


	copy_annotation_file_to_ratings_dir = False

	if copy_annotation_file_to_ratings_dir:
		path_to_unvalidated_lexicon = os.listdir(preprocessing_dir)
		path_to_unvalidated_lexicon = [n for n in path_to_unvalidated_lexicon if n.endswith('ratings.csv')][0]

		for rater_id in rater_ids:
			new_path = path_to_unvalidated_lexicon.replace('preprocessing', 'human_ratings').replace('.csv', f'_{rater_id}.csv')
			# copy from preprocessing to human_raters
			shutil.copy(path_to_unvalidated_lexicon, new_path)


	# Now you have this file in human_ratings Insight_24-08-08T21-29-52_ratings_001.csv, ..._002.csv, ..._003.csv.

	


1.7.3. Have them rate following these or similar instructions. Save files as csv instead of excel to use the code below

https://github.com/danielmlow/construct-tracker/blob/daniels_branch/tutorials/lexicon%20risk%20factors%20final%20instructions.pdf

[Google docs version](https://docs.google.com/document/d/1pu89KmU31grhzFeZmwOoN0U2plSWaDMSKWYUXy_acL4/edit?usp=sharing)

In this toy example, I just duplicated the files as an example so they all have the same ratings. One of the raters added some suggestions. The instructions can take them as 3/3 or the rater can add them at the bottom and can add their rating. Then these can be sent to other raters in a second round before going onto the next step.



1.7.4. Take average ratings, discard tokens below a certain average

In [34]:
if perform_human_ratings:
    # average_ratings

    rating_files = os.listdir(ratings_dir)
    rating_files = [n for n in rating_files if n not in ['.DS_Store']]
    print(rating_files) # this should not be empty, you should have some human_rating files for each rater

In [35]:
if perform_human_ratings:
	from construct_tracker.lexicon import merge_rating_dfs

	# Copy and paste correct files here to make sure you don't add the same ratings with two different extensions:
	rating_files = [
		'Insight_24-08-08T21-29-52_ratings_001.csv', 'Insight_24-08-08T21-29-52_ratings_002.csv', 'Insight_24-08-08T21-29-52_ratings_003.csv', 
		]

	# Ratings by all raters in a single dictionary
	# all_ratings_per_construct_dict = {
		# construct_1: {token_1: [rating_1, rating_2, ...], token_2: [rating_1, rating_2, ...], ...},
		# construct_2: {token_1: [rating_1, rating_2, ...], token_2: [rating_1, rating_2, ...], ...},
		# ...} 

	all_ratings_per_construct_dict = merge_rating_dfs(ratings_dir, rating_files, construct_information.keys())

	# Save ratings to lexicon
	todays_date = datetime.datetime.utcnow().strftime("%y-%m-%d") 
	my_lexicon.set_attribute('ratings_'+todays_date, all_ratings_per_construct_dict) 


In [36]:

if perform_human_ratings:
	# Average and remove tokens < a threshold
	ratings_avg, ratings_removed = lexicon.avg_above_thresh(all_ratings_per_construct_dict, thresh = 1.3)

	# Remove tokens with low avg. ratings
	for construct in ratings_removed:
		remove_tokens = ratings_removed[construct]
		my_lexicon.remove(construct, remove_tokens=remove_tokens, 
						source = my_lexicon.creator + ": tokens rated lower than 1.3 by raters")




In [37]:
my_lexicon.name

'Insight'

In [38]:
clean_lexicon_name

'insight_v1-0'

In [39]:
if perform_human_ratings:

	# Average and only keep tokens 3/3 on protoypicality. These are definitely related to the construct and can be used for CTS or to avoid false positives. 
	ratings_avg_prototypical, ratings_removed_prototypical = lexicon.avg_above_thresh(all_ratings_per_construct_dict, thresh = 3) # set to 3 or perhaps above 2 like 2.1. 

	my_lexicon_prototypes = copy.deepcopy(my_lexicon)
	my_lexicon_prototypes.name = my_lexicon.name + ' prototypes'

	# Remove tokens with low avg. ratings
	for construct in ratings_removed_prototypical:
		remove_tokens = ratings_removed[construct]
		my_lexicon_prototypes.remove(construct, remove_tokens=remove_tokens, 
									source = my_lexicon_prototypes.creator + ": tokens rated lower than 3 by raters")



	# Save final prototypes lexicon
	
	my_lexicon_prototypes.save(f'{lexicon_dir}/{clean_lexicon_name}/', filename = clean_lexicon_name+'_validated_prototypes-3-3') # save lexicon protocotypes 3/3



In [40]:
if perform_human_ratings:
	# List construct with less than 10 tokens (which might be too few for a lexicon but sufficient for CTS method). [] means none.
	print([(k,v) for k,v in ratings_avg_prototypical.items() if len(v)<10])

1.7.5. Inter-rater reliability (OPTIONAL)

	See manuscript for more information


In [41]:
# In this toy example, all raters rated the tokens the same, so IRR = 1 out of 1. 

if perform_human_ratings:
	from construct_tracker.utils import irr
	import numpy as np


	# df = all_ratings_per_construct_dict
	# or from: all_ratings_per_construct_dict = my_lexicon.get_attribute('ratings_24-08-08')

	# TODO: turn into function
	cohens_kappa_all = {}
	fleiss_kappa_all = {}
	for construct, tokens in all_ratings_per_construct_dict.items():
		construct_c_ratings = list(tokens.values())
		construct_c_ratings_mode = int(np.median([len(n) for n in construct_c_ratings])) # construct_c_ratings)
		construct_c_ratings_all_annotated = []
		for token_i_ratings in construct_c_ratings:
			token_i_ratings = list(token_i_ratings)
		
			
			if len(token_i_ratings)==construct_c_ratings_mode and np.mean(token_i_ratings)>1.3:
			# 	token_i_ratings = token_i_ratings + [np.round(np.mean(token_i_ratings),0)]

				construct_c_ratings_all_annotated.append(token_i_ratings)
				
		# If I dont have them all the same shape, can't calculate
		construct_c_ratings = np.array(construct_c_ratings_all_annotated)
		construct_c_ratings = construct_c_ratings.astype(int)
		
		if construct_c_ratings.shape[1] == 2:
			# kappa = binary_inter_rater_reliability(construct_c_ratings[:,0], construct_c_ratings[:,1])
			kappa = irr.cohens_kappa(construct_c_ratings)
			cohens_kappa_all[construct.replace('_include','')] = kappa
			print(f"Cohen's Weighted Kappa (2 raters) for {construct}: {kappa}")
		elif construct_c_ratings.shape[1] >= 3:
			kappa = irr.calculate_fleiss_kappa(construct_c_ratings)
			fleiss_kappa_all[construct.replace('_include','')] = kappa
			print(f"Fleiss' Kappa (3 or more raters) for {construct}: {kappa}")


	weighted_kappa = pd.DataFrame(cohens_kappa_all, index = ['weighted_kappa']).T.mean()
	fleis_kappa = pd.DataFrame(fleiss_kappa_all, index = ['fleiss_kappa']).T


	display(weighted_kappa)
	display(fleis_kappa)



#### Lexicon step 1.8: Save final lexicon

In [42]:
clean_lexicon_name

'insight_v1-0'

In [43]:
# Main lexicon in the main folder. 
my_lexicon.save(f'{lexicon_dir}/{clean_lexicon_name}/', filename = f'{clean_lexicon_name}_validated') # save lexicon

INFO: Saved lexicon to ./ct_lexicons//insight_v1-0//insight_v1-0_validated_25-01-21T21-52-07


#### Lexicon step 1.9: Final feature extraction

If you want to load lexicon in a different script in the future:
```python
path = f'./ct_lexicons/{clean_lexicon_name}/'
my_lexicon = load_lexicon(path = path)
```

In [44]:
# Now count whether tokens appear in document:

# We'll set normalize to True. Whether to normalize the extracted features by word count. 3 matches in a short document would be weighed higher than in a long document.
counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = False,
                                                                                      )
display(counts)
# Might be worth calibrating (removing words that create false positives too often) on a training set. 

extracting... 


100%|██████████| 3/3 [00:00<00:00, 1799.62it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",2,1,0,17
1,1,"He meditates a lot, but he's not super smart",0,1,0,8
2,2,He is too competitive,0,0,0,4


In [45]:
# Now count whether tokens appear in document:

normalize = True
# We'll set normalize to True. Whether to normalize the extracted features by word count. 3 matches in a short document would be weighed higher than in a long document.
counts, matches_by_construct, matches_doc2construct, matches_construct2doc  = my_lexicon.extract(documents,
                                                                                      normalize = normalize,
                                                                                      )
display(counts)
# Might be worth calibrating (removing words that create false positives too often) on a training set. 

extracting... 


100%|██████████| 3/3 [00:00<00:00, 1506.76it/s]


Unnamed: 0,document_id,document,Insight,Mindfulness,Compassion,word_count
0,0,"Every time I speak with my cousin Bob, I have ...",0.117647,0.058824,0.0,17
1,1,"He meditates a lot, but he's not super smart",0.0,0.125,0.0,8
2,2,He is too competitive,0.0,0.0,0.0,4


In [46]:
# save counts
output_dir = './data/insight_project/'
os.makedirs(output_dir, exist_ok=True)
counts.to_csv(f'{output_dir}insight_lexicon_counts_normalize-{normalize}.csv', index = False)
print('saved to', f'{output_dir}insight_lexicon_counts_normalize-{normalize}.csv')

saved to ./data/insight_project/insight_lexicon_counts_normalize-True.csv
