### Welcome! This notebook describes a method of generating inferential decompositions from text as introduced in our EMNLP 2023 paper: [Natural Language Decompositions of Implicit Content Enable Better Text Representations](https://arxiv.org/pdf/2305.14583.pdf)!

We will guide you through the process step-by-step, and provide explanations and code snippets along the way. The method can be broken down into the following steps:

1. **Sample a small number of items from your dataset**: Here, we use a dataset of tweets posted by legislators during the 115, 116 and 117th US Congresses. 
2. **Craft Implicit and Explicit Propositions**: Refer to Appendix 2. of our paper for a description of the instructions used in the paper to craft exemplar poropotitions. We will use the same instructions to craft implicit and explicit propositions for our dataset.
3. **Prompt an LLM with the crafted exemplars**: Here, we will use GPT3.5 Turbo for our experiments. 
4. **Validate**: Confirm that a random sample of the generated decompositions are _plausible_.
5. **Downstream Usage**: Use the decompositions in the target task. 

#### Getting Started

To begin, run the first cell below to import the necessary packages and set up the environment. The helper functions and accompanying code are in `eval_mteb.py` and `generation_utils.py`. 

##### Note: 
We assume that your OPENAI_API_KEY is an environment variable. It can also be set manually in the config by setting  `config["llm"]["openai_api_key"]`

In [3]:
import os 
import json 
from tqdm import tqdm
import pandas as pd
import random


OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
from pathlib import Path

from transformers import GenerationConfig 


#### Choosing the data 

For the purpose of this tutorial, we choose a dataset of congressional tweets sampled from the 115th, 116th and 117th Congress. The data can be found in `data/sampled_tweets_senate_115-117.jsonl`.

### Step 1: Sample a small number of items from your dataset

We sampled the following tweets from the dataset to create our exemplar propositions.

```
1. "The #HonestAds Act will strengthen protections against foreign interference in our election. No more election ads paid for in rubles."
2. "Our nation is hurting.\\n\\nGeorge Floyd's death was horrific and justice must be served. A single act of violence at the hands of an officer is one too many. \\n\\nGeorge Floyd deserved better. All black Americans do. Indeed, all Americans do."
3. "Happy Wyoming Day! Today, our great Equality State celebrates 151 years of being the first to officially recognize women's inherent right to vote and to hold office. "
4. "\"More apologies from Mark Zuckerberg won't fix Facebook. We need accountability and action \\u2013 not vague commitments to do better while continuing to profit off of users' personal data. "
5. "Finding a permanent solution to #ProtectDreamers is as urgent a task as ever. President Trump created this crisis, and he should stop tanking bipartisan congressional efforts to solve it. We owe it to these kids to keep them in the only country they've ever known as home."
6. "Qualified immunity reform should have as its focus professionalizing police departments, institutionalizing best police practices when it comes to use of force, and protecting constitutional rights of American citizens."
7. "Survivors of the coronavirus show symptoms of ME/CFS, a debilitating and chronic illness that already impacts 2.5 million Americans. I am fighting to secure the funding needed to treat this disease and give patients the care they need.
```

### Step 2: Craft Implicit and Explicit Propositions

Then we craft both explicit and implicit propositions corresponding to each of the tweets, which can be found in `exemplars/leg_tweets_exemplars.jsonl`

### DIY:  Create your own exemplars! 

#### Step 1: Sample random comments from the dataset

In [5]:
TWEETS_FILEPATH = Path('data/sampled_tweets_senate_115-117.jsonl')
tweets = read_jsonl(TWEETS_FILEPATH)

random.seed(42)
exemplar_candidates = random.sample(tweets, 10)
exemplar_tweets = [x['tweet'] for x in exemplar_candidates]

#### Step 2: Create your own exemplars

Once you are done writing the exemplars for a particular tweet, press the "Submit" button

In [6]:
from widget_utils import create_textboxes, show_document

In [11]:
tweet_decomp_exemplars = []

for index, tweet in enumerate(exemplar_tweets[:2]): # change index to include all tweets
    fancy_text = show_document(index, tweet)

    # Display the fancy text
    display(fancy_text)

    decomps = create_textboxes()
    tweet_decomp_exemplars.append([tweet, decomps])

        

HTML(value="<h4 style='font-family: sans-serif; color:blue;'>Document 0:</h4><p style='font-family: Verdana'>G…

Button(description='Add Decomposition', style=ButtonStyle())

Button(description='Submit', style=ButtonStyle())

Text(value='Decomposition 1 here', description='Box 1:')

Text(value='Decomposition 2 here', description='Box 2:')

Text(value='Decomposition 3 here', description='Box 3:')

Text(value='Decomposition 4 here', description='Box 4:')

Text(value='Decomposition 5 here', description='Box 5:')

HTML(value="<h4 style='font-family: sans-serif; color:blue;'>Document 1:</h4><p style='font-family: Verdana'>E…

Button(description='Add Decomposition', style=ButtonStyle())

Button(description='Submit', style=ButtonStyle())

Text(value='Decomposition 1 here', description='Box 1:')

Text(value='Decomposition 2 here', description='Box 2:')

Text(value='Decomposition 3 here', description='Box 3:')

Text(value='Decomposition 4 here', description='Box 4:')

Text(value='Decomposition 5 here', description='Box 5:')

#### Step 3: Save them in the right format 

In [13]:
with open("exemplars/custom_collected_exemplars.jsonl", "w") as f: 
    for elem in tweet_decomp_exemplars: 
        s= json.dumps([elem[0], elem[1][0]])
        f.write(f"{s}\n")

### Step 3: Prompting a LLM with the crafted exemplars

We use the `GenerationEmbedder` class from `eval_mteb.py` along with the hyperparameters specified in `configs/leg-tweet-gen-gpt3.5-propositions-all.yaml` to prompt GPT3.5 Turbo with the exemplars. The generated decompositions can be found in `data/gpt3.5_tweets_to_gen_all.jsonl`. 

In [2]:
from eval_mteb import  GenerationEmbedder, read_jsonl, load_config, write_jsonl

TWEETS_FILEPATH = Path('data/sampled_tweets_senate_115-117.jsonl')
tweets = read_jsonl(TWEETS_FILEPATH)

# load the config file and the exemplars 
config = load_config('configs/leg-tweet-gen-gpt3.5-propositions-all.yaml')
exemplars = read_jsonl(config["data"]['exemplars_path'])

# initialize the generation object with hyperparameters loaded from the config file
model = GenerationEmbedder(
    instructions=config["data"]["instructions"],
    openai_api_key=config["llm"]["openai_api_key"],
    exemplar_pool=exemplars,
    exemplar_format=config["exemplars"]["format"],
    exemplar_sep=config["exemplars"]["separator"],
    multi_output_sep=config["exemplars"]["multi_output_separator"],
    exemplars_per_prompt=config["exemplars"]["exemplars_per_prompt"],
    draws_per_pool=config["exemplars"]["draws_per_pool"],
    repeat_draws=config["exemplars"]["repeat_draws"],
    shuffles_per_draw=config["exemplars"]["shuffles_per_draw"],
    output_combination_strategy=config["embeddings"]["output_combination_strategy"],
    include_original_doc=config["embeddings"]["include_original_doc"],
    embedding_model_name=config["embeddings"]["embedding_model_name"],
    gen_model_name=config["llm"]["gen_model_name"],
    generations_per_prompt=config["llm"]["generations_per_prompt"],
    temperature=config["llm"]["temperature"],
    top_p=config["llm"]["top_p"],
    generation_kwargs=config["llm"]["generation_kwargs"],
    max_tokens=config["llm"]["max_tokens"],
    cache_db_path=config["main"]["cache_db_path"],
    dry_run=config["main"]["dry_run"],
    device=config["embeddings"]["device"],
    seed=config["main"]["seed"],
)



For the purpose of the tutorial, we are generating decompositions for the first 10 tweets, decompositions for the whole dataset can be found in `data/gpt3.5_tweets_to_gen_all.jsonl`.

In [21]:
# generate propositions from tweets 
# simple batching code that deals with breaks in connections

# use a small sample of tweets to test the generations
OUTPUT_PATH = Path("test.jsonl")

tweet_texts = [tweet['tweet'] for tweet in tweets][:10] # remove [:10] to run on all tweets

propositions = read_jsonl(OUTPUT_PATH)
batch_size = 100
for index in tqdm(range(len(propositions), len(tweet_texts), batch_size)):
    batch = tweet_texts[index:index+batch_size]
    propositions.extend(model.generate_from_inputs(batch))
    write_jsonl(propositions, OUTPUT_PATH)




0it [00:00, ?it/s]


#### Step 4: Validate

We sample some of the generated decompositions and confirm that they are _plausible_. In our paper, this was done using a human study. Please refer to Section 3 of our paper for more details. 


In [22]:
# before sampling, make sure to keep the tweet with the generations: 

for tweet_text, props in zip(tweet_texts, propositions):
    props.append(tweet_text)

# sample from the propositions
random.seed(42)
sample = random.sample(propositions, 5)

for elem in sample: 
    print(f"TWEET: {elem[-1]}")
    print("PROPOSITONS:")
    for prop in elem[:-1]:
        print(prop)
    print("---------------------------")

TWEET: Cindy &amp; I are praying for all those in the path of #HurricaneIrma. We thank the brave volunteers &amp; urge all to listen to local officials.
PROPOSITONS:
Cindy and I are offering prayers for those affected by Hurricane Irma
Gratitude towards brave volunteers
People should follow the guidance of local officials during the hurricane
Hurricane Irma poses a significant threat
---------------------------
TWEET: We must do more to address mental health issues our veterans face and ensure all have access to treatment. @WSAZnews #suicidepreventionmonth  
PROPOSITONS:
Veterans face mental health issues that need to be addressed
Access to mental health treatment for veterans should be ensured
Suicide prevention is important
Mental health support for veterans needs improvement
---------------------------
TWEET: This project will bring hundreds of jobs &amp; millions of $ in economic growth to Northwest MT. #EmployMT #ConnectMT
 
PROPOSITONS:
The project will create job opportunities i

#### Step 5: Use the propositions for your own downstream task!

## TODOS: 

1. Clustering and tsne viz 
2. far -> close in embedding space 