In [1]:
import sys
sys.path.append('..')
from dotenv import load_dotenv
load_dotenv()
%load_ext autoreload
%autoreload 2

In [22]:
import pandas as pd
import textwrap
import json

from src.generate import generate_questions, rewrite_text, generate_experiment
from src.search import get_reference_questions

## Survey-it!

This Proof of Concept (POC) jupyter notebook is designed to extract actionable insights from a given text and validate these insights with data derived from surveys. It skillfully formulates targeted questions and utilizes a reference database to deepen understanding and corroborate findings with empirical data.

The process begins with a short input sentence. Survey-it! then transforms this input into a concise, data-backed blog post styled in the vein of fivethirtyeight.com, ensuring that the output is not only informative but also engaging and grounded in research.

<img src="../reports/figures/diagram_rag.PNG" alt="sarcasm" width="800"/>

### Input Sentence

In [84]:
data = 'The president of United States likes to drink coffee in the morning.'

### Synthetic Survey Questions Generation

Using `gpt-4-1106-preview` to create syntethic survey questions based on the input text.

In [85]:
questions = generate_questions(data=data)

In [86]:
for q in questions.questions:
    print(f'Question {q.id}: {q.question}')

Question 1: Do you typically drink coffee in the morning?
Question 2: How influential is the public behavior of the president on your personal habits?
Question 3: What is your favorite morning beverage?


### Searching for Similar Questions in Reference Database

We utilize a small dataset of survey questions embedded using `text-embedding-ada-002`. In this step, we perform semantic text search to retrieve relevant data as context.

In [87]:
survey_questions = [q.question for q in questions.questions]

In [88]:
reference_questions = []

for q in survey_questions:
    reference_questions.append(get_reference_questions(question=q, n=1, threshold=0.8))

In [89]:
reference_questions = (pd.concat(reference_questions)
                       .drop_duplicates()
                       .reset_index(drop=True)
                    )
reference_questions

Unnamed: 0,isPanel,surveyCountry,surveyQuestion,surveyData
0,True,United States,"In general, do you prefer hot coffee or iced c...","[{'Hot coffee': 0.73, 'Iced coffee': 0.27}, {'..."
1,True,United States,Do you support the President?,"[{'No': 0.45, 'Yes': 0.55}, {'No': 0.14, 'Yes'..."


In [111]:
survey_context = ''

for index, row in reference_questions.iterrows():
    surveyData_2023 = row['surveyData'].replace('[', '').replace(']', '').split('},')[-1]
    survey_context += f"Reference Question {index+1} - Survey Country: {row['surveyCountry']}, Survey Year: 2023, Survey Question: {row['surveyQuestion']}, Survey Data: {surveyData_2023[1:]}\n"

print(survey_context)

Reference Question 1 - Survey Country: United States, Survey Year: 2023, Survey Question: In general, do you prefer hot coffee or iced coffee?, Survey Data: {'Hot coffee': 0.6, 'Iced coffee': 0.4}
Reference Question 2 - Survey Country: United States, Survey Year: 2023, Survey Question: Do you support the President?, Survey Data: {'No': 0.48, 'Yes': 0.52}



### Output Generation

Using `gpt-4-1106-preview` to create a data-backed blog post styled in the vein of fivethirtyeight.com

In [128]:
response = rewrite_text(text=data, context=survey_context)

In [129]:
print(textwrap.fill(response, 200, break_long_words=False, replace_whitespace=False))

Picture this: as dawn breaks over the White House, the leader of the free world loves nothing more than to kickstart the day with a good old cup of joe. And while the president's morning ritual might
seem as American as tweeting from bed, let's pour over some fresh, piping hot data that gives us the buzz on the nation's coffee preferences.

In a recent survey swooping across the States in 2023,
when asked about their coffee allegiance, 60% of participants steamed towards hot coffee, leaving iced coffee to chill with the remaining 40%. That's right, despite the trendy cold brews and iced
lattes fencing the coffee landscape, it's the classic hot coffee that's still brewing strong in the hearts of American caffeine aficionados.

But wait, there's more to stir into this beverage tale.
The same year, the nation seems somewhat split on their support for the President, with a narrow 52% leaning in favor, while 48% aren't exactly toasting to their commander-in-chief's health. It's a
political 

### Future Work

What would happen if the questions recorded in the database are not relevant for uncovering insights derived from the input text?

In [12]:
new_data = 'New UFO videos were released by the Pentagon.'

In [13]:
new_questions = generate_questions(data=new_data)

In [14]:
for q in new_questions.questions:
    print(f'Question {q.id}: {q.question}')

Question 1: Have you watched the new UFO videos released by the Pentagon?
Question 2: Do you believe that the UFOs shown in the Pentagon's videos are evidence of extraterrestrial life?
Question 3: How likely are you to trust the authenticity of videos released by the Pentagon?


In [11]:
new_survey_questions = [q.question for q in new_questions.questions]

new_reference_questions = []

for q in new_survey_questions:
    new_reference_questions.append(get_reference_questions(question=q, n=1, threshold=0.8))

new_reference_questions = (pd.concat(new_reference_questions)
                           .drop_duplicates()
                           .reset_index(drop=True)
                           )

new_reference_questions

Unnamed: 0,isPanel,surveyCountry,surveyQuestion,surveyData


In this case we can use the OpenAI API to generate the entire data structure necessary to create a query for an API that delivers results from a synthetic experiment.

<img src="../reports/figures/diagram_experiment_simulator.PNG" alt="sarcasm" width="800"/>

In [19]:
new_target_question = 'Do you to trust the authenticity of videos released by the Pentagon?'

In [20]:
experiment = generate_experiment(data = new_target_question)

In [27]:
print(experiment.model_dump_json(indent=2))

{
  "id": 1,
  "pre_cooked_levels_lookup": [
    "Yes",
    "No"
  ],
  "population_traits": [
    "level of education",
    "political affiliation"
  ],
  "chain_of_thought": "Trust in the authenticity of videos released by the Pentagon may be influenced by an individual's level of education and their political affiliation. Someone with higher education might be more critical and question the credibility, whereas political affiliation could influence trust based on the individual's perception of the government."
}


Then we can use a Causal Experiment Simulator to create syntethic survey results like this:

In [29]:
new_survey_context = "Reference Question - Survey Country: United States, Survey Year: 2023, Survey Question: Do you to trust the authenticity of videos released by the Pentagon?, Survey Data: {'No': 0.4, 'Yes': 0.6}"


Finally we can create the blog post:

In [30]:
new_response = rewrite_text(text=new_data, context=new_survey_context)

In [31]:
print(textwrap.fill(new_response, 200, break_long_words=False, replace_whitespace=False))

Alright folks, let's dive into the great unknown with a side of skepticism. The Pentagon has dropped some fresh UFO footage that's got everyone talking. But before you strap on your tinfoil hats,
let's look at how much trust Americans place in these otherworldly releases. According to a recent 2023 survey, it turns out 60% of those polled in the United States are inclined to believe the
Pentagon's videos are the real McCoy. Meanwhile, a not-insignificant 40% are giving the side-eye, casting doubt on the legitimacy of these extraterrestrial exposés. So, as the debate over little green
men rages on, it's clear that while the majority may be on board with military-released footage, there's a substantial chunk who remain staunchly in the skeptic's corner. Keep your eyes on the skies,
but maybe take what you see with a grain of interstellar salt.
