# Title

In [144]:
from openai import OpenAI
import nltk
from nltk.tokenize.treebank import TreebankWordDetokenizer
import random
import pandas as pd

In [145]:
client = OpenAI()
detokenize = TreebankWordDetokenizer().detokenize

In [266]:
#Importing a list of sentences from a work of literature.
sentences = nltk.corpus.gutenberg.sents('austen-emma.txt') + nltk.corpus.gutenberg.sents('austen-persuasion.txt') + nltk.corpus.gutenberg.sents('austen-sense.txt')

#Try to limit the cost by considering sentences of between 10 and 20 words.
sentences = [s for s in sentences if 9 < len(s) < 21]

#The first and last few sentences can be weird, so out of an abundance of caution, drop the first and last 10 sentences. 
sentences = sentences[10:-10] 

In [267]:
'Out filtered corpus contains {} sentences'.format(len(sentences))

'Out filtered corpus contains 4984 sentences'

Some useful functions for categorizing sentences, quickly getting a handle on those sentences.

In [268]:
def tags_present_in_sentence(s):
    tagged = nltk.pos_tag(s)
    tags = [w[1] for w in tagged]
    return tags

def display_sampling(list_of_sentences, k = 5, seed = None):
    random.seed(seed)
    to_display = random.choices(list_of_sentences, k = k)
    for s in to_display:
        print(detokenize(s))


In [269]:
def ends_in_VBD(s):
    "Ends in a verb in the past tense"
    return nltk.pos_tag(s)[-2][1] == 'VBD'

def ends_in_married(s):
    return s[-2] == 'married'

def contains_fairfax(s):
    return 'Fairfax' in s

def contains_two(s):
    return 'two' in s or 'Two' in s

def contains_CD(s):
    "Contains a numerical reference"
    return "CD" in tags_present_in_sentence(s)

In [270]:
display_sampling([s for s in sentences if contains_CD(s)])

" I was six weeks with Edward," said he, " and saw him happy.
But thirty - five has nothing to do with matrimony."
However, I think it answered so far as to tempt one to go again.
He positively said that it had been known to no being in the world but their two selves."
" You think so, do you?-- I wanted the opinion of some one who could really judge.


The above is helpful for playing around and getting sense of what our sentences look like. Now build a dataframe including the sentences and the result of our conditions.

In [271]:
df = pd.DataFrame(pd.Series(sentences, name = 'sentences'))

In [272]:
df.head()

Unnamed: 0,sentences
0,"[This, is, three, times, as, large, .--, And, ..."
1,"[_We_, must, begin, ;, we, must, go, and, pay,..."
2,"["", My, dear, ,, how, am, I, to, get, so, far, ?]"
3,"["", No, ,, papa, ,, nobody, thought, of, your,..."
4,"[We, must, go, in, the, carriage, ,, to, be, s..."


In [273]:
def add_condition_column(df, condition):
    df[condition.__name__] = df.sentences.apply(condition)
    

In [274]:
for cond in [ends_in_vbd, ends_in_married, contains_fairfax, contains_two, contains_CD]:
    add_condition_column(df, cond)

In [275]:
df.head()

Unnamed: 0,sentences,ends_in_vbd,ends_in_married,contains_fairfax,contains_two,contains_CD
0,"[This, is, three, times, as, large, .--, And, ...",False,False,False,False,True
1,"[_We_, must, begin, ;, we, must, go, and, pay,...",False,False,False,False,False
2,"["", My, dear, ,, how, am, I, to, get, so, far, ?]",False,False,False,False,False
3,"["", No, ,, papa, ,, nobody, thought, of, your,...",False,False,False,False,False
4,"[We, must, go, in, the, carriage, ,, to, be, s...",False,False,False,False,False


Now that we have a dataset (and a suite of tools for adding to the dataset), let's focus on getting the LLM to learn the condition in context and evaluate the LLM's learning. 

In [277]:
def get_learning_data(df, cond, k = 5, seed = None):
    """Filter on entries in df (not) satisfying condition, take only the sentences series, and sample k entries

    Returns a pair of series, one satisfying the condition and one not. 
    """
    
    satisfying = df.loc[df[cond.__name__] == True].sentences.sample(n = k, random_state = seed)
    not_satisfying = df.loc[df[cond.__name__] == False].sentences.sample(n = k, random_state = seed)


    return [satisfying, not_satisfying]

In [289]:
ld = get_learning_data(df, ends_in_married)

In [293]:
def base_prompt(learning_data):
    p = "I have a secret condition in mind. The following sentences are labeled 'True' if the condition is satisfied and labeled 'False' otherwise:\n\n"
    for s in learning_data[0]:
        p += detokenize(s) + ": " + 'True\n'
    p += '\n'
    for s in learning_data[1]:
        p += detokenize(s) + ": " + 'False\n'
    
    p += "\nYour task is to label the following sentences according to the secret condition. Return an ordered list using only the words 'True' or 'False':"
    return p

In [294]:
print(base_prompt(ld))

I have a secret condition in mind. The following sentences are labeled 'True' if the condition is satisfied and labeled 'False' otherwise:

Every friend of Miss Taylor must be glad to have her so happily married.": True
Though I think he had better not have married.: True
It was very wrong of me, you know, to keep any remembrances, after he was married.: True
I am sure I was very much surprized when I first heard she was going to be married.": True
And I am so glad your sister is going to be well married!: True

He looked completely astonished, but not more astonished than pleased; his eyes brightened!: False
He begged her pardon, but she must be applied to, to explain Italian again.: False
It is a corner room, and has windows on two sides.: False
" Now, ma' am," said Jane to her aunt, " shall we join Mrs.: False
Edward is very amiable, and I love him tenderly.: False

Your task is to label the following sentences according to the secret condition. Return an ordered list using only the