# GA Capstone
## Causal Model Evaluation

The goal here is to evaluate the causal model based on the classification model. To do so, I will generate text, reduced to the first sentence of each generated output, and run each sentence through the classification model. The goal is to have the most possible Shakespearean results (as the causal model is supposed to generate Shakespearean text). Score will be the percentage of Shakespearean results out of total results.

### Imports and Preliminaries

In [1]:
# tokenizer
from transformers import AutoTokenizer

# models
from transformers import TFAutoModelForCausalLM, TFAutoModelForSequenceClassification

# custom utilities
from utilities.utilities import load_config, get_model_path, load_model, load_tokenizer
from utilities.utilities import load_text_from_config
from utilities.utilities import generate_from
from utilities.utilities import classify_from
from utilities.utilities import extract_sentences

# pandas for csv read and extract
import pandas as pd

# other
import random
import os

In [2]:
# model config file
CONFIG_FILE = 'config.json'
cfgvars = load_config(CONFIG_FILE)

### Load Models and Model Support

In [3]:
# get model locations, load models, and load tokenizers
causal_model_path = get_model_path(CONFIG_FILE, 'causal')
class_model_path = get_model_path(CONFIG_FILE, 'class')

causal_model = load_model(causal_model_path, 'causal')
class_model = load_model(class_model_path, 'class')

causal_tokenizer = load_tokenizer(cfgvars['CAUSAL_MODEL'])
class_tokenizer = load_tokenizer(cfgvars['CLASS_MODEL'])

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ../models/shakespeare.distilgpt2.8.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
Some layers from the model checkpoint at ../models/shakespeare.distilbert-base-uncased.2 were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a 

### Load and Prep Test Data

In [4]:
# get some test data to fuel generator
test_data = dict()

wines_test_data_path = os.path.join(cfgvars['DATA_DIR'], 'winemag-data-130k-v2.csv')
wines_test_data = pd.read_csv(wines_test_data_path)
wines_test_data = ' '.join(list(wines_test_data['description']))
wines_test_data = extract_sentences(wines_test_data)
test_data['wines'] = wines_test_data

s, o = load_text_from_config(cfgvars)
test_data['shakespeare'] = extract_sentences(s)
test_data['other'] = extract_sentences(o)

for v in test_data.values():
    print(v[:2])

['Aromas include tropical fruit, broom, brimstone and dried herb.', "The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."]
['From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory:', 'But thou, contracted to thine own bright eyes, Feed’st thy light’s flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel:']
['Lift up your hearts in Gumber, laugh the Weald And you my mother the Valley of Arun sing.', 'Here am I homeward from my wandering Here am I homeward and my heart is healed.']


In [5]:
# FORMAT TEST DATA

# fragment ratio for prompt selection
FRAG_RAT = 0.4

# sample size
SAMPLES = 100

# function to extract fragment from sentence
def get_frag(text, rat=0.2):
    words = text.split()
    nwords = len(words)
    if not rat:
        nout = nwords
    elif rat < 1:
        nout = int(nwords * rat) or 1
    else:
        if rat > nwords: rat = nwords
        nout = rat
        
    return ' '.join(words[:nout])

def get_samples(data, samples = 10):
    sampled = random.sample(data, samples)
    return [get_frag(sentence, FRAG_RAT) for sentence in sampled]

# create samples
samples = {k: get_samples(v, SAMPLES) for k, v in test_data.items()}

for v in samples.values():
    print(v[:2])

["At four years old, it's still young, its", 'Drink']
['And yet not so,', 'Stay, Humphrey,']
['It gave them both', 'Thou shalt, when he hath runne the']


### Generate Text from Test Data

In [6]:
# GENERATE TEXT

# maximum lines of ouput per category
MAX_OUTPUT_LINES = 10

# generate and store
generated = {k: list() for k in samples.keys()}
for k,v in samples.items():
    count = 0
    print(f'\n*--- {k} ---*')
    for line in v:
        count += 1
        if count <= MAX_OUTPUT_LINES: print(f'Input: {line}')
        gs = extract_sentences(generate_from(line, causal_model, causal_tokenizer))[0]
        if count <= MAX_OUTPUT_LINES: print(f'> Output: {gs}')
        if count > MAX_OUTPUT_LINES: print(f'{count},', end=' ')
        generated[k].append(gs)
    print('\n')


*--- wines ---*
Input: At four years old, it's still young, its
> Output: At four years old, it's still young, its very full.
Input: Drink
> Output: Drink, and make haste.
Input: On the velvety palate, firm, polished tannins
> Output: On the velvety palate, firm, polished tannins.
Input: This is open and and very
> Output: This is open and and very little in my life.
Input: Drink
> Output: Drink but a drink and bring the lads out.
Input: It's an opulent, delicious wine, immediately likeable now
> Output: It's an opulent, delicious wine, immediately likeable now than a year in Italy.
Input: Would pair with
> Output: Would pair with her so to say I was a monster?
Input: Oak provides fatness, and the right
> Output: Oak provides fatness, and the right wing makes it.
Input: drink
> Output: drink and eat.
Input: A hand-stitched, red vinyl body suit, tailored to the bottle’s seductive
> Output: A hand-stitched, red vinyl body suit, tailored to the bottle’s seductive breath?
11, 12, 13, 14, 

# Classify and Score Generated Text

In [7]:
# CLASSIFY
for k, v in generated.items():
    class_tokenizer.eos_token = class_tokenizer.unk_token
    class_tokenizer.pad_token = class_tokenizer.eos_token
    results = classify_from(v, class_model, class_tokenizer)
    shakespearean_ratio = sum(results.c) / len(results.c)
    score_mean = sum(results.s) / len(results.s)
    
    print(f'\n*--- {k} ---*')
    print(f'Shakespearean: {shakespearean_ratio * 100}%')
    print(f'Mean score: {score_mean}')


*--- wines ---*
Shakespearean: 63.0%
Mean score: 0.5876119114458561

*--- shakespeare ---*
Shakespearean: 99.0%
Mean score: 0.8906413942575455

*--- other ---*
Shakespearean: 61.0%
Mean score: 0.5830802789190784


### Conclusion

Maybe it's to be expected, but Shakespearean input generates the most Shakespearean output, as evidenced by the higher positive classifications and mean score for the Shakespeare input text. Overall, all generated text is on average more than 50% Shakespearean.

Future tests could try different sized sentence fragments, to see if the causal model generate more Shakespearean text with more or fewer words in the input prompt.