# Exp8: Generate text of a certain difficulty level
Using prompts that instruct the Gemini Pro model about the requirements for each level, Gemini is asked to generate texts on a certain level.

In [1]:
from vertexai.preview.generative_models import GenerativeModel, Part, HarmCategory, HarmBlockThreshold
import pandas as pd
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import config
import pandas as pd
import random
import torch
import time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

random.seed(config.SEED)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = config.PATH_TO_GCP_CREDS

from sentence_transformers import SentenceTransformer

import spacy
nlp = spacy.load("en_core_web_sm")

egp = pd.read_csv('../dat/egponline.csv')

Let's create some prompts from existing stories.

In [2]:
cefr_texts = pd.read_csv("../dat/cefr_leveled_texts.csv")
cefr_texts.head()

Unnamed: 0,text,label
0,Hi!\nI've been meaning to write for ages and f...,B2
1,﻿It was not so much how hard people found the ...,B2
2,Keith recently came back from a trip to Chicag...,B2
3,"The Griffith Observatory is a planetarium, and...",B2
4,-LRB- The Hollywood Reporter -RRB- It's offici...,B2


In [3]:
print(cefr_texts.text[20][:100])

Who says adult parties have to be boring. More and more adults are reliving their childhoods or crea


We create descriptors and story prompts by using the first 100 characters of each story

In [4]:
description = {
    "C2": "Can understand and interpret critically virtually all forms of the written language including abstract, structurally complex, or highly colloquial literary and non-literary writings. Can understand a wide range of long and complex texts, appreciating subtle distinctions of style and implicit as well as explicit meaning.",
    "C1": "Can understand in detail lengthy, complex texts, whether or not they relate to his/her own area of speciality, provided he/she can reread difficult sections.",
    "B2": "Can read with a large degree of independence, adapting style and speed of reading to different texts and purposes, and using appropriate reference sources selectively. Has a broad active reading vocabulary, but may experience some difficulty with low-frequency idioms.",
    "B1": "Can read straightforward factual texts on subjects related to his/her field and interest with a satisfactory level of comprehension.",
    "A2": "Can understand short, simple texts on familiar matters of a concrete type which consist of high frequency everyday or job-related language. Can understand short, simple texts containing the highest frequency vocabulary, including a proportion of shared international vocabulary items.",
    "A1": "Can understand very short, simple texts a single phrase at a time, picking up familiar names, words and basic phrases and rereading as required."
}

storyPrompts = [f"{text[:100]}..." for text in cefr_texts.text]

In [None]:
import time
def generate(level, storyPrompt):
  model = GenerativeModel("gemini-pro")
  print(level)
  print(storyPrompt)
  
  prompt = f"Write a story using the following prompt on CEFR level {level} (Description: {description[level]})\n\n{storyPrompt}"
  print(prompt)
  responses = model.generate_content(
    prompt,
    safety_settings={ # was necessary due to weird model behavior
        HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    },
    generation_config={
        "max_output_tokens": 1024,
        "temperature": 1,
        "top_p": 0.9,
        
    },
  stream=True,
  )

  text = ""
  for response in responses:
    try:
      text += response.candidates[0].content.parts[0].text
    except Exception as e:
      print(response.candidates)
      print(e)
      #return generate(level, storyPrompt)
  time.sleep(10)
  return text

num_stories = 50
random.shuffle(storyPrompts)

file_path = "../dat/generated_texts_2.csv"
if os.path.exists(file_path):
    existing_df = pd.read_csv(file_path)
    existing_stories = list(existing_df.story.unique())
    storyPrompts = storyPrompts[slice(0, num_stories-len(existing_stories))] + existing_stories
else:
    existing_df = pd.DataFrame(columns=["label", "story", "text"])
    

story_counts = existing_df['label'].value_counts()
for level in description.keys():
    current_count = story_counts.get(level, 0)
    stories_to_add = num_stories - current_count

    for story in storyPrompts[:stories_to_add]:
        
        text = generate(level, story)
        new_row = {"label": level, "story": story, "text": text}
        pd.DataFrame([new_row]).to_csv(file_path, mode='a', index=False, header=not os.path.exists(file_path))

In [14]:
text_df = pd.DataFrame(texts)
print(text_df.text[5])

George had perpetually exuded an aura of hilarity, captivating those around him with his infectious humor. Our paths crossed serendipitously at the local cinema, where I had eagerly anticipated the release of the latest Spider-Man installment. As fate would have it, George occupied the adjacent seat, and our conversation ignited with an effortless camaraderie that defied the constraints of time.

During the ensuing months, George became an integral part of my life. His presence radiated an infectious energy that transformed the mundane into the extraordinary. Whether we embarked on spontaneous road trips, engaged in intellectually stimulating debates, or simply reveled in shared moments of laughter, George possessed an uncanny ability to elevate every experience.

One particularly memorable evening, as we strolled through the picturesque streets of our quaint town, George recounted tales of his eccentric family, each anecdote punctuated with his signature wit and charm. His grandmother