# Assignment 2
## P+7 (Oulipian language modelling)

### Background

The Oulipo (*Ouvroir de Littérature Potentielle*, or “Workshop of Potential Literature”) is a French literary group founded in 1960 by writer Raymond Queneau and mathematician François Le Lionnais. The group focuses on using rules and constraints in writing as a way to spark creativity. Rather than seeing constraints as obstacles, Oulipians treat them as tools to inspire new forms of storytelling and poetry. Their work combines mathematics, language, and playfulness, making their approach both unique and influential in modern literature.

One of the most famous Oulipian writers is Georges Perec, who is known for his creative use of constraints. His novel *La Disparition* (“A Void”) is written entirely without the letter "e," which is especially challenging given how common "e" is in French. Perec’s writing often plays with the structure of language in surprising ways. One popular Oulipian technique is N+7, where each noun in a text is replaced by the noun seven entries later in a dictionary. This creates unusual, absurd, and often funny results, encouraging writers to think differently about language and meaning.

![George Perec](https://upload.wikimedia.org/wikipedia/commons/7/76/Myart_georges-perec_1978.jpg)

(George Perec, 1978. From Wikidata)

<!-- <img src="https://media.vigliensoni.com/clips/CART498/perec-01.jpg" width="800"> -->


## How it works

The N+7 technique process is straightforward

- **Start with a text**. Choose any text—this could be a poem, a sentence, or a passage.
- **Use a dictionary**. Have a dictionary (or word list) handy.
- **Replace each noun**. For every substantive noun in the original text, replace it with the noun appearing seven nouns away in the dictionary. If the end of the dictionary is reached, you can loop back to the beginning.
- **Maintain grammar**. Ensure the new text maintains grammatical correctness as much as possible, though the results often turn out  surreal and nonsensical.

For example, using the N+7 technique with a standard English dictionary for the original sentence:

*The cat sat on the mat*.

- ”Cat” → 7 nouns after “cat” is “catalog.”
- ”Mat” → 7 nouns after ”mat” is ”material.”

Results in:

*The catalog sat on the material.*

### Assignment and deliverables

For this assignment, you will create a variation of the N+7 technique we will name P+7. Using the GPT-2 language model, you will replace the last word of each line from *The Snow Man* with the word that has the seventh-highest probability according to the model’s predictions.

By the end of this assignment, submit a link to a GitHub repository named `CART498-GenAI` containing a folder labelled `A02` with the following items:

- A version of the text processed with your P+7 technique, saved as a `.txt` file.
- A second version of the text processed with `P+x`. Choose an `x` value that produces the funniest, wittiest, or most absurd version of the original text. Save this as a .txt file, and include the `x` value in the filename (e.g., `P+23.txt` or `P+12.txt`).
- A Python notebook with the script used to generate your P+7 and P+x transformations using the GPT-2 language model.
- A short reflection (250–350 words) explaining how altering the `x` value impacted the output of your P+x version. Additionally, discuss how you would implement a P+7 technique in which all nouns are replaced with their seventh-highest probability alternatives.

### *The Snow Man*
by Wallace Stevens (1879-1955)


> One must have a mind of winter  
> To regard the frost and the boughs  
> Of the pine-trees crusted with snow;  
> And have been cold a long time  
> To behold the junipers shagged with ice,  
> The spruces rough in the distant glitter  
> Of the January sun; and not to think  
> Of any misery in the sound of the wind,  
> In the sound of a few leaves,  
> Which is the sound of the land  
> Full of the same wind  
> That is blowing in the same bare place  
> For the listener, who listens in the snow,  
> And, nothing himself, beholds  
> Nothing that is not there and the nothing that is.  




In [1]:
# prompt: Generate a script that takes "The Snow Man" text by Wallace Stevens, tokenize the text, and replace the word at the end of each line replace by the seventh-highest generative probability according to the ChatGPT 2.0 text model

# Import libraries
from transformers import pipeline, GPT2TokenizerFast

# Load the pre-trained GPT-2 model
generator = pipeline('text-generation', model='gpt2')
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

# Input text
text = """One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is."""

# Process the text
lines = text.strip().split('\n')
modified_lines = []

for line in lines:
    words = line.split()
    if words:
        # Generate text using GPT-2
        input_text = " ".join(words[:-1])  # Exclude the last word
        generated_text = generator(input_text, max_length=len(input_text) + 10, num_return_sequences=1)[0]['generated_text']

        # Tokenize the generated text
        tokens = tokenizer.encode(generated_text)

        # Find the seventh-highest probability token (this is a simplified approach)
        # In a real-world scenario, you'd need to access the probabilities from the model output
        try:
            new_word = tokenizer.decode(tokens[-8]) # tokenizer probability
        except IndexError:
            new_word = ""

        modified_line = input_text + " " + new_word
        modified_lines.append(modified_line)
    else:
        modified_lines.append(line)  # handle empty lines

# Join the modified lines into a single string
modified_text = "\n".join(modified_lines)


# Save the result to a file
with open('P+7.txt', 'w') as f:
    f.write(modified_text)

modified_text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to 

"One must have a mind of What\nTo regard the frost and the  separated\nOf the pine-trees crusted with  the\nAnd have been cold a long \n\nTo behold the junipers shagged with  description\nThe spruces rough in the distant  cliff\nOf the January sun; and not to  ninth\nOf any misery in the sound of the  a\nIn the sound of a few  more\nWhich is the sound of the  of\nFull of the same  Linux\nThat is blowing in the same bare  When\nFor the listener, who listens in the  middle\nAnd, nothing himself, 're\nNothing that is not there and the nothing that  could"

In [2]:
# prompt: Generate a copy of the code above but replace with seventh-highest probability with a variable probablility of x

from transformers import pipeline, GPT2TokenizerFast

# Load the pre-trained GPT-2 model
generator = pipeline('text-generation', model='gpt2')
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

# Input text
text = """One must have a mind of winter
To regard the frost and the boughs
Of the pine-trees crusted with snow;
And have been cold a long time
To behold the junipers shagged with ice,
The spruces rough in the distant glitter
Of the January sun; and not to think
Of any misery in the sound of the wind,
In the sound of a few leaves,
Which is the sound of the land
Full of the same wind
That is blowing in the same bare place
For the listener, who listens in the snow,
And, nothing himself, beholds
Nothing that is not there and the nothing that is."""

# Process the text
lines = text.strip().split('\n')
modified_lines = []

x = 25

for line in lines:
    words = line.split()
    if words:
        # Generate text using GPT-2
        input_text = " ".join(words[:-1])  # Exclude the last word
        generated_text = generator(input_text, max_length=len(input_text) + 10, num_return_sequences=1)[0]['generated_text']

        # Tokenize the generated text
        tokens = tokenizer.encode(generated_text)

        # Find the x-th highest probability token (this is a simplified approach)
        # In a real-world scenario, you'd need to access the probabilities from the model output
        try:
            new_word = tokenizer.decode(tokens[-(x+1)])
        except IndexError:
            new_word = ""

        modified_line = input_text + " " + new_word
        modified_lines.append(modified_line)
    else:
        modified_lines.append(line)  # handle empty lines

# Join the modified lines into a single string
modified_text = "\n".join(modified_lines)

# Save the result to a file
with open(f'P+{x}.txt', 'w') as f:
    f.write(modified_text)

modified_text

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to 

'One must have a mind of  own\nTo regard the frost and the  we\nOf the pine-trees crusted with pill\nAnd have been cold a long  you\nTo behold the junipers shagged with  the\nThe spruces rough in the distant ,\nOf the January sun; and not to  sun\nOf any misery in the sound of the  the\nIn the sound of a few  few\nWhich is the sound of the  to\nFull of the same Full\nThat is blowing in the same bare  with\nFor the listener, who listens in the  be\nAnd, nothing himself,  no\nNothing that is not there and the nothing that  the'