# Day 5
In today's exercise, we will focus on extracting information from PDF articles and interview transcripts. We will again use the transformers text generation pipeline to extract information from these sources. The first part of the exercise will focus on extracting information from a PDF article, while the second part will focus on extracting information from an interview transcript.

By the end of this exercise, you should be able to:
- Extract information from text using the `transformers` text generation pipeline
- Gain a sense for how subtle changes in the prompt can affect the quality of the generated text

## Environment Setup
**Make sure to set your runtime to use a GPU by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`**

In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Installing requisite packages
    !pip install transformers accelerate pymupdf &> /dev/null

    # Change working directory to day_5
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_5

In [None]:
import fitz
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
import torch
import textwrap

## Part 1: Extracting Information from PDF articles
This section will focus on extracting information from a PDF article. We will be working with our working paper "Against Justaism: A call for more measured discussions on LLM cognition". 

The code begins by reading the PDF file using the `fitz` library. It then extracts the text from the first four pages of the PDF, which contains the whole article apart from the conclusion and references. We do this so that the prompt does not overload the available GPU memory in Colab:

In [None]:
# Extract text from PDF
pdf = fitz.open('Against_Justaism.pdf')
against_justaism = ""
for page in pdf[:4]:
    against_justaism += page.get_text()

against_justaism[:1000]  # Display the first 500 characters to verify content extraction

The code next loads the `'microsoft/Phi-3-mini-128k-instruct'` model and tokenizer:

In [None]:
torch.random.manual_seed(42) # For reproducibility

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda", # Use GPU
    torch_dtype=torch.float16, # Use half-precision
    trust_remote_code=True,
    attn_implementation='eager' # For faster inference on T4 GPUs
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

The code then creates a text generation pipeline using the loaded model and tokenizer. It sets `"do_sample": False` to use greedy decoding and `"max_new_tokens": 300` to limit the number of tokens to a reasonable length. 

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 300,  # Maximum number of tokens to generate
    "return_full_text": False, # Return only the generated text
    "do_sample": False # Use greedy decoding
    # "temperature": 0.0 # Change for TASK 2 from part 2
}

We next use a specific prompting format recommended on the [model card](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct). The format alternates between the `"assistant"` and `"user"` roles to generate a chat-like conversation. In this case, the assistant provides the text as context, and the user asks questions about the text. The assistant then generates a response to the user's query.

In [None]:
def query_paper(query, text):
    """Generates a response to a query about a text using the text generation pipeline."""
    prompt = [
        {"role": "assistant", "content": "What do you want to know about the text?:\n-------------------\n" + text},
        {"role": "user", "content": query}
    ]
    output = pipe(prompt, **generation_args)
    return '\n'.join(textwrap.wrap(output[0]['generated_text'], 100))

print(query_paper("What is the title of the paper?", against_justaism))

In [None]:
print(query_paper("Return the full abstract from the paper", against_justaism))

In [None]:
print(query_paper("Summarise the paper", against_justaism))

**TASK 1**: Come up with your own prompts to extract information from the paper. Make the prompts more and more difficult until the model no longer provides satisfactory responses.
**TASK 2**: Try tweaking the prompt to the unsatisfactory response to see if you can get a better response. 

## Part 2: Extracting information from Interview Transcripts
This section will focus on extracting information from interview transcripts. Extracting information from interviews can be tricky due to the informal and sometimes grammatically incorrect nature of the text. We will be working with an interview transcript with Ilya Sutskever (conducted by Sven Strohband, processed [here](https://www.lesswrong.com/posts/TpKktHS8GszgmMw4B/ilya-sutskever-s-thoughts-on-ai-safety-july-2023-a)). The code begins by reading the interview transcript from the `ilya_interview.txt` file:

In [None]:
# Reads 'ilya_interview.txt' file into a string
with open('ilya_interview.txt', 'r') as file:
    ilya_interview = file.read()

# Print the first 1000 characters to verify content extraction
ilya_interview[:1000]

The code begins by asking a simple, high-level question about the interview:

In [None]:
print(query_paper("What does Ilya think of AI safety?", ilya_interview))

As you can see, the response isn't quite satisfactory, and sometimes mistrepresents Ilya's views or hallucinates information. In the next prompt, we will try to guide the model to focus on key sentences that illustrate Ilya's stance on AI safety to reduce the risk of hallucination:

In [None]:
print(query_paper("Extract key sentences illustrating Ilya's stance about AI Safety", ilya_interview))

The two prompting styles can be combined to provide more context while still focusing on key sentences:

In [None]:
# Trying the first question again but with a more directed style of prompting to reduce risks of hallucination
print(query_paper("What does Ilya think of AI safety? Focus on key sentences from the transcript and expand on these", ilya_interview))


**TASK 1**: Now prompt the model in the code block below to pretend that is Ilya based on the interview transcript and ask it some completely unrelated and absurd questions. What kind of responses do you get?

In [ ]:
# Add your TASK 1 prompt here
print(query_paper("[ADD PROMPT]", ilya_interview))

**TASK 2**: Try tweaking the `"temperature"` parameter in the `generation_args` dictionary to see how it affects the generated text (you will need to uncomment it first). Does it improve the quality of the response for TASK 1?  