<a href="https://colab.research.google.com/github/falawar7/AAI_633O/blob/main/Week3/Copy_of_FE_Experimenting_with_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Week 3 Hands-on Lab: Experimenting With Large Language Models**

**Introduction:**

In this hands-on python notebook, we willl be experimenting with LLMs.
This will help you:
1.	Use a pre-trained LLM from the Hugging Face library for text summarization.
2.	Implement a question-answering task using a pre-trained LLM.
3.	Understand how LLMs perform NLP tasks in real-world scenarios.


# **Part 1: Text Summarization**

**1. Import Necessary Libraries**

We will be using the [Transformers](https://huggingface.co/docs/transformers/en/index) library from [Hugging Face](https://huggingface.co/). The Transformers library provides APIs and tools to easily download and train state-of-the-art pretrained models.

In [None]:
from transformers import pipeline

**2. Set up the Summarization Pipeline**

In [None]:
# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Input a long text for summarization
long_text = """
Artificial intelligence (AI) is rapidly transforming industries, from healthcare and education to finance and entertainment.
Generative AI models, such as Large Language Models (LLMs), are at the forefront of this transformation. These models are
trained on vast datasets and can generate human-like text, enabling applications like automated customer support,
personalized education tools, and content creation. Despite their potential, challenges such as bias, ethical concerns,
and the environmental impact of large-scale training persist. Addressing these challenges is crucial for the responsible
deployment of AI technologies in the future.
"""

# Generate a summary
summary = summarizer(long_text, max_length=50, min_length=25, do_sample=False)
print("Summary:")
print(summary[0]['summary_text'])


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Summary:
Artificial intelligence (AI) is rapidly transforming industries, from healthcare and education to finance and entertainment. Generative AI models, such as Large Language Models (LLMs), are at the forefront of this transformation. Despite their potential, challenges


**3. Experiment**

* Replace long_text with any article or paragraph of your choice.
* Try different max_length and min_length values to see how they affect the summary.


# Added A TExt from Week1 on Alpha Fold that [P1_FE_Practical Exercise (Case Study) AlphaFold](https://raw.githubusercontent.com/falawar7/AAI_633O/refs/heads/main/Week3/P1_LLM.txt)

In [None]:
from transformers import pipeline
import requests

# Load the summarization pipeline
summarizer = pipeline("summarization")

# Input a long text for summarization (URL)
FE_text_url = "https://raw.githubusercontent.com/falawar7/AAI_633O/refs/heads/main/Week3/P1_LLM.txt"

# Fetch the text content from the URL
response = requests.get(FE_text_url)
FE_text = response.text

# Generate a summary
summary1 = summarizer(FE_text, max_length=100, min_length=25, do_sample=False)
print("Summary1:")
print(summary1[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Summary1:
 Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments . It can predict the 3D structures of proteins based on their amino acid sequences . AlphaFold could help us face up to the challenge of cleaning up our world .


**Added Diffent Model : google/flan-t5-base**

In [None]:
from transformers import pipeline
import requests

# Load the summarization pipeline
summarizer = pipeline("summarization", model="google/flan-t5-base")

# Input a long text for summarization (URL)
FE_text_url = "https://raw.githubusercontent.com/falawar7/AAI_633O/refs/heads/main/Week3/P1_LLM.txt"

# Fetch the text content from the URL
response = requests.get(FE_text_url)
FE_text = response.text

# Generate a summary
summary2 = summarizer(FE_text, max_length=100, min_length=25, do_sample=False)
print("Summary2:")
print(summary2[0]['summary_text'])

Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (876 > 512). Running this sequence through the model will result in indexing errors


Summary2:
Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments. It can predict the 3D structures of proteins based on their amino acid sequences. Researchers were successfully determined protein structured by methods and tools such as X-ray crystallography.


**Added Diffent Model : facebook/bart-large-cnn**

In [None]:
from transformers import pipeline
import requests

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-xsum")

# Input a long text for summarization (URL)
FE_text_url = "https://raw.githubusercontent.com/falawar7/AAI_633O/refs/heads/main/Week3/P1_LLM.txt"

# Fetch the text content from the URL
response = requests.get(FE_text_url)
FE_text = response.text

# Generate a summary
summary3 = summarizer(FE_text, max_length=100, min_length=25, do_sample=False)
print("Summary3:")
print(summary3[0]['summary_text'])

Device set to use cuda:0


Summary3:
An artificial intelligence (AI) that can predict the 3D structure of proteins with high accuracy has been developed by Google.


# **All models Was used Max length = 100 and Min Lenth = 25 The outputs are below:**

**Summary on sshleifer/distilbart-cnn-12-6:** Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments . It can predict the 3D structures of proteins based on their amino acid sequences . AlphaFold could help us face up to the challenge of cleaning up our world .

**Summary on Model: google/flan-t5-base:** Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments. Can predict the 3D structures of proteins based on their amino acid sequences.

**Summary on Model: facebook/bart-large-cnn**
An artificial intelligence (AI) that can predict the 3D structure of proteins with high accuracy has been developed by Google.


# **Part 2: Question Answering**

**1.	Set Up the Question-Answering Pipeline**

In [None]:
# Load the question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Input context and questions
context = """
The Large Language Model (LLM) GPT-3, developed by OpenAI, is known for its exceptional ability to generate human-like text.
It uses the Transformer architecture and has 175 billion parameters, making it one of the largest AI models in the world.
LLMs like GPT-3 are widely used in applications such as content creation, summarization, and question answering.
"""

question = "What architecture does GPT-3 use?"

# Get the answer
answer = qa_pipeline(question=question, context=context)
print("Answer:")
print(answer['answer'])

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


Answer:
Transformer


**2.	Experiment**

* Modify the context and question variables with your own text and queries.
* Observe how the model adjusts its answers based on the provided input


In [None]:
# Load the question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Input context and questions
context = """
Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments. It can predict the 3D structures of proteins based on their amino acid sequences.
Researchers were successfully determined protein structured by methods and tools such as X-ray crystallography Nuclear Magnetic Resonance and cryoelectronic microscopy but these methods are expensive and take time to have results , but now with Alpha fold 2 and its Evoformer  Neural network unique it can leverage its generative capabilities for generating new output based on the learned patterns from Data processing , analysis , visualization on 3d Models and prediction. The generative capabilities listed below are:
1.	Learning patterns : the learn patterns is being utilizes by the Deep Neural Network(Evoformer) in the data where it focus on the relationships between the sequence of amino acid the outputs in the 3D protein structure
2.	Predictions: once its trained, it uses the learned pattern to predict the 3d model for the protein structures that not been determined experimentally.
3.	Reinforcement process: it refines it predictions by leveraging the data evolution and it constraints which improves and boost the accuracy of prediction over time.

Below is the complete process:




What patterns does AlphaFold learn from training data (e.g. protein structure datasets)?
Well it works on the Amino Acid Iterations by applying the refined MSA and representation using Rotations and translations on each Amino Acid revealing a guess in the 3D protein Structure , the cross -communicated calculations happen 48 times before it reaches the MSA and Pair refined, then it applies physical and chemical constrains dictated by atomic bond, angles and torsional angles then iterates it back to the evoformer and the structures the model process by additional 3 times to have a total for 4 cycles before it predictive the 3d coordinated and output it.
Real world applications:
Discovering Drugs that bind tightly to protein pockets
SARS-CoV-2:  Targeting the Spike protein of SARS Cov-2
Alpahfold’s ability to predict protein structures with high accuracy helps by identifying the precise shape and characteristics of these binding packets.
The challenge was that they had to target  on human cells and they have to go on bed experimental using the tools such as X-ray crystallography Nuclear Magnetic Resonance and cryoelectronic microscopy but these methods are expensive and take time to have results to rebuild the structure but with alphafold’s and its high accuracy things were easier to come as DNA was transcribed to RNA and translated into Amino acid sequence.
The outcome is to accelerate the drug and vaccine development to overcome the pandemic happened in year 2020, these techniques helped to come up with a vaccination in short notice compared to previous virus spread , even though we were obliged to had the vaccination nerveless if it boosted our immunity or didn’t , the future will reveal its advantages and disadvantages .
Other industries can rely on its research:
•	Managing plastic pollution
91% of all plastic ever produced has never been recycled. AlphaFold could help us face up to the challenge of cleaning up our world.
•	Supporting global food supplies
40% of the world’s crops are lost to disease each year. AlphaFold could unlock insights that help keep food on tables.
•	Increasing honeybees’ chances of survival
Researchers have used AlphaFold to understand vitellogenin, a protein fundamental to the immune system of honeybees. This allowed them to cut research time from years to days, laying the foundation for new work that could help increase this vital species’ chances of survival

Conclusion :
Alphafold’s success highlights AI’s potential to accelerate the progress in complex real-world problems by its accuracy in predicting protein structure , within its Deep mind and Evo former neural networks , doors can be opened to ither industries for impactful , successful solution within Animals, Plants , Environment and others.

"""

question1 = "Why Alpha fold was developed?"
question2= "What is the main purpose of Alphafold?"
question3= "how does alpha fold works?"
question4= "how does alphafold helped in covid reserach?"
question5= "what are some other appications other than medical research that alphafold can contribute to?"

# Get the answer
answer = qa_pipeline(question=question, context=context)
print("Answer:")
print(answer['answer'])

# Get the answer for question 2
answer = qa_pipeline(question=question2, context=context)
print("\nAnswer 2:")
print(answer['answer'])

# Get the answer for question 3
answer = qa_pipeline(question=question3, context=context)
print("\nAnswer 3:")
print(answer['answer'])

# Get the answer for question 4
answer = qa_pipeline(question=question4, context=context)
print("\nAnswer 4:")
print(answer['answer'])

# Get the answer for question 5
answer = qa_pipeline(question=question5, context=context)
print("\nAnswer 5:")
print(answer['answer'])


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Answer:
cryoelectronic microscopy

Answer 2:
It can predict the 3D structures of proteins based on their amino acid sequences

Answer 3:
DNA was transcribed to RNA and translated into Amino acid sequence

Answer 4:
it refines it predictions by leveraging the data evolution and it constraints

Answer 5:
Nuclear Magnetic Resonance and cryoelectronic microscopy


In [None]:
# Load the question-answering pipeline
qa_pipeline = pipeline("question-answering")

# Input context and questions
context = """
Alpha fold AI was developed by Google Deep mind, the first non-experimental method that can rapidly accomplish accuracy with comparable experiments. It can predict the 3D structures of proteins based on their amino acid sequences.
Researchers were successfully determined protein structured by methods and tools such as X-ray crystallography Nuclear Magnetic Resonance and cryoelectronic microscopy but these methods are expensive and take time to have results , but now with Alpha fold 2 and its Evoformer  Neural network unique it can leverage its generative capabilities for generating new output based on the learned patterns from Data processing , analysis , visualization on 3d Models and prediction. The generative capabilities listed below are:
1.	Learning patterns : the learn patterns is being utilizes by the Deep Neural Network(Evoformer) in the data where it focus on the relationships between the sequence of amino acid the outputs in the 3D protein structure
2.	Predictions: once its trained, it uses the learned pattern to predict the 3d model for the protein structures that not been determined experimentally.
3.	Reinforcement process: it refines it predictions by leveraging the data evolution and it constraints which improves and boost the accuracy of prediction over time.

Below is the complete process:




What patterns does AlphaFold learn from training data (e.g. protein structure datasets)?
Well it works on the Amino Acid Iterations by applying the refined MSA and representation using Rotations and translations on each Amino Acid revealing a guess in the 3D protein Structure , the cross -communicated calculations happen 48 times before it reaches the MSA and Pair refined, then it applies physical and chemical constrains dictated by atomic bond, angles and torsional angles then iterates it back to the evoformer and the structures the model process by additional 3 times to have a total for 4 cycles before it predictive the 3d coordinated and output it.
Real world applications:
Discovering Drugs that bind tightly to protein pockets
SARS-CoV-2:  Targeting the Spike protein of SARS Cov-2
Alpahfold’s ability to predict protein structures with high accuracy helps by identifying the precise shape and characteristics of these binding packets.
The challenge was that they had to target  on human cells and they have to go on bed experimental using the tools such as X-ray crystallography Nuclear Magnetic Resonance and cryoelectronic microscopy but these methods are expensive and take time to have results to rebuild the structure but with alphafold’s and its high accuracy things were easier to come as DNA was transcribed to RNA and translated into Amino acid sequence.
The outcome is to accelerate the drug and vaccine development to overcome the pandemic happened in year 2020, these techniques helped to come up with a vaccination in short notice compared to previous virus spread , even though we were obliged to had the vaccination nerveless if it boosted our immunity or didn’t , the future will reveal its advantages and disadvantages .
Other industries can rely on its research:
•	Managing plastic pollution
91% of all plastic ever produced has never been recycled. AlphaFold could help us face up to the challenge of cleaning up our world.
•	Supporting global food supplies
40% of the world’s crops are lost to disease each year. AlphaFold could unlock insights that help keep food on tables.
•	Increasing honeybees’ chances of survival
Researchers have used AlphaFold to understand vitellogenin, a protein fundamental to the immune system of honeybees. This allowed them to cut research time from years to days, laying the foundation for new work that could help increase this vital species’ chances of survival

Conclusion :
Alphafold’s success highlights AI’s potential to accelerate the progress in complex real-world problems by its accuracy in predicting protein structure , within its Deep mind and Evo former neural networks , doors can be opened to ither industries for impactful , successful solution within Animals, Plants , Environment and others.

"""


question6= "what are some other appications other than medical research that alphafold can contribute to? name at least 2"

# Get the answer
answer = qa_pipeline(question=question6, context=context)
print("Answer 6:")
print(answer['answer'])


No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


Answer 6:
Animals, Plants , Environment


# **Part 3: Combine Summarization and Question Answering**

**1. Pipeline Integration**

Combine the two pipelines to first summarize a long text and then extract answers to specific questions from the summary.


In [None]:
# Summarize the text
summary = summarizer(long_text, max_length=50, min_length=25, do_sample=False)[0]['summary_text']

# Define a question based on the summary
question = "What are the challenges mentioned in the summary?"

# Use the QA pipeline to extract the answer
answer = qa_pipeline(question=question, context=summary)
print("Question:", question)
print("Answer:", answer['answer'])


Question: What are the challenges mentioned in the summary?
Answer: high accuracy


**2. Experiment**

•	Change the input text and questions to test the robustness of the combined approach.


In [None]:
# Summarize the text
summary = summarizer(long_text, max_length=50, min_length=25, do_sample=False)[0]['summary_text']

# Define a question based on the summary
question22 = "?"

# Use the QA pipeline to extract the answer
answer = qa_pipeline(question=question, context=summary)
print("Question:", question)
print("Answer:", answer['answer'])


Question: How we can Leverage these techniques?
Answer: high accuracy


# **Summary:**

By completing this activity, you have:

* Gained hands-on experience using LLMs for real-world NLP tasks.
* Understood the capabilities and limitations of pre-trained LLMs.
* Appreciated the practical applications of LLMs in summarization and question answering.

This activity ensures practical understanding of LLMs while showcasing their real-world relevance. Let me know if you’d like additional extensions!
