# Unit 1 – Generative AI & NLP Hands-On

## Objective
To explore foundational NLP tasks and benchmark different Transformer architectures
by observing their behavior across text generation, tokenization, POS tagging, NER,
summarization, and architecture–task mismatch experiments.


In [21]:
#PES2UG23CS703
from transformers import pipeline, set_seed, GPT2Tokenizer
import os
import nltk

In [22]:
file_path = "unit1.txt" #PES2UG23CS703

In [23]:
#PES2UG23CS703
try:
  with open(file_path, "r", encoding="utf-8") as f:
    text = f.read()
  print("File loaded successfully!")
except FileNotFoundError:
  print(f"Error: '{file_path} not found.")

File loaded successfully!


In [24]:
#PES2UG23CS703
print("---Data Preview---")
print(text[:500] + "...")

---Data Preview---
Generative AI and Its Applications: A Foundational Briefing

Executive Summary

This document provides a comprehensive overview of Generative AI, synthesizing foundational concepts, technological underpinnings, and practical applications as outlined in the course materials from PES University. Generative AI represents a transformative subset of Artificial Intelligence focused on creating novel content, a capability primarily driven by the advent of Large Language Models (LLMs). The evolution of ...


In [25]:
#PES2UG23CS703
fast_generator = pipeline('text-generation', model='distilgpt2')
set_seed(42)
output_fast = fast_generator(
    "Generative AI transformed pattern learning into content creation",
    max_new_tokens=80,
    min_new_tokens=30,
    do_sample=True,
    temperature=0.95,
    top_p=0.9,
    repetition_penalty=1.2
)

print(output_fast[0]['generated_text'])

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generative AI transformed pattern learning into content creation, which could be applied to software development.
The study is supported by a grant from the Foundation for Science and Technology at Cambridge University (LDS) in collaboration with IETF Grant W5-2317087 as part of an extension of its grants agreement between RIAA GIS Research Fund under CC BY2B - US$1M/S0110G0RQ4


In [26]:
#PES2UG23CS703
generator_gpt2 = pipeline(
    "text-generation",
    model="gpt2"
)

set_seed(42)

output_gpt2 = generator_gpt2(
    "Generative AI transformed pattern learning into content creation",
    max_new_tokens=80,
    min_new_tokens=30,
    do_sample=True,
    temperature=0.95,
    top_p=0.9,
    repetition_penalty=1.2
)

print(output_gpt2[0]["generated_text"])

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generative AI transformed pattern learning into content creation, which could be applied to existing systems.
: A technique developed by a team of researchers led by neuroscientist Professors Kao-Yee and Tseung Lee that generates artificial intelligence at scale for example as well as in real life using human brain waves rather than machines or computer code : As part with the Neural Information Processing Toolkit (NIPT), an advanced development tool called


In [27]:
#PES2UG23CS703
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

sentence = "Generative AI transformed pattern learning into content creation."

tokens = tokenizer.tokenize(sentence)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['Gener', 'ative', 'ĠAI', 'Ġtransformed', 'Ġpattern', 'Ġlearning', 'Ġinto', 'Ġcontent', 'Ġcreation', '.']
Token IDs: [8645, 876, 9552, 14434, 3912, 4673, 656, 2695, 6282, 13]


In [28]:
#PES2UG23CS703
import nltk
nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("averaged_perceptron_tagger")
nltk.download("averaged_perceptron_tagger_eng")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [29]:
#PES2UG23CS703
from nltk import word_tokenize, pos_tag
pos_tags = nltk.pos_tag(nltk.word_tokenize(sentence))
print(f"POS TAGS: {pos_tags}")

POS TAGS: [('Generative', 'JJ'), ('AI', 'NNP'), ('transformed', 'VBD'), ('pattern', 'JJ'), ('learning', 'VBG'), ('into', 'IN'), ('content', 'JJ'), ('creation', 'NN'), ('.', '.')]


In [30]:
#PES2UG23CS703
ner_pipeline = pipeline("ner",model="dbmdz/bert-large-cased-finetuned-conll03-english",aggregation_strategy="simple")

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


In [31]:
#PES2UG23CS703
snippet = text[:1000]
entities = ner_pipeline(snippet)
print(f"{'Entity':<20} | {'Type':<10} | {'Score':<6}")
print("-" * 45)

for entity in entities:
    if entity["score"] > 0.90:
        print(f"{entity['word']:<20} | {entity['entity_group']:<10} | {entity['score']:.2f}")

Entity               | Type       | Score 
---------------------------------------------
AI                   | MISC       | 0.98
PES University       | ORG        | 0.99
AI                   | MISC       | 0.98
Large Language Models | MISC       | 0.91
LLMs                 | MISC       | 0.90
Transformer          | MISC       | 0.99


In [32]:
#PES2UG23CS703
text = """Natural Language Processing (NLP) forms the foundation of Large Language Models and focuses on enabling computers to understand, interpret, and generate human language. Since machines cannot process raw text directly, NLP techniques convert language into numerical representations that models can learn from.

The first step in this process is tokenization, where input text is broken into smaller units such as words or subwords. Tokenization defines the vocabulary of a language model and allows text to be handled systematically. After tokenization, word embeddings map each token to a numerical vector. These vectors are learned representations that capture semantic relationships, allowing related words to be positioned closer together in vector space.

NLP also analyzes the grammatical structure of language. Part-of-Speech tagging assigns grammatical categories such as nouns, verbs, and adjectives to words, helping resolve ambiguity and improve syntactic understanding. Named Entity Recognition builds on this by identifying real-world entities including people, organizations, locations, numerical values, and temporal expressions, enabling deeper semantic interpretation.

A major challenge in NLP is ambiguity. Words may have multiple meanings, sentences can be structured in more than one valid way, and pronouns may refer to different entities depending on context. Figurative language further complicates interpretation, requiring models to infer meaning beyond literal text.

Text classification is a fundamental NLP task that involves assigning predefined labels to text. One commonly used algorithm is the Naive Bayes classifier, which applies probabilistic reasoning to estimate class membership based on word occurrence. Although it assumes independence between words, it remains efficient and effective. Techniques such as Laplace smoothing are used to handle unseen words and ensure reliable predictions."""


In [33]:
#PES2UG23CS703
summarizer_distil = pipeline(
    "summarization",
    model="sshleifer/distilbart-cnn-12-6"
)

summary_distil = summarizer_distil(
    text,
    max_length=120,
    min_length=50,
    do_sample=False
)

print("DistilBART Summary:")
print(summary_distil[0]["summary_text"])

Device set to use cpu


DistilBART Summary:
 Natural Language Processing (NLP) forms the foundation of Large Language Models and focuses on enabling computers to understand, interpret, and generate human language . NLP techniques convert language into numerical representations that models can learn from . The first step in this process is tokenization, where input text is broken into smaller units such as words or subwords .


In [34]:
#PES2UG23CS703
summarizer_bart = pipeline(
    "summarization",
    model="facebook/bart-large-cnn"
)

summary_bart = summarizer_bart(
    text,
    max_length=80,
    min_length=30,
    do_sample=False
)

print("\nBART-Large Summary:")
print(summary_bart[0]["summary_text"])

Device set to use cpu



BART-Large Summary:
Natural Language Processing (NLP) forms the foundation of Large Language Models. It focuses on enabling computers to understand, interpret, and generate human language. NLP also analyzes the grammatical structure of language.


In [35]:
#PES2UG23CS703
qa_pipeline = pipeline(
    "question-answering",
    model="distilbert-base-cased-distilled-squad"
)

qa_questions = [
    "What role does Natural Language Processing play in Large Language Models?",
    "What challenges are associated with human language understanding?"
]

for question in qa_questions:
    result = qa_pipeline(
        question=question,
        context=text[:5000]
    )
    print(f"\nQuestion: {question}")
    print(f"Answer: {result['answer']}")

Device set to use cpu



Question: What role does Natural Language Processing play in Large Language Models?
Answer: enabling computers to understand, interpret, and generate human language

Question: What challenges are associated with human language understanding?
Answer: ambiguity


In [36]:
#PES2UG23CS703
mask_filler = pipeline(
    "fill-mask",
    model="bert-base-uncased"
)

masked_sentence = "The objective of Generative AI is to produce [MASK] outputs."

predictions = mask_filler(masked_sentence)

for pred in predictions:
    print(f"{pred['token_str']}: {pred['score']:.2f}")


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Device set to use cpu


optimal: 0.13
efficient: 0.07
sustainable: 0.06
desired: 0.05
effective: 0.02


#PES2UG23CS703