In [1]:
# install required packages (Colab)
!pip install -q transformers sentencepiece torch sentence-transformers faiss-cpu nltk

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m57.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [9]:
# Imports
import re, json
from pprint import pprint
import numpy as np
import nltk
nltk.download('punkt')
nltk.download('punkt_tab') # Download the missing resource

from nltk.tokenize import sent_tokenize, word_tokenize

from transformers import pipeline
from sentence_transformers import SentenceTransformer
import faiss

print("Imports done")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Imports done


In [10]:
# Utilities
def clean_text(text):
    text = re.sub(r"\s+", " ", text).strip()
    return text

def sent_tokenize_wrapper(text):
    return sent_tokenize(text)


In [11]:
# Simple extractive summarizer (frequency scoring)
import string
from collections import defaultdict

def extractive_summarizer(text, top_n=3):
    text = clean_text(text)
    sentences = sent_tokenize_wrapper(text)
    if len(sentences) <= top_n:
        return " ".join(sentences)

    words = [w.lower() for w in word_tokenize(text) if w.isalpha()]
    stopwords = set(nltk.corpus.stopwords.words("english")) if 'stopwords' in nltk.corpus.__dict__ else set()
    freq = defaultdict(int)
    for w in words:
        if w not in stopwords:
            freq[w] += 1
    if not freq:
        return " ".join(sentences[:top_n])

    max_freq = max(freq.values())
    for w in list(freq.keys()):
        freq[w] = freq[w] / max_freq

    sent_scores = {}
    for s in sentences:
        for w in word_tokenize(s.lower()):
            if w in freq:
                sent_scores[s] = sent_scores.get(s, 0) + freq[w]

    ranked = sorted(sent_scores.items(), key=lambda x: x[1], reverse=True)
    top_sentences = [s for s,_ in ranked[:top_n]]
    # keep original order for coherence
    top_sentences = sorted(top_sentences, key=lambda s: sentences.index(s))
    return " ".join(top_sentences)


In [12]:
# Testing extractive Summarizer
sample_text = """Deep learning has revolutionized many fields by enabling neural networks to learn complex representations from data.
However, training deep models requires lots of labeled data and careful regularization.
Transfer learning and pretrained transformer models have alleviated some of these constraints by offering strong starting points.
Transformers in particular have set new state-of-the-art results across many NLP tasks such as summarization and translation.
This ecosystem makes practical NLP development much faster."""
print("Extractive summary:")
print(extractive_summarizer(sample_text, top_n=2))


Extractive summary:
Deep learning has revolutionized many fields by enabling neural networks to learn complex representations from data. Transfer learning and pretrained transformer models have alleviated some of these constraints by offering strong starting points.


In [13]:
# Setting up abstractive summarizer (small model)
# Using a compact distilbart model for speed.
summarizer = pipeline('summarization', model='sshleifer/distilbart-cnn-12-6')

def abstractive_summarize(text, min_length=25, max_length=100):
    text = clean_text(text)
    # pipeline will chunk if model supports; for long text consider manual chunking
    out = summarizer(text, min_length=min_length, max_length=max_length)
    return out[0]['summary_text']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


In [14]:
# Test abstractive summarizer
print("Abstractive summary:")
print(abstractive_summarize(sample_text, min_length=20, max_length=60))


Abstractive summary:
 Transfer learning and pretrained transformer models have alleviated some of these constraints by offering strong starting points . Transformers in particular have set new state-of-the-art results across many NLP tasks such as summarization and translation .


In [17]:
# Building a small KB for out Chatbot
kb_texts = [
    "NLP stands for Natural Language Processing and deals with analyzing and generating human language.",
    "Text summarization shortens long documents into concise summaries while retaining key information.",
    "Transformers are deep learning models using self-attention; commonly used for text generation and understanding.",
    "Transfer learning uses pretrained models as a starting point for new tasks, saving time and data."
]

# Optionally load KB from a JSON file or text files:
# with open('my_kb.json') as f: kb_texts = json.load(f)
print("KB documents:", len(kb_texts))
pprint(kb_texts)


KB documents: 4
['NLP stands for Natural Language Processing and deals with analyzing and '
 'generating human language.',
 'Text summarization shortens long documents into concise summaries while '
 'retaining key information.',
 'Transformers are deep learning models using self-attention; commonly used '
 'for text generation and understanding.',
 'Transfer learning uses pretrained models as a starting point for new tasks, '
 'saving time and data.']


In [18]:
# Encode KB and build FAISS (cosine similarity via normalized inner product)
embed_model = SentenceTransformer('all-MiniLM-L6-v2')  # small & fast
kb_embeddings = embed_model.encode(kb_texts, convert_to_numpy=True, show_progress_bar=True)

# normalize for cosine similarity
faiss.normalize_L2(kb_embeddings)

dim = kb_embeddings.shape[1]
index = faiss.IndexFlatIP(dim)   # inner product on normalized vectors == cosine similarity
index.add(kb_embeddings)
print("FAISS index built with dimension", dim)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

FAISS index built with dimension 384


In [19]:
# Retrieval
def retrieve(query, top_k=2):
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(q_emb)
    D, I = index.search(q_emb, top_k)
    results = [kb_texts[i] for i in I[0]]
    scores = [float(d) for d in D[0]]
    return list(zip(results, scores))

# quick test
print(retrieve("What are transformers used for?", top_k=2))


[('Transformers are deep learning models using self-attention; commonly used for text generation and understanding.', 0.5590986013412476), ('Transfer learning uses pretrained models as a starting point for new tasks, saving time and data.', 0.2340046465396881)]


In [20]:
# Generator pipeline (T5 small)
generator = pipeline('text2text-generation', model='t5-small')

def generate_answer(user_query, top_k=2):
    retrieved = retrieve(user_query, top_k=top_k)
    contexts = [t for t,_ in retrieved]
    # creating a concise prompt: context + question
    prompt = "context: " + " || ".join(contexts) + " question: " + user_query + " answer:"
    out = generator(prompt, max_length=128, do_sample=False)
    return out[0]['generated_text'], retrieved

# test
ans, retrieved = generate_answer("How can transfer learning help me?")
print("Generated answer:", ans)
print("Retrieved sources:")
pprint(retrieved)


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generated answer: True
Retrieved sources:
[('Transfer learning uses pretrained models as a starting point for new tasks, '
  'saving time and data.',
  0.6322948932647705),
 ('Transformers are deep learning models using self-attention; commonly used '
  'for text generation and understanding.',
  0.3159871995449066)]


In [24]:
# Save KB and FAISS index (so you can reuse without re-encoding)
import numpy as np, json
np.save('kb_embeddings.npy', kb_embeddings)
with open('kb_texts.json','w') as f: json.dump(kb_texts, f)
faiss.write_index(index, 'kb_index.faiss')
print("Saved: kb_embeddings.npy, kb_texts.json, kb_index.faiss")



Saved: kb_embeddings.npy, kb_texts.json, kb_index.faiss


In [25]:
# Cell 13: small interactive loop: choose mode
print("Choose mode: 1) summarize  2) chat  3) exit")
while True:
    mode = input("\nEnter mode (1/2/3): ").strip()
    if mode == '1':
        txt = input("\nPaste text to summarize (or 'back' to choose mode):\n")
        if txt.lower() == 'back': continue
        which = input("Which summarizer? (e) extractive, (a) abstractive [e/a]: ").strip().lower()
        if which == 'e':
            print("\nExtractive summary:\n", extractive_summarizer(txt, top_n=3))
        else:
            print("\nAbstractive summary:\n", abstractive_summarize(txt, min_length=20, max_length=80))
    elif mode == '2':
        q = input("\nEnter your question (or 'back' to choose mode):\n")
        if q.lower() == 'back': continue
        answer, retrieved = generate_answer(q, top_k=2)
        print("\nBot answer:\n", answer)
        print("\n(Used sources and scores):")
        pprint(retrieved)
    elif mode == '3' or mode.lower() in ['exit','quit']:
        print("Exiting. Re-run cell to start again.")
        break
    else:
        print("Invalid option. Enter 1, 2, or 3.")


Choose mode: 1) summarize  2) chat  3) exit

Enter mode (1/2/3): 1

Paste text to summarize (or 'back' to choose mode):
A data scientist is a professional who extracts knowledge and insights from large, complex datasets by using a combination of programming, mathematics, statistics, and domain expertise to build predictive models and solve business problems. They employ advanced tools like machine learning algorithms to identify patterns, make forecasts, and translate data into actionable insights that guide organizational decision-making and improve operations.  
Which summarizer? (e) extractive, (a) abstractive [e/a]: e

Extractive summary:
 A data scientist is a professional who extracts knowledge and insights from large, complex datasets by using a combination of programming, mathematics, statistics, and domain expertise to build predictive models and solve business problems. They employ advanced tools like machine learning algorithms to identify patterns, make forecasts, and trans

Your max_length is set to 80, but your input_length is only 73. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)



Abstractive summary:
  Data scientist is a professional who extracts knowledge and insights from large, complex datasets . They employ advanced tools like machine learning algorithms to identify patterns, make forecasts, and translate data into actionable insights that guide organizational decision-making .

Enter mode (1/2/3): 2

Enter your question (or 'back' to choose mode):
What does NLP deal with?


Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)



Bot answer:
 analyzing and generating human language

(Used sources and scores):
[('NLP stands for Natural Language Processing and deals with analyzing and '
  'generating human language.',
  0.7764977216720581),
 ('Text summarization shortens long documents into concise summaries while '
  'retaining key information.',
  0.3089718818664551)]

Enter mode (1/2/3): 3
Exiting. Re-run cell to start again.
