In [1]:
import re
# Sample text
sample_text = """
Chapter 1: Introduction to NLP

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It allows computers to understand, interpret, and generate human languages. NLP is widely used in applications such as machine translation, speech recognition, chatbots, and text analysis.

Chapter 2: Tokenization

Tokenization is the process of breaking down text into individual words or phrases, known as tokens. These tokens form the building blocks of NLP tasks. There are different levels of tokenization, such as word-level and sentence-level tokenization. Word-level tokenization breaks a sentence into words, while sentence-level tokenization splits the text into sentences.

Chapter 3: Named Entity Recognition

Named Entity Recognition (NER) is a technique used in NLP to identify and classify key information (entities) within a text. Entities include names of people, organizations, locations, dates, etc. NER models are trained on large datasets and are often fine-tuned for specific domains, such as biomedical texts or legal documents.

Chapter 4: Part-of-Speech Tagging

Part-of-Speech (POS) tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. POS tagging is essential for understanding the syntactic structure of a sentence and is commonly used in parsing and other NLP tasks.
"""

# Cleaning function
def clean_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces and newlines
    text = re.sub(r'[^\w\s.,!?]', '', text)  # Remove non-alphanumeric characters except punctuation
    return text

# Split the text into chunks
def split_into_chunks(text, chunk_size=100):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Clean and preprocess the text
cleaned_text = clean_text(sample_text)
chunks = split_into_chunks(cleaned_text)

# Display the chunks
for i, chunk in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{chunk}\n")

print(f"Preprocessing complete! The text has been split into {len(chunks)} chunks.")


Chunk 1:
Chapter 1 Introduction to NLP Natural Language Processing NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It allows computers to understand, interpret, and generate human languages. NLP is widely used in applications such as machine translation, speech recognition, chatbots, and text analysis. Chapter 2 Tokenization Tokenization is the process of breaking down text into individual words or phrases, known as tokens. These tokens form the building blocks of NLP tasks. There are different levels of tokenization, such as wordlevel and sentencelevel tokenization. Wordlevel tokenization breaks a sentence into words,

Chunk 2:
while sentencelevel tokenization splits the text into sentences. Chapter 3 Named Entity Recognition Named Entity Recognition NER is a technique used in NLP to identify and classify key information entities within a text. Entities include names of people, organizations, locations, d

In [2]:
!pip install sentence-transformers faiss-cpu


Collecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading sentence_transformers-3.2.1-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m76.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu, sentence-transformers
Successfully installed faiss-cpu-1.9.0 sentence-transformers-3.2.1


In [3]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load pre-trained Sentence-BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')  # You can change to a different BERT-based model if needed

# Generate embeddings for each text chunk
embeddings = model.encode(chunks)

# Convert embeddings to a numpy array for FAISS indexing
embedding_array = np.array(embeddings)

# Build the FAISS index
d = embedding_array.shape[1]  # Dimension of the embeddings
index = faiss.IndexFlatL2(d)  # L2 distance is used to compare vectors
index.add(embedding_array)  # Add embeddings to the index

print(f"Index contains {index.ntotal} vectors.")


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Index contains 3 vectors.


In [4]:
# Example query
query = "What is tokenization in NLP?"

# Embed the query
query_embedding = model.encode([query])

# Search the index for the top 3 closest embeddings to the query
k = 3  # Number of nearest neighbors to return
distances, indices = index.search(np.array(query_embedding), k)

# Display the most similar chunks
print("Top matching chunks:")
for i, idx in enumerate(indices[0]):
    print(f"Chunk {idx+1} (Distance: {distances[0][i]:.4f}):")
    print(chunks[idx])
    print()


Top matching chunks:
Chunk 1 (Distance: 0.4059):
Chapter 1 Introduction to NLP Natural Language Processing NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It allows computers to understand, interpret, and generate human languages. NLP is widely used in applications such as machine translation, speech recognition, chatbots, and text analysis. Chapter 2 Tokenization Tokenization is the process of breaking down text into individual words or phrases, known as tokens. These tokens form the building blocks of NLP tasks. There are different levels of tokenization, such as wordlevel and sentencelevel tokenization. Wordlevel tokenization breaks a sentence into words,

Chunk 2 (Distance: 0.8379):
while sentencelevel tokenization splits the text into sentences. Chapter 3 Named Entity Recognition Named Entity Recognition NER is a technique used in NLP to identify and classify key information entities within a text. En

In [5]:
!pip install transformers




In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Falcon model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b-instruct")
model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-7b-instruct")

tokenizer_config.json:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.48G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [8]:
# Set pad_token to eos_token for Falcon
tokenizer.pad_token = tokenizer.eos_token

# Combine only the top 1 or 2 retrieved chunks into a single input for the LLM
retrieved_text = " ".join([chunks[idx] for idx in indices[0][:1]])  # Limit to just one chunk on tokenization


input_text = f"Question: {query}\nRelevant Information: {retrieved_text}\nBased on the provided information, explain the process of tokenization and its role in NLP:\nAnswer:"


# Tokenize the input text for GPT-2, ensuring attention_mask is provided
inputs = tokenizer.encode_plus(input_text, return_tensors="pt", padding=True)

# Generate a response using Falcon, passing attention_mask and limiting new tokens
outputs = model.generate(inputs['input_ids'],
                         attention_mask=inputs['attention_mask'],
                         max_new_tokens=100,
                         num_return_sequences=1,
                         no_repeat_ngram_size=2)

# Decode and print the generated answer
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Answer:\n", answer)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Answer:
 Question: What is tokenization in NLP?
Relevant Information: Chapter 1 Introduction to NLP Natural Language Processing NLP is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. It allows computers to understand, interpret, and generate human languages. NLP is widely used in applications such as machine translation, speech recognition, chatbots, and text analysis. Chapter 2 Tokenization Tokenization is the process of breaking down text into individual words or phrases, known as tokens. These tokens form the building blocks of NLP tasks. There are different levels of tokenization, such as wordlevel and sentencelevel tokenization. Wordlevel tokenization breaks a sentence into words,
Based on the provided information, explain the process of tokenization and its role in NLP:
Answer: Tokenizing text is breaking it down into its smallest units, typically words. This process is essential in natural-langua