<a href="https://colab.research.google.com/github/anirbanghoshsbi/.github.io/blob/master/NLP_Text_Modelling/Text_Preparation_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
'''
!pip install langchain --q
!pip install chromadb  --q
!pip install bertopic  --q
!pip install sentence_transformers  --q
!pip install pypdf  --q
'''

'\n!pip install langchain --q\n!pip install chromadb  --q\n!pip install bertopic  --q\n!pip install sentence_transformers  --q\n!pip install pypdf  --q\n'

In [27]:
import uuid
import chromadb
from langchain.text_splitter import RecursiveCharacterTextSplitter
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer


In [28]:

# 1. Read Sample Text (replace with your actual text loading)
with open("/content/my_temp_text.txt", "r") as file: # Replace "your_2000_word_text.txt"
    sample_text = file.read()


# 2. Clean the Text (adapt cleaning as needed for your data)
def clean_text(text):
    text = text.lower()
    # Add more cleaning steps here as necessary (e.g., punctuation removal, etc.)
    return text

cleaned_text = clean_text(sample_text)


In [29]:

# 3. Split Text into Chunks using Langchain's RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=400) # Adjust chunk_size & overlap
initial_chunks = text_splitter.create_documents([cleaned_text])
# Extract the text content from Langchain documents
chunk_texts = [doc.page_content for doc in initial_chunks]



In [30]:

# 4. Topic Modeling with BERTopic

topic_model = BERTopic(embedding_model='all-MiniLM-L6-v2') # Pass embedding model directly
topics, probs = topic_model.fit_transform(chunk_texts)



In [31]:

# 5. Prepare and Store Data in ChromaDB

# Initialize Sentence Transformer model (Nordic-like)
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')  # Or any other preferred model

# Initialize ChromaDB client
client = chromadb.Client()

# Create a ChromaDB collection
collection = client.create_collection(name="my_knowledge_base_C")

embeddings = []
metadatas = []
documents = []  # Store the original text for retrieval
chunk_ids = []


for i, (chunk, topic, prob) in enumerate(zip(chunk_texts, topics, probs)):
    chunk_id = str(uuid.uuid4())
    embedding = model.encode(chunk).tolist()
    metadata = {"topic": topic, "chunk_id":chunk_id,  "source": "your_text_source"} #Basic metadata
    embeddings.append(embedding)
    metadatas.append(metadata)
    documents.append(chunk)
    chunk_ids.append(chunk_id)

collection.add(
        embeddings=embeddings,
        metadatas=metadatas,
        documents=documents,
        ids=chunk_ids
)


# 6. Retrieval Example with ChromaDB

def retrieve_relevant_chunks(query, top_k=5):  #Retrieve top 2 by default
    query_embedding = model.encode(query).tolist()

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
 #       where={"topic": topic} Add filtering if needed based on topic
    )

    retrieved_chunks = results["documents"][0]
    retrieved_metadatas = results["metadatas"][0]
    retrieved_ids = results["ids"][0]

    return retrieved_chunks, retrieved_metadatas, retrieved_ids


user_query = "What are the main challenges discussed in the text?" # Example query
retrieved_chunks, retrieved_metadatas, retrieved_ids  = retrieve_relevant_chunks(user_query)

for i in range(len(retrieved_chunks)):
    print(f"Chunk ID: {retrieved_ids[i]}")
    print(f"Metadata: {retrieved_metadatas[i]}")
    print(f"Text: {retrieved_chunks[i]}\n")




# ... Further processing with Langchain (augmentation, generation, etc.) would follow here ...

Chunk ID: 82d60f65-d26b-40e1-84a2-859c04c39b4e
Metadata: {'chunk_id': '82d60f65-d26b-40e1-84a2-859c04c39b4e', 'source': 'your_text_source', 'topic': -1}
Text: what looks unsustainable but is actually a new trend we haven’t
accepted yet?
who do i think is smart but is actually full of it?
am i prepared to handle risks i can’t even envision?
which of my current views would change if my incentives were
different?
what are we ignoring today that will seem shockingly obvious in
the future?
what events very nearly happened that would have fundamentally
changed the world i know if they had occurred?
how much have things outside my control contributed to things i
take credit for?
how do i know if i’m being patient (a skill) or stubborn (a flaw)?
who do i look up to that is secretly miserable?
what hassle am i trying to eliminate that’s actually an unavoidable
cost of success?
what crazy genius that i aspire to emulate is actually just crazy?

Chunk ID: 7e62ea99-ba3f-4973-ba60-6d12f44d1d3a
Meta

In [32]:
len(topic_model.get_topic_info())

24

In [33]:
print(topic_model.get_topic_info())

    Topic  Count                              Name  \
0      -1    246                  -1_the_to_of_and   
1       0     50          0_reference_note_text_go   
2       1     48         1_species_evolution_to_of   
3       2     41           2_the_of_depression_was   
4       3     37          3_it_the_that_innovation   
5       4     36            4_he_burns_the_stories   
6       5     29                5_1950s_the_in_and   
7       6     29               6_you_cancer_is_the   
8       7     27           7_bryan_brendan_ski_the   
9       8     26              8_hill_pavlov_the_in   
10      9     24              9_is_future_you_what   
11     10     21              10_is_the_to_markets   
12     11     19  11_disease_infectious_decline_in   
13     12     18                 12_is_you_to_that   
14     13     18              13_news_bad_of_local   
15     14     17            14_sears_the_in_stores   
16     15     17       15_incentives_people_to_you   
17     16     16          16

In [36]:
outlier_chunks = []
for i, topic in enumerate(topics):
    if topic == 3:
        outlier_chunks.append(chunk_texts[i])

# Now outlier_chunks contains all the text chunks assigned to topic -1
for chunk in outlier_chunks:
    print(chunk, "\n") # print to inspect/further process.



few looked at early cars and said, “oh, there’s a thing i can commute to
work in.”
few saw a plane and said, “aha, i can use that to get to my next
vacation.”
it took decades for people to see that potential.
what they did say early on was, “can we mount a machine gun on that?
can we drop bombs out of it?”
adolphus greely was one of the first people outside the car industry to
realize the “horseless carriage” could be useful. greely, a brigadier general,
purchased three cars in 1899—almost a decade before ford’s model t—for
the u.s. army to experiment with.
in one of its first mentions of automobiles, the los angeles times wrote
about general greely’s purchase:
it can be used for the transportation of light artillery such as
machine guns. it can be utilized for the carrying of equipment, 

purchased three cars in 1899—almost a decade before ford’s model t—for
the u.s. army to experiment with.
in one of its first mentions of automobiles, the los angeles times wrote
about general greely’

In [39]:
user_query = "how to futureproof our life from failing?" # Example query
retrieved_chunks, retrieved_metadatas, retrieved_ids  = retrieve_relevant_chunks(user_query)

for i in range(len(retrieved_chunks)):
    print(f"Chunk ID: {retrieved_ids[i]}")
    print(f"Metadata: {retrieved_metadatas[i]}")
    print(f"Text: {retrieved_chunks[i]}\n")


Chunk ID: 5c612d7a-35b2-4abe-87c6-7efb90853229
Metadata: {'chunk_id': '5c612d7a-35b2-4abe-87c6-7efb90853229', 'source': 'your_text_source', 'topic': 19}
Text: putting everything else in a bucket that’s in constant need of updating and
adapting. the few (very few) things that never change are candidates for
long-term thinking. everything else has a shelf life.
long term is less about time horizon and more about flexibility.
if it’s 2010 and you say “i have a ten-year time horizon,” your target date is
2020. which is when the world fell to pieces. if you were a business or an
investor it was a terrible time to assume the world was ready to hand you
the reward you had been patiently awaiting.
a long time horizon with a firm end date can be as reliant on chance as a
short time horizon.
far superior is flexibility.
time is compounding’s magic, and its importance can’t be minimized.

Chunk ID: 5a5f4f01-3fd6-49b6-ab41-817a45e3288b
Metadata: {'chunk_id': '5a5f4f01-3fd6-49b6-ab41-817a45e3288b',