## RAG with TF-IDF

TF-IDF
=====

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (a corpus).

TF-IDF **does not understand the underlying meaning of words**. Instead, it is a purely statistical method that relies on word frequency to determine the importance of words in documents.

In TF-IDF, the words "car" and "automobile" would have no relationship unless they co-occur in many documents. Their importance is determined purely by their frequency.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Data: List of documents on sustainable living
documents = [
    "Renewable energy sources, such as solar and wind power, are crucial for reducing our dependence on fossil fuels and mitigating climate change. Solar panels convert sunlight into electricity, while wind turbines harness the power of the wind to generate energy. Both sources are sustainable and produce no greenhouse gas emissions during operation.",
    "Sustainable agriculture involves practices that protect the environment, public health, human communities, and animal welfare. Techniques such as crop rotation, organic farming, and agroforestry help maintain soil health, reduce pesticide use, and enhance biodiversity. Sustainable agriculture aims to produce food while ensuring the long-term health of ecosystems.",
    "Reducing waste and recycling materials are key components of sustainable living. By minimizing waste production, reusing products, and recycling materials, we can conserve natural resources, reduce pollution, and lower greenhouse gas emissions. Programs that promote composting, upcycling, and responsible consumption habits contribute to waste reduction efforts.",
    "Water conservation is essential for sustainable living, especially in regions facing water scarcity. Techniques such as rainwater harvesting, using water-efficient appliances, and xeriscaping can significantly reduce water usage. Conservation efforts help ensure that water resources are available for future generations and support healthy ecosystems.",
    "Sustainable transportation options, such as electric vehicles, public transit, cycling, and walking, can reduce carbon emissions and alleviate traffic congestion. Investing in infrastructure that supports these modes of transportation, like bike lanes and charging stations, is essential for promoting sustainable urban mobility.",
    "Green building practices aim to reduce the environmental impact of construction and create healthier living spaces. Techniques include using energy-efficient materials, incorporating natural lighting, and designing buildings that minimize energy consumption. Green buildings often feature renewable energy systems and are designed to be environmentally friendly.",
    "Community-based sustainability initiatives involve collective efforts to promote sustainable living practices. These initiatives can include community gardens, local recycling programs, and educational workshops on sustainability. By working together, communities can create a more sustainable and resilient environment for all residents."
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Function to retrieve similar documents
def get_similar_documents(query, top_k=3):
    query_tfidf = vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_tfidf, tfidf_matrix).flatten()
    similar_indices = cosine_similarities.argsort()[-top_k:][::-1]
    return [(documents[i], cosine_similarities[i]) for i in similar_indices]

# Load pre-trained model and tokenizer
model_name = "gpt2"  # You can replace this with another model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Function to generate responses
def generate_response(prompt, max_length=300):
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(
        inputs.input_ids,
        attention_mask=inputs.attention_mask,
        max_length=max_length,
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.eos_token_id  # Handle tokenization padding
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# RAG pipeline function
def rag_pipeline(query, max_length=150):
    similar_docs = get_similar_documents(query)
    context = "\n".join([doc for doc, _ in similar_docs])
    prompt = f"Based on the following documents:\n{context}\nAnswer the following question: {query}"
    return generate_response(prompt, max_length=max_length)

# Example usage without RAG
query = "What are the benefits of renewable energy?"
response_without_rag = generate_response(query, max_length=300)
print("Response without RAG:")
print(response_without_rag)

# Example usage with RAG
response_with_rag = rag_pipeline(query, max_length=300)
print("\nResponse with RAG:")
print(response_with_rag)


Response without RAG:
What are the benefits of renewable energy?

The benefits are obvious. The cost of electricity is lower than in the past, and the cost is higher than the costs of other forms of energy.
...
 (1) The benefits for the environment are greater than for other types of power. (2) There is no need for a new power plant. There are no new plants. No new energy sources are needed. And there is nothing to stop the development of new sources of generation. In fact, the only way to reduce the amount of carbon dioxide in our atmosphere is to use less fossil fuels. This is the most important thing to do. It is also the least expensive. If we could reduce our carbon emissions by a factor of 10, we would reduce emissions of about one-third of the world's greenhouse gas emissions. We would also reduce greenhouse gases by about half of what we are currently emitting. That is a huge reduction in emissions, but it is not a major one. So, if we can reduce carbon pollution by 10 percent,

## We can see that  

1.   TF-IDF matches the sentenses, which have the words in our query
2.   It uses the selected sentensces only for RAG purpose



In [14]:
# Example usage without RAG
query = "What are some community efforts people can take for sustainability"
response_without_rag = generate_response(query, max_length=300)
print("Response without RAG:")
print(response_without_rag)

# Example usage with RAG
response_with_rag = rag_pipeline(query, max_length=300)
print("\nResponse with RAG:")
print(response_with_rag)

Response without RAG:
What are some community efforts people can take for sustainability?

We have a lot of community initiatives that are focused on sustainability. We have the Sustainable Cities Initiative, which is a community-based initiative that is focused around sustainability in the city.
... We are working with the City of Toronto to develop a sustainable city plan. The city is working on a plan that will be implemented by the end of the year. It will include a number of initiatives, including the Green Building Plan, the Healthy Cities Plan and the Clean Cities plan, and we are also working to create a Sustainable City Plan for the Toronto area. So we have some of these initiatives in place, but we also have other initiatives. And we're working very hard to get them implemented. But we want to make sure that we do everything we can to ensure that the people of our city are able to live in a city that's sustainable.

Response with RAG:
Based on the following documents:
Communi