<a href="https://colab.research.google.com/github/YeshwanthMotivity/Weather-Retrieval-and-Analysis/blob/main/Predict__Weather_using_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Predict Weather using RAG**


PART 1 **–** Setup, Load Dataset, Preprocess, Disable Telemetry, No API Key


Install Required Packages

In [11]:
# Install Hugging Face and FAISS
!pip install faiss-cpu gradio transformers sentence-transformers --quiet

Disable Telemetry (No API Prompts)



In [12]:
import os
# Disable telemetry
os.environ["WANDB_DISABLED"] = "true"
os.environ["HF_HUB_DISABLE_TELEMETRY"] = "1"

Imports and Dataset Load




In [13]:
import pandas as pd
import faiss
import numpy as np
import gradio as gr
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
from transformers import pipeline

Load Dataset

In [14]:
import pandas as pd

# Load the Excel file
excel_path = '/content/drive/MyDrive/Jena Dataset.xlsx'
df = pd.read_excel(excel_path)

# View sample rows
df.head()


Unnamed: 0,Date Time,p (mbar),T (degC),Tpot (K),Tdew (degC),rh (%),VPmax (mbar),VPact (mbar),VPdef (mbar),sh (g/kg),H2OC (mmol/mol),rho (g/m**3),wv (m/s),max. wv (m/s),wd (deg)
0,01.01.2009 00:10:00,996.52,T (degC),265.4,-8.9,93.3,3.33,3.11,0.22,1.94,3.12,1307.75,1.03,1.75,152.3
1,01.01.2009 00:20:00,996.57,T (degC),265.01,-9.28,93.4,3.23,3.02,0.21,1.89,3.03,1309.8,0.72,1.5,136.1
2,01.01.2009 00:30:00,996.53,T (degC),264.91,-9.31,93.9,3.21,3.01,0.2,1.88,3.02,1310.24,0.19,0.63,171.6
3,01.01.2009 00:40:00,996.51,T (degC),265.12,-9.07,94.2,3.26,3.07,0.19,1.92,3.08,1309.19,0.34,0.5,198.0
4,01.01.2009 00:50:00,996.51,T (degC),265.15,-9.04,94.1,3.27,3.08,0.19,1.92,3.09,1309.0,0.32,0.63,214.3


Convert Rows to Text

In [15]:
# Convert rows into text chunks (for embedding)
def row_to_text(row):
    return f"DateTime: {row['Date Time']}, Temperature: {row['T (degC)']}°C, Humidity: {row['rh (%)']}%, Wind Speed: {row['wv (m/s)']} m/s"

# Apply to a subset for speed (e.g., 10,000 rows)
texts = df.head(10000).apply(row_to_text, axis=1).tolist()

# Preview one
print(texts[0])

DateTime: 01.01.2009 00:10:00, Temperature: T (degC)°C, Humidity: 93.3%, Wind Speed: 1.03 m/s


PART 2 – Embeddings + FAISS Setup

Load SentenceTransformer Model for Embeddings

We’ll use a lightweight yet effective embedding model.

"all-MiniLM-L6-v2".

In [16]:
# Load embedding model (efficient & suitable for Colab)
embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generate Embeddings for Text Chunks

In [17]:
# Generate embeddings (batch processing for speed)
embeddings = embedder.encode(texts, show_progress_bar=True, convert_to_numpy=True)

# Shape of embeddings
print(f"Embeddings shape: {embeddings.shape}")

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

Embeddings shape: (10000, 384)


Store Embeddings in FAISS Index

In [18]:
# Create FAISS index
embedding_dim = embeddings.shape[1]
index = faiss.IndexFlatL2(embedding_dim)

# Add embeddings to the index
index.add(embeddings)

# Save the mapping between embeddings and original text
text_mapping = {i: text for i, text in enumerate(texts)}

# Confirm size
print(f"Number of vectors in FAISS index: {index.ntotal}")

Number of vectors in FAISS index: 10000


Retrieval Function

This function will:

Convert the user query into an embedding.

Search FAISS for top-k similar weather data.

Return retrieved text chunks.

In [19]:

def retrieve_similar_chunks(query, k=5):
    # Embed the query
    query_embedding = embedder.encode([query], convert_to_numpy=True)

    # Search FAISS
    distances, indices = index.search(query_embedding, k)

    # Retrieve corresponding text
    results = [text_mapping[idx] for idx in indices[0]]

    return results

Test Retrieval Example

In [20]:
# Example user query
query = "What was the weather like on 2014-01-01?"

# Retrieve similar weather chunks
results = retrieve_similar_chunks(query)

# Display results
for res in results:
    print(res)


DateTime: 14.01.2009 14:10:00, Temperature: -1.58°C, Humidity: 64.13%, Wind Speed: 0.4 m/s
DateTime: 14.01.2009 16:50:00, Temperature: -1.42°C, Humidity: 68.68%, Wind Speed: 0.36 m/s
DateTime: 14.01.2009 14:30:00, Temperature: -1.35°C, Humidity: 64.3%, Wind Speed: 0.43 m/s
DateTime: 14.01.2009 13:50:00, Temperature: -1.94°C, Humidity: 64.47%, Wind Speed: 0.67 m/s
DateTime: 14.01.2009 15:10:00, Temperature: -1.61°C, Humidity: 64.52%, Wind Speed: 0.69 m/s


PART 3 – Local LLM Response Generation

*   Concatenate the retrieved chunks.
*   Use a local LLM to answer your weather query.
*   Return the LLM-generated response.



 Load the LLM

 google/flan-t5-base

In [21]:
# Load text generation pipeline (small model for speed)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

llm_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(llm_name)
model = AutoModelForSeq2SeqLM.from_pretrained(llm_name)

# Define text generation function
def generate_answer(prompt, max_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
    outputs = model.generate(**inputs, max_length=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Combine Retrieval + Generation

We will:
*   Retrieve relevant weather chunks.
*   Construct a prompt.
*   Generate the answer using Flan-T5.


In [22]:
def answer_query(query):
    # Step 1: Retrieve
    retrieved_chunks = retrieve_similar_chunks(query)

    # Step 2: Combine chunks
    context = "\n".join(retrieved_chunks)

    # Step 3: Construct prompt
    prompt = f"""Given the following weather data:\n{context}\nAnswer the question: {query}"""

    # Step 4: Generate response
    answer = generate_answer(prompt)

    return answer


 Test LLM Response




In [23]:
query = "What was the weather like on 14th January 2009 afternoon?"
response = answer_query(query)

print("LLM Response:")
print(response)

LLM Response:
Windy


PART 4 – Gradio Web Interface + Graph

1.   Create a Gradio app for text input + LLM output.


2.   Optionally plot weather data (matching the query) using matplotlib.



Gradio Interface + Plot

In [24]:
!pip install requests --quiet

In [25]:
import pandas as pd
import requests
from datetime import datetime
import gradio as gr

# Load entire dataset (no row limit)
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects(AI ML)/jena_climate_2009_2016.csv')
df['Date Time'] = pd.to_datetime(df['Date Time'], format='%d.%m.%Y %H:%M:%S')

# Filter only data from 2009
df_2009 = df[df['Date Time'].dt.year == 2009]

# OpenWeather API Key
API_KEY = "e2438cd8b6f47cceb53993ecf3731624"

# Fetch real-time weather
def get_current_weather(location):
    url = f"https://api.openweathermap.org/data/2.5/weather?q={location}&appid={API_KEY}&units=metric"
    response = requests.get(url)

    if response.status_code == 200:
        data = response.json()
        return {
            "temp": data['main']['temp'],
            "humidity": data['main']['humidity'],
            "wind_speed": data['wind']['speed'],
            "description": data['weather'][0]['description'].capitalize()
        }
    else:
        return None

# Get 2009 temperature for today's date (e.g., 20-03-2009)
def get_2009_temp_for_today(df_2009):
    today = datetime.now()
    target_date_2009 = datetime(2009, today.month, today.day)

    filtered = df_2009[df_2009['Date Time'].dt.date == target_date_2009.date()]

    if not filtered.empty:
        temp_2009 = round(filtered.iloc[0]['T (degC)'], 2)
        date_2009 = filtered.iloc[0]['Date Time'].strftime('%d.%m.%Y %H:%M:%S')
        return temp_2009, date_2009
    else:
        return None, None

# Gradio interface function
def compare_weather(query, location):
    temp_2009, date_2009 = get_2009_temp_for_today(df_2009)
    current_weather = get_current_weather(location)

    if temp_2009 is None or current_weather is None:
        return "Error retrieving data.", "Check dataset or API", None

    comparison = f"📅 2009 Date: {date_2009} | 🌡️ Temp: {temp_2009}°C\n"
    comparison += f"📍 Current Temp in {location}: {current_weather['temp']}°C\n"

    diff = round(current_weather['temp'] - temp_2009, 2)
    if diff > 0:
        comparison += f"Today is {diff}°C warmer than the same day in 2009."
    elif diff < 0:
        comparison += f"Today is {abs(diff)}°C colder than the same day in 2009."
    else:
        comparison += "Today’s temperature is the same as in 2009!"

    real_time_info = (
        f"Location: {location}, Temperature: {current_weather['temp']}°C, "
        f"Humidity: {current_weather['humidity']}%, Wind Speed: {current_weather['wind_speed']} m/s, "
        f"Weather: {current_weather['description']}"
    )

    return comparison, real_time_info, None

# Gradio UI
gr.Interface(
    fn=compare_weather,
    inputs=[
        gr.Textbox(label="Enter Your Weather Question"),
        gr.Textbox(label="Enter City for Real-Time Weather")
    ],
    outputs=[
        gr.Textbox(label="Comparison (2009 vs Today)"),
        gr.Textbox(label="Real-Time Weather Data"),
        gr.Image(type="pil", label="Weather Plot (Optional)")
    ],
    title="Weather Comparison: 2009 vs Today",
    description="Compare today’s temperature with the same day in 2009 (Jena Climate Dataset)."
).launch()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0516c74532f94ace03.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




"Real-Time Weather vs Historical Climate: Temperature Comparison Using Jena Climate Dataset"

In [26]:
!pip install requests gradio --quiet

import pandas as pd
import requests
from datetime import datetime
import gradio as gr

# Load dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects(AI ML)/jena_climate_2009_2016.csv')
df['Date Time'] = pd.to_datetime(df['Date Time'], format='%d.%m.%Y %H:%M:%S')

# Filter data for 2009
df_2009 = df[df['Date Time'].dt.year == 2009]

# OpenWeatherMap API Key
API_KEY = "e2438cd8b6f47cceb53993ecf3731624"

# Get real-time weather
def get_current_weather(location):
    url = f"https://api.openweathermap.org/data/2.5/weather?q={location}&appid={API_KEY}&units=metric"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return {
            "temp": data['main']['temp'],
            "humidity": data['main']['humidity'],
            "wind_speed": data['wind']['speed'],
            "description": data['weather'][0]['description'].capitalize()
        }
    else:
        return None

# Get 2009 temperature for today's date
def get_2009_temp_for_today(df_2009):
    today = datetime.now()
    target_date_2009 = datetime(2009, today.month, today.day)
    filtered = df_2009[df_2009['Date Time'].dt.date == target_date_2009.date()]
    if not filtered.empty:
        temp_2009 = round(filtered.iloc[0]['T (degC)'], 2)
        date_2009 = filtered.iloc[0]['Date Time'].strftime('%d.%m.%Y %H:%M:%S')
        return temp_2009, date_2009
    else:
        return None, None

# Gradio interface function (No plot)
def compare_weather(query, location):
    temp_2009, date_2009 = get_2009_temp_for_today(df_2009)
    current_weather = get_current_weather(location)

    if temp_2009 is None or current_weather is None:
        return "Error retrieving data.", "Check dataset or API"

    comparison = f"📅 2009 Date: {date_2009} | 🌡️ Temp: {temp_2009}°C\n"
    comparison += f"📍 Current Temp in {location}: {current_weather['temp']}°C\n"

    diff = round(current_weather['temp'] - temp_2009, 2)
    if diff > 0:
        comparison += f"Today is {diff}°C warmer than the same day in 2009."
    elif diff < 0:
        comparison += f"Today is {abs(diff)}°C colder than the same day in 2009."
    else:
        comparison += "Today’s temperature is the same as in 2009!"

    real_time_info = (
        f"Location: {location}, Temperature: {current_weather['temp']}°C, "
        f"Humidity: {current_weather['humidity']}%, Wind Speed: {current_weather['wind_speed']} m/s, "
        f"Weather: {current_weather['description']}"
    )

    return comparison, real_time_info

# Gradio UI (No plot output)
gr.Interface(
    fn=compare_weather,
    inputs=[
        gr.Textbox(label="Enter Your Weather Question"),
        gr.Textbox(label="Enter City for Real-Time Weather")
    ],
    outputs=[
        gr.Textbox(label="Comparison (2009 vs Today)"),
        gr.Textbox(label="Real-Time Weather Data")
    ],
    title="Weather Comparison: 2009 vs Today",
    description="Compare today’s temperature with the same day in 2009 (Jena Climate Dataset)."
).launch(share=True)  # share=True for Colab use


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://3842657c146d7cfa50.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


