# 🔧 Refactored Notebook: Prompt Analysis for Energy Prediction using Deep Learning

This notebook extends the chatbot prototype with NLP tools for analyzing a prompt and extracting features relevant to energy prediction.

### 📘 Curriculum Learning 1.1: Token Count as Proxy for Complexity

In [2]:
# Install dependencies
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import os

# 1. Download NLTK data
# Ensure 'punkt' is available and nltk_data path is set
nltk_data_path = os.path.join(os.getcwd(), 'nltk_data')
print("Downloading 'punkt' tokenizer...")
nltk.download('punkt', download_dir=nltk_data_path, force=True)
print("Downloading 'punkt_tab' tokenizer...")
nltk.download('punkt_tab', download_dir=nltk_data_path, force=True)

# Always append the custom nltk_data path (if not already present)
if nltk_data_path not in nltk.data.path:
    nltk.data.path.append(nltk_data_path)

# Debugging paths and contents
print("NLTK Data Paths:", nltk.data.path)
print("Contents of nltk_data:", os.listdir(nltk_data_path))

Downloading 'punkt' tokenizer...


[nltk_data] Downloading package punkt to c:\StudentWork\Code\CSCN8010\
[nltk_data]     Sustainable_AI_Project_Protocol\Phase One\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to c:\StudentWork\Code\CSCN8
[nltk_data]     010\Sustainable_AI_Project_Protocol\Phase
[nltk_data]     One\nltk_data...


Downloading 'punkt_tab' tokenizer...
NLTK Data Paths: ['C:\\Users\\Eespinosa/nltk_data', 'c:\\StudentWork\\Code\\CSCN8010\\Sustainable_AI_Project_Protocol\\.venv\\nltk_data', 'c:\\StudentWork\\Code\\CSCN8010\\Sustainable_AI_Project_Protocol\\.venv\\share\\nltk_data', 'c:\\StudentWork\\Code\\CSCN8010\\Sustainable_AI_Project_Protocol\\.venv\\lib\\nltk_data', 'C:\\Users\\Eespinosa\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', 'D:\\nltk_data', 'E:\\nltk_data', 'c:\\StudentWork\\Code\\CSCN8010\\Sustainable_AI_Project_Protocol\\Phase One\\nltk_data']
Contents of nltk_data: ['tokenizers']


[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


In [3]:
from nltk.tokenize import word_tokenize

def count_tokens(prompt):
    return len(word_tokenize(prompt.lower()))

prompt = "How can I reduce my energy cost when using LLMs?"
print("Token count:", count_tokens(prompt))

Token count: 11


### 🔗 Curriculum Learning 2.1: Chain Rule-Based Scoring with N-Grams

In [8]:
from nltk.util import bigrams
from collections import Counter

# Generate a large synthetic token corpus with over 1000 domain-specific words
# Covering automotive, food production, oil, electronics, agriculture, medical equipment, hygiene products

import random

# Word banks per domain
domains = {
    "automotive": [
        "engine", "transmission", "assembly", "OEM", "dealership", "tire", "brake", "dashboard", "fuel", "ignition",
        "torque", "suspension", "steering", "battery", "chassis", "gearbox", "clutch", "radiator", "axle", "bumper"
    ],
    "food_production": [
        "harvest", "fertilizer", "pesticide", "organic", "irrigation", "yield", "poultry", "dairy", "grain", "corn",
        "packaging", "labeling", "nutrition", "distribution", "shelf_life", "inspection", "processing", "warehouse", "frozen", "perishable"
    ],
    "oil_gas": [
        "refinery", "barrel", "pipeline", "petroleum", "diesel", "gasoline", "offshore", "drilling", "rig", "lubricant",
        "exploration", "upstream", "downstream", "crude", "hydrocarbon", "fracturing", "compressor", "reservoir", "seismic", "platform"
    ],
    "electronics": [
        "semiconductor", "resistor", "capacitor", "PCB", "microchip", "soldering", "transistor", "diode", "oscilloscope", "breadboard",
        "LED", "IC", "voltage", "current", "amplifier", "microcontroller", "sensor", "IoT", "firmware", "FPGA"
    ],
    "agriculture": [
        "tractor", "crop", "cultivation", "acreage", "cooperative", "greenhouse", "farm", "livestock", "milking", "plowing",
        "seeding", "spraying", "harvesting", "rotation", "fertility", "tillage", "yield", "grains", "paddock", "pasture"
    ],
    "medical": [
        "stethoscope", "thermometer", "syringe", "scalpel", "respirator", "defibrillator", "catheter", "IV", "diagnosis", "radiology",
        "sterile", "disinfectant", "bandage", "infusion", "gown", "PPE", "intubation", "oxygen", "nebulizer", "vaccine"
    ],
    "hygiene": [
        "soap", "sanitizer", "toilet_paper", "shampoo", "conditioner", "toothbrush", "toothpaste", "detergent", "cleanser", "disinfectant",
        "deodorant", "napkin", "razor", "floss", "moisturizer", "lotion", "wipe", "spray", "cotton", "absorbent"
    ],
    "logistics": [
        "shipping", "freight", "customs", "tariff", "inventory", "procurement", "tracking", "dispatch", "container", "pallet",
        "warehouse", "fulfillment", "courier", "manifest", "cross_docking", "handover", "last_mile", "airwaybill", "consignment", "bill_of_lading"
    ]
}

# Multiply each domain by 6 to exceed 1000 tokens and shuffle
token_corpus = []
for domain_words in domains.values():
    token_corpus.extend(domain_words * 6)

# Shuffle for realism
random.shuffle(token_corpus)

# Truncate if needed
token_corpus = token_corpus[:1050]

# Output the Python code version of this
token_list_code = "tokens = [\n" + ",\n".join(f'    "{word}"' for word in token_corpus) + "\n]"
token_list_code[:1000]  # preview only first 1000 characters

unigrams = Counter(tokens)
bigrams_list = list(bigrams(tokens))
bigram_counts = Counter(bigrams_list)

def bigram_probability(prompt):
    words = word_tokenize(prompt.lower())
    score = 1.0
    for i in range(1, len(words)):
        prev, curr = words[i - 1], words[i]
        score *= bigram_counts[(prev, curr)] / unigrams[prev] if unigrams[prev] else 1e-6
    return score

print("Bigram Score:", bigram_probability(prompt))

Bigram Score: 0.0


#### 📘 Embedding Support

In [9]:
from gensim.models import Word2Vec
import numpy as np

# Get training corpus from the preious section
corpus = token_corpus
model = Word2Vec(sentences=corpus, vector_size=50, window=2, min_count=1, sg=1, seed=42)

def sentence_vector(sentence):
    tokens = word_tokenize(sentence.lower())
    vecs = [model.wv[word] for word in tokens if word in model.wv]
    return np.mean(vecs, axis=0) if vecs else np.zeros(model.vector_size)

### 📊 Curriculum Learning 3.1: Combined Prompt Feature Extraction

In [15]:
def extract_prompt_features(prompt):
    tokens = word_tokenize(prompt.lower())
    token_count = len(tokens)
    avg_vec = sentence_vector(prompt)
    avg_norm = np.linalg.norm(avg_vec)
    bigram_score = bigram_probability(prompt)
    return {
        "token_count": token_count,
        "avg_embedding_norm": avg_norm,
        "bigram_likelihood": bigram_score
    }

# Example usage using the provided long prompt
prompt = "Provide a comprehensive report detailing the environmental, economic, and geopolitical impact of offshore drilling in the Gulf of Mexico. Include historical data on oil spills, current safety regulations, the role of independent contractors, transportation logistics to refineries in Texas and Louisiana, and how recent changes in U.S. trade policy affect crude exports to Canada and Mexico. Also assess fluctuations in barrel pricing, regional employment statistics, and supply chain vulnerabilities related to pipeline maintenance and hurricane disruptions. Finally, compare with alternative extraction methods like fracking and tar sands."

features = extract_prompt_features(prompt)

print(features)

{'token_count': 100, 'avg_embedding_norm': 9.92105490783582, 'bigram_likelihood': 0.034}


#### 🧮 Prompt Feature Extraction + Model Metadata + Preprocessing

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import re

# 🔧 Placeholder functions
def sentence_vector(prompt):
    return np.random.rand(300)

def bigram_probability(prompt):
    return 0.034

# 🧠 Simple tokenizer (fallback)
def simple_tokenize(text):
    return re.findall(r'\b\w+\b', text.lower())

# 🔍 Feature extraction function
def extract_prompt_features(prompt):
    tokens = simple_tokenize(prompt)
    token_count = len(tokens)
    avg_vec = sentence_vector(prompt)
    avg_norm = np.linalg.norm(avg_vec)
    bigram_score = bigram_probability(prompt)
    return {
        "token_count": token_count,
        "avg_embedding_norm": avg_norm,
        "bigram_likelihood": bigram_score
    }

# ✏️ User prompt from the previous section
print(f"Prompt = ",prompt)

# ⚙️ LLM + server metadata
llm_metadata = {
    "layers": 12,
    "training_time": 48,
    "flops_per_hour": 2e12
}

# 📦 Combine prompt + metadata
prompt_features = extract_prompt_features(prompt)
all_features = {**llm_metadata, **prompt_features}

# 📊 DataFrame and scaling
df = pd.DataFrame([all_features])
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_features, columns=df.columns)

print("✅ Normalized input features ready for model:")
display(scaled_df)


Prompt =  Provide a comprehensive report detailing the environmental, economic, and geopolitical impact of offshore drilling in the Gulf of Mexico. Include historical data on oil spills, current safety regulations, the role of independent contractors, transportation logistics to refineries in Texas and Louisiana, and how recent changes in U.S. trade policy affect crude exports to Canada and Mexico. Also assess fluctuations in barrel pricing, regional employment statistics, and supply chain vulnerabilities related to pipeline maintenance and hurricane disruptions. Finally, compare with alternative extraction methods like fracking and tar sands.
✅ Normalized input features ready for model:


Unnamed: 0,layers,training_time,flops_per_hour,token_count,avg_embedding_norm,bigram_likelihood
0,0.0,0.0,0.0,0.0,0.0,0.0


### 📊 Curriculum Learning (8): Multi-Layer Perceptron (MLP) Architecture

Prepare features from the user prompt using classic NLP tools. These features will feed into a lightweight neural network to predict the energy cost of remote LLM inference.

Note that the model is currently performing a forward pass using randomly initialized weights, without any training. That’s why you're seeing output, but it’s not meaningful yet.

In [17]:
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# 📊 DataFrame and scaling
# This dictionary should come from your Prompt + Model metadata extractor
all_features = {
    "layers": 12,
    "training_time": 48,
    "flops_per_hour": 2e12,
    "token_count": 105,
    "avg_embedding_norm": 0.87,
    "bigram_likelihood": 0.031
}
df = pd.DataFrame([all_features])

# 🧪 Normalize features
scaler = StandardScaler()
scaled_array = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_array, columns=df.columns)

# 🧠 Define MLP Model
class EnergyEstimatorMLP(nn.Module):
    def __init__(self, input_dim):
        super(EnergyEstimatorMLP, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Dropout(p=0.2),
            nn.Linear(8, 1)  # Output: Energy prediction (kWh)
        )

    def forward(self, x):
        return self.model(x)

# 🧮 Prepare input and model
input_tensor = torch.tensor(scaled_array, dtype=torch.float32)
input_dim = input_tensor.shape[1]
mlp_model = EnergyEstimatorMLP(input_dim)

# 🔮 Forward pass to get energy estimate
with torch.no_grad():
    prediction = mlp_model(input_tensor)
    predicted_energy_kwh = prediction.item()

print("🔋 Estimated Energy Consumption:", predicted_energy_kwh, "kWh")


🔋 Estimated Energy Consumption: -0.5921454429626465 kWh


✅ Let's Add Training Capability
Below is a minimal training pipeline for the EnergyEstimatorMLP using synthetic data. You can later replace this with real, labeled energy usage data.

🔁 Training Code: Curriculum Learning (8.1) – MLP Training Loop

In [18]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# 🔁 Simulated training data (replace with real data when available)
# Each row corresponds to scaled features: [layers, training_time, flops/hour, token_count, avg_embedding_norm, bigram_likelihood]
X_train = torch.tensor(np.tile(scaled_array, (100, 1)), dtype=torch.float32)

# Simulated target energy consumption values (in kWh)
y_train = torch.tensor(np.random.normal(loc=150.0, scale=30.0, size=(100, 1)), dtype=torch.float32)

# 📐 Initialize model
input_dim = X_train.shape[1]
model = EnergyEstimatorMLP(input_dim)

# ⚙️ Loss and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 📊 Training loop
epochs = 100
for epoch in range(epochs):
    model.train()
    
    # Forward pass
    predictions = model(X_train)
    loss = loss_fn(predictions, y_train)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")

# ✅ Save trained model (optional)
torch.save(model.state_dict(), "mlp_energy_estimator.pth")
print("Model trained and saved successfully.")


Epoch 0, Loss: 23924.6406
Epoch 10, Loss: 23916.6367
Epoch 20, Loss: 23909.8145
Epoch 30, Loss: 23904.6777
Epoch 40, Loss: 23899.2734
Epoch 50, Loss: 23894.3203
Epoch 60, Loss: 23887.1348
Epoch 70, Loss: 23881.5098
Epoch 80, Loss: 23873.3496
Epoch 90, Loss: 23865.5820
Model trained and saved successfully.


Here are **5 practical actions** to help replace the synthetic data with **real, labeled energy usage data** and repeat the training of the `EnergyEstimatorMLP` model.


## ✅ Project-Based Actions

### 🔍 1. **Gather Prompt Logs and Metadata from an API Call**

* **Task**: Use OpenAI or Hugging Face APIs to send prompts and collect metadata.
* **Output**: For each prompt, record:

  * Prompt text
  * Number of tokens
  * Model used
  * Inference latency
  * Model configuration (layers, type, etc.)
* **Example Toolkits**: Python + `openai` or `transformers` library.

### 🔌 2. **Estimate Energy Cost per Inference**

* **Task**: Estimate the energy (kWh) consumed per prompt execution using:

  * Known model compute cost (FLOPs or GPU time)
  * Device power consumption specs (from server logs or public benchmarks)
* **Formula**:

  $$
  \text{kWh} = \frac{\text{FLOPs per inference} \times \text{Watt-hours per FLOP}}{3.6 \times 10^6}
  $$
* **Tools**: Spreadsheet + Python script to convert FLOPs → kWh.

### 📊 3. **Label Real Prompts with Energy Use Estimates**

* **Task**: Combine prompt features and estimated energy cost into a labeled dataset.
* **Output**: A CSV or DataFrame with:

  * Columns: `token_count`, `layers`, `flops_per_hour`, `avg_embedding_norm`, `bigram_likelihood`, `energy_kwh`
  * Format: Ready for training the neural net.

### 🧪 4. **Train and Evaluate the Model**

* **Task**: Replace `X_train` and `y_train` with real data.
* **Metrics to Report**:

  * Mean Absolute Error (MAE)
  * Mean Squared Error (MSE)
  * R² Score
* **Stretch Goal**: Plot training loss over epochs.

### 🧼 5. **Improve Data Quality (Optional)**

* **Tasks**:

  * Remove outliers (e.g. prompts with >2000 tokens)
  * Normalize inputs again
  * Add missing values imputation if needed
  * Use embeddings from multiple prompt types for variety

## 📁 Example Dataset Output

| layers | training\_time | flops/hour | token\_count | avg\_embedding\_norm | bigram\_likelihood | energy\_kwh |
| ------ | -------------- | ---------- | ------------ | -------------------- | ------------------ | ----------- |
| 12     | 48             | 2.0E+12    | 105          | 0.87                 | 0.031              | 148.2       |
| 24     | 96             | 4.5E+12    | 330          | 0.92                 | 0.021              | 290.1       |
| ...    | ...            | ...        | ...          | ...                  | ...                | ...         |

