# Automatic Diagnostics Report Generator using Small Language Model

Automated Diagnostics Report Generator for VSB Power Line Fault Detection, uses open-source language models to generate intelligent analysis reports.
Generates professional PDF reports for power grid operations.

Model Predictions and Interpretations, Trend Analysis, Visualizations and Alerts can be added integrated with the “explanations carefully generated” using Small Language model like __HuggingFace Microsoft DialoGPT-small__ , __ollama__ or __GPT-2 variants__. It can be scheduled to be generated in cloud connected to the grid in regular intervals and pushed to a dashboard or sent to emails. 

## Value Proposition for Automatic Diagnostics Report Generation:

* Transforms raw ML predictions into actionable operational intelligence
* Reduces manual analysis time from hours to minutes and consistent reporting
* No external API dependencies, Template-based insights when SLM unavailable

#### This is just showing art of the possible, but we can expand in different ways to make this output in the best way needed by the stakeholders

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import json
import os
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# For PDF generation
try:
    from reportlab.lib.pagesizes import letter, A4
    from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Image, Table, TableStyle, PageBreak
    from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
    from reportlab.lib.units import inch
    from reportlab.lib import colors
    from reportlab.lib.enums import TA_CENTER, TA_LEFT, TA_JUSTIFY
    from reportlab.graphics.shapes import Drawing, Rect
    from reportlab.graphics.charts.linecharts import HorizontalLineChart
    from reportlab.graphics.charts.barcharts import VerticalBarChart
    REPORTLAB_AVAILABLE = True
except ImportError:
    REPORTLAB_AVAILABLE = False
    print("reportlab not available. Install with: pip install reportlab")

# For open-source language models
try:
    from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
    TRANSFORMERS_AVAILABLE = True
except ImportError:
    TRANSFORMERS_AVAILABLE = False
    print("transformers not available. Install with: pip install transformers torch")

try:
    import ollama
    OLLAMA_AVAILABLE = True
except ImportError:
    OLLAMA_AVAILABLE = False


# Imports from Helper functions
from helper.common import prepare_data,create_feature_tier_mapping
from helper.load_save_model import load_saved_model
from helper.report_generation import generate_diagnostics_report

In [2]:
DATA_PATH = r'./data/vsb-power-line-fault-detection/'
FEATURE_PATH = r"./features//"
CHUNK_SIZE = 1000
SAMPLE_SIZE=1000
BEST_MODEL_PATH= r'saved_models\best_model_catboost_fast_20250928_173554.pkl'

### Prepare Data for Creating Dignostics Report

In [3]:
# Step 1: Load and prepare data with optimized features
feature_df  = pd.read_parquet(f"{FEATURE_PATH}/final_features.parquet")

In [4]:
data_splits = prepare_data(feature_df, target_col='target',random_state=123)

Preparing data for model training...
Feature matrix shape: (5000, 124)
Target distribution:
target
0    0.9336
1    0.0664
Name: proportion, dtype: float64
Training set: 3200 samples
Validation set: 800 samples
Test set: 1000 samples


In [5]:
X_train, X_test, y_train, y_test = data_splits['X_train'],data_splits['X_test'],data_splits['y_train'], data_splits['y_test']

### Loading the Best Model

In [6]:
model, metadata = load_saved_model(BEST_MODEL_PATH)

✓ Loaded model: saved_models\best_model_catboost_fast_20250928_173554.pkl


### Preparing the Predictions

In [7]:
y_pred = model.predict(X_test)

In [8]:
y_pred_proba = model.predict_proba(X_test)

In [9]:
predictions_df = pd.DataFrame({
    'prediction': y_pred,      # Required: 0 or 1
    'confidence': y_pred_proba[:,1],    # Optional but recommended
    'actual': y_test        ,         # Optional, for performance evaluation,
    'signal_id':X_test.index
})

### Report Generation

#### Calling the report generation Helper Function

##### Uses __Microsoft SLM DialoGPT-small__ for report preparation

In [10]:
report_path = generate_diagnostics_report(
    predictions_df=predictions_df,
    output_format="pdf",
    data_source_type="features"  # or "auto" for auto-detection
)

Analyzing data trends...
No timestamp column found. Creating synthetic timestamps for analysis...
Setting up huggingface language model...


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Successfully loaded Hugging Face model: microsoft/DialoGPT-small
Generating insights...
Error generating LLM insights: Input length of input_ids is 240, but `max_length` is set to 240. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.
Creating visualizations...
Generating PDF report...
PDF report generated successfully: diagnostics_reports\VSB_Diagnostics_Report_20250928_193620.pdf
Diagnostics report generated successfully!
Report: diagnostics_reports\VSB_Diagnostics_Report_20250928_193620.pdf
JSON Data: diagnostics_reports\analysis_data_20250928_193620.json
Visualizations: 2 plots saved
Data source type: features


#### This helper function is provided only for showcasing what is possible as a prototype. We make assumptions for timestamps, intervals and provide a template of report that shows the trends, anomalies and analysis results in a PDF report. 

### End of Automatic Diagnostics Report Generator