In [None]:
from langchain_community.chat_models import ChatOpenAI
from langchain.schema import SystemMessage, HumanMessage
from dotenv import load_dotenv
from IPython.display import Markdown, display
import os

from pathlib import Path
from langchain_openai import ChatOpenAI


# Load API key
load_dotenv()

MODEL = "gpt-4o"
OUTPUT_FILE = "FinalReport.txt"
openai_api_key = os.getenv("OPENAI_API_KEY")

# Setup LLM
llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0,
    openai_api_key=openai_api_key
)

def read_file(path: str) -> str:
    return Path(path).read_text(encoding="utf-8")


def generate_summary(Push_Commit_summary: str, White_paper_comparision: str) -> str:
    """Takes two text inputs + instruction, returns LLM output."""
    llm = ChatOpenAI(model=MODEL, temperature=0.0, openai_api_key=os.getenv("OPENAI_API_KEY"))
    
    # 2) Define instruction
    instruction = (

    '''
        You are an AI report generator. Based on the inputs provided, create a structured report in HTML format with the following three sections, using dangerouslySetInnerHTML={{ __html: reportMarkdown }}; html should not affect other elements:

        1. **Validation Metrics**
            - Summarize model evaluation metrics (Accuracy, Precision, Recall, F1 Score, AUC, etc.).
            - Highlight strengths or weaknesses in these metrics.

        2. **Code Comparison Inferences**
            - Compare implemented code functionalities with described goals or documentation (e.g., whitepaper).
            - Identify alignments, discrepancies, or missing elements.
            - Highlight improvements or regressions.

        3. **Recommendations**
            - Suggest improvements or next steps based on validation results and code comparison.
            - Include actionable changes to improve accuracy, consistency, or system robustness.

    '''
    )

    user_content = (
        f"{instruction}\n\n"
        f"## Document A\n```\n{Push_Commit_summary}\n```\n\n"
        f"## Document B\n```\n{White_paper_comparision}\n```\n"
    )
    
    return llm.invoke([
        SystemMessage(content="You are a concise, structured assistant. Use Markdown."),
        HumanMessage(content=user_content)
    ]).content



Push_Commit_summary = read_file("Push_Commit_summary_outout.txt")
White_paper_comparision = read_file("white_paper_comparision.txt")


# 3) Get output
output = generate_summary(Push_Commit_summary, White_paper_comparision)

# 4) Print + save
print(output)
Path(OUTPUT_FILE).write_text(output, encoding="utf-8")
print(f"\nSaved to {OUTPUT_FILE}")

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>AI Report</title>
</head>
<body>

    <h1>AI Report</h1>

    <h2>1. Validation Metrics</h2>
    <p>The code implementation provides specific metrics such as F1 Score, Precision, Recall, and Accuracy. However, the white paper emphasizes high precision, recall, and F1-score without specifying exact values. The strengths of the code lie in its detailed metric reporting, but it lacks alignment with the white paper's focus on logistic regression, which may affect the interpretability of these metrics.</p>

    <h2>2. Code Comparison Inferences</h2>
    <p>The code uses XGBoost (XGBClassifier) instead of the Logistic Regression model specified in the white paper. This represents a significant deviation from the documented goals. The data splitting strategy and feature selection also differ, with the code lacking explicit preprocessi

In [8]:

def generate_summary(Push_Commit_summary: str, White_paper_comparision: str) -> str:
    """Takes two text inputs + instruction, returns LLM output."""
    llm = ChatOpenAI(model=MODEL, temperature=0.0, openai_api_key=os.getenv("OPENAI_API_KEY"))
    
    # 2) Define instruction
    instruction = (

    '''
        You are an AI report generator. Based on the inputs provided, create a structured report:

        1. **Validation Metrics**
            - Summarize model evaluation metrics (Accuracy, Precision, Recall, F1 Score, AUC, etc.).
            - Highlight strengths or weaknesses in these metrics.

        2. **Code Comparison Inferences**
            - Compare implemented code functionalities with described goals or documentation (e.g., whitepaper).
            - Identify alignments, discrepancies, or missing elements.
            - Highlight improvements or regressions.

        3. **Recommendations**
            - Suggest improvements or next steps based on validation results and code comparison.
            - Include actionable changes to improve accuracy, consistency, or system robustness.

    '''
    )

    user_content = (
        f"{instruction}\n\n"
        f"## Document A\n```\n{Push_Commit_summary}\n```\n\n"
        f"## Document B\n```\n{White_paper_comparision}\n```\n"
    )
    
    return llm.invoke([
        SystemMessage(content="You are a concise, structured assistant. Use Markdown."),
        HumanMessage(content=user_content)
    ]).content

In [9]:
# 3) Get output
output = generate_summary(Push_Commit_summary, White_paper_comparision)

# 4) Print + save
print(output)
Path(OUTPUT_FILE).write_text(output, encoding="utf-8")
print(f"\nSaved to {OUTPUT_FILE}")

# Structured Report

## 1. Validation Metrics

### Summary of Model Evaluation Metrics
- **Accuracy**: Not explicitly mentioned in the white paper; code provides `accuracy_score`.
- **Precision, Recall, F1 Score**: White paper emphasizes high values; code includes these metrics but does not specify target values.
- **AUC**: Not mentioned in either document.

### Strengths and Weaknesses
- **Strengths**: The code provides a comprehensive set of evaluation metrics, allowing for detailed performance analysis.
- **Weaknesses**: The white paper lacks specific metric targets, making it difficult to assess alignment with code results.

## 2. Code Comparison Inferences

### Alignment with Goals or Documentation
- **Model Selection**: White paper specifies Logistic Regression; code uses XGBoost, indicating a significant deviation.
- **Data Splitting**: Discrepancy in validation and testing data proportions.
- **Feature Selection**: Code uses a subset of features mentioned in the white paper.
- 

In [None]:
# Function to create report
def CreateReport(input1, input2, input3):
    formatted_prompt = f"""
   You are an AI report generator. Based on the inputs provided, create a structured report with the following three sections:

   1. **Validation Metrics**
      - Summarize model evaluation metrics (Accuracy, Precision, Recall, F1 Score, AUC, etc.).
      - Highlight strengths or weaknesses in these metrics.

   2. **Code Comparison Inferences**
      - Compare implemented code functionalities with described goals or documentation (e.g., whitepaper).
      - Identify alignments, discrepancies, or missing elements.
      - Highlight improvements or regressions.

   3. **Recommendations**
      - Suggest improvements or next steps based on validation results and code comparison.
      - Include actionable changes to improve accuracy, consistency, or system robustness.

   ---

   ### 🔢 Input Block 1 (Enhancements & Changes):
   {input1}

   ---

   ### 🧩 Input Block 2 (Version-wise Comparison):
   {input2}

   ---

   ### 📄 Input Block 3 (Whitepaper vs Code Analysis):
   {input3}

   Generate the report in markdown format with bullet points for clarity.
      """

    response = llm.invoke([
        SystemMessage(content="You are a helpful assistant that creates structured reports for ML systems."),
        HumanMessage(content=formatted_prompt)
    ])

    return response.content.strip()


In [None]:
    
# Run the report generator
if __name__ == "__main__":
    input1 = '''New Features / Enhancements:
    Added feature engineering pipeline for income, credit history, and loan term normalization.
    Integrated missing value imputation using median/mode strategies.
    Introduced XGBoost and Random Forest classifiers alongside logistic regression for improved model performance.
    Implemented model selection and hyperparameter tuning using GridSearchCV.
    Added streamlit-based frontend for interactive loan approval predictions.
    Included model versioning with MLflow for tracking experiments.
    🐛 Bug Fixes / Code Refactoring:
    Fixed issue with incorrect encoding of categorical variables (replaced LabelEncoder with OneHotEncoder).
    Refactored data loading and preprocessing into modular functions (data_utils.py).
    Improved error handling and logging across preprocessing and inference scripts.
    📁 Repository Structure Updates:
    Created notebooks/, src/, and models/ directories for cleaner project organization.
    Added requirements.txt and README.md with setup instructions.
    Updated .gitignore to exclude model artifacts and environment files.
    📈 Performance Changes:
    Validation accuracy improved from ~78% to 84% with model tuning and feature engineering.
    Reduced training time by 20% after optimizing preprocessing and model pipeline.
    📌 Commit Comparison Highlights:
    Compared commits: a1c2b3d (old baseline model) → d4e5f6g (latest tuned system).
    Major differences:
    Introduction of new ML models and evaluation metrics.
    UI integration for real-time prediction.
    Codebase modularization and documentation improvements.'''

    input2 = '''Structure & Organization
    V1: Single Jupyter notebook; all logic inline.
    V2: Modular scripts (data_utils.py, model.py, app.py); clean folder structure.
    Inference: Shift from exploratory to production-grade code.
    🧮 Data Preprocessing
    V1: Basic null handling and LabelEncoder.
    V2: SimpleImputer, OneHotEncoder, scaling, ColumnTransformer.
    Inference: More robust and reusable preprocessing pipeline.
    🧠 Feature Engineering
    V1: Used raw features.
    V2: Added domain-driven features (e.g., debt-to-income ratio, loan amount bins).
    Inference: Better input representation, likely improved model performance.
    🤖 Modeling
    V1: Logistic Regression, no tuning.
    V2: Added Random Forest, XGBoost, and GridSearchCV.
    Inference: More powerful models with hyperparameter optimization.
    📊 Evaluation
    V1: Accuracy only.
    V2: Precision, Recall, F1, AUC, confusion matrix.
    Inference: Deeper insight into performance, especially for imbalanced classes.
    🌐 Deployment/UI
    V1: No interface
    V2: Streamlit app for real-time predictions.
    Inference: User-friendly and deployable.

    📈 Experiment Tracking
    V1: None.
    V2: MLflow used for tracking metrics and versions.
    Inference: Enables reproducibility and team collaboration.

    ✅ Final Verdict:
    V2 demonstrates a professional-grade ML system — modular, explainable, user-facing, and maintainable — a significant upgrade over the initial proof-of-concept in V1.
    '''
   
    input3 = '''Connecting
    🧠 AI Feature Mapping Validator
    Compare functionalities between a Whitepaper and its Codebase
    📄 Upload Whitepaper File

    white paper.txt
    Drag and drop file here
    Limit 200MB per file • TXT, MD, PDF
    white paper.txt
    329.0B
    💻 Upload Code File

    model.ipynb
    Drag and drop file here
    Limit 200MB per file • PY, TXT, IPYNB
    model.ipynb
    8.9KB

    ⚖️ Comparing Functionalities
    To perform a comprehensive comparison between the whitepaper and the code functionalities, let's break down the information provided and identify any discrepancies or updates needed:

    Comparison of Features:
    Whitepaper Features:

    Sepal width (cm)
    Petal length (cm)
    Petal width (cm)
    Code Features:

    Sepal length (cm)
    Sepal width (cm)
    Petal length (cm)
    Petal width (cm)
    Missing Feature in Whitepaper:

    Sepal length (cm) is used in the code but not mentioned in the whitepaper.
    Comparison of Model:
    Both the whitepaper and the code use Logistic Regression. There is no discrepancy here.
    Comparison of Validation Metrics:
    Whitepaper Metrics:

    Accuracy: 90%
    Precision: 90%
    Recall: 85%
    F1 Score: 88%
    Code Metrics:

    Accuracy: 96.67%
    Precision: 96.67%
    Recall: 96.67%
    F1 Score: 96.67%
    Comparison of Scores:

    The code metrics are higher than those specified in the whitepaper. This indicates that the model performs better than the expectations set in the whitepaper.
    Critical Validation Metrics:
    The critical metrics (Accuracy, Precision, Recall, F1 Score) in the code are all greater than those in the whitepaper.
    Conclusion:
    Feature Discrepancy: The whitepaper is missing the feature "Sepal length (cm)" which is used in the code.
    Validation Metrics Discrepancy: The code achieves higher validation metrics than those specified in the whitepaper.
    Action Required:

    White Paper Update Needed: The whitepaper is not updated. Please update the document to include the missing feature "Sepal length (cm)" and revise the validation metrics to reflect the improved performance of the model as demonstrated in the code.
    This analysis ensures that the documentation accurately reflects the implementation and performance of the model.
    '''
 
    result = CreateReport(input1, input2, input3)
    # 
    # print("=== Summary ===")
    # print(result)


=== Summary ===
# AI System Evaluation Report

## 1. Validation Metrics

- **Accuracy**: Improved from ~78% to 84% after enhancements.
- **Precision, Recall, F1 Score, AUC**: Newly introduced metrics provide a comprehensive evaluation of model performance, especially for imbalanced classes.
- **Strengths**:
  - Significant improvement in accuracy and overall model performance due to feature engineering and model tuning.
  - Introduction of multiple evaluation metrics offers deeper insights into model performance.
- **Weaknesses**:
  - No specific weaknesses identified in the validation metrics; however, continuous monitoring is recommended to ensure consistent performance across different datasets.

## 2. Code Comparison Inferences

- **Implemented Code vs. Described Goals**:
  - **Alignments**:
    - The codebase now includes modular scripts and a clean folder structure, aligning with best practices for production-grade systems.
    - Feature engineering and preprocessing enhancements

In [8]:
Markdown(result)

# AI System Evaluation Report

## 1. Validation Metrics

- **Accuracy**: Improved from ~78% to 84% after enhancements.
- **Precision, Recall, F1 Score, AUC**: Newly introduced metrics provide a comprehensive evaluation of model performance, especially for imbalanced classes.
- **Strengths**:
  - Significant improvement in accuracy and overall model performance due to feature engineering and model tuning.
  - Introduction of multiple evaluation metrics offers deeper insights into model performance.
- **Weaknesses**:
  - No specific weaknesses identified in the validation metrics; however, continuous monitoring is recommended to ensure consistent performance across different datasets.

## 2. Code Comparison Inferences

- **Implemented Code vs. Described Goals**:
  - **Alignments**:
    - The codebase now includes modular scripts and a clean folder structure, aligning with best practices for production-grade systems.
    - Feature engineering and preprocessing enhancements align with the goals of improving model input representation.
    - The introduction of Random Forest and XGBoost models, along with hyperparameter tuning, aligns with the goal of improving model performance.
    - The addition of a Streamlit app for real-time predictions aligns with the goal of creating a user-friendly interface.
    - MLflow integration for experiment tracking aligns with the goal of enhancing reproducibility and collaboration.
  - **Discrepancies**:
    - The whitepaper lacks mention of the "Sepal length (cm)" feature used in the code.
    - Validation metrics in the code exceed those specified in the whitepaper, indicating a need for documentation updates.
- **Improvements/Regressions**:
  - The system has improved significantly in terms of modularity, explainability, and user interaction compared to the initial version.
  - No regressions identified; the system has evolved from a proof-of-concept to a professional-grade ML system.

## 3. Recommendations

- **Documentation Updates**:
  - Update the whitepaper to include the "Sepal length (cm)" feature and revise validation metrics to reflect the improved performance.
- **Model and Feature Enhancements**:
  - Continue exploring additional feature engineering techniques to further enhance model performance.
  - Consider implementing additional models or ensemble methods to further improve accuracy and robustness.
- **System Robustness and Consistency**:
  - Regularly validate the model on new datasets to ensure consistent performance.
  - Implement automated testing for data preprocessing and model inference to catch potential errors early.
- **User Interface and Deployment**:
  - Gather user feedback on the Streamlit app to identify areas for improvement in user experience.
  - Explore deployment options to scale the application for larger user bases if needed.

By following these recommendations, the system can maintain its high performance and continue to evolve in response to user needs and technological advancements.