# Financial Document Analyzer: Step-by-Step Demo

This notebook walks you through the full workflow of the **Financial Document Analyzer**, a tool designed to extract, classify, and analyze expenses from financial documents (e.g., "nota integrativa"). The focus is on energy-related expenses and their impact on total service and production costs across different sectors.

### Project Structure
- **`scripts/`**: Contains modular Python scripts for data loading, HTML parsing, expense classification, and analysis.
- **`data/`**: Holds sample or synthetic financial data.
- **`outputs/`**: Stores generated visualizations.

### Notebook Overview
1. **Load Data**: Start with synthetic or real financial data.
2. **Extract Tables**: Parse HTML to extract expense descriptions and values.
3. **Classify Expenses**: Use zero-shot NLP to categorize expenses.
4. **Analyze Energy Costs**: Calculate energy expenses and their ratios to service and production costs.
5. **Visualize Trends**: Generate sector-based visualizations.

Let’s get started!

## Step 1: Load Data
We begin by loading financial data into a DataFrame. For this demo, we’ll generate synthetic data on the fly. In a real-world scenario, you can replace this with your own CSV file from the `data/` directory.

In [None]:
import sys
sys.path.append('../scripts')  # Add scripts directory to path

from data_loader import load_data

# Load synthetic data (or specify a file path to load real data)
df = load_data()

# Display the first few rows
print("Sample DataFrame:")
display(df.head())

**Explanation**
- **`load_data()`**: This function (from `scripts/data_loader.py`) generates synthetic data with columns like `Company_ID`, `Sector_Code`, `HTML` (containing expense tables), `Total_Production_Costs`, and `Service_Costs`.
- **Output**: A pandas DataFrame with sample financial data for multiple companies. For example:
  ```
  Company_ID  Sector_Code  HTML         Total_Production_Costs  Service_Costs
  Company_1   A            <table>...</table>  1000000             300000
  Company_2   B            <table>...</table>  2000000             500000
  ```

## Step 2: Extract Tables from HTML
Next, we parse the HTML content in the `HTML` column to extract expense descriptions and their corresponding values.

In [None]:
from html_parser import extract_tables

# Extract expenses from the HTML column
df['Expenses'] = df['HTML'].apply(extract_tables)

# Display extracted expenses for the first company
print("\nExtracted Expenses for Company_1:")
display(df['Expenses'].iloc[0])

**Explanation**
- **`extract_tables(html)`**: This function (from `scripts/html_parser.py`) uses BeautifulSoup to parse HTML `<table>` elements, extracting rows (`<tr>`) and cells (`<td>`) into a list of tuples: `(description, value)`.
- **Output**: For each company, a list of expense items. Example:
  ```
  [('Electricity bill', 50000), ('Consulting fees', 20000), ('Raw materials', 150000)]
  ```

## Step 3: Classify Expenses Using NLP
We use zero-shot classification to categorize each expense into predefined categories (e.g., "Energy", "Services") without requiring labeled training data.

In [None]:
from expense_classifier import classify_expense

# Define candidate labels for classification
candidate_labels = ["Energy", "Services", "Materials", "Other"]

# Classify each expense
df['Classified_Expenses'] = df['Expenses'].apply(
    lambda exps: [(classify_expense(desc, candidate_labels), value) for desc, value in exps]
)

# Display classified expenses for the first company
print("\nClassified Expenses for Company_1:")
display(df['Classified_Expenses'].iloc[0])

**Explanation**
- **`classify_expense(desc, candidate_labels)`**: This function (from `scripts/expense_classifier.py`) uses a pre-trained zero-shot classification model (e.g., from Hugging Face) to predict the most likely category for each expense description.
- **Logic**: If the model’s confidence score exceeds 0.7, the predicted label is assigned; otherwise, the expense is marked as "Uncategorized".
- **Output**: A list of tuples with classified categories and values. Example:
  ```
  [('Energy', 50000), ('Services', 20000), ('Materials', 150000)]
  ```

## Step 4: Analyze Energy-Related Expenses
Now, we calculate the total energy costs for each company and compute their ratios to service and production costs.

In [None]:
from analysis import analyze_energy_costs

# Perform analysis for each company
df['Analysis'] = df.apply(
    lambda row: analyze_energy_costs(row['Classified_Expenses'], 
                                     row['Service_Costs'], 
                                     row['Total_Production_Costs']), 
    axis=1
)

# Display analysis for the first company
print("\nEnergy Cost Analysis for Company_1:")
display(df['Analysis'].iloc[0])

**Explanation**
- **`analyze_energy_costs(classified_expenses, service_costs, production_costs)`**: This function (from `scripts/analysis.py`) does the following:
  1. Sums all expenses classified as "Energy" to get `Energy_Costs`.
  2. Calculates `Energy_to_Service_Ratio` = `Energy_Costs / Service_Costs`.
  3. Calculates `Energy_to_Production_Ratio` = `Energy_Costs / Total_Production_Costs`.
- **Output**: A dictionary with analysis results. Example:
  ```
  {'Energy_Costs': 50000, 'Energy_to_Service_Ratio': 0.1667, 'Energy_to_Production_Ratio': 0.05}
  ```

## Step 5: Visualize Sector Trends
Finally, we aggregate the energy cost ratios by sector and create bar charts to visualize trends.

In [None]:
from analysis import plot_sector_analysis

# Generate and display the plots
plot_sector_analysis(df)

**Explanation**
- **`plot_sector_analysis(df)`**: This function (from `scripts/analysis.py`) performs the following:
  1. Groups the DataFrame by `Sector_Code`.
  2. Calculates the mean `Energy_to_Service_Ratio` and `Energy_to_Production_Ratio` for each sector.
  3. Creates two bar charts:
     - **Chart 1**: Average energy costs as a percentage of service costs by sector.
     - **Chart 2**: Average energy costs as a percentage of production costs by sector.
- **Output**: Visualizations saved to `outputs/sector_energy_analysis.png`. Example plots might show:
  - Sector A: 16% of service costs, 5% of production costs.
  - Sector B: 20% of service costs, 8% of production costs.

## Conclusion
This notebook demonstrates how to:
- **Extract structured data** from unstructured HTML financial documents.
- **Apply NLP** for zero-shot classification of expenses.
- **Perform financial analysis** to understand energy cost impacts.
- **Visualize insights** across sectors using bar charts.

### Next Steps
You can extend this project by:
- Loading real data from the `data/` directory instead of synthetic data.
- Fine-tuning the zero-shot classification model with domain-specific examples.
- Adding interactive visualizations using tools like Streamlit or Plotly.

Happy analyzing!