# Agricultural Analytics Demo

This notebook demonstrates the complete workflow of our agricultural analytics system:
1. PDF Text and Table Extraction
2. Futures Prices EDA and Anomaly Detection
3. Report Summarization using LLM

Let's start by importing the required libraries and initializing our components.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import json

from agri_analytics import AgriAnalytics

# Set up plotting
plt.style.use('seaborn')
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)

## 1. PDF Processing

First, we'll extract text and tables from the Oilseeds Outlook Report using multiple extraction methods including PDF parsing, tabula-py for tables, and OCR for complex layouts.

In [None]:
# Initialize the analytics system
analytics = AgriAnalytics()

# Extract PDF content
pdf_text, pdf_tables, cv_tables = analytics.extract_pdf_content('Oilseeds Outlook Report 122024.pdf')

print(f"Number of tables extracted: {len(pdf_tables)}\n")
print("First 500 characters of extracted text:")
print(pdf_text[:500])

Let's examine the extracted tables. The system uses both tabula-py for structured tables and computer vision techniques for complex layouts:

In [None]:
# Display extracted tables
print("Tables extracted using tabula-py:")
for i, table in enumerate(pdf_tables):
    print(f"\nTable {i+1}:")
    display(table)

print("\nTables detected using CV:")
for i, table in enumerate(cv_tables):
    print(f"\nCV-detected Table {i+1}:")
    # Convert OCR output to DataFrame for better visualization
    text_data = [word for word in table['text'] if word.strip() != '']
    print(' '.join(text_data))

## 2. Futures Prices Analysis

Now let's analyze the futures prices data. We'll look at price trends for different commodities and detect anomalies using Isolation Forest algorithm.

In [None]:
# Load futures data
futures_data = analytics.load_futures_data('futures_prices.csv')

# Basic statistics
print("Basic statistics for futures prices:")
display(futures_data.groupby('Symbol')['Close'].describe())

# Plot price trends by commodity
plt.figure(figsize=(15, 8))
for symbol in futures_data['Symbol'].unique():
    symbol_data = futures_data[futures_data['Symbol'] == symbol]
    plt.plot(symbol_data['Date'], symbol_data['Close'], label=symbol, alpha=0.7)

plt.title('Futures Prices Over Time by Commodity')
plt.xlabel('Date')
plt.ylabel('Price')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Calculate volatility
volatility = futures_data.groupby('Symbol')['Close'].agg(['std', 'mean']).round(2)
volatility['cv'] = (volatility['std'] / volatility['mean'] * 100).round(2)
print("\nPrice Volatility by Commodity (Coefficient of Variation):")
display(volatility.sort_values('cv', ascending=False))

### Anomaly Detection

Let's detect and visualize price anomalies using Isolation Forest. We'll set a contamination factor of 0.1 (expecting about 10% of the points to be anomalies) and use 150 estimators for better accuracy.

In [None]:
# Detect anomalies
anomalies, stats = analytics.detect_anomalies(contamination=0.1, n_estimators=150)

print("Anomaly Detection Statistics:")
print(json.dumps(stats, indent=2))

# Visualize anomalies
analytics.visualize_prices_and_anomalies()

# Additional analysis of anomalies
anomaly_dates = futures_data[futures_data['is_anomaly'] == -1][['Date', 'Symbol', 'Close']]
print("\nTop 10 Anomalous Price Points:")
display(anomaly_dates.sort_values('Close', ascending=False).head(10))

## 3. Report Summarization

Finally, let's use our LLM-based summarizer to generate a structured summary of the report. The summarizer uses a Llama-2 model to analyze the text and extract key insights in four main categories:
- Market Trends
- Price Forecasts
- Supply/Demand Analysis
- Key Risk Factors

In [None]:
# Generate report summary
summary_json = analytics.generate_report_summary()
summary = json.loads(summary_json)

print("Report Summary:")
print("=============\n")

for section, content in summary['summary'].items():
    print(f"{section.replace('_', ' ').title()}:")
    print(f"{content}\n")

print("Key Metrics:")
print(json.dumps(summary['key_metrics'], indent=2))

print("\nExtracted Data Statistics:")
print(f"Number of tables found: {summary['extracted_tables_count']}")
print(f"Number of CV-detected tables: {summary['cv_detected_tables_count']}")

## Conclusion

This notebook has demonstrated the complete workflow of our agricultural analytics system:
1. We extracted text and tables from the PDF report using multiple methods (PDF parsing, tabula-py, and OCR)
2. We analyzed futures prices data, visualized trends, and detected anomalies using Isolation Forest
3. We generated a structured summary of the report using our LLM-based summarizer

The system successfully combines traditional data analysis with modern ML techniques to provide comprehensive insights into agricultural markets.