# Telangana Crop Yield Prediction - Data Processing Pipeline

## Overview
This notebook demonstrates the complete data processing pipeline for Telangana crop yield prediction system.

### What You'll Learn:
1. Loading crop and weather datasets
2. Data cleaning and preprocessing
3. Feature engineering for agricultural predictions
4. Merging multiple data sources
5. Creating the final training dataset

### Dataset Information:
- **Crop Data**: 32 crops across 32 districts (2018-2022)
- **Weather Data**: Daily rainfall, temperature, humidity (47+ months)
- **Final Dataset**: 5,621 records with 25+ features

---

## Step 1: Import Required Libraries

We'll use:
- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computations
- **matplotlib/seaborn**: Data visualization

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Configure display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print(f"   Pandas version: {pd.__version__}")
print(f"   NumPy version: {np.__version__}")

## Step 2: Load Crop Data

### About the Crop Dataset:
- Contains agricultural production data for Telangana
- Covers 32 different crops
- Spans across 32 districts and multiple years
- Includes: District, Season, Crop, Year, Area, Production

In [None]:
# Load crop data
crop_df = pd.read_csv('Telangana_AgriData_Suite/sample_data/telangana_crop_data_cleaned.csv')

print("📊 Crop Dataset Loaded!")
print(f"   Total Records: {len(crop_df):,}")
print(f"   Columns: {list(crop_df.columns)}")
print(f"\n   Shape: {crop_df.shape}")

# Display first few records
print("\n📋 Sample Data:")
crop_df.head(10)

### Data Exploration: Crop Dataset

In [None]:
# Basic statistics
print("📊 Dataset Information:")
print(f"   Unique Districts: {crop_df['District'].nunique()}")
print(f"   Unique Crops: {crop_df['Crop'].nunique()}")
print(f"   Unique Seasons: {crop_df['Season'].nunique()}")
print(f"   Year Range: {crop_df['Year'].min()} - {crop_df['Year'].max()}")

print("\n🌾 Available Crops:")
print(crop_df['Crop'].unique())

print("\n🏘️ Available Districts:")
print(sorted(crop_df['District'].unique()))

### Calculate Yield (Target Variable)

**Formula**: Yield (kg/ha) = (Production in tons × 1000) / Area in hectares

This is our **target variable** that we want to predict!

In [None]:
# Calculate yield in kg/ha
# Multiply by 1000 to convert from tons to kg
crop_df['Yield'] = (crop_df['Production'] / crop_df['Area']) * 1000

# Handle infinite and NaN values
crop_df['Yield'] = crop_df['Yield'].replace([np.inf, -np.inf], np.nan)

print("✅ Yield Calculated!")
print(f"\n📊 Yield Statistics (kg/ha):")
print(crop_df['Yield'].describe())

# Visualize yield distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
crop_df['Yield'].hist(bins=50, edgecolor='black')
plt.xlabel('Yield (kg/ha)')
plt.ylabel('Frequency')
plt.title('Distribution of Crop Yields')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
crop_df.boxplot(column='Yield', by='Season', figsize=(6, 5))
plt.xlabel('Season')
plt.ylabel('Yield (kg/ha)')
plt.title('Yield by Season')
plt.suptitle('')

plt.tight_layout()
plt.show()

## Step 3: Load Weather Data

### About the Weather Dataset:
- Daily weather observations (47+ months)
- Parameters: Rainfall, Temperature (Min/Max), Humidity (Min/Max)
- Aggregated to seasonal level (Kharif: Jun-Oct, Rabi: Nov-Mar)

In [None]:
# Load rainfall/weather data
weather_df = pd.read_csv('Telangana_AgriData_Suite/sample_data/telangana_rainfall_seasonal.csv')

print("🌦️ Weather Dataset Loaded!")
print(f"   Total Records: {len(weather_df):,}")
print(f"   Columns: {list(weather_df.columns)}")
print(f"\n   Shape: {weather_df.shape}")

# Display sample
print("\n📋 Sample Weather Data:")
weather_df.head(10)

### Weather Data Statistics

In [None]:
# Weather statistics
print("📊 Weather Statistics:")
print(f"   Districts with weather data: {weather_df['District'].nunique()}")
print(f"   Seasons: {weather_df['Season'].unique()}")
print(f"   Year Range: {weather_df['Year'].min()} - {weather_df['Year'].max()}")

print("\n🌧️ Rainfall Statistics (mm):")
print(weather_df['Total_Rainfall'].describe())

print("\n🌡️ Temperature Statistics (°C):")
print("   Avg Min Temp:", weather_df['Avg_Temp_Min'].describe())
print("   Avg Max Temp:", weather_df['Avg_Temp_Max'].describe())

## Step 4: Merge Crop and Weather Data

### Merging Strategy:
- Join on: **District, Season, Year**
- This links crop performance to corresponding weather conditions
- Enables weather-driven yield predictions

In [None]:
# Merge crop and weather data
print("🔗 Merging crop and weather datasets...")
print(f"   Crop records before merge: {len(crop_df):,}")
print(f"   Weather records: {len(weather_df):,}")

# Perform merge
merged_df = crop_df.merge(
    weather_df,
    on=['District', 'Season', 'Year'],
    how='inner'  # Keep only matching records
)

print(f"\n✅ Merge Complete!")
print(f"   Final records: {len(merged_df):,}")
print(f"   Total features: {len(merged_df.columns)}")
print(f"\n   Districts in final dataset: {merged_df['District'].nunique()}")

# Check for data quality
print(f"\n📊 Data Quality Check:")
print(f"   Missing values:\n{merged_df.isnull().sum()[merged_df.isnull().sum() > 0]}")

## Step 5: Feature Engineering

### Creating Advanced Features:
We'll create domain-specific features that help the model learn better:

1. **Temporal Features**: Years since start
2. **Weather Derivatives**: Average temp, temp range, rainfall per day
3. **Agricultural Indicators**: GDD (Growing Degree Days)
4. **Stress Indicators**: Heat stress, water stress
5. **Interaction Features**: Area × Rainfall, Area × Temperature

In [None]:
print("🔧 Engineering Features...\n")

# 1. Temporal Feature
merged_df['Years_Since_Start'] = merged_df['Year'] - merged_df['Year'].min()
print("✅ Created: Years_Since_Start")

# 2. Rainfall-based features
# Days in season (Kharif: 153 days, Rabi: 151 days)
merged_df['Season_Days'] = merged_df['Season'].map({'Kharif': 153, 'Rabi': 151})
merged_df['Rainfall_Per_Day'] = merged_df['Total_Rainfall'] / merged_df['Season_Days']
print("✅ Created: Rainfall_Per_Day")

# 3. Temperature features
merged_df['Temp_Avg'] = (merged_df['Avg_Temp_Min'] + merged_df['Avg_Temp_Max']) / 2
merged_df['Temp_Range'] = merged_df['Avg_Temp_Max'] - merged_df['Avg_Temp_Min']
print("✅ Created: Temp_Avg, Temp_Range")

# 4. Humidity features
merged_df['Humidity_Avg'] = (merged_df['Avg_Humidity_Min'] + merged_df['Avg_Humidity_Max']) / 2
merged_df['Humidity_Range'] = merged_df['Avg_Humidity_Max'] - merged_df['Avg_Humidity_Min']
print("✅ Created: Humidity_Avg, Humidity_Range")

# 5. Growing Degree Days (GDD)
# Formula: max(0, Temp_Avg - Base_Temp)
# Base temperature for crops (typically 10°C)
base_temp = 10
merged_df['GDD'] = merged_df['Temp_Avg'].apply(lambda x: max(0, x - base_temp))
print("✅ Created: GDD (Growing Degree Days)")

# 6. Stress Indicators
# Heat Stress: Temperature above 35°C is stressful
merged_df['Heat_Stress'] = (merged_df['Avg_Temp_Max'] > 35).astype(int)

# Water Stress: Rainfall below 500mm is stressful
merged_df['Water_Stress'] = (merged_df['Total_Rainfall'] < 500).astype(int)

# Optimal Conditions: No stress
merged_df['Optimal_Conditions'] = ((merged_df['Heat_Stress'] == 0) & 
                                    (merged_df['Water_Stress'] == 0)).astype(int)
print("✅ Created: Heat_Stress, Water_Stress, Optimal_Conditions")

# 7. Interaction Features
# These capture combined effects
merged_df['Area_Rainfall_Interaction'] = merged_df['Area'] * merged_df['Total_Rainfall']
merged_df['Area_Temp_Interaction'] = merged_df['Area'] * merged_df['Temp_Avg']
print("✅ Created: Area_Rainfall_Interaction, Area_Temp_Interaction")

print(f"\n🎉 Feature Engineering Complete!")
print(f"   Total features now: {len(merged_df.columns)}")

### Visualize Engineered Features

In [None]:
# Visualize key engineered features
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# GDD distribution
axes[0, 0].hist(merged_df['GDD'], bins=30, edgecolor='black', color='skyblue')
axes[0, 0].set_title('Growing Degree Days (GDD)')
axes[0, 0].set_xlabel('GDD')
axes[0, 0].set_ylabel('Frequency')

# Rainfall per day
axes[0, 1].hist(merged_df['Rainfall_Per_Day'], bins=30, edgecolor='black', color='lightgreen')
axes[0, 1].set_title('Daily Rainfall')
axes[0, 1].set_xlabel('Rainfall (mm/day)')
axes[0, 1].set_ylabel('Frequency')

# Temperature range
axes[0, 2].hist(merged_df['Temp_Range'], bins=30, edgecolor='black', color='coral')
axes[0, 2].set_title('Temperature Range')
axes[0, 2].set_xlabel('Range (°C)')
axes[0, 2].set_ylabel('Frequency')

# Stress indicators
stress_counts = pd.DataFrame({
    'Heat Stress': merged_df['Heat_Stress'].value_counts(),
    'Water Stress': merged_df['Water_Stress'].value_counts(),
    'Optimal': merged_df['Optimal_Conditions'].value_counts()
})
stress_counts.plot(kind='bar', ax=axes[1, 0], color=['red', 'blue', 'green'])
axes[1, 0].set_title('Stress Indicators')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_xticklabels(['No', 'Yes'], rotation=0)

# Yield vs GDD
axes[1, 1].scatter(merged_df['GDD'], merged_df['Yield'], alpha=0.5, s=10)
axes[1, 1].set_title('Yield vs Growing Degree Days')
axes[1, 1].set_xlabel('GDD')
axes[1, 1].set_ylabel('Yield (kg/ha)')

# Yield vs Rainfall
axes[1, 2].scatter(merged_df['Total_Rainfall'], merged_df['Yield'], alpha=0.5, s=10, color='green')
axes[1, 2].set_title('Yield vs Total Rainfall')
axes[1, 2].set_xlabel('Rainfall (mm)')
axes[1, 2].set_ylabel('Yield (kg/ha)')

plt.tight_layout()
plt.show()

## Step 6: Save Processed Dataset

We'll save the fully processed dataset for use in machine learning.

In [None]:
# Save processed dataset
output_file = 'telangana_complete_processed.csv'
merged_df.to_csv(output_file, index=False)

print(f"✅ Dataset Saved: {output_file}")
print(f"\n📊 Final Dataset Summary:")
print(f"   Total Records: {len(merged_df):,}")
print(f"   Total Features: {len(merged_df.columns)}")
print(f"   Districts: {merged_df['District'].nunique()}")
print(f"   Crops: {merged_df['Crop'].nunique()}")
print(f"   Year Range: {merged_df['Year'].min()} - {merged_df['Year'].max()}")
print(f"\n   File Size: {merged_df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")

# Display final columns
print(f"\n📋 Final Columns:")
for i, col in enumerate(merged_df.columns, 1):
    print(f"   {i:2d}. {col}")

## Summary

### What We Accomplished:
✅ Loaded crop and weather datasets

✅ Calculated yield (target variable)

✅ Merged data sources on District, Season, Year

✅ Engineered 15+ domain-specific features

✅ Created comprehensive training dataset

✅ Saved processed data for ML modeling

### Key Features Created:
- **Weather**: Total_Rainfall, Rainfall_Per_Day, Temp_Avg, Temp_Range, Humidity_Avg
- **Agricultural**: GDD, Heat_Stress, Water_Stress, Optimal_Conditions
- **Interactions**: Area_Rainfall_Interaction, Area_Temp_Interaction

### Next Step:
Move to **02_ML_Model_Training.ipynb** to build and train the prediction model!

---