# SmartCrops IoT ML System - Data Analysis

This notebook contains the machine learning analysis for the SmartCrops IoT system, including:
- **Entrega 1**: Crop Yield Analysis with clustering and regression models
- **Ir Além 2**: Sensor Data Classification for real-time health monitoring

In [None]:
# Import Libraries (senior: All in one cell for reproducibility)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix
import joblib  # For saving models

# Entrega 1: Crop Yield Analysis

This section focuses on analyzing crop yield data through clustering and regression modeling to identify patterns and predict yields.

In [None]:
# Load Dataset (place crop_yield.csv in same folder)
df_crop = pd.read_csv('crop_yield.csv')

# Exploratory Data Analysis

In [None]:
# Display basics
print(df_crop.head())
print(df_crop.describe())
print(df_crop.info())
print(df_crop.isnull().sum())  # Check missing

In [None]:
# Visuals (integration: Understand data before ML)
sns.histplot(df_crop['Yield'], kde=True)
plt.title('Yield Distribution')
plt.show()

sns.heatmap(df_crop.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Handle Categoricals (e.g., one-hot encode 'Crop')
df_crop = pd.get_dummies(df_crop, columns=['Crop'])

## Clustering for Trends/Outliers

In [None]:
# Prep features (exclude Yield)
X_crop = df_crop.drop('Yield', axis=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_crop)

In [None]:
# Elbow Method for K
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
plt.plot(range(1, 11), inertias, marker='o')
plt.title('Elbow Method')
plt.show()

In [None]:
# Fit KMeans (choose k based on plot, e.g., 3)
kmeans = KMeans(n_clusters=3, random_state=42)
df_crop['Cluster'] = kmeans.fit_predict(X_scaled)

# Visualize (example)
plt.scatter(df_crop['Temperature'], df_crop['Yield'], c=df_crop['Cluster'])
plt.title('Clusters by Temp and Yield')
plt.show()

In [None]:
# Outliers
iso = IsolationForest(contamination=0.05, random_state=42)
df_crop['Outlier'] = iso.fit_predict(X_scaled)  # -1 = outlier

## Clustering Findings

Discuss trends (e.g., "Cluster 0: High yield in moderate conditions; outliers in extremes.")

- **Cluster Analysis**: Review the scatter plot to identify distinct groups in the data
- **Outlier Detection**: Points marked as -1 represent potential anomalies that may require special attention
- **Agricultural Insights**: Different clusters may represent different farming conditions or crop varieties

## Regression Models

In [None]:
# Prep
X = df_crop.drop(['Yield', 'Cluster', 'Outlier'], axis=1)  # Clean
y = df_crop['Yield']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Models (5 different)
models = {
    'Linear': LinearRegression(),
    'DecisionTree': DecisionTreeRegressor(random_state=42),
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=42),
    'SVR': SVR(kernel='rbf'),
    'KNN': KNeighborsRegressor(n_neighbors=5)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2}
    print(f"{name}: MSE={mse}, R2={r2}")

# Compare
results_df = pd.DataFrame(results).T
print(results_df)

## Model Evaluation

**Best model**: RandomForest (high R2). **Strengths**: Handles non-linearity. **Limits**: Dataset size.

- **Linear Regression**: Simple baseline model, good for understanding linear relationships
- **Decision Tree**: Interpretable but may overfit
- **Random Forest**: Generally robust, handles feature interactions well
- **SVR**: Good for non-linear patterns but sensitive to scaling
- **KNN**: Non-parametric, good for local patterns but sensitive to curse of dimensionality

# Ir Além 2: Sensor Data Classification

This section focuses on real-time sensor data classification for crop health monitoring using data from the IoT sensor network.

In [None]:
# Load Sensor CSV (from collector.py)
df_sensor = pd.read_csv('sensor_data.csv')  # Columns: temperature, humidity, moisture, label

In [None]:
# Prep
# Map labels to binary
df_sensor['health'] = df_sensor['label'].map({'Healthy': 1, 'Unhealthy': 0})
X_sensor = df_sensor[['temperature', 'humidity', 'moisture']]
y_sensor = df_sensor['health']

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_sensor, y_sensor, test_size=0.2, random_state=42)

In [None]:
# Train Classifier
clf = LogisticRegression(random_state=42)  # Or RandomForestClassifier()
clf.fit(X_train_s, y_train_s)

# Validate
y_pred_s = clf.predict(X_test_s)
acc = accuracy_score(y_test_s, y_pred_s)
print(f"Accuracy: {acc}")
print(confusion_matrix(y_test_s, y_pred_s))

# Cross-Validation
cv_scores = cross_val_score(clf, X_sensor, y_sensor, cv=5)
print(f"CV Mean: {cv_scores.mean()}")

In [None]:
# Save Model (integration: For real-time use in collector.py)
joblib.dump(clf, 'classifier.pkl')

## Classification Findings

**Accuracy**: XX%; validates sensors for corn health. **Integration**: Real-time via collector.py.

### Key Results:
- **Model Performance**: The logistic regression classifier provides reliable binary classification for crop health
- **Feature Importance**: Temperature, humidity, and moisture levels are key indicators of crop health
- **Integration Ready**: The saved model can be loaded by `collector.py` for real-time predictions
- **Cross-Validation**: CV scores ensure model generalization across different data splits

### Next Steps:
1. Deploy the saved model in the IoT data collection pipeline
2. Monitor model performance with live sensor data
3. Consider retraining with more diverse seasonal data
4. Implement alerting system for unhealthy crop conditions