# Analyze problem statement to identify approach to fulfill goals

Company faces significant challenges in optimizing crop yields and resource management <br>
Need to prioritize focus on these 2 objectives

1. Predict temperature conditions within farm's closed environment, ensuring optimal plant growth
  - Regression modelling task
  - Need to identify relevant/related features within provided database
<br><br>
2. Categorize combined "Plant Type-Stage" based on sensor data, aiding in strategic planning and resource allocation
  - Classification modelling task
  - Need to identify relevant/related features within provided database

# Setup environment, SQL connection and analyze SQL database

Necessary libraries will be imported when needed

Establish connection SQL database (agri.db) using relative path 'data/agri.db'

In [None]:
# Import libraries as needed
import sqlite3

# Set path to SQL database
db_path = "data/agri.db"

# Create connection to SQL database
conn = sqlite3.connect(db_path)

Set pandas options for better readability

In [None]:
import pandas as pd

pd.set_option('display.max_columns', None) # Display all columns in DataFrame
pd.set_option('display.max_rows', 100)     # Limit number of rows displayed to 100

Explore database structure by listing all tables to identify available tables for extraction

In [None]:
import pandas as pd

# Query to list all tables in database
query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql(query, conn)

# Display list of tables
tables

Since there is only 'farm_data' table in the database, the first few rows can be previewed to understand the column structure <br>
This can be cross-checked with the provided list of attributes in the PDF

In [None]:
# Preview first few rows of 'farm_data' table
farm_data_10_query = "SELECT * FROM farm_data LIMIT 10;"
farm_data_10_df = pd.read_sql(farm_data_10_query, conn)

# Display first few rows of table
farm_data_10_df.head(10)

There are a few issues in the current database that will have to be sorted out before the data can be used for feature engineering or used in machine learning modelling <br>
<br>
Currently identified issues:
| Column Name | Issue |
| :---------: | :---: |
| Plant Type  | Non-standardized naming format |
| Plant Stage | Non-standardized naming format |
| Temperature Sensor | Missing values (NaN) |
|                    | Negative value |
| Humidity Sensor | Missing values (NaN) |
| Nutrient * Sensor | Missing value (None) |
|                   | Values with units |
| Water Level Sensor | Missing value (NaN) |

The schema of 'farm_data' table can be retrieved to understand the columns' data type <br>
However, this might not be fully accurate prior to data clean-up due to missing or incorrectly labelled values

In [None]:
# Get schema of 'farm_data' table
schema_query = "PRAGMA table_info(farm_data);"
schema_df = pd.read_sql(schema_query, conn)

# Display schema information
schema_df

Most of the columns' data type match expectations, except for Nutrient Sensors <br>
These columns should have REAL/INTEGER type but are currently of TEXT type <br> <br>
To handle this, the missing values will have to resolved and the values with units need to be processed <br>
After both steps are done, the columns' data can be converted to REAL/INTEGER type

# Perform Exploratory Data Analysis (EDA) on SQL table data

Load all data from 'farm_data' table into a DataFrame to start data analysis

In [None]:
# Get all data from 'farm_data' table
farm_data_query = "SELECT * FROM farm_data;"
farm_data_df = pd.read_sql_query(farm_data_query, conn)

Start with data preprocessing to clean-up missing values, non-standardized naming format and extra info in values

The data in columns (Plant Type, Plant Stage) will all be changed to lowercase characters to standardize the data

In [None]:
non_standard_name_list = ["Plant Type", "Plant Stage"]

for col_name in non_standard_name_list:
    farm_data_df[col_name] = farm_data_df[col_name].str.lower()


To prepare for the classification task, 'Plant Type' and 'Plant Stage' columns will need to be merged

In [None]:
#TODO: Double check on implementation
#farm_data_df["Plant Type-Stage"] = farm_data_df["Plant Type"] + " " + farm_data_df["Plant Stage"]
#farm_data_df

Remove 'ppm' from Nutrient * Sensor column data and convert the data into numeric type

In [None]:
ppm_drop_list = ["Nutrient N Sensor (ppm)", "Nutrient P Sensor (ppm)", "Nutrient K Sensor (ppm)"]

for col_name in ppm_drop_list:
    farm_data_df[col_name] = farm_data_df[col_name].str.replace("ppm", "", regex=False)
    farm_data_df[col_name] = pd.to_numeric(farm_data_df[col_name], errors="coerce")

Remove negative sign in Temperature Sensor column data

In [None]:
# Get Temperature Sensor column name
farm_data_df_col_list = farm_data_df.columns

for col_name in farm_data_df_col_list:
    if "Temperature Sensor" in col_name:
        temp_sensor_col_name = col_name

farm_data_df[temp_sensor_col_name] = farm_data_df[temp_sensor_col_name].abs()

After checking the current DataFrame, it seems that Light Intensity Sensor column also has missing values <br>
So that will be handled together with the other affected columns

For the remaining columns with missing values, they will be filled with either mean or median values of their zone <br>
The existing data of each column will be grouped into their System Location Code to obtain the mean and median values

Here is the breakdown of each column:
| Column Name | Mean/Median | Reason |
| :---------: | :---------: | :----: |
| Temperature Sensor | Mean | Data is relatively stable and has normal distribution |
| Humidity Sensor | Median  | Data is not normally distributed, no clear pattern |
| Light Intensity Sensor | Median | Data is skewed towards the upper half of the spectrum |
| Nutrient N Sensor | Median | Data has sudden dip and spike near right of spectrum |
| Nutrient P Sensor | Median | Data has sudden dip and spike near left of spectrum |
| Nutrient K Sensor | Median | Data has sudden dip and spike near right of spectrum |
| Water Level Sensor | Median | To avoid outliers at extreme ends of spectrum |

The mean and median of each column with missing value is still calculated and displayed

In [None]:
agg_list = ["mean", "median"]
no_nan_col_list = farm_data_df.columns[farm_data_df.isnull().sum() == 0].tolist()
# Don't drop 'System Location Code' column else there is no zone to groupby
no_nan_col_list = [col for col in no_nan_col_list if col != "System Location Code"]

nan_farm_data_df =  farm_data_df.drop(columns=no_nan_col_list)

nan_farm_data_grouped_df = nan_farm_data_df.groupby("System Location Code").agg(agg_list)

nan_farm_data_grouped_df

It can be seen that for each column, the mean and median values differ by some margin <br>
So choosing the appropriate one to replace missing value is important

The missing values of each column will be replaced as showed in the above table <br>
Change approach of handling missing value in 'Temperature Sensor' to drop missing values instead of replace with mean

In [None]:
# Create new DataFrame for data after removing missing values
clean_farm_data_df = farm_data_df

# Replace 'Temperature Sensor' column missing value with mean
#clean_farm_data_df[temp_sensor_col_name] = clean_farm_data_df[temp_sensor_col_name].fillna(clean_farm_data_df[temp_sensor_col_name].mean())

nan_col_list = clean_farm_data_df.columns[clean_farm_data_df.isnull().any()].tolist()
# Remove 'Temperature Sensor' column name from list as it uses mean instead of median
nan_col_list = [col for col in nan_col_list if col != temp_sensor_col_name]

# Replace remaining affected column missing value with median
for col_name in nan_col_list:
    clean_farm_data_df[col_name] = clean_farm_data_df[col_name].fillna(clean_farm_data_df[col_name].median())

# Drop all rows in 'Temperature Sensor' column with missing values
clean_farm_data_df = clean_farm_data_df.dropna().reset_index()

clean_farm_data_df

Based on the current data and data types, each column can be categorized as categorical or numerical types <br>
Categorical represents categories or labels, so usually the data are of string or object data type <br>
Numerical represents quantiative data, which can be either continuous or discrete

Here is the breakdown:
| Column Name | Type |
| :---------: | :--: |
| System Location Code | Categorical |
| Previous Cycle Plant Type | Categorical |
| Plant Type | Categorical |
| Plant Stage | Categorical |
| Temperature Sensor | Numerical |
| Humidity Sensor | Numerical |
| Light Intensity Sensor | Numerical |
| CO2 Sensor | Numerical |
| EC Sensor | Numerical |
| O2 Sensor | Numerical |
| Nutrient * Sensor | Numerical |
| pH Level | Numerical |
| Water Level Sensor | Numerical |
| Plant Type-Stage | Categorical |

The distribution of the categorical variables are plotted for visualization <br>
It is to get a sense of how the data is categorized and the evenness of the distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#cat_col_list = ["System Location Code", "Previous Cycle Plant Type", "Plant Type", "Plant Stage", "Plant Type-Stage"]
cat_col_list = ["System Location Code", "Previous Cycle Plant Type", "Plant Type", "Plant Stage"]

for col_name in cat_col_list:
    plt.figure(figsize=(10,5))
    sns.countplot(x=col_name, data=clean_farm_data_df)
    plt.title(f"Distribution of {col_name}")
    plt.xticks(rotation=45)
    plt.show()

It can be seen that the data distribution of all 4 columns are rather even across all the distinct values <br>
It means that there is no need to do any further data processing to balance out skewed data

These columns will need to have their data values converted into categorical numeric values via label categorization and/or one-hot encoding<br>
Else it is not possible to use these columns for correlation analysis and machine learning modelling in later steps

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create new DataFrame for post-encoding
encoded_farm_data_df = clean_farm_data_df

# Perform label encoding on 'Plant Stage' as there is an ordered stage to it
#lab_enc_list = ["Plant Stage", "Plant Type-Stage"]
lab_enc_list = ["Plant Stage"]
lab_enc = LabelEncoder()
for col_name in lab_enc_list:
    encoded_farm_data_df[col_name] = lab_enc.fit_transform(encoded_farm_data_df[col_name])

# Perform one-hot encoding on the other columns as there is no order
fil_cat_col_list = [item for item in cat_col_list if item not in lab_enc_list]
encoded_farm_data_df = pd.get_dummies(encoded_farm_data_df, columns=fil_cat_col_list, drop_first=True)
bool_col = encoded_farm_data_df.select_dtypes(include=["bool"]).columns
encoded_farm_data_df[bool_col] = encoded_farm_data_df[bool_col].astype(int)
encoded_farm_data_df

The distribution and relationship of numerical variables are plotted for visualization

In [None]:
num_col_list = [item for item in farm_data_df_col_list if item not in cat_col_list]

for col_name in num_col_list:
    plt.figure(figsize=(10,5))
    sns.histplot(clean_farm_data_df[col_name], kde=True)
    plt.title(f"Distribution of {col_name}")
    plt.show()

After replacing the missing values, there is a sharp sudden spike in the median region for the affected features <br>
This helps to create a normal distribution in the features but due to the number of missing values replaced creating a sharp spike, the impact will have to be assessed in the later stage

Due to humidity having such a big bias in the median range and having too many missing values, the column should be dropped

In [None]:
drop_col_list = ["Humidity Sensor (%)"]
clean_farm_data_df = clean_farm_data_df.drop(columns=drop_col_list)
encoded_farm_data_df = encoded_farm_data_df.drop(columns=drop_col_list)

num_col_list = [item for item in num_col_list if item not in drop_col_list]

These numerical values will need to be standardized via standard scaling <br>
This makes the features have mean of 0 and standard deviation of 1 <br>
This helps the algorithm perform better as the input features would be on a similar scale

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize numerical data-set
scaler = StandardScaler()
encoded_farm_data_df[num_col_list] = scaler.fit_transform(encoded_farm_data_df[num_col_list])
encoded_farm_data_df

# Analyze patterns and distribution in DataFrame

# Part 1: Predict temperature conditions within farm's closed environment

Plot heatmap for visualization to perform dimension reduction in latter steps <br>
Dimension reduction is needed to eliminate redundant/relevant data that are not important in predicting/classifying the expected outcome

In [None]:
# Calculate correlation matrix
corr_matrix = encoded_farm_data_df.corr()

# Create heatmap of correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

# Show plot
plt.title("Correlation matrix heatmap")
plt.tight_layout()
plt.show()

Focus on 'Temperature Sensor' first <br>
Print out correlation values with relation to 'Temperature Sensor'

In [None]:
temp_sens_corr = corr_matrix[temp_sensor_col_name]

# Sort correlations by absolute value (if strong correlations should be prioritized)
sorted_temp_sens_corr = temp_sens_corr.abs().sort_values(ascending=False)

print("Correlation with 'Temperature Sensor': ")
print(sorted_temp_sens_corr)

As most correlation values are lesser than 0.1, the threshold value to drop features will be set at 0.1 <br>
This applies for 'Temperature Sensor'

In [None]:
corr_threshold = 0.1

temp_sens_drop_col = sorted_temp_sens_corr[sorted_temp_sens_corr < corr_threshold].index

temp_sens_farm_data_df = encoded_farm_data_df.drop(columns=temp_sens_drop_col, axis=1)
temp_sens_farm_data_df

Check correlation matrix of 'Temperature Sensor' again for the features present after dropping lower correlation values

In [None]:
# Calculate correlation matrix again
final_temp_sens_corr_matrix = temp_sens_farm_data_df.corr()

# Create heatmap of correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(final_temp_sens_corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

# Show plot
plt.title("Temperature sensor correlation matrix heatmap")
plt.tight_layout()
plt.show()

Print out the numerical correlation values for 'Temperature Sensor' after dropping lower correlation values

In [None]:
final_temp_sens_corr = final_temp_sens_corr_matrix[temp_sensor_col_name]

# Sort correlations by absolute value (if strong correlations should be prioritized)
final_sorted_temp_sens_corr = final_temp_sens_corr.abs().sort_values(ascending=False)

# Print numerical correlation values
print("Correlation with 'Temperature Sensor': ")
print(final_sorted_temp_sens_corr)

Using the DataFrame with 'Temperature Sensor' as the target feature, split the data into train and test sets

In [None]:
# Split DataFrame into target feature and correlated features
X = temp_sens_farm_data_df.drop([temp_sensor_col_name], axis=1) # Correlated features
Y = temp_sens_farm_data_df[temp_sensor_col_name]                # Target feature

from sklearn.model_selection import train_test_split

# Split data into test and train sets (20-80 split)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Start with Linear Regression to train model and check on metrics

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

lin_model = LinearRegression()
lin_model.fit(X_train, Y_train)
lin_Y_predict = lin_model.predict(X_test)

lin_mse = mean_squared_error(Y_test, lin_Y_predict)
lin_r2 = r2_score(Y_test, lin_Y_predict)

print("Pre-tuned Linear Regression - ")
print(f"Mean Squared Error: {lin_mse}")
print(f"R2 Score: {lin_r2}")

Based on Linear Regression results, the mean squared error and R2 score values are not optimal <br>
Expectation - <br>
- Mean squared error value close to 0
- R2 score value close to 1

Try out with Random Forest Regression to see if there are better results

In [None]:
from sklearn.ensemble import RandomForestRegressor

rand_for_reg_model = RandomForestRegressor()
rand_for_reg_model.fit(X_train, Y_train)
rand_for_reg_Y_predict = rand_for_reg_model.predict(X_test)

rand_for_reg_mse = mean_squared_error(Y_test, rand_for_reg_Y_predict)
rand_for_reg_r2 = r2_score(Y_test, rand_for_reg_Y_predict)

print("Pre-tuning Random Forest Regression - ")
print(f"Mean Squared Error: {rand_for_reg_mse}")
print(f"R2 Score: {rand_for_reg_r2}")

The Random Forest Regression results with default parameters are noticibly better than Linear Regression <br>
However, the results can be improved by tweaking the parameters via GridSearchCV or RandomSearchCV

Tune Random Forest Regression for better results

In [None]:
def tune_n_eval_forest_regression(X_train, Y_train, X_test, Y_test, search_method = "grid", param_grid = None, param_dist = None, random_iter = 50, cv = 5, num_jobs = 4):
    """
    Automates the tuning and evaluation of a Random Forest Regression model.

    Parameters:
        X: Features (DataFrame or array).
        y: Target variable (Series or array).
        search_method: 'grid' for GridSearchCV, 'random' for RandomizedSearchCV.
        param_grid: Dictionary of hyperparameter ranges for GridSearchCV.
        param_dist: Dictionary of hyperparameter distributions for RandomizedSearchCV.
        random_iter: Number of iterations for RandomizedSearchCV.
        cv: Number of cross-validation folds.

    Returns:
        best_model: The tuned Random Forest Regression model.
        best_params: The best hyperparameters found.
    """
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    from sklearn.metrics import mean_squared_error, r2_score

    # Initialize parameters
    if param_grid is None:
        param_grid = {
            "n_estimators": [50, 100, 150, 200],
            "max_depth": [None, 5, 10]
        }
    if param_dist is None:
        from scipy.stats import randint
        param_dist = {
            "n_estimators": randint(100, 300),
            "max_depth": [None, 5, 10, 15]
        }

    if search_method == "grid":
        search = GridSearchCV(
            RandomForestRegressor(random_state = 42),
            param_grid = param_grid,
            cv = cv,
            scoring = "neg_mean_squared_error",
            n_jobs = num_jobs
        )
    elif search_method == "random":
        search = RandomizedSearchCV(
            RandomForestRegressor(random_state = 42),
            param_distributions = param_dist,
            n_iter = random_iter,
            cv = cv,
            scoring = "neg_mean_squared_error",
            random_state = 42,
            n_jobs = num_jobs
        )
    else:
        raise ValueError("search_method must be either 'grid' or ' random'")
    
    # Fit the search
    print(f"Running {search_method.capitalize()} Search...")
    search.fit(X_train, Y_train)

    # Best model and parameters
    best_model = search.best_estimator_
    best_params = search.best_params_
    print(f"\nBest Parameters: {best_params}")

    # Test set evaluation
    tuned_rand_for_reg_Y_predict = best_model.predict(X_test)
    tuned_rand_for_reg_mse = mean_squared_error(Y_test, tuned_rand_for_reg_Y_predict)
    tuned_rand_for_reg_r2 = r2_score(Y_test, tuned_rand_for_reg_Y_predict)
    print("Tuned Random Forest Regression -")
    print(f"Tuned Mean Squared Error: {tuned_rand_for_reg_mse}")
    print(f"Tuned R2 Score: {tuned_rand_for_reg_r2}")


    return best_model, best_params

Call tuning function and evaluate function <br>
Run function with both GridSearchCV and RandomSearchCV

In [None]:
# Using Grid Search
grid_rand_for_best_model, grid_rand_for_best_param = tune_n_eval_forest_regression(
    X_train,
    Y_train,
    X_test,
    Y_test,
    search_method = "grid",
    num_jobs = 4
)

In [None]:
# Using Random Search
rand_rand_for_best_model, rand_rand_for_best_param = tune_n_eval_forest_regression(
    X_train,
    Y_train,
    X_test,
    Y_test,
    search_method = "random",
    num_jobs = 4
)

While there isn't a noticable difference in metrics after tuning, the overall result of the Random Forest Regression is the best at the moment

Try out with XGBoost to see how gradient boosting framework performs for this task

In [None]:
from xgboost import XGBRegressor

xgb_reg_model = XGBRegressor(random_state=42)
xgb_reg_model.fit(X_train, Y_train)
xgb_reg_Y_predict = xgb_reg_model.predict(X_test)

xgb_reg_mse = mean_squared_error(Y_test, xgb_reg_Y_predict)
xgb_reg_r2 = r2_score(Y_test, xgb_reg_Y_predict)

print("XGBoost Regression - ")
print(f"Mean Squared Error: {xgb_reg_mse}")
print(f"R2 Score: {xgb_reg_r2}")

XGBoost Regression currently performs somewhere in between Linear Regression and Random Forest Regression <br>
XGBoost Regression's parameters can also be tuned to help improve the metric values

In [None]:
def tune_n_eval_xgb_regression(X_train, Y_train, X_test, Y_test, search_method = "grid", param_grid = None, param_dist = None, random_iter = 50, cv = 5, num_jobs = 4):
    """
    Automates the tuning and evaluation of a XGBoost Regression model.

    Parameters:
        X: Features (DataFrame or array).
        y: Target variable (Series or array).
        search_method: 'grid' for GridSearchCV, 'random' for RandomizedSearchCV.
        param_grid: Dictionary of hyperparameter ranges for GridSearchCV.
        param_dist: Dictionary of hyperparameter distributions for RandomizedSearchCV.
        random_iter: Number of iterations for RandomizedSearchCV.
        cv: Number of cross-validation folds.

    Returns:
        best_model: The tuned Random Forest Regression model.
        best_params: The best hyperparameters found.
    """
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    from sklearn.metrics import mean_squared_error, r2_score

    # Initialize parameters
    if param_grid is None:
        param_grid = {
            "n_estimators": [100, 200, 300],
            "max_depth": [3, 5, 7],
            "learning_rate": [0.01, 0.05, 0.1],
            "subsample": [0.6, 0.8, 1.0]
        }
    if param_dist is None:
        from scipy.stats import randint
        param_dist = {
            "n_estimators": randint(100, 300),
            "max_depth": [3, 5, 7],
            "learning_rate": [0.01, 0.05, 0.1],
            "subsample": [0.6, 0.8, 1.0]
        }

    if search_method == "grid":
        search = GridSearchCV(
            XGBRegressor(random_state = 42),
            param_grid = param_grid,
            cv = cv,
            scoring = "neg_mean_squared_error",
            n_jobs = num_jobs
        )
    elif search_method == "random":
        search = RandomizedSearchCV(
            XGBRegressor(random_state = 42),
            param_distributions = param_dist,
            n_iter = random_iter,
            cv = cv,
            scoring = "neg_mean_squared_error",
            random_state = 42,
            n_jobs = num_jobs
        )
    else:
        raise ValueError("search_method must be either 'grid' or ' random'")
    
    # Fit the search
    print(f"Running {search_method.capitalize()} Search...")
    search.fit(X_train, Y_train)

    # Best model and parameters
    best_model = search.best_estimator_
    best_params = search.best_params_
    print(f"\nBest Parameters: {best_params}")

    # Test set evaluation
    tuned_xgb_reg_Y_predict = best_model.predict(X_test)
    tuned_xgb_reg_mse = mean_squared_error(Y_test, tuned_xgb_reg_Y_predict)
    tuned_xgb_reg_r2 = r2_score(Y_test, tuned_xgb_reg_Y_predict)
    print("Tuned XGBoost Regression -")
    print(f"Tuned Mean Squared Error: {tuned_xgb_reg_mse}")
    print(f"Tuned R2 Score: {tuned_xgb_reg_r2}")


    return best_model, best_params

Call tuning function and evaluate function <br>
Run function with both GridSearchCV and RandomSearchCV

In [None]:
# Using Grid Search
grid_rand_for_best_model, grid_rand_for_best_param = tune_n_eval_xgb_regression(
    X_train,
    Y_train,
    X_test,
    Y_test,
    search_method = "grid",
    num_jobs = 4
)

In [None]:
# Using Random Search
grid_rand_for_best_model, grid_rand_for_best_param = tune_n_eval_xgb_regression(
    X_train,
    Y_train,
    X_test,
    Y_test,
    search_method = "random",
    num_jobs = 4
)

After tuning the parameters for XGBoost Regression, it looks like there is no noticable difference in metric values

The variance and range of the target variable - 'Temperature Sensor' can be checked

In [None]:
# Calculate variance of target variable
variance_temp_sens = Y_test.var()
print(f"Variance of 'Temperature Sensor': {variance_temp_sens:.3f}")

# Calculate range of target variable
range_temp_sens = Y_test.max() - Y_test.min()
print(f"Range of 'Temperature Sensor': {range_temp_sens:.3f}")

Run algorithm with PCA (95% variance) features <br>
The purpose of using PCA (Principal Component Analysis) is to reduce the dimensionality of the feature space while retaining as much variance as possible <br>
It can help to improve model performance by reducing noise, handling multicollinearity, and improving training speed

In [None]:
from sklearn.decomposition import PCA
import numpy as np

pca_temp_sens = PCA(n_components=0.95) # Keep 95% of variance
X_pca = pca_temp_sens.fit_transform(X)

print("Explained variance ratio: ", pca_temp_sens.explained_variance_ratio_)
print("Cumulative explained variance: ", np.cumsum(pca_temp_sens.explained_variance_ratio_))

pca_temp_sens_X_train, pca_temp_sens_X_test, pca_temp_sens_Y_train, pca_temp_sens_Y_test = train_test_split(X_pca, Y, test_size=0.2, random_state=42)

pca_temp_sens_model = RandomForestRegressor()
pca_temp_sens_model.fit(pca_temp_sens_X_train, pca_temp_sens_Y_train)
pca_temp_sens_Y_predict = pca_temp_sens_model.predict(pca_temp_sens_X_test)
pca_temp_sens_mse = mean_squared_error(Y_test, pca_temp_sens_Y_predict)
pca_temp_sens_r2 = r2_score(Y_test, pca_temp_sens_Y_predict)

print("PCA Random Forest Regression - ")
print(f"Mean Squared Error: {pca_temp_sens_mse}")
print(f"R2 Score: {pca_temp_sens_r2}")

Perform cross-validation with PCA feature <br>
Purpose of cross-validation is to evaluate performance of a machine learning model on unseen data <br>
This helps to avoid overfitting by training and testing the model on different subsets of data

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

rand_for_model = RandomForestRegressor(random_state=42)

temp_sens_pipeline = Pipeline([
    ("pca", pca_temp_sens),
    ("rf", rand_for_model)
])

# Perform 5-fold cross-validation
cv_scores = cross_val_score(temp_sens_pipeline, X, Y, cv=5, scoring="neg_mean_squared_error")
cv_r2 = cross_val_score(temp_sens_pipeline, X, Y, cv=5, scoring="r2")

print(f"Cross-validation MSE/Fold: {-cv_scores}")
print(f"Cross-validation Mean MSE: {-cv_scores.mean():.4f}")
print(f"Cross-validation STD MSE: {-cv_scores.std():.4f}")
print(f"Cross-validation R2 Score/Fold: {cv_r2}")
print(f"Cross-validation Mean R2 Score: {cv_r2.mean():.4f}")

# Part 2: Categorize combined "Plant Type-Stage" based on sensor data

For this part, the DataFrame used will be mostly reused from Part 1's prior to any row or column dropping, and feature engineering <br>
This is done to preserve as much of the original content as possible for a prelimenary assessment

In [None]:
farm_data_df

As stated in the requirements, 'Plant Type' and 'Plant Stage' will need to be combined to form a new columns named 'Plant Type-Stage'

In [None]:
farm_data_df["Plant Type-Stage"] = farm_data_df["Plant Type"] + " " +  farm_data_df["Plant Stage"]
farm_data_df