# Analyze problem statement to identify approach to fulfill goals

Company faces significant challenges in optimizing crop yields and resource management <br>
Need to prioritize focus on these 2 objectives

1. Predict temperature conditions within farm's closed environment, ensuring optimal plant growth
  - Regression modelling task
  - Need to identify relevant/related features within provided database
<br><br>
2. Categorize combined "Plant Type-Stage" based on sensor data, aiding in strategic planning and resource allocation
  - Classification modelling task
  - Need to identify relevant/related features within provided database

# Setup environment, SQL connection and analyze SQL database

Necessary libraries will be imported when needed

Establish connection SQL database (agri.db) using relative path 'data/agri.db'

In [None]:
# Import libraries as needed
import sqlite3

# Set path to SQL database
db_path = "data/agri.db"

# Create connection to SQL database
conn = sqlite3.connect(db_path)

Set pandas options for better readability

In [None]:
import pandas as pd

pd.set_option('display.max_columns', None) # Display all columns in DataFrame
pd.set_option('display.max_rows', 100)     # Limit number of rows displayed to 100

Explore database structure by listing all tables to identify available tables for extraction

In [None]:
import pandas as pd

# Query to list all tables in database
query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql(query, conn)

# Display list of tables
tables

Since there is only 'farm_data' table in the database, the first few rows can be previewed to understand the column structure <br>
This can be cross-checked with the provided list of attributes in the PDF

In [None]:
# Preview first few rows of 'farm_data' table
farm_data_10_query = "SELECT * FROM farm_data LIMIT 10;"
farm_data_10_df = pd.read_sql(farm_data_10_query, conn)

# Display first few rows of table
farm_data_10_df.head(10)

There are a few issues in the current database that will have to be sorted out before the data can be used for feature engineering or used in machine learning modelling <br>
<br>
Currently identified issues:
| Column Name | Issue |
| :---------: | :---: |
| Plant Type  | Non-standardized naming format |
| Plant Stage | Non-standardized naming format |
| Temperature Sensor | Missing values (NaN) |
|                    | Negative value |
| Humidity Sensor | Missing values (NaN) |
| Nutrient * Sensor | Missing value (None) |
|                   | Values with units |
| Water Level Sensor | Missing value (NaN) |

The schema of 'farm_data' table can be retrieved to understand the columns' data type <br>
However, this might not be fully accurate prior to data clean-up due to missing or incorrectly labelled values

In [None]:
# Get schema of 'farm_data' table
schema_query = "PRAGMA table_info(farm_data);"
schema_df = pd.read_sql(schema_query, conn)

# Display schema information
schema_df

Most of the columns' data type match expectations, except for Nutrient Sensors <br>
These columns should have REAL/INTEGER type but are currently of TEXT type <br> <br>
To handle this, the missing values will have to resolved and the values with units need to be processed <br>
After both steps are done, the columns' data can be converted to REAL/INTEGER type

# Perform Exploratory Data Analysis (EDA) on SQL table data

Load all data from 'farm_data' table into a DataFrame to start data analysis

In [None]:
# Get all data from 'farm_data' table
farm_data_query = "SELECT * FROM farm_data;"
farm_data_df = pd.read_sql_query(farm_data_query, conn)

Start with data preprocessing to clean-up missing values, non-standardized naming format and extra info in values

The data in columns (Plant Type, Plant Stage) will all be changed to lowercase characters to standardize the data

In [None]:
non_standard_name_list = ["Plant Type", "Plant Stage"]

for col_name in non_standard_name_list:
    farm_data_df[col_name] = farm_data_df[col_name].str.lower()


Remove 'ppm' from Nutrient * Sensor column data and convert the data into numeric type

In [None]:
ppm_drop_list = ["Nutrient N Sensor (ppm)", "Nutrient P Sensor (ppm)", "Nutrient K Sensor (ppm)"]

for col_name in ppm_drop_list:
    farm_data_df[col_name] = farm_data_df[col_name].str.replace("ppm", "", regex=False)
    farm_data_df[col_name] = pd.to_numeric(farm_data_df[col_name], errors="coerce")

Remove negative sign in Temperature Sensor column data

In [None]:
# Get Temperature Sensor column name
farm_data_df_col_list = farm_data_df.columns

for col_name in farm_data_df_col_list:
    if "Temperature Sensor" in col_name:
        temp_sensor_col_name = col_name

farm_data_df[temp_sensor_col_name] = farm_data_df[temp_sensor_col_name].abs()

After checking the current DataFrame, it seems that Light Intensity Sensor column also has missing values <br>
So that will be handled together with the other affected columns

For the remaining columns with missing values, they will be filled with either mean or median values of their zone <br>
The existing data of each column will be grouped into their System Location Code to obtain the mean and median values

Here is the breakdown of each column:
| Column Name | Mean/Median | Reason |
| :---------: | :---------: | :----: |
| Temperature Sensor | Mean | Data is relatively stable and has normal distribution |
| Humidity Sensor | Median  | Data is not normally distributed, no clear pattern |
| Light Intensity Sensor | Median | Data is skewed towards the upper half of the spectrum |
| Nutrient N Sensor | Median | Data has sudden dip and spike near right of spectrum |
| Nutrient P Sensor | Median | Data has sudden dip and spike near left of spectrum |
| Nutrient K Sensor | Median | Data has sudden dip and spike near right of spectrum |
| Water Level Sensor | Median | To avoid outliers at extreme ends of spectrum |

The mean and median of each column with missing value is still calculated and displayed

In [None]:
agg_list = ["mean", "median"]
no_nan_col_list = farm_data_df.columns[farm_data_df.isnull().sum() == 0].tolist()
# Don't drop 'System Location Code' column else there is no zone to groupby
no_nan_col_list = [col for col in no_nan_col_list if col != "System Location Code"]

nan_farm_data_df =  farm_data_df.drop(columns=no_nan_col_list)

nan_farm_data_grouped_df = nan_farm_data_df.groupby("System Location Code").agg(agg_list)

nan_farm_data_grouped_df

It can be seen that for each column, the mean and median values differ by some margin <br>
So choosing the appropriate one to replace missing value is important

The missing values of each column will be replaced as showed in the above table

In [None]:
# Create new DataFrame for data after removing missing values
clean_farm_data_df = farm_data_df

# Replace 'Temperature Sensor' column missing value with mean
clean_farm_data_df[temp_sensor_col_name] = clean_farm_data_df[temp_sensor_col_name].fillna(clean_farm_data_df[temp_sensor_col_name].mean())

nan_col_list = clean_farm_data_df.columns[clean_farm_data_df.isnull().any()].tolist()
# Remove 'Temperature Sensor' column name from list as it uses mean instead of median
nan_col_list = [col for col in nan_col_list if col != temp_sensor_col_name]

# Replace remaining affected column missing value with median
for col_name in nan_col_list:
    clean_farm_data_df[col_name] = clean_farm_data_df[col_name].fillna(clean_farm_data_df[col_name].median())

clean_farm_data_df

Based on the current data and data types, each column can be categorized as categorical or numerical types <br>
Categorical represents categories or labels, so usually the data are of string or object data type <br>
Numerical represents quantiative data, which can be either continuous or discrete

Here is the breakdown:
| Column Name | Type |
| :---------: | :--: |
| System Location Code | Categorical |
| Previous Cycle Plant Type | Categorical |
| Plant Type | Categorical |
| Plant Stage | Categorical |
| Temperature Sensor | Numerical |
| Humidity Sensor | Numerical |
| Light Intensity Sensor | Numerical |
| CO2 Sensor | Numerical |
| EC Sensor | Numerical |
| O2 Sensor | Numerical |
| Nutrient * Sensor | Numerical |
| pH Level | Numerical |
| Water Level Sensor | Numerical |

The distribution of the categorical variables are plotted for visualization <br>
It is to get a sense of how the data is categorized and the evenness of the distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

cat_col_list = ["System Location Code", "Previous Cycle Plant Type", "Plant Type", "Plant Stage"]

for col_name in cat_col_list:
    plt.figure(figsize=(10,5))
    sns.countplot(x=col_name, data=clean_farm_data_df)
    plt.title(f"Distribution of {col_name}")
    plt.xticks(rotation=45)
    plt.show()

It can be seen that the data distribution of all 4 columns are rather even across all the distinct values <br>
It means that there is no need to do any further data processing to balance out skewed data

These columns will need to have their data values converted into categorical numeric values via label categorization and/or one-hot encoding<br>
Else it is not possible to use these columns for correlation analysis and machine learning modelling in later steps

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create new DataFrame for post-encoding
encoded_farm_data_df = clean_farm_data_df

# Perform label encoding on 'Plant Stage' as there is an ordered stage to it
lab_enc = LabelEncoder()
encoded_farm_data_df["Plant Stage"] = lab_enc.fit_transform(encoded_farm_data_df["Plant Stage"])

# Perform one-hot encoding on the other columns as there is no order
fil_cat_col_list = [item for item in cat_col_list if item != "Plant Stage"]
encoded_farm_data_df = pd.get_dummies(encoded_farm_data_df, columns=fil_cat_col_list, drop_first=True)
bool_col = encoded_farm_data_df.select_dtypes(include=["bool"]).columns
encoded_farm_data_df[bool_col] = encoded_farm_data_df[bool_col].astype(int)
encoded_farm_data_df

The distribution and relationship of numerical variables are plotted for visualization

In [None]:
num_col_list = [item for item in farm_data_df_col_list if item not in cat_col_list]

for col_name in num_col_list:
    plt.figure(figsize=(10,5))
    sns.histplot(clean_farm_data_df[col_name], kde=True)
    plt.title(f"Distribution of {col_name}")
    plt.show()

After replacing the missing values, there is a sharp sudden spike in the median region for the affected features <br>
This helps to create a normal distribution in the features but due to the number of missing values replaced creating a sharp spike, the impact will have to be assessed in the later stage

These numerical values will need to be standardized via standard scaling <br>
This makes the features have mean of 0 and standard deviation of 1 <br>
This helps the algorithm perform better as the input features would be on a similar scale

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize numerical data-set
scaler = StandardScaler()
encoded_farm_data_df[num_col_list] = scaler.fit_transform(encoded_farm_data_df[num_col_list])
encoded_farm_data_df

# Analyze patterns and distribution in DataFrame

Plot heatmap for visualization to perform dimension reduction in latter steps <br>
Dimension reduction is needed to eliminate redundant/relevant data that are not important in predicting/classifying the expected outcome

In [None]:
# Calculate correlation matrix
corr_matrix = encoded_farm_data_df.corr()

# Create heatmap of correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

# Show plot
plt.title("Correlation matrix heatmap")
plt.tight_layout()
plt.show()