# Feature Selection 

This notebook functions as a feature selector for each of the three response variables. To use this: 

1. Feed in a dataframe with all of the features that would like to be tested
2. Run the code in this Notebook

This should return the features for each of the three response variables that are the best predictors.

## Notes and Literature:

Lasso is generally better than most other methods for feature fitting. It will return a list of features, and the remainder it 'shrinks' to 0, as such can avoids overfitting - and tests better on out of sample data (prediction). The downside is that it assumes a linear relationship between the variables (variables can be transformed, but after transformation it produces a linear fit). It can be fine tuned with
 1. Choice of Alpha (higher alpha gives more penalization, shrinking more features to 0)
 2. Transformation of response variable (GLMs)

Random Forest is a non-linear approach to the data, but will not give a set of features. It does list the features based on importance, cut offs are based on other rules that need to be applied by the user. Tuning is given by:
 1. n_estimators, which is the number of estimators in the tree (Although this is moreso a tradeoff on performance not model tuning)
 2. Where to cut features off

Different choices can give different results, so for each set of features, pick the tuning that gives the model w.r.t the chosen criteria

## Preparation

Just running through our feature selection now

### Loading packages / libraries

In [None]:
!pip install uv
!uv pip install  -r requirements.txt 

#new library
!pip install mlxtend


Restart Kernel here

Then we load in packages

In [None]:
## import packages

import snowflake
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Data manipulation and analysis
import numpy as np
import pandas as pd
from IPython.display import display

# Multi-dimensional arrays and datasets (e.g., NetCDF, Zarr)
import xarray as xr

# Geospatial raster data handling with CRS support
import rioxarray as rxr

# Raster operations and spatial windowing
import rasterio
from rasterio.windows import Window

# Feature preprocessing and data splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from scipy.spatial import cKDTree

# Machine Learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Planetary Computer tools for STAC API access and authentication
import pystac_client
import planetary_computer as pc
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

from datetime import date
from tqdm import tqdm
import os 

#NEW PACKAGES
import planetary_computer 
import dask 
from scipy import stats
from datetime import datetime
from dask.distributed import Client

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
from sklearn.model_selection import KFold
import statsmodels.api as sm
from sklearn.feature_selection import RFE

import xgboost as xgb

from sklearn.model_selection import GroupKFold

def run_groupkfold_cv(X, y, groups, n_splits=5, param_name="Parameter"):
    gkf = GroupKFold(n_splits=n_splits)
    fold_results = []

    for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
        # print(f"\n=== Fold {fold+1} ===")

        # Split
        X_train, X_test = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[val_idx]

        # Scale
        X_train_scaled, X_test_scaled, scaler = scale_data(X_train, X_test)

        # Train
        model = train_model(X_train_scaled, y_train)

        # Evaluate (in-sample)
        y_train_pred, r2_train, rmse_t
        # Evaluate (out-sample)rain = evaluate_model(model, X_train_scaled, y_train, "Train")

        y_test_pred, r2_test, rmse_test = evaluate_model(model, X_test_scaled, y_test, "Test")

        fold_results.append((r2_train, rmse_train, r2_test, rmse_test))

    df_results_kfold = pd.DataFrame(fold_results, columns=['R2_Train', 'RMSE_Train', 'R2_Test', 'RMSE_Test']).reset_index().rename(columns={"index": "fold"})
    df_results_kfold['Parameter'] = param_name
    df_results_kfold['Features'] = ', '.join([col for col in X.columns if col != 'sample_location_group'])
    df_results_kfold = df_results_kfold[['Parameter', 'Features', 'R2_Train', 'RMSE_Train', 'R2_Test', 'RMSE_Test']]

    return df_results_kfold


### Loading in Data

Loading in dataframe for model selection

N.B when applying custom dataset, load it in as wq_data below. This has some transformations to scrap junk data and remove all nulls. This may or may not be requried in other functions

In [None]:
Water_Quality_df = pd.read_csv("data/water_quality_training_dataset.csv")

landsat_train_features = pd.read_csv("data/landsat/landsat_features_training_mvdb.csv")
landsat_train_features['Sample Date'] = pd.to_datetime(landsat_train_features['Sample Date'], format="%d/%m/%Y")

Terraclimate_df = pd.read_csv("data/terraclimate/terraclimate_features_training_pet.csv")

q_terraclimateload = pd.read_csv("data/terraclimate/terraclimate_features_training_q.csv")
q_terraclimate = q_terraclimateload[['LATITUDE', 'LONGITUDE', 'SAMPLE DATE', 'Q']].rename(columns={
    'LATITUDE': 'Latitude',
    'LONGITUDE': 'Longitude',
    'SAMPLE DATE': 'Sample Date',
    '2.Q': 'Q'
})


#landsat_train_features['NDMI'] = landsat_train_features['NDMI'].astype(float)
#landsat_train_features['MNDWI'] = landsat_train_features['MNDWI'].astype(float)
#landsat_train_features['Sample Date'] = pd.to_datetime(landsat_train_features['Sample Date'],  format='%d-%m-%Y')

def combine_two_datasets(dataset1,dataset2,dataset3, dataset4):
    '''
    Returns a  vertically concatenated dataset.
    Attributes:
    dataset1 - Dataset 1 to be combined 
    dataset2 - Dataset 2 to be combined
    '''
    
    data = pd.concat([dataset1,dataset2,dataset3, dataset4], axis=1)
    data = data.loc[:, ~data.columns.duplicated()]
    return data

wq_data = combine_two_datasets(Water_Quality_df, landsat_train_features, Terraclimate_df, q_terraclimate)
wq_data['Sample Date'] = pd.to_datetime(wq_data['Sample Date'],  format='mixed')

#ullify all negative observations
for column in wq_data.columns:
    if wq_data[column].dtype == 'string': wq_data[column] = pd.to_numeric(wq_data[column], errors='coerce')
    elif column != "Sample Date": wq_data[wq_data[column] < -0.1][column] = np.nan 
    
#number of cv groups
cv_groups = 6

wq_data = wq_data.drop(columns=['qa_radsat', 'cloud_qa', 'Unnamed: 0'])
wq_data = wq_data.dropna(how='any',axis=0)

#split over longitude
wq_data['cv_group'] = pd.qcut(wq_data['Sample Date'], q=cv_groups, labels=False)

plt.scatter(wq_data['Sample Date'], wq_data['cv_group'])
plt.xlabel('Sample Date')
plt.ylabel('cv_group')
plt.show()

wq_data['Month_cosine'] = np.cos((wq_data['Sample Date'].dt.month + (wq_data['Sample Date'].dt.day/31))* np.pi / 6)
plt.scatter(wq_data['Sample Date'], wq_data['Month_cosine'])
plt.xlabel('Sample Date')
plt.ylabel('Month_Cosine')
plt.show()

wq_data = wq_data.drop(columns=['Sample Date'])

print(wq_data.info())

# Specify the number of folds
lat_sep_kf = GroupKFold(n_splits=cv_groups - 1)

This next section is to apply any transformations to the dataset. For example we take the box-cox transformation of the Y data. And for the predictor lwir.1 we take the square. 

But this transformation is arbitrary. Fit any transformations here



In [None]:
#Box Cox of predictors
Total_Alkalinity_bc, Total_Alkalinity_lambda_opt = stats.boxcox(wq_data['Total Alkalinity'])
Electrical_Conductance_bc, Electrical_Conductance_lambda_opt = stats.boxcox(wq_data['Electrical Conductance'])
Dissolved_Reactive_Phosphorus_bc, Dissolved_Reactive_Phosphorus_lambda_opt = stats.boxcox(wq_data['Dissolved Reactive Phosphorus'])

wq_data['Total Alkalinity'] = Total_Alkalinity_bc
wq_data['Electrical Conductance'] = Electrical_Conductance_bc
wq_data['Dissolved Reactive Phosphorus'] = Dissolved_Reactive_Phosphorus_bc

#Square x value
squarelwir1 = wq_data['lwir.1'] ** 2
wq_data['lwir.1'] = squarelwir1

Splitting into X and Y, and then test and training data.

In [None]:

#test train based on location
wq_data_test = wq_data[wq_data['cv_group'] == 0]
wq_data_train = wq_data[wq_data['cv_group'] > 0]

#then split into X and Y 
Y_train = wq_data_train[["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus"]]
X_train = wq_data_train.drop(columns=["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus", "Longitude", "Latitude"])
Y_test = wq_data_test[["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus"]]
X_test = wq_data_test.drop(columns=["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus", "Longitude", "Latitude"])

'''
Y_data = wq_data[["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus"]]
X_data = wq_data.drop(columns=["Total Alkalinity", "Electrical Conductance", "Dissolved Reactive Phosphorus", "Longitude", "Latitude"])

X_train, X_test, Y_train, Y_test = train_test_split(
    X_data, Y_data, test_size=0.2, random_state=88, shuffle=False
)
'''


print(Y_train.shape)
print(X_train.shape)
print(Y_test.shape)
print(X_test.shape)

## Random Forest Regression

Lets build a random forest regression using our data now. We will use RFE and Cross Validation to choose the optimal number of predictors.

Recursive Feature Elimination (RFE) a backward selection, wrapper-based machine learning technique that improves model performance and reduces overfitting by recursively removing the least significant features. In this case, we will iterate through Random Forest Regressions, using RFE to select variables that perform the best, combining with Cross Validation as follows:

1. Iterative RFE inside a Cross-Validation Loop: The data is split into cross-validation folds. In each fold's training set, the RFE process runs multiple times, eliminating a set number or percentage of the least important features at each "step".
2. Performance Evaluation: For each number of features evaluated within the RFE process, the model's performance (e.g., accuracy, F1 score, R-squared) is scored on the corresponding test (validation) fold.
3. Averaging Scores: The scores for each feature subset size are averaged across all cross-validation folds.
4. Optimal Feature Selection: The number of features that yields the highest average cross-validation score is identified as the optimal number.
5. Final RFE Fit: A final RFE process is run on the entire dataset using the determined optimal number of features to select the final feature set


### Total Alkalinity

First lets choose the features, lets choose the features and create the dataframe:

In [None]:
Y_train_totalalkalinity = Y_train[["Total Alkalinity"]]
Y_test_totalalkalinity = Y_test[["Total Alkalinity"]]


rf_totalalkalinity =  xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.3,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_totalalkalinity =  RandomForestRegressor(
    n_estimators=100,
    random_state=88,
    max_features=0.5
)
'''
rfecv_totalalkalinity = RFECV(estimator=rf_totalalkalinity
    , step=1
    , cv=lat_sep_kf
    , scoring='neg_mean_squared_error'
    , n_jobs=-1)
'''
rfecv_totalalkalinity = RFE(estimator=rf_totalalkalinity
    , step=1
)
'''
#rfecv_totalalkalinity.fit(X_train.drop(columns=['cv_group']), Y_train_totalalkalinity)    
rfecv_totalalkalinity.fit(X_train.drop(columns=['cv_group']), Y_train_totalalkalinity, groups= X_train['cv_group'])

print(f"Optimal number of features: {rfecv_totalalkalinity.n_features_}")
print(f"Selected features mask: {rfecv_totalalkalinity.support_}")


mean_scores = rfecv_totalalkalinity.cv_results_['mean_test_score']
num_features = rfecv_totalalkalinity.cv_results_['n_features']
std_error = rfecv_totalalkalinity.cv_results_['std_test_score']

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("MSE")
plt.errorbar(
    x=num_features,
    y=mean_scores,
    yerr=std_error,
)
plt.title("Recursive Feature Elimination Total Alkalinity")
plt.show()

selected_features_totalalkalinity = rfecv_totalalkalinity.get_feature_names_out() 
X_train_selected_totalalkalinity = X_train[selected_features_totalalkalinity]
X_test_selected_totalalkalinity = X_test[selected_features_totalalkalinity]

print(X_train_selected_totalalkalinity.info())

Fit a new model on the selected features, and then evaluate its performance:

In [None]:
rf_totalalkalinity_selected = xgb.XGBRegressor(
    n_estimators=300,
    learning_rate=0.01,
    max_depth=8,
    subsample=0.5,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_totalalkalinity_selected = RandomForestRegressor(n_estimators = 100, random_state = 88, max_features = 0.5)
'''
rf_totalalkalinity_selected.fit(X_train_selected_totalalkalinity, Y_train_totalalkalinity)

# Make predictions on the test set
Y_pred_train_totalalkalinity = rf_totalalkalinity_selected.predict(X_train_selected_totalalkalinity)

mse = mean_squared_error(Y_train_totalalkalinity, Y_pred_train_totalalkalinity)
r2 = r2_score(Y_train_totalalkalinity, Y_pred_train_totalalkalinity)

print("Training: MSE:", mse, "R2:", r2)

# Make predictions on the test set
Y_pred_test_totalalkalinity = rf_totalalkalinity_selected.predict(X_test_selected_totalalkalinity)

mse = mean_squared_error(Y_test_totalalkalinity, Y_pred_test_totalalkalinity)
r2 = r2_score(Y_test_totalalkalinity, Y_pred_test_totalalkalinity)

print("Test: MSE:", mse, "R2:", r2)

Then check residuals:

In [None]:
Y_train_totalalkalinity_plot =  Y_train_totalalkalinity.to_numpy().flatten()

resid_train_totalalkalinity = Y_train_totalalkalinity_plot - Y_pred_train_totalalkalinity

plt.scatter(Y_train_totalalkalinity_plot, resid_train_totalalkalinity)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Total Alkalinity')
plt.ylabel('Residuals')
plt.show()

sm.qqplot(resid_train_totalalkalinity)
plt.show()

In [None]:
slope, intercept = np.polyfit(Y_train_totalalkalinity_plot, Y_pred_train_totalalkalinity, 1) #
line_of_best_fit = slope * Y_train_totalalkalinity_plot + intercept
plt.scatter(Y_train_totalalkalinity_plot, Y_pred_train_totalalkalinity)
plt.plot(Y_train_totalalkalinity_plot, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Total Alkalinity Train')
plt.ylabel('Predict')
plt.show()

slope, intercept = np.polyfit(Y_test_totalalkalinity.to_numpy().flatten(), Y_pred_test_totalalkalinity, 1) #
line_of_best_fit = slope * Y_test_totalalkalinity.to_numpy().flatten() + intercept
plt.scatter(Y_test_totalalkalinity, Y_pred_test_totalalkalinity)
plt.plot(Y_test_totalalkalinity, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Total Alkalinity Test')
plt.ylabel('Predict')
plt.show()

### Electrical Conductance

In [None]:
Y_train_electricalconductance = Y_train[["Electrical Conductance"]]
Y_test_electricalconductance = Y_test[["Electrical Conductance"]]


rf_electricalconductance =  xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.3,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_electricalconductance =  RandomForestRegressor(
    n_estimators=100,
    random_state=88,
    max_features=0.5
)
'''
rfecv_electricalconductance = RFECV(estimator=rf_electricalconductance
    , step=1
    , cv=lat_sep_kf
    , scoring='neg_mean_squared_error'
    , n_jobs=-1)
    
rfecv_electricalconductance.fit(X_train.drop(columns=['cv_group']), Y_train_electricalconductance, groups= X_train['cv_group'])

print(f"Optimal number of features: {rfecv_electricalconductance.n_features_}")
print(f"Selected features mask: {rfecv_electricalconductance.support_}")

mean_scores = rfecv_electricalconductance.cv_results_['mean_test_score']
num_features = rfecv_electricalconductance.cv_results_['n_features']
std_error = rfecv_electricalconductance.cv_results_['std_test_score']

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("MSE")
plt.errorbar(
    x=num_features,
    y=mean_scores,
    yerr=std_error,
)
plt.title("Recursive Feature Elimination Electrical Conductance")
plt.show()

selected_features_electricalconductance = rfecv_electricalconductance.get_feature_names_out() 
X_train_selected_electricalconductance = X_train[selected_features_electricalconductance]
X_test_selected_electricalconductance = X_test[selected_features_electricalconductance]

print(X_train_selected_electricalconductance.info())

In [None]:

rf_electricalconductance_selected = xgb.XGBRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.5,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_electricalconductance_selected = RandomForestRegressor(n_estimators = 100, random_state = 88, max_features = 0.5)
'''
rf_electricalconductance_selected.fit(X_train_selected_electricalconductance, Y_train_electricalconductance)

# Make predictions on the test set
Y_pred_train_electricalconductance = rf_electricalconductance_selected.predict(X_train_selected_electricalconductance)

mse = mean_squared_error(Y_train_electricalconductance, Y_pred_train_electricalconductance)
r2 = r2_score(Y_train_electricalconductance, Y_pred_train_electricalconductance)

print("Training: MSE:", mse, "R2:", r2)

# Make predictions on the test set
Y_pred_test_electricalconductance = rf_electricalconductance_selected.predict(X_test_selected_electricalconductance)

mse = mean_squared_error(Y_test_electricalconductance, Y_pred_test_electricalconductance)
r2 = r2_score(Y_test_electricalconductance, Y_pred_test_electricalconductance)

print("Test: MSE:", mse, "R2:", r2)

In [None]:
Y_train_electricalconductance_plot =  Y_train_electricalconductance.to_numpy().flatten()

resid_train_electricalconductance = Y_train_electricalconductance_plot - Y_pred_train_electricalconductance

plt.scatter(Y_train_electricalconductance_plot, resid_train_electricalconductance)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Total Alkalinity')
plt.ylabel('Residuals')
plt.show()

sm.qqplot(resid_train_electricalconductance)
plt.show()

In [None]:
slope, intercept = np.polyfit(Y_train_electricalconductance_plot, Y_pred_train_electricalconductance, 1) #
line_of_best_fit = slope * Y_train_electricalconductance_plot + intercept
plt.scatter(Y_train_electricalconductance_plot, Y_pred_train_electricalconductance)
plt.plot(Y_train_electricalconductance_plot, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Electrical Conductance Train')
plt.ylabel('Predict')
plt.show()

slope, intercept = np.polyfit(Y_test_electricalconductance.to_numpy().flatten(), Y_pred_test_electricalconductance, 1) #
line_of_best_fit = slope * Y_test_electricalconductance.to_numpy().flatten() + intercept
plt.scatter(Y_test_electricalconductance, Y_pred_test_electricalconductance)
plt.plot(Y_test_electricalconductance, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Electrical Conductance Test')
plt.ylabel('Predict')
plt.show()

### Dissolved Reactive Phosphorus



In [None]:
Y_train_dissolvedreactivephosphorus = Y_train[["Dissolved Reactive Phosphorus"]]
Y_test_dissolvedreactivephosphorus = Y_test[["Dissolved Reactive Phosphorus"]]


rf_dissolvedreactivephosphorus = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    subsample=0.3,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_dissolvedreactivephosphorus = RandomForestRegressor(
    n_estimators = 100
    , random_state = 88
    , max_features = 0.5
)    
'''
rfecv_dissolvedreactivephosphorus = RFECV(estimator=rf_dissolvedreactivephosphorus
    , step=1
    , cv=lat_sep_kf
    , scoring='neg_mean_squared_error'
    , n_jobs=-1)
    
rfecv_dissolvedreactivephosphorus.fit(X_train.drop(columns=['cv_group']), Y_train_dissolvedreactivephosphorus, groups= X_train['cv_group'])

print(f"Optimal number of features: {rfecv_dissolvedreactivephosphorus.n_features_}")
print(f"Selected features mask: {rfecv_dissolvedreactivephosphorus.support_}")

mean_scores = rfecv_dissolvedreactivephosphorus.cv_results_['mean_test_score']
num_features = rfecv_dissolvedreactivephosphorus.cv_results_['n_features']
std_error = rfecv_dissolvedreactivephosphorus.cv_results_['std_test_score']

plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("MSE")
plt.errorbar(
    x=num_features,
    y=mean_scores,
    yerr=std_error,
)
plt.title("Recursive Feature Elimination Dissolved Reactive Phosphophrus")
plt.show()

selected_features_dissolvedreactivephosphorus = rfecv_dissolvedreactivephosphorus.get_feature_names_out() 
X_train_selected_dissolvedreactivephosphorus = X_train[selected_features_dissolvedreactivephosphorus]
X_test_selected_dissolvedreactivephosphorus = X_test[selected_features_dissolvedreactivephosphorus]

print(X_train_selected_dissolvedreactivephosphorus.info())

In [None]:

rf_dissolvedreactivephosphorus_selected = xgb.XGBRegressor(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.5,
    objective='reg:squarederror',
    random_state=88,
    alpha = 1,
    reg_lambda = 1
)
'''
rf_dissolvedreactivephosphorus_selected = RandomForestRegressor(n_estimators = 100, random_state = 88, max_features = 0.5)
'''
rf_dissolvedreactivephosphorus_selected.fit(X_train_selected_dissolvedreactivephosphorus, Y_train_dissolvedreactivephosphorus)

# Make predictions on the test set
Y_pred_train_dissolvedreactivephosphorus = rf_dissolvedreactivephosphorus_selected.predict(X_train_selected_dissolvedreactivephosphorus)

mse = mean_squared_error(Y_train_dissolvedreactivephosphorus, Y_pred_train_dissolvedreactivephosphorus)
r2 = r2_score(Y_train_dissolvedreactivephosphorus, Y_pred_train_dissolvedreactivephosphorus)

print("Training: MSE:", mse, "R2:", r2)

# Make predictions on the test set
Y_pred_test_dissolvedreactivephosphorus = rf_dissolvedreactivephosphorus_selected.predict(X_test_selected_dissolvedreactivephosphorus)

mse = mean_squared_error(Y_test_dissolvedreactivephosphorus, Y_pred_test_dissolvedreactivephosphorus)
r2 = r2_score(Y_test_dissolvedreactivephosphorus, Y_pred_test_dissolvedreactivephosphorus)

print("Test: MSE:", mse, "R2:", r2)

In [None]:
Y_train_dissolvedreactivephosphorus_plot =  Y_train_dissolvedreactivephosphorus.to_numpy().flatten()

resid_train_dissolvedreactivephosphorus = Y_train_dissolvedreactivephosphorus_plot - Y_pred_train_dissolvedreactivephosphorus

plt.scatter(Y_train_dissolvedreactivephosphorus_plot, resid_train_dissolvedreactivephosphorus)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Total Alkalinity')
plt.ylabel('Residuals')
plt.show()

sm.qqplot(resid_train_dissolvedreactivephosphorus)
plt.show()

In [None]:

slope, intercept = np.polyfit(Y_train_dissolvedreactivephosphorus_plot, Y_pred_train_dissolvedreactivephosphorus, 1) #
line_of_best_fit = slope * Y_train_dissolvedreactivephosphorus_plot + intercept

plt.scatter(Y_train_dissolvedreactivephosphorus_plot, Y_pred_train_dissolvedreactivephosphorus)
plt.plot(Y_train_dissolvedreactivephosphorus_plot, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Dissolved Reactive Phosphorus Train')
plt.ylabel('Predict')
plt.show()


slope, intercept = np.polyfit(Y_test_dissolvedreactivephosphorus.to_numpy().flatten(), Y_pred_test_dissolvedreactivephosphorus, 1) #
line_of_best_fit = slope * Y_test_dissolvedreactivephosphorus.to_numpy().flatten() + intercept
plt.scatter(Y_test_dissolvedreactivephosphorus, Y_pred_test_dissolvedreactivephosphorus)
plt.plot(Y_test_dissolvedreactivephosphorus, line_of_best_fit, color='red', label='Line of Best Fit') #
plt.xlabel('Dissolved Reactive Phosphorus Test')
plt.ylabel('Predict')
plt.show()