In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

# Fish, Weight, and Species

The following is a model to estimate the weight of a fish based on it's measurements.

The data used to train and test the model in question is found in the data folder as a csv file.

## Data

In [None]:
# Read data into a Dataframe
fish_data = pd.read_csv('./data/Fish.csv')

fish_data.info()

While the data has no null values, it could and does have weights of zero.

This should not be possible, and these entries will be removed in cleaning.

In [None]:
# Drop Weights of 0
index = fish_data[ fish_data['Weight'] == 0.0].index
fish_data.drop(index, inplace = True)

fish_data.describe()

In this model, the species of each fish in the dataset was not used.  The model was found to be more prone to error and less accurate while taking this portion into account.

In [None]:
# Separate Target from features
y_fish = fish_data.iloc[:, 1]
X_fish = fish_data.drop(columns=['Weight'])
X_fish = X_fish.drop(columns=['Species'])

# Convert into numpy arrays
X_fish = X_fish.values
y_fish = y_fish.values

# Split into Training and Test Groups
X_fish_train, X_fish_test, y_fish_train, y_fish_test = train_test_split(X_fish, y_fish, test_size = 0.2, random_state = 0)

## Training

The model is built using sci-kit's Ridge model with an alpha of 0.

While this would technically be linear regression without utilizing the normalization that comes with the Ridge model, the Linear Regression model in sci-kit produced worse results.

In [None]:
# Build Model
regressor = make_pipeline(PolynomialFeatures(3), Ridge(alpha=0.0))

# Train on split data
regressor.fit(X_fish_train, y_fish_train)

## Evaluation

The model will be evaluated utilizing three methods:

1) Comparing Predictions to Actual Test Values
2) Mean Squared Error
3) R2 Score

The last three will be done using the results of the first evaluation method.

In [None]:
# Predict using Test Set
y_fish_pol_pred = regressor.predict(X_fish_test)

# Build a comparison Dataframe with predictions and actual
pred_pol_compare = pd.DataFrame()
pred_pol_compare['Prediction'] = y_fish_pol_pred.tolist()
pred_pol_compare['Actual'] = y_fish_test.tolist()

print(pred_pol_compare)

In [None]:
# Evaluate using MSE and R2Score
mse_pol = mean_squared_error(y_fish_test, y_fish_pol_pred)
r2s_pol = r2_score(y_fish_test, y_fish_pol_pred)

print("MSE: %s | R2S: %s" % (mse_pol, r2s_pol))