# Regression Analysis of Fish Dataset on Kaggle

The data chosen is the kaggle fish market dataset available [here](https://www.kaggle.com/aungpyaeap/fish-market).
This data set allows one to explore linear regression for predicting continues variables and logistic regression for 
predicting discrete varables. In the case of linear regression, I predicted the weight of a fish based on it's attributes
and in the case of logistic regression I predicted the species of fish based on it's attributes.

In [1]:
# importing needed libraries
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.metrics import accuracy_score

In [2]:
# reading csv file of data
df = pd.read_csv('fish.csv')

In [3]:
# peeking at the available data
df.head()

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.52,4.02
1,Bream,290.0,24.0,26.3,31.2,12.48,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.73,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.444,5.134


In [4]:
# viewing basic statictics about the dataset to have a better sense of what the data contains
df.describe()

Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
count,159.0,159.0,159.0,159.0,159.0,159.0
mean,398.326415,26.24717,28.415723,31.227044,8.970994,4.417486
std,357.978317,9.996441,10.716328,11.610246,4.286208,1.685804
min,0.0,7.5,8.4,8.8,1.7284,1.0476
25%,120.0,19.05,21.0,23.15,5.9448,3.38565
50%,273.0,25.2,27.3,29.4,7.786,4.2485
75%,650.0,32.7,35.5,39.65,12.3659,5.5845
max,1650.0,59.0,63.4,68.0,18.957,8.142


In [5]:
# splitting the data into train and test
data_train, data_test = train_test_split(df, test_size=0.3, random_state=42)

## Performing Linear Regression on chosen dataset

With linear regression, the focus is to predict fish weight

In [6]:
def feature_selection_and_preprocessing_linear_regression(dataset):
    # Since there are no Null Values all the atrributes of the dataset are used
    features = dataset[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Species']].copy()
    return features

In [7]:
# Creating the linear regression model
linear_model = make_pipeline(
    make_column_transformer(
        (OneHotEncoder(sparse=False), ['Species']),
        remainder='passthrough'
    ),
    LinearRegression()
)

In [8]:
# Obtraining dependent and independent varaibles for training linear regressor
linear_X_train = feature_selection_and_preprocessing_linear_regression(
                    data_train.drop('Weight', axis=1)
                ) 
linear_y_train =  data_train['Weight']

In [9]:
# Performing cross validation before fitting model
scores = cross_val_score(linear_model, linear_X_train, linear_y_train, scoring='r2', cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.91 (+/- 0.07)


In [10]:
# Fitting the linear regression model
linear_model.fit(linear_X_train, linear_y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(sparse=False),
                                                  ['Species'])])),
                ('linearregression', LinearRegression())])

In [11]:
# Testing the accuracy of the linear regression model on the test data
linear_model.score(
        feature_selection_and_preprocessing_linear_regression(
            data_test.drop('Weight', axis=1)
        ), 
        data_test['Weight']
)

0.9379921317795373

## Performing Logistic Regression on chosen dataset

With logistic regression the purpose is to predict fish species

In [12]:
def feature_selection_and_preprocessing_logistic_regression(dataset):
    # Since there are no null values all the attributes will be used to compute the species
    features = dataset[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Weight']].copy()
    return features

In [13]:
# creating the logictic regression model
logistic_model = make_pipeline(
    make_column_transformer(
        (StandardScaler(), ['Weight']),
        remainder='passthrough'
    ),
    LogisticRegression(max_iter=1000,solver='liblinear')
)

In [14]:
# Fitting the logictic regression model
logistic_model.fit(
    feature_selection_and_preprocessing_logistic_regression(
        data_train.drop('Species', axis=1)
    ),
    data_train['Species']
)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['Weight'])])),
                ('logisticregression',
                 LogisticRegression(max_iter=1000, solver='liblinear'))])

In [15]:
# Testing the accuracy of the logistic regression model
train_predictions = logistic_model.predict(
    feature_selection_and_preprocessing_logistic_regression(
        data_train.drop('Species', axis=1)
    )
)

test_predictions = logistic_model.predict(
    feature_selection_and_preprocessing_logistic_regression(
        data_test.drop('Species', axis=1)
    )
)

print("Train accuracy:", accuracy_score(
    data_train['Species'],
    train_predictions
))
print("Test accuracy:", accuracy_score(
    data_test['Species'],
    test_predictions
))

Train accuracy: 0.963963963963964
Test accuracy: 0.9375


Both models seem to perform fairly well with accuracies above 0.9