###General Instructions
Abalone are a popular shellfish. Pressure on the abalone population from the fishing industry have caused the species to go into decline.  Efforts have been underway for sometime to limit the harvest of abalone to fish above a certain age, but there is no way to accurately detect the age of an abalone without counting the layers of its shell, with each layer indicating 1.5 years of life, and counting the layers requires the harvesting of the animal.

Researchers from the University of Tasmania have compiled a [dataset](https://archive.ics.uci.edu/ml/datasets/Abalone) of physical characteristics, many of which can be measured without harming the animal, along with a count of rings for a large number of abalone harvested off the Australian coast.  Use these data which should be loaded to the *abalone* folder under your file store root folder, to build a regression model to predict the number of rings (and therefore the age) of abalone based on the following characteristics:

* sex
* mm_length
* mm_diameter
* mm_height
* g_whole_weight

Replace any missing values for the last 4 of these characteristics with a median value.  Replace any missing values for sex with the most frequently occuring value. Handle sex as a categorical feature.  Build a linear regression model and package your data transformations with the model as a pipeline to aid in the conversion of your model into an application that could be deployed to aid fisherman collecting abalone.

Be sure to score your model for accuracy and use a 5-fold cross-validation to ensure you reduce the impact of random splits on your results.  Print the model score where indicated in the cells below.

In [0]:
# Notebook config
USER_NAME = dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
FILE_STORE_ROOT = '/FileStore/shared_uploads/'+USER_NAME
DATA_FILE_NAME = FILE_STORE_ROOT + '/abalone'

In [0]:
# Read the data to a pandas DataFrame and assemble feature and label arrays
import pandas as pd
import numpy as np

df = (
  spark
    .read
    .csv(
      FILE_STORE_ROOT + '/abalone/abalone.data',
      sep=',',
      header=True,
      inferSchema=True,
      nanValue='?'
      )
  ).toPandas()

# Convert 'sex' to categorical feature
df['sex'] = pd.Categorical(df['sex']).codes

# Remove unnecessary columns
df_clean = df.drop(columns=['g_shucked_weight', 'g_viscera_weight', 'g_shell_weight'])

# Assemble feature and label arrays
features = df[['sex', 'mm_length', 'mm_diameter', 'mm_height', 'g_whole_weight']]
labels = df['rings']

display(df_clean)



sex,mm_length,mm_diameter,mm_height,g_whole_weight,rings
2,0.455,0.365,0.095,0.514,15
2,0.35,0.265,0.09,0.2255,7
0,0.53,0.42,0.135,0.677,9
2,0.44,0.365,0.125,0.516,10
1,0.33,0.255,0.08,0.205,7
1,0.425,0.3,0.095,0.3515,8
0,0.53,0.415,0.15,0.7775,20
0,0.545,0.425,0.125,0.768,16
2,0.475,0.37,0.125,0.5095,9
0,0.55,0.44,0.15,0.8945,19


In [0]:
# Assemble your model pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
 

# Define stages for ColumnTransformer
transformer = ColumnTransformer(
  [
    ('most_frequent_missing', SimpleImputer(missing_values=np.NaN, strategy='most_frequent'), ['sex'] ),
    ('cat_encoder', OneHotEncoder(), ['sex'] ),
    ('median_missing', SimpleImputer(missing_values=np.NaN, strategy='median'), ['mm_length','mm_diameter','mm_height','g_whole_weight'] )
  ],
  remainder='passthrough'
  )

# Instantiate and configure model
reg = LinearRegression()

# Define the pipeline
pipeline = Pipeline([
    ('transformer', transformer),
    ('linear_regression', reg)
])

In [0]:
# Train your model using a 5-fold cross-validation
from sklearn.model_selection import cross_val_score
 
pipeline.fit(features, labels)
cv_scores = cross_val_score(pipeline, features, labels, cv=5)

In [0]:
# Present your model score
print('Result of 5-fold cross-validation', cv_scores)
print('Model Accuracy - Average Mean: %.2f%% ' % (np.mean(cv_scores)*100))
print('Model Accuracy - Average Standard Deviation: %.2f%% ' % (np.std(cv_scores)*100))

Result of 5-fold cross-validation [ 0.17605253 -0.17381158  0.22911154  0.37349785  0.29870041]
Model Accuracy - Average Mean: 18.07% 
Model Accuracy - Average Standard Deviation: 18.93% 


In [0]:
# Model score without applying 5-fold cross-validation
 
# fit the model
pipeline.fit(features, labels)
 
# make predictions
predicted_years = pipeline.predict(features)
 
# calculate score
lr_score = pipeline.score(features, labels) 
print('Linear regression Model Score: %.2f%% ' % (lr_score*100))

Linear regression Model Score: 36.91% 
