<a href="https://colab.research.google.com/github/dvisionst/Abalone_Exercise/blob/main/Abalone_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Abalone Core Exercise
- Jose Flores
- 22 July 2022

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer









Prepare the Abalone Dataset for Modeling
The rings column will be your target column.

Note: Similar to trees, the number of rings for Abalone can be used to determine the age.  

In [12]:
# importing the data to use in a dataframe and displaying first 5 rows
data = '/content/abalone.data'
df = pd.read_csv(data)
df.head()

Unnamed: 0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
0,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
1,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


In [13]:
# adding the column names from the names dile that was downloaded.

df.columns = ['Sex', 'Length', 'Diameter', 'Height', 
              'Whole_weight', 'Shucked_weight', 'Viscera_weight', 
              'Shell_weight', 'Rings' ]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4176 non-null   object 
 1   Length          4176 non-null   float64
 2   Diameter        4176 non-null   float64
 3   Height          4176 non-null   float64
 4   Whole_weight    4176 non-null   float64
 5   Shucked_weight  4176 non-null   float64
 6   Viscera_weight  4176 non-null   float64
 7   Shell_weight    4176 non-null   float64
 8   Rings           4176 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


## 1) Separate your data into the features matrix (X) and target vector (y).

In [14]:
# no missing values will go straight to features matrix and target vector
# target vector (y) will be the Rings column

X = df.drop(columns='Rings')
y = df['Rings']

## 2) Train/test split the data. Please use the random number 42 for consistency

In [15]:
# doing the split into trian and test sets of data using 42 for consistency

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## 3) Create a ColumnTransformer to preprocess the data. Remember to:

    a) Create column selectors for the numeric and categorical columns

    b) Create a OneHotEncoder for one-hot encoding the categorical columns

    c) Create a StandardScaler for scaling numeric columns

    d) Match each transformer with the appropriate selector in a tuple

    e) Use the tuples to create a ColumnTransformer to preprocess the data.

In [16]:
# a) creating column selectors for both numeric and catagorical data

cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

In [19]:
# b) creating a OHE for  one-hot encoding the categorical columns

ohe = OneHotEncoder(handle_unknown='ignore')


In [21]:
# c) Creating a standard scaler for the numeric features
scaler = StandardScaler()


In [24]:
# d) transforming the data

num_tuple = (scaler, num_selector)
cat_tuple = (ohe, cat_selector)

In [25]:
# e) Use the tuples to create a ColumnTransformer to preprocess the data.
col_transformer = make_column_transformer(num_tuple, cat_tuple, 
                                          remainder= 'passthrough')

In [26]:
# fitting to trainging set and transforming the data. 
col_transformer.fit(X_train)

X_train_processed = col_transformer.transform(X_train)
X_test_processed = col_transformer.transform(X_test)


## Displaying Results of the transformed data

In [30]:

# transformed and processed training data
X_train_df = pd.DataFrame(X_train_processed)
X_train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.546422,-1.55617,-1.053558,-1.258947,-1.260633,-1.337852,-1.212341,0.0,0.0,1.0
1,0.795725,0.521917,0.706869,0.605245,0.789463,0.749584,0.409117,0.0,0.0,1.0
2,0.252013,0.319177,0.354783,0.37885,0.602065,0.040129,0.172355,1.0,0.0,0.0
3,1.172142,0.927397,0.824231,1.234461,1.277152,1.490874,0.940037,1.0,0.0,0.0
4,-1.462774,-1.4548,-1.17092,-1.233452,-1.177094,-1.205966,-1.212341,0.0,0.0,1.0


In [31]:
# transformed and processed testing data
X_test_df = pd.DataFrame(X_test_processed)
X_test_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.753901,0.927397,0.824231,1.115145,0.895581,1.35444,0.380418,1.0,0.0,0.0
1,0.544781,0.572602,0.237422,0.654196,1.141683,0.526742,0.089847,1.0,0.0,0.0
2,0.084716,0.116436,0.12006,0.195286,0.170822,0.14018,0.079085,0.0,1.0,0.0
3,0.963022,0.978082,0.589507,0.802066,0.728502,0.804157,0.868291,1.0,0.0,0.0
4,-0.208052,-0.289044,0.354783,-0.357445,-0.54039,-0.346434,-0.243771,1.0,0.0,0.0
