<a href="https://colab.research.google.com/github/evroth/gsb545repo/blob/main/PA_Bagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice Activity: Bagging
Week 3

## The Abalone Data

In [3]:
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"

# Column names for the data
column_names = ["sex", "length", "diameter", "height", "whole_weight", 
                "shucked_weight", "viscera_weight", "shell_weight", "rings"]

# Import data from web link into a pandas data frame
data = pd.read_csv(url, header=None, names=column_names)

# Print the first 5 rows of the data frame to verify it's been imported correctly
print(data)

     sex  length  diameter  height  whole_weight  shucked_weight  \
0      M   0.455     0.365   0.095        0.5140          0.2245   
1      M   0.350     0.265   0.090        0.2255          0.0995   
2      F   0.530     0.420   0.135        0.6770          0.2565   
3      M   0.440     0.365   0.125        0.5160          0.2155   
4      I   0.330     0.255   0.080        0.2050          0.0895   
...   ..     ...       ...     ...           ...             ...   
4172   F   0.565     0.450   0.165        0.8870          0.3700   
4173   M   0.590     0.440   0.135        0.9660          0.4390   
4174   M   0.600     0.475   0.205        1.1760          0.5255   
4175   F   0.625     0.485   0.150        1.0945          0.5310   
4176   M   0.710     0.555   0.195        1.9485          0.9455   

      viscera_weight  shell_weight  rings  
0             0.1010        0.1500     15  
1             0.0485        0.0700      7  
2             0.1415        0.2100      9  
3      

In [4]:
data.dtypes

sex                object
length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
rings               int64
dtype: object

## Preprocessing

In [10]:
df2 = pd.get_dummies(data, columns = ['sex'])

In [11]:
df2.dtypes

length            float64
diameter          float64
height            float64
whole_weight      float64
shucked_weight    float64
viscera_weight    float64
shell_weight      float64
rings               int64
sex_F               uint8
sex_I               uint8
sex_M               uint8
dtype: object

## Build the bagging model

Model 1:

Bagged RF with simple MAE of test **set**

In [33]:
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Separate features and target variable
X = df2.drop(['rings'], axis=1)
y = df2['rings']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# Build a random forest model using bagging
rf = RandomForestRegressor(n_estimators=100, random_state=7)
model = BaggingRegressor(base_estimator=rf, n_estimators=10, random_state=7)

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on test data
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)



Mean Absolute Error: 1.4939904306220098


Model 2:

Same Bagged RF with 10 fold cross validation

In [35]:
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor
from sklearn.model_selection import cross_val_score

# Separate features and target variable
X = df2.drop(['rings'], axis=1)
y = df2['rings']

model = BaggingRegressor(RandomForestRegressor(n_estimators = 100, random_state=7), n_estimators = 2, random_state = 7) 
cross_val_score(model,X,y,scoring="neg_mean_absolute_error",cv=10).mean()

-1.5672391799479075

Note negative mean absolute error above; this model is currently worse than the first

Model 3: 

SVM with 10 fold cross validation

In [38]:
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor

# Separate features and target variable
X = df2.drop(['rings'], axis=1)
y = df2['rings']

regr = BaggingRegressor(estimator=SVR(),
                        n_estimators=10, random_state=0).fit(X, y)

cross_val_score(regr,X,y,scoring="neg_mean_absolute_error",cv=10).mean()

-1.5881219570171203

In [30]:
import sklearn
# other metric options
sklearn.metrics.get_scorer_names()

['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'matthews_corrcoef',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_negative_likelihood_ratio',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'positive_likelihood_ratio',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'rand_score',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',

From just these basic two models, we would choose the model 2 RF