# 1. Preprocessing

## a. Dataset Description
- Provide a brief description of the dataset, including details about its attributes and target attribute. Clearly state the purpose of the dataset. Identify and categorize attributes as discrete or continuous.

## b. Handling NaN Values
- Identify and handle NaN values in the dataset. Remove rows that contain NaN values to ensure data integrity.

## c. Numerical Attribute Statistics
- Calculate the mean and variance for each numerical attribute in the dataset. This step helps in understanding the central tendency and spread of the numerical features.


In [1]:
import pandas as pd
import numpy as np



# Dataset
dataFrame = pd.read_csv('https://raw.githubusercontent.com/Vali2804/ML_setDate/main/penguins.csv')

# 1. Preprocessing
# Dataset Description 

#a.
# Penguin Dataset Description
attributes = ['species', 'island', 'bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
targetAttribute = 'species'
purpose = 'Classification of penguin species'

# Attribute types
discreteAttributes = ['species', 'island']
continuousAttributes = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

# b.
# Remove NaN values
dataFrame = dataFrame.dropna()

# c.
# Calculte mean and variance for each attribute
num_attributes = dataFrame[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]
mean_values = num_attributes.mean()
variance_values = num_attributes.var()

print('Mean values for each attribute:\n', mean_values)
print('Variance values for each attribute:\n', variance_values)


Mean values for each attribute:
 bill_length_mm         43.992793
bill_depth_mm          17.164865
flipper_length_mm     200.966967
body_mass_g          4207.057057
dtype: float64
Variance values for each attribute:
 bill_length_mm           29.906333
bill_depth_mm             3.877888
flipper_length_mm       196.441677
body_mass_g          648372.487699
dtype: float64


# 2. Weak estimators
## a. Data Preprocessing - Handling Non-Numeric Discrete Attributes
- Convert the discrete attributes that are not numeric (such as strings or boolean values) into numerical. If this doesn't apply to your dataset, provide
a short explanation on how you would proceed.

## b.  Custom Function for Variable Discretization and Testing with Example
- Write a function get_splits which, given an attribute and the labels column, will identify the splits that could be used to discretization of the
variable. Test your function on an example.

## c. AdaBoost Configuration - Counting Weak Estimators and External Threshold
- How many weak estimators (or splits) are available in our version of AdaBoost? (include the external threshold)

In [2]:
# 2. Weak estimators
# a. convert discrete attributes to numeric
dataFrame['species'] = dataFrame['species'].astype('category').cat.codes
dataFrame['island'] = dataFrame['island'].astype('category').cat.codes

print('Discrete attributes converted to numeric:\n', dataFrame)
# b. the function get_splits()
def get_splits(attribute_values, labels):
    unique_values = np.unique(attribute_values)
    splits = []

    for value in unique_values:
        left_mask = attribute_values == value
        right_mask = ~left_mask

        left_labels = labels[left_mask]
        right_labels = labels[right_mask]

        split = (left_labels.mean() + right_labels.mean()) / 2
        splits.append((value, split))

    return splits

# testing get_splits()
attribute_values = dataFrame['species'].values
labels = dataFrame['island'].values
splits = get_splits(attribute_values, labels)
print('Splits for attribute species:\n', splits)

Discrete attributes converted to numeric:
      species  island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0          0       2            39.1           18.7              181.0   
1          0       2            39.5           17.4              186.0   
2          0       2            40.3           18.0              195.0   
4          0       2            36.7           19.3              193.0   
5          0       2            39.3           20.6              190.0   
..       ...     ...             ...            ...                ...   
338        2       0            47.2           13.7              214.0   
340        2       0            46.8           14.3              215.0   
341        2       0            50.4           15.7              222.0   
342        2       0            45.2           14.8              212.0   
343        2       0            49.9           16.1              213.0   

     body_mass_g     sex  
0         3750.0    MALE  
1         3800

# 3. AdaBoost
# a. The function calculate_split_probs
- Write a function calculate_split_probs that takes the following arguments:
    + attribute - the name of one attribute (a column from your dataframe)
    + split - the value of the split
    + D - the current distribution of weights for each instance of the training dataset ( sum(D) should be 1 )
The function should return a dictionary following this structure:

In [327]:
def calculate_split_probs(attribute, split, D):
    left_mask = attribute <= split
    right_mask = attribute > split

    left_weights = {}
    right_weights = {}

    for label in np.unique(attribute):
        left_denominator = np.sum(D[left_mask])
        right_denominator = np.sum(D[right_mask])

        if left_denominator != 0:
            left_weights[label] = np.sum(D[left_mask & (attribute == label)]) / left_denominator
        else:
            left_weights[label] = 0

        if right_denominator != 0:
            right_weights[label] = np.sum(D[right_mask & (attribute == label)]) / right_denominator
        else:
            right_weights[label] = 0

    return {'left': left_weights, 'right': right_weights}

# testing calculate_split_probs()
attribute = dataFrame['species'].values
split = 0.5
D = np.ones(len(attribute)) / len(attribute)
split_probs = calculate_split_probs(attribute, split, D)
print('Split probabilities for attribute species:\n', split_probs)


Split probabilities for attribute species:
 {'left': {0: 1.0, 1: 0.0, 2: 0.0}, 'right': {0: 0.0, 1: 0.3636363636363635, 2: 0.6363636363636362}}


# b. The function calculate_split_prediction 
- Write a function calculate_split_prediction that takes the output of calculate_split_probs and returns a vector [label_left, label_right] ,
where label_left is the label with the highest weight on the left side of the split, and label_right is the label with the highest weight on the right
side of the split

In [328]:
def calculate_split_prediction(split_probs):
    left_weights = split_probs['left']
    right_weights = split_probs['right']

    label_left = max(left_weights, key=lambda x: left_weights[x])
    label_right = max(right_weights, key=lambda x: right_weights[x])

    return [label_left, label_right]
# testing calculate_split_prediction()
split_prediction = calculate_split_prediction(split_probs)
print('Split prediction for attribute species:\n', split_prediction)

Split prediction for attribute species:
 [0, 2]


# c. The function calculate_split_errors 
- Write a function calculate_split_errors that takes the output of calculate_split_probs and returns the error of the split, i.e. 1 - the largest weight
on the left side of the split - the largest weight on the right side of the split.

In [329]:
def calculate_split_errors(split_probs):
    left_weights = split_probs['left']
    right_weights = split_probs['right']

    max_left_weight = max(left_weights, key=lambda x: left_weights[x])
    max_right_weight = max(right_weights, key=lambda x: right_weights[x])

    error = abs(1 - left_weights[max_left_weight] - right_weights[max_right_weight])
    left_error = abs(1 - left_weights[max_left_weight])
    right_error = abs(1 - right_weights[max_right_weight])
    return error, left_error, right_error

# testing calculate_split_errors()
split_errors = calculate_split_errors(split_probs)
print('Split errors for attribute species:\n', split_errors)

Split errors for attribute species:
 (0.6363636363636362, 0.0, 0.36363636363636376)


# d. The function find_best_estimator 
- Write a function find_best_estimator that, given a dataframe df , a list of splits for each attribute splits_list and a distribution of weights D , will
find the attribute and split that achieves the lowest error.
The return of the function should be the following: [(index_of_best_attribute, best_split, label_left_side, label_right_side),
estimator_error] .


In [330]:
def find_best_estimator(df, splits_list, D):
    best_attribute_index = None
    best_split_value = None
    best_label_left = None
    best_label_right = None
    best_estimator_error = float('inf')

    for i, attribute in enumerate(continuousAttributes):
        attribute_values = df[attribute].values
        labels = df[targetAttribute].values

        for split_value, _ in splits_list[attribute]:
            split_probs = calculate_split_probs(attribute_values, split_value, D)
            split_errors = calculate_split_errors(split_probs)

            total_error = split_errors[0]

            if total_error < best_estimator_error:
                best_attribute_index = i
                best_split_value = split_value
                best_label_left, best_label_right = calculate_split_prediction(split_probs)
                best_estimator_error = total_error

    return [(best_attribute_index, best_split_value, best_label_left, best_label_right), best_estimator_error]

# Testing the find_best_estimator function
splits_list = {attribute: get_splits(dataFrame[attribute].values, dataFrame[targetAttribute].values) for attribute in continuousAttributes}
weights_distribution = np.ones(len(dataFrame)) / len(dataFrame)
best_estimator_info = find_best_estimator(dataFrame, splits_list, weights_distribution)
print('Best Estimator Information:\n(best_attribute_index, best_split_value, best_label_left, best_label_right)\n', best_estimator_info)


Best Estimator Information:
(best_attribute_index, best_split_value, best_label_left, best_label_right)
 [(0, 32.1, 32.1, 41.1), 0.021084337349397585]


# e.   
- Assuming that D follows the uniform distribution, what is the attribute and the split that achieves the lowest error? What are the labels predicted by
the split?

# f. The function get_estimator_weight 
- Write a function get_estimator_weight that will calculate the weight alpha of the estimator determined at point d.

In [331]:
def get_estimator_weight(estimator_error):
    return 0.5 * np.log((1 - estimator_error) / estimator_error)
# testing the get_estimator_weight function
estimator_weight = get_estimator_weight(best_estimator_info[1])
print('Estimator Weight:\n', estimator_weight)


Estimator Weight:
 1.9189575166372121


# g. The function update_D
- Write a function update_D that, given a dataframe df , an estimator current_h and the current distribution D , will return the vector with the updated
weights of the training instances.


In [332]:
def update_D(df, current_h, D):
    attribute = continuousAttributes[current_h[0]]
    split_value = current_h[1]
    label_left = current_h[2]
    label_right = current_h[3]

    attribute_values = df[attribute].values
    labels = df[targetAttribute].values

    left_mask = attribute_values <= split_value
    right_mask = attribute_values > split_value

    left_labels = labels[left_mask]
    right_labels = labels[right_mask]

    D_left = D[left_mask]
    D_right = D[right_mask]

    D_updated = np.zeros(len(D))

    D_updated[left_mask] = D_left * np.exp(-estimator_weight * (left_labels != label_left))
    D_updated[right_mask] = D_right * np.exp(-estimator_weight * (right_labels != label_right))

    D_updated /= np.sum(D_updated)

    return D_updated

#testing the update_D function
D_updated = update_D(dataFrame, best_estimator_info[0], weights_distribution)
print('Updated D:\n', D_updated)


Updated D:
 [0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003 0.003003
 0.003003 0.003003 0.003003 0.003003 0.

# h. The function adaboost 
- Write a function adaboost that, given a dataframe df and a number of iterations niter , will train an AdaBoost model. The function should return
the following dictionary 
```python
{
"estimators": the list of the `niter` estimators; each estimator should be described as (index_of_best_attribute, best_split,
label_left_side, label_right_side)
"estimators_weights": the list of the estimators' weights
}
```

In [333]:
def adaboost(df, niter):
    # Initialize weights
    D = np.ones(len(df)) / len(df)

    # Lists to store estimators and their weights
    estimators = []
    estimators_weights = []

    for _ in range(niter):
        # Get splits for continuous attributes
        splits_list = {attribute: get_splits(df[attribute].values, df[targetAttribute].values) for attribute in continuousAttributes}

        # Find the best weak estimator
        best_estimator_info = find_best_estimator(df, splits_list, D)

        # Get the weight for the weak estimator
        estimator_weight = get_estimator_weight(best_estimator_info[1])

        # Update the weights distribution
        D = update_D(df, best_estimator_info[0], D)

        # Store the weak estimator and its weight
        estimators.append(best_estimator_info[0])
        estimators_weights.append(estimator_weight)

    return {"estimators": estimators, "estimators_weights": estimators_weights}
# testing the adaboost function
adaboost_info = adaboost(dataFrame, 10)
print('AdaBoost Information:\n', adaboost_info)


AdaBoost Information:
 {'estimators': [(0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1), (0, 32.1, 32.1, 41.1)], 'estimators_weights': [1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121, 1.9189575166372121]}


# i. The function adaboost_predict
- Write a function adaboost_predict that, given an instance X and the model created at h, will return the label predicted by the majority of the
estimators

In [334]:
def adaboost_predict(X, model):
    # Initialize a dictionary to store the votes for each label
    label_votes = {}

    # Iterate through each weak estimator in the model
    for i, estimator_info in enumerate(model["estimators"]):
        # Get the information about the current weak estimator
        attribute_index, split_value, label_left, label_right = estimator_info

        # Get the attribute value for the current instance
        attribute_value = X.iloc[attribute_index]

        # Make a prediction based on the weak estimator
        if attribute_value <= split_value:
            label = label_left
        else:
            label = label_right

        # Update the votes for the predicted label
        if label in label_votes:
            label_votes[label] += model["estimators_weights"][i]
        else:
            label_votes[label] = model["estimators_weights"][i]

    # Return the label with the majority of votes
    return max(label_votes, key=label_votes.get)

# testing adaboost_predict function
X = dataFrame.iloc[0]
model = adaboost(dataFrame, 10)
prediction = adaboost_predict(X, model)

print('Prediction for the first instance:\n', prediction)


Prediction for the first instance:
 32.1


# j. The function cross_validation
- Use cross-validation to identify the appropriate number of iterations that will achieve good results on your dataset. The number of iterations shouldn't
exceed 50.
