## Assignment 2 (50 marks)
#### =====================================================================================================
### Deadline: 09/28 11:59 pm
#### =====================================================================================================

### Problem 1: Perceptron Learning (15 marks)

The dataset `lab02_dataset_1.csv` has a *3-dimensional input space* and a class label of *Positive* and *Negative*. For this task, you are **not allowed** to use any *functionalities* of the `sklearn` module.

### 1.a (10 marks)

Write a function `my_perceptron()` which applies the perceptron algorithm (refer to the lecture slide covering linear separators for details of this algorithm) on the dataset to create a linear separator. `my_perceptron()` takes the dataset as its input and returns a ***3-dimensional weight vector*** which can be used to create the linear separator (assume `bias = 0`). Use the *initial weights* `w = [3.5,0.5,-2.5]`. Use a classification threshold of `99%` i.e., `my_perceptron()` will terminate once the misclassification rate is less than `1%`.

In [2]:
import pandas as pd
import numpy as np

def my_perceptron():
    data = pd.read_csv('lab02_dataset_1.csv')

    inputs = data[['X', 'Y', 'Z']].to_numpy()
    classes = data['Class'].map({'Negative': -1, 'Positive': 1}).to_numpy()

    weights = np.array([3.5, 0.5, -2.5])
    bias = 0
    
    for iter in range(1000): 
        err = 0
        for i in range(len(inputs)):
            a = np.dot(weights, inputs[i]) + bias
            predication = 1 if a >= 0 else -1
            if predication != classes[i]:
                weights += classes[i] * inputs[i]
                bias += classes[i]
                err += 1
        err_rate = err / len(data)
        if err_rate <= 0.01:
            break
    
    return weights, bias

if __name__ == "__main__":
    weights, bais = my_perceptron()
    print("Weights:", weights)
    print("Bias:", bais)

Weights: [0.14536582 2.30319913 0.34978878]
Bias: -1


### 1.b (5 marks)

Create a *3D plot* which showcases the dataset in a 3D-space alongwith the *linear separator* you obtained from `my_perceptron()`. Use two different colors to represent the data points belonging in the two classes for ease of viewing.

In [6]:
import plotly.graph_objects as go

def plot(weights, bias):
    df = pd.read_csv('lab02_dataset_1.csv')
    inputs = df[['X','Y','Z']].to_numpy(dtype=float)
    cls = df['Class'].map({'Negative': -1, 'Positive': 1}).to_numpy(dtype=int)

    # Split for two colors
    pos = (cls == 1)
    neg = (cls == -1)

    fig = go.Figure()

    # Positive class
    fig.add_trace(go.Scatter3d(
        x=inputs[pos,0], y=inputs[pos,1], z=inputs[pos,2],
        mode="markers",
        marker=dict(size=4, symbol="circle", color="royalblue"),
        name="Positive",
        text=["Positive"]*pos.sum(),
        hovertemplate="X=%{x:.3f}<br>Y=%{y:.3f}<br>Z=%{z:.3f}<br>%{text}"
    ))

    # Negative class
    fig.add_trace(go.Scatter3d(
        x=inputs[neg,0], y=inputs[neg,1], z=inputs[neg,2],
        mode="markers",
        marker=dict(size=4, symbol="diamond", color="tomato"),
        name="Negative",
        text=["Negative"]*neg.sum(),
        hovertemplate="X=%{x:.3f}<br>Y=%{y:.3f}<br>Z=%{z:.3f}<br>%{text}"
    ))

    x_min, x_max = inputs[:,0].min(), inputs[:,0].max()
    y_min, y_max = inputs[:,1].min(), inputs[:,1].max()
    z_min, z_max = inputs[:,2].min(), inputs[:,2].max()

    pad = 0.05
    x_range, y_range, z_range = x_max-x_min, y_max-y_min, z_max-z_min
    x_min, x_max = x_min - pad*x_range, x_max + pad*x_range
    y_min, y_max = y_min - pad*y_range, y_max + pad*y_range
    z_min, z_max = z_min - pad*z_range, z_max + pad*z_range

    grid = 40

    x_grid = np.linspace(x_min, x_max, grid)
    y_grid = np.linspace(y_min, y_max, grid)
    X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
    Z_grid = -(weights[0]*X_grid + weights[1]*Y_grid + bias) / weights[2]
    Z_grid = np.clip(Z_grid, z_min, z_max)
    fig.add_trace(go.Surface(
        x=X_grid, y=Y_grid, z=Z_grid,
        opacity=0.35, showscale=False,
        name="separator"
    ))
    
    fig.update_layout(
        title="Perceptron Linear Separator (3D, Plotly)",
        scene=dict(
            xaxis_title="X", yaxis_title="Y", zaxis_title="Z",
            aspectmode="data"
        ),
        legend=dict(x=0.02, y=0.98)
    )

    fig.show()

    
if __name__ == "__main__":
    weights, bias = my_perceptron()
    plot(weights, bias)

### Problem 2: SVM Classification (35 marks)

`lab02_dataset_2.xlsx` contains the claim history of 27,513 homeowner policies. The following table describes the eleven columns in the dataset.

| Name | Description |
| --- | --- |
| policy | Policy Identifier |
| exposure | Duration a Policy exposed in a Year |
| num_claims | Number of Claims in a Year |
| amt_claims | Total Claim Amount in a Year	|
| f_primary_age_tier | Age Tier of Primary Insured |
| f_primary_gender | Gender of Primary Insured |
| f_marital | Marital Status of Primary Insured |
| f_residence_location | Location of Residence Property |
| f_fire_alarm_type | Fire Alarm Type |
| f_mile_fire_station | Distance to Nearest Fire Station |
| f_aoi_tier | Amount of Insurance Tier |

We want to predict the *Frequency* which is the *number of claims per unit of exposure* using the above features.  We first divide the reported number of claims by the exposure. This gives us the *Frequency*.  Next, we put the policies into five groups according to their *Frequency* values. We will use this *Group* as our target variable which has five classes.

| Group | Values |
| :--- | :--- |
| 0 | Frequency = 0 |
| 1 | 0 < Frequency <= 1 |
| 2 | 1 < Frequency <= 2 |
| 3 | 2 < Frequency <= 3 |
| 4 | 3 < Frequency |

### 2.a (6 marks)
Create a new column for the dataset which will indiciate the *Frequency Group* and output the updated dataset.

In [31]:
def add_column(name):
    df = pd.read_excel('lab02_dataset_2.xlsx')
    frequency = np.divide(df['num_claims'].values, df['exposure'].values, where=df['exposure'].values!=0, out=np.zeros_like(df['num_claims'].values, dtype=float))
    groups = np.select(
        [frequency == 0, (frequency > 0) & (frequency <= 1), (frequency > 1) & (frequency <= 2), (frequency > 2) & (frequency <= 3)],
        ['0', '1', '2', '3'],
        default='4'
    ).astype(int)
    
    df[name] = groups
    df.to_excel('lab02_dataset_2_freq.xlsx', index=False)
    print(df.head())
    
if __name__ == "__main__":
    add_column('frequency_group')

   policy  exposure  num_claims  amt_claims f_primary_age_tier  \
0  P00001       1.0           0        0.00            21 - 27   
1  G00002       1.0           0        0.00            38 - 60   
2  A00003       1.0           2     3079.01            38 - 60   
3  P00004       1.0           1      804.87            28 - 37   
4  G00005       1.0           1      638.74            28 - 37   

  f_primary_gender   f_marital f_residence_location f_fire_alarm_type  \
0             Male     Married                Urban     Alarm Service   
1             Male  Un-Married             Suburban               NaN   
2           Female     Married             Suburban        Standalone   
3           Female  Un-Married             Suburban        Standalone   
4           Female  Un-Married             Suburban     Alarm Service   

  f_mile_fire_station   f_aoi_tier  frequency_group  
0            < 1 mile  351K - 600K                0  
1         1 - 5 miles       < 100K                0  
2 

### 2.b (6 marks)
There are seven categorial features in the dataset namely, *f_aoi_tier, f_primary_age_tier, f_fire_alarm_type, f_marital, f_mile_fire_station, f_primary_gender, f_residence_location*. Display all the unique values of these seven categories.

In [21]:
CATEGORICAL = [
    'f_aoi_tier',
    'f_primary_age_tier',
    'f_fire_alarm_type',
    'f_marital',
    'f_mile_fire_station',
    'f_primary_gender',
    'f_residence_location'
]

def display_unique_values(column):
    df = pd.read_excel('lab02_dataset_2_freq.xlsx')
    unique_values = df[column].unique()
    print(f"Unique values in column '{column}': {unique_values}")

if __name__ == "__main__":
    for col in CATEGORICAL:
        display_unique_values(col)

Unique values in column 'f_aoi_tier': ['351K - 600K' '< 100K' '100K - 350K' '601K - 1M' '> 1M']
Unique values in column 'f_primary_age_tier': ['21 - 27' '38 - 60' '28 - 37' '> 60' '< 21']
Unique values in column 'f_fire_alarm_type': ['Alarm Service' nan 'Standalone']
Unique values in column 'f_marital': ['Married' 'Un-Married' 'Not Married']
Unique values in column 'f_mile_fire_station': ['< 1 mile' '1 - 5 miles' '> 10 miles' '6 - 10 miles']
Unique values in column 'f_primary_gender': ['Male' 'Female']
Unique values in column 'f_residence_location': ['Urban' 'Suburban' 'Rural']


### 2.c (6 marks)
We will train SVM models using those seven categorical features. However, their values are currently all categorical data, but SVM requires them to be numerical. Perform `One-hot Encoding` on these features to obtain an updated dataset which has only numerical values.

In [34]:
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(columns):
    df = pd.read_excel('lab02_dataset_2_freq.xlsx')
    dropped = df[['policy', 'frequency_group']]
    data = df.drop(columns=['policy', 'frequency_group']).astype(str).fillna('Missing')
    encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')
    encoded_data = encoder.fit_transform(data[columns])
    encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(columns))
    df = pd.concat([encoded_df, dropped], axis=1)
    df.to_excel('lab02_dataset_2_encoded.xlsx', index=False)
    
if __name__ == "__main__":
    one_hot_encode(CATEGORICAL)

### 2.d (6 marks)
Divide the observations into training and testing partitions. Observations whose *Policy Identifier* starts with the letters A, G, and P will go to the training partition. The remaining observations go to the testing partition. Output the total number of policies present in the Training partition and Testing partition.

In [28]:
def split_by_policy():
    df = pd.read_excel('lab02_dataset_2_encoded.xlsx')
    policy = df['policy'].astype(str).str.strip().str.upper()
    
    mask = policy.str.startswith(('A', 'G', 'P'))
    
    training_data = df[mask]
    testing_data = df[~mask]
    
    training_data.to_excel('lab02_dataset_2_train.xlsx', index=False)
    testing_data.to_excel('lab02_dataset_2_test.xlsx', index=False)
    
    return len(training_data), len(testing_data)

if __name__ == "__main__":
    train, test = split_by_policy()
    print(f"Training set size: {train}")
    print(f"Testing set size: {test}")


Training set size: 20661
Testing set size: 6852


### 2.e (6 marks)
Train an SVM model using [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). The input features will be the encoded version of the feature set *f_aoi_tier, f_primary_age_tier, f_fire_alarm_type, f_marital, f_mile_fire_station, f_primary_gender, f_residence_location* and the output is the *Frequency Group*. Use `verbose=1` to observe the optimization steps during the training process.

In [None]:
from sklearn.svm import LinearSVC

def _ohe_cols(df, cols):
    return [c for c in df.columns if any(c.startswith(col + '_') for col in cols)]

def train_svm():
    train_data = pd.read_excel('lab02_dataset_2_train.xlsx')
    test_data = pd.read_excel('lab02_dataset_2_test.xlsx')

    y_train = train_data['frequency_group'].astype(int).to_numpy()

    feat_cols_train = _ohe_cols(train_data, CATEGORICAL)
    feat_cols_test  = _ohe_cols(test_data,  CATEGORICAL)

    all_cols = sorted(set(feat_cols_train) | set(feat_cols_test))
    x_train = train_data.reindex(columns=all_cols, fill_value=0).to_numpy(dtype=float)
    
    clf = LinearSVC(C=1.0, dual='auto', max_iter=20000, verbose=1, random_state=0)
    clf.fit(x_train, y_train)
    
    return clf

### 2.f (5 marks)
Compute and output the accuracy score on the testing partition.

In [55]:
from sklearn.metrics import accuracy_score, classification_report

if __name__ == "__main__":
    svm_model = train_svm()
    
    train_data = pd.read_excel('lab02_dataset_2_train.xlsx')
    test_data = pd.read_excel('lab02_dataset_2_test.xlsx')

    feat_cols_train = _ohe_cols(train_data, CATEGORICAL)
    feat_cols_test  = _ohe_cols(test_data,  CATEGORICAL)

    all_cols = sorted(set(feat_cols_train) | set(feat_cols_test))
    x_train = train_data.reindex(columns=all_cols, fill_value=0).to_numpy(dtype=float)
    x_test  = test_data.reindex(columns=all_cols,  fill_value=0).to_numpy(dtype=float)

    train_predict = svm_model.predict(x_train)
    test_predict = svm_model.predict(x_test)
    
    print(f"Train accuracy: {accuracy_score(train_data['frequency_group'].astype(int).to_numpy(), train_predict)}")
    print(f"Test accuracy: {accuracy_score(test_data['frequency_group'].astype(int).to_numpy(), test_predict)}")
    
    print(classification_report(test_data['frequency_group'], test_predict, labels=[0,1,2,3,4], zero_division=0))

[LibLinear]Train accuracy: 0.49169933691496054
Test accuracy: 0.4918272037361354
              precision    recall  f1-score   support

           0       0.60      0.82      0.70      3858
           1       0.28      0.05      0.09      1750
           2       0.13      0.06      0.08       779
           3       0.05      0.08      0.06       214
           4       0.07      0.14      0.09       251

    accuracy                           0.49      6852
   macro avg       0.23      0.23      0.20      6852
weighted avg       0.43      0.49      0.43      6852

