## Course Code: DS4003
## Course Name: Principles and Techniques for Data Science
## Lab Session: 13 -14: SVM, Ensemble Methods

# Today's Topics
In this lab, we will focus on the basic libraries and functions used to implelment the methods covered in the lectures in Python.
* SVMs
* Random Forest Classifier 
* Adaptive Boosting 
* Gradient Boosting 
* XGBoost

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from tqdm.auto import tqdm 
import random 
import warnings 
warnings.simplefilter('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

## Now, we will use Iris dataset for the following tasks 


In [2]:
df = pd.read_csv('Iris.csv')
print(pd.unique(df.Species))

['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


In [3]:
label_dict = {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
df2 = df.replace({'Species': label_dict})

In [4]:
df2.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,0
1,2,4.9,3.0,1.4,0.2,0
2,3,4.7,3.2,1.3,0.2,0
3,4,4.6,3.1,1.5,0.2,0
4,5,5.0,3.6,1.4,0.2,0


In [5]:
# your code here
label_dict = {'Iris-setosa': 1, 'Iris-versicolor': -1, 'Iris-virginica': -1}
feat_list = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']


# SVMs
Supprt Vector Machine is a classic machine learning classifier that attempts to separate classes of data using a hyperplane.

Different kernels can be used to learn different boundaries between classes, that might have different distributions. 

Reading material: 
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html 
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html

## Task: Run a classic SVC on the iris data

To access the SVM code from sklearn, we need to import it as follows:

In [6]:
from sklearn.svm import SVC

What arguments can we supply this model and what are the defaults?

In [7]:
svc_model = SVC()
params = svc_model.get_params()

print(params)


{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, 'tol': 0.001, 'verbose': False}


These arguments are explained in the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html.

In [8]:
train, test = train_test_split(df2, test_size=0.25)
scaler = StandardScaler() 
train_X = train[feat_list].copy() 
test_X = test[feat_list].copy() 
train_X = scaler.fit_transform(train_X) 
test_X = scaler.transform(test_X)
train_y = train['Species'].values 
test_y = test['Species'].values

In [9]:

clf = SVC(gamma='auto')
clf.fit(train_X, train_y)

SVC(gamma='auto')

In [10]:
pred = clf.predict(test_X)
acc = accuracy_score(y_true=test_y, y_pred=pred)
print(acc)

0.9736842105263158


## Task: Run a Linear SVC on the iris data

In [11]:
# Your code here


## Task: Run a non linear SVC on the iris data
Additionally, vary the kernel function to see if it has any effects on the performance.

In [12]:
# Your code here


# Decision Trees

Decision Trees are a widely used machine learning classifier that separates data classes by constructing a rule-based system based on the individual features. One of the advantages of decision trees is that they do not require feature scaling, as each feature is split independently during the learning process.

Further reading: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

## Task: Apply Decision Tree Classifier on Iris data 

## Compare results for gini, log loss, and entropy based Decision Tree Classifier 



In [13]:
from sklearn.tree import DecisionTreeClassifier
print(f"This has the default parameters:\n {DecisionTreeClassifier().get_params()}")

This has the default parameters:
 {'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'random_state': None, 'splitter': 'best'}


In [14]:
# Your code here


# Ensemble Methods

Ensemble methods combine multiple models to improve performance, making predictions more accurate and robust. They work by leveraging the strengths of different models, reducing overfitting, and enhancing generalization
Bagging, Boosting, Stacking. 


## Task: Apply RF on the Iris data 
RF is one of the most popular solutions in bagging. We train multiple models on different data subsets, then average or vote on the final prediction.

In [15]:
# Your code here 


### Boosting: Builds models sequentially, with each new model correcting errors made by previous ones. Examples include AdaBoost and Gradient Boosting.

# Task: Apply AdaBoost on the Iris data 

In [16]:
# Your code here 

# Task: Apply Gradient Boosting on Iris data 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

In [17]:
# Your code here 


# Task: Apply XGBoost classifier on the Iris data 

reading material: https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier

The XGBoost module is not in embedded in sklearn so we will have to install it. 
- pip install xgboost

In [18]:
# Your code here 


### Stacking: Combines different models’ predictions using a meta-learner, enhancing accuracy by capturing various data patterns.


# Task: Build your own stacked model using three different classifiers
You can use the StackingClassifier module in sklearn: https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.StackingClassifier.html

In [19]:
from sklearn.ensemble import StackingClassifier

# Task: Is there a difference in accuracy?