# **Decision Tree Algorithm**

A **decision tree** is a supervised machine learning algorithm used for **classification** and **regression tasks**. It splits the data into subsets based on the value of input features, creating a tree-like structure of decisions.

### Key Components:
- **Nodes**: Represent decisions or tests on a feature.
- **Root Node**: The top node, where the first decision is made.
- **Leaf Nodes**: The final outcome, representing **class labels** (in classification) or **values** (in regression).
- **Branches**: The paths connecting nodes, representing outcomes of decisions.

The algorithm works by selecting the best feature to split the data at each node using criteria like **Gini Impurity** (for classification) or **Mean Squared Error** (for regression). This process repeats recursively until a stopping condition is met, such as reaching a **maximum tree depth** or having **pure nodes**.

![DT](https://images.spiceworks.com/wp-content/uploads/2022/05/04131724/How-does-a-decision-tree-work.png)

### Advantages:
- **Interpretability**: Decision trees are easy to interpret and visualize.
- Can handle both **numerical** and **categorical data**.

However, decision trees are prone to **overfitting**, especially when the tree is deep. They may also be **unstable**, with small changes in the data leading to drastically different trees. **Pruning** (removing unnecessary branches) helps mitigate overfitting.

### Additional Details:

- **Building the Tree**: The tree is built by iteratively splitting the data at each node based on a feature that best separates the target variable. The best feature is chosen using metrics like **Gini Impurity**, **Information Gain** (for classification), or **Mean Squared Error** (for regression).
  
- **Overfitting and Pruning**: Decision trees can become overly complex and perform well on training data but poorly on unseen data (overfitting). To address this, **pruning** techniques like **cost complexity pruning** or **post-pruning** remove branches that don't significantly improve accuracy.

- **Advantages Over Other Models**: Decision trees are **non-parametric**, meaning they do not assume a specific form for the data distribution. They are also easy to interpret and visualize, which is beneficial for explaining the decision-making process.

- **Disadvantages**: Despite their interpretability, decision trees are sensitive to **noisy data** and can easily become too complex. They may also be biased towards features with more categories, and their performance can degrade with an increase in the number of features. **Ensemble methods** like **Random Forests** and **Gradient Boosting Trees** mitigate these issues by combining multiple trees to improve generalization.

### Next Steps in the Decision Tree Process

After understanding the key concepts, advantages, and challenges associated with decision trees, we can proceed with the **practical implementation**. The next steps involve:

- **Preparing the dataset** (e.g., cleaning, encoding)
- **Configuring the model** (e.g., setting hyperparameters)
- **Training** the decision tree
- **Evaluating** its performance

These steps are critical to ensure the decision tree is effectively built and performs well on unseen data. We will begin by obtaining and pre-processing the data, followed by model training, prediction, and evaluation.


In [24]:
#Importing the libraries
import pandas as pd
import numpy as np

In [25]:
# Installing seaborn (this step may not be necessary if seaborn is already installed)
import piplite
await piplite.install("seaborn")

# Importing the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Displaying plots inline in Jupyter/Colab notebooks
%matplotlib inline

In [26]:
#importing warnings
import warnings
warnings.filterwarnings("ignore")

### About the Dataset

This dataset contains information about a set of patients who suffered from the same illness. During their treatment, each patient was prescribed one of five medications: **Drug A**, **Drug B**, **Drug C**, **Drug X**, or **Drug Y**. The goal is to build a model to predict which drug a future patient might respond to based on various features.

The features in the dataset are:
- **Age**: The age of the patient.
- **Sex**: The gender of the patient (male or female).
- **Blood Pressure (BP)**: The patient's blood pressure level.
- **Cholesterol**: The cholesterol level of the patient.
- **Na_to_K**: The sodium-to-potassium ratio in the patient's body.

The target variable is the **Drug**, which represents the medication that each patient responded to. This is a **multiclass classification** problem, where the objective is to classify the drug that a new or unknown patient would be prescribed, based on the features.

The dataset is useful for building a decision tree model that can predict the appropriate drug for a new patient based on their characteristics.


Now, read the data using pandas dataframe:


In [27]:
# Reading the Dataset
df=pd.read_csv("drug200.csv",delimiter=",")

In [28]:
# The following command displays the first 10 rows of the DataFrame 'df'
# It is useful to quickly inspect the data and check its structure, columns, and values.

df.head(10) # Showing the first 10 rows of the DataFrame 'df'

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
5,22,F,NORMAL,HIGH,8.607,drugX
6,49,F,NORMAL,HIGH,16.275,drugY
7,41,M,LOW,HIGH,11.037,drugC
8,60,M,NORMAL,HIGH,15.171,drugY
9,43,M,LOW,NORMAL,19.368,drugY


In [29]:
# The following command returns the number of rows and columns in the DataFrame 'df'.
# It is useful to quickly check the size of your dataset.

df.shape  # Returns a tuple (number_of_rows, number_of_columns)

(200, 6)

In [30]:
# The following command generates descriptive statistics for all the numerical columns in the DataFrame 'df'
# It helps you quickly summarize the distribution and central tendency of the data.

df.describe()  # Returns summary statistics for numerical columns in the DataFrame

Unnamed: 0,Age,Na_to_K
count,200.0,200.0
mean,44.315,16.084485
std,16.544315,7.223956
min,15.0,6.269
25%,31.0,10.4455
50%,45.0,13.9365
75%,58.0,19.38
max,74.0,38.247


In [31]:
# The following command provides a summary of the DataFrame 'df'.
# It includes the column names, data types, and the number of non-null entries for each column.

df.info()  # Returns a summary of the DataFrame, including column data types and non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Age          200 non-null    int64  
 1   Sex          200 non-null    object 
 2   BP           200 non-null    object 
 3   Cholesterol  200 non-null    object 
 4   Na_to_K      200 non-null    float64
 5   Drug         200 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.3+ KB


### Pre-processing

Using **my_data** as the Drug.csv data read by pandas, declare the following variables:

- **X** as the Feature Matrix (data of **my_data**)
- **y** as the response vector (target)

Remove the column containing the target name since it doesn't contain numeric values.

In [41]:
# Declaring the feature matrix (X) and response vector (y)
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values  # Selecting relevant features
y = df['Drug']  # The target variable 'Drug'

In [47]:
# X contains the selected features 
X[0:5]

array([[23, 'F', 'HIGH', 'HIGH', 25.355],
       [47, 'M', 'LOW', 'HIGH', 13.093],
       [47, 'M', 'LOW', 'HIGH', 10.114],
       [28, 'F', 'NORMAL', 'HIGH', 7.798],
       [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)

In [None]:
# Importing preprocessing from sklearn
from sklearn import preprocessing

# Encoding the 'Sex' feature (F = 0, M = 1)
df_sex = preprocessing.LabelEncoder()
df_sex.fit(['F','M'])
X[:,1] = le_sex.transform(X[:,1]) # Transforming the 'Sex' column in the feature matrix

# Encoding the 'Blood Pressure' feature (LOW = 0, NORMAL = 1, HIGH = 2)
df_BP = preprocessing.LabelEncoder()
df_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])
X[:,2] = le_BP.transform(X[:,2]) # Transforming the 'BP' column in the feature matrix

# Encoding the 'Cholesterol' feature (NORMAL = 0, HIGH = 1)
df_Chol = preprocessing.LabelEncoder()
df_Chol.fit([ 'NORMAL', 'HIGH'])
X[:,3] = le_Chol.transform(X[:,3]) # Transforming the 'Cholesterol' column in the feature matrix

# Displaying the first 5 rows of the feature matrix after encoding
X[0:5]

Now we can fill the target variable.

In [44]:
#y contains the target (drug class)
y[0:5]

0    drugY
1    drugC
2    drugC
3    drugX
4    drugY
Name: Drug, dtype: object

As you may figure out, some features in this dataset are categorical, such as **Sex** or **BP**. Unfortunately, Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using **pandas.get_dummies()**| to convert the categorical variable into dummy/indicator variables.

### Setting up the Decision Tree

To build and evaluate our decision tree model, we need to split the dataset into a **training set** and a **testing set**. This allows us to train the model on one portion of the data and evaluate its performance on another, unseen portion. We will use the **train_test_split** function from the **sklearn.model_selection** library to perform this split.

A common practice is to allocate **70%** of the data to the training set and **30%** to the testing set. This ensures the model has enough data to learn from while also providing a fair evaluation using the test set.

In [2]:
# Importing train_test_split from sklearn
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train,X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 3)

# Print the shape of the training and test sets
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

<class 'NameError'>: name 'X' is not defined