# Introduction to Machine Learning with Decision Trees, Random Forests, and XGBoost

Lesson Overview:
In this lesson, we will cover the basics of machine learning using three popular algorithms: Decision Trees, Random Forests, and XGBoost. We'll start by exploring how to process data with both categorical and continuous features, followed by a step-by-step implementation of each algorithm.

In [4]:
!pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


## Data Processing

In this section, we load the dataset, separate features and target variable, and encode categorical features using OneHotEncoding. This step prepares the data for training our machine learning models.


In [6]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load the Adult dataset
from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
X = adult.data.features
y = adult.data.targets

# variable information
print(adult.variables)

              name     role         type      demographic  \
0              age  Feature      Integer              Age   
1        workclass  Feature  Categorical           Income   
2           fnlwgt  Feature      Integer             None   
3        education  Feature  Categorical  Education Level   
4    education-num  Feature      Integer  Education Level   
5   marital-status  Feature  Categorical            Other   
6       occupation  Feature  Categorical            Other   
7     relationship  Feature  Categorical            Other   
8             race  Feature  Categorical             Race   
9              sex  Feature       Binary              Sex   
10    capital-gain  Feature      Integer             None   
11    capital-loss  Feature      Integer             None   
12  hours-per-week  Feature      Integer             None   
13  native-country  Feature  Categorical            Other   
14          income   Target       Binary           Income   

                       

### Separate Features and Target Variable

In [21]:
X_train_adult, X_test_adult, y_train_adult, y_test_adult = train_test_split(X, y, test_size=0.2, random_state=42)

### Encode Categorical Features

This modification uses the align method to ensure that the one-hot encoded DataFrames for the training and testing sets have the same columns. Any columns present in one set but not in the other will be added with zeros.

In [22]:
# One-hot encode categorical features with column alignment
X_train_adult = pd.get_dummies(X_train_adult)
X_test_adult = pd.get_dummies(X_test_adult)

# Align columns to ensure consistency between training and testing sets
X_train_adult, X_test_adult = X_train_adult.align(X_test_adult, join='outer', axis=1, fill_value=0)

## Decision Trees

A decision tree is a popular machine learning algorithm that is used for both classification and regression tasks. It models decisions as a tree-like structure where an input is progressively split into subsets based on certain features. Each internal node of the tree represents a decision based on a particular feature, and each leaf node represents the outcome or decision.




1.   **Root Node**: The tree starts with a root node that includes the entire dataset.
2. **Feature Selection**: The algorithm evaluates different features in the dataset to determine the best feature to split the data. The goal is to find the feature that provides the best separation of the data into distinct classes or values.
3. **Splitting**: The dataset is split into subsets based on the chosen feature. Each branch represents a possible outcome or decision based on the feature's value.
4. **Recursive Process**: The splitting process is then applied recursively to each subset. At each internal node, a decision is made based on a specific feature, and the dataset is divided into subsets accordingly.
5. **Stopping Criteria**: The recursive process continues until a stopping criteria is met. This could be a predefined depth of the tree, a minimum number of samples in a leaf node, or other criteria. These criteria help prevent overfitting.
6. **Leaf Nodes**: The final nodes of the tree are called leaf nodes, and they represent the output or decision. For classification tasks, each leaf corresponds to a specific class, while for regression tasks, the leaf nodes contain the predicted values.
7. **Prediction**: To make predictions for new data, you traverse the tree from the root to a leaf node based on the values of the input features. The predicted output is then based on the majority class (for classification) or the average value (for regression) of the samples in the leaf node.


Decision trees are popular because they are easy to understand and interpret. However, they are prone to overfitting, and various techniques such as pruning and setting stopping criteria are used to mitigate this issue. Additionally, ensemble methods like Random Forests and Gradient Boosted Trees are often employed to improve the predictive performance of decision trees.


#### Example
Let's consider a simple example of a decision tree for a binary classification problem, where the goal is to determine whether a person will play golf based on weather conditions. The features are "*Outlook*," "*Temperature*," "*Humidity*," and "*Wind*."

Decision Tree for Play Golf:

                          Outlook
                           / | \
                          /  |  \
                  Sunny   | Overcast | Rainy
                         \  /           \
                          \/              \
                   Humidity <= 75         No
                     /       \
                    /         \
                 Yes           No



Here's a step-by-step breakdown:

1. **Root Node**: The root node considers the feature "Outlook."
2. **Splitting at Outlook**: Three branches are created based on different outlook conditions: Sunny, Overcast, and Rainy.
3. **Further Splitting**: For the "Sunny" branch, the decision is based on the "Humidity" feature. If humidity is less than or equal to 75, the decision is "Yes" (play golf), otherwise "No." For the "Overcast" branch, the decision is "Yes" directly without further splitting. For the "Rainy" branch, no additional splitting is done, and the decision is "No."
4. **Leaf Nodes**: The leaf nodes represent the final decisions. In this case, "Yes" means play golf, and "No" means do not play golf.


This simple decision tree provides a set of rules to decide whether a person will play golf based on the given weather conditions. Note that this is a basic example, and in a real-world scenario, decision trees can be more complex with additional features and nodes.**Also, it's important to consider overfitting and use techniques like pruning to optimize the tree's performance.**

In [24]:
# Import DecisionTreeClassifier from scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Create a decision tree model
dt_model = DecisionTreeClassifier(random_state=42)

# Train the model
dt_model.fit(X_train_adult, y_train_adult)

# Make predictions on the test set
dt_predictions = dt_model.predict(X_test_adult)

# Evaluate the accuracy of the model
dt_accuracy = accuracy_score(y_test_adult, dt_predictions)
print(f'Test accuracy {dt_accuracy}')

Test accuracy 0.4677039615109018


## Random Forests

In [26]:
# Import RandomForestClassifier from scikit-learn
from sklearn.ensemble import RandomForestClassifier

# Create a random forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train_adult, y_train_adult.values.ravel())  # Use .values.ravel() to convert to 1-dimensional array

# Make predictions on the test set
rf_predictions = rf_model.predict(X_test_adult)

# Evaluate the accuracy of the model
rf_accuracy = accuracy_score(y_test_adult, rf_predictions)
print(f'Test accuracy {rf_accuracy}')

Test accuracy 0.5371071757600573


## XGBoost

In [32]:
# Install xgboost if not already installed
# !pip install xgboost

# Import XGBClassifier from xgboost
from xgboost import XGBClassifier

# Import accuracy_score for evaluation
from sklearn.metrics import accuracy_score

# Create an XGBoost model
xgb_model = XGBClassifier(random_state=42)

# Encode the target variable using Label Encoding
label_encoder_xgb = LabelEncoder()
y_train_adult_encoded = label_encoder_xgb.fit_transform(y_train_adult)
y_test_adult_encoded = label_encoder_xgb.transform(y_test_adult)

# Train the model
xgb_model.fit(X_train_adult, y_train_adult_encoded.ravel())

# Make predictions on the test set
xgb_predictions = xgb_model.predict(X_test_adult)

# Decode the predictions back to original labels if needed
xgb_predictions_original_labels = label_encoder_xgb.inverse_transform(xgb_predictions)

# Evaluate the accuracy of the model
xgb_accuracy = accuracy_score(y_test_adult, xgb_predictions_original_labels)
print(f'Test accuracy {xgb_accuracy}')

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


Test accuracy 0.593100624424199
