<a href="https://colab.research.google.com/github/comparativechrono/Principles-of-Data-Science/blob/main/Week_3/Section_2_Python_Example__Building_a_Decision_Tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Section 2 - Building a decision tree with python

Decision trees are a popular choice for classification because of their intuitiveness and simplicity. These models use a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. In this section we will build a decision tree classifier using Python's scikit-learn library, one of the most popular machine learning libraries. We will work with a dataset to predict customer churn based on various customer features.

1. Setting Up the Environment:

To follow along, ensure you have Python installed with the scikit-learn package. If you haven’t installed scikit-learn, you can do so using pip:

In [None]:
pip install scikit-learn numpy pandas matplotlib

2. Importing Required Libraries:

Start by importing necessary libraries. We’ll use Pandas for data manipulation, scikit-learn for building the decision tree, and Matplotlib for visualizing the tree.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

3. Preparing the Data:

Let's assume we have a CSV file named customer_data.csv that contains customer churn data. The features include age, income, and account length, and the target variable is churn (yes or no).

In [None]:
# Load the dataset
data = pd.read_csv('customer_data.csv')

# Display the first few rows of the dataframe
print(data.head())

# Define the features and the target
X = data[['Age', 'Income', 'Account Length']]
y = data['Churn']  # the target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Building the Decision Tree Model:

We will use scikit-learn to create and train a decision tree classifier.

In [None]:
# Create a decision tree classifier model
classifier = DecisionTreeClassifier(max_depth=3, random_state=42)

# Fit the model on the training data
classifier.fit(X_train, y_train)

5. Visualizing the Decision Tree:

After training the model, visualize the decision tree to understand how it makes decisions:

In [None]:
# Plot the decision tree
plt.figure(figsize=(12, 8))
plot_tree(classifier, filled=True, feature_names=['Age', 'Income', 'Account Length'], class_names=['No Churn', 'Churn'])
plt.title('Decision Tree for Customer Churn')
plt.show()

6. Evaluating the Model:

Evaluate the model’s performance using the test set. We'll calculate the accuracy as a simple metric.

In [None]:
from sklearn.metrics import accuracy_score

# Predict on the test set
y_pred = classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.2f}")

7. Conclusion:

This example has demonstrated how to build and visualize a decision tree classifier using scikit-learn in Python. Decision trees are a valuable tool for classification due to their ability to clearly show how decisions are made, making them especially useful in applications where interpretability is important. While they are simple and effective, they can be prone to overfitting, especially with very complex trees. Therefore, it’s often useful to prune the tree or limit its depth, as we did in this example, to prevent overfitting and to enhance model generalization.

By integrating decision tree classifiers into your data science workflow, you can efficiently tackle a wide range of classification problems with a model that stakeholders can easily understand and trust.