**Lab Title: Pima Indians Diabetes Prediction using Decision Trees**

**Objective:** In this lab, you will build and evaluate a Decision Tree classifier to predict the onset of diabetes in female patients of Pima Indian heritage. This exercise will provide you with hands-on experience in data preparation, model training, and performance evaluation for a real-world classification problem.
Dataset Background: We will be using the "Pima Indians Diabetes" dataset, which originates from the National Institute of Diabetes and Digestive and Kidney Diseases. The goal is to diagnostically predict whether a patient has diabetes based on several diagnostic measurements. All patients included in this dataset are females of Pima Indian heritage and are at least 21 years old.

**Data Dictionary:** The dataset contains the following features:

**•	Preg:** Number of times pregnant

**•	Plas:** Plasma glucose concentration after 2 hours in an oral glucose tolerance test

**•	Pres:** Diastolic blood pressure (mm Hg)

**•	skin:** Triceps skin fold thickness (mm)

**•	test:** 2-Hour serum insulin (mu U/ml)

**•	mass:** Body mass index (BMI)

**•	pedi:** Diabetes pedigree function

**•	age:** Age (years)

**•	class:** The target variable (0 for non-diabetic, 1 for diabetic). Out of 768 total instances, 268 are class 1.

**Step-by-Step Solution Guide**

This guide will walk you through the process of building the Decision Tree classifier.

**Step 1:**

Import Necessary Libraries
First, you need to import all the required Python libraries for data manipulation, modeling, and evaluation.

In [1]:
#import libraries
import pandas as pd
import numpy as py

**Step 2:**

Load and Inspect the Data
Load the dataset into a pandas DataFrame and perform a preliminary inspection to understand its structure and check for any missing values. (Assume the data is in a file named diabetes.csv).

In [9]:
# Load the dataset
df = pd.read_csv('Lab/diabetes.csv')

# Display the first few rows
df.head(10)

# Check for missing values and data types
df.isnull().sum()


Preg     0
Plas     0
Pres     0
skin     0
test     0
mass     0
pedi     0
age      0
class    0
dtype: int64

**Step 3:** Define Features (X) and Target (y)

Separate your DataFrame into the feature set (X) and the target variable (y). The "Class" column is your target, and all other columns are your features.


In [3]:
# Define feature columns

# Create feature set (X) and target variable (y)



**Step 4:** Split Data into Training and Testing Sets


To evaluate your model's performance on unseen data, you must split your dataset into a training set (for building the model) and a testing set (for testing its accuracy). We will use an 80/20 split.


In [4]:
# Split data into 80% training and 20% testing


*Note: random_state=1 ensures that you get the same split every time you run the code, making your results reproducible.*

**Step 5:** Build and Train the Decision Tree Model

Create an instance of the DecisionTreeClassifier and fit it to your training data. Setting a max_depth can help prevent the model from becoming overly complex and overfitting the data.


In [5]:
# Create a Decision Tree classifier object
# We'll set a max_depth of 4 to keep the tree simple and interpretable

# Train the model on the training data


**Step 6:** Make Predictions

Use your trained classifier to make predictions on the X_test data that you held back earlier.

In [6]:
# Make predictions on the test set


**Step 7:** Evaluate Model Performance

Now, compare the model's predictions (y_pred) with the actual target values (y_test) to assess its performance.

In [7]:
# 1. Check Accuracy


# 2. View the Confusion Matrix (graphical)


# 3. View the Classification Report


**Step 8:** Visualize the Decision Tree

A key advantage of Decision Trees is their interpretability. You can visualize the trained tree to understand the decision rules it learned from the data.

In [8]:
# Visualize the Decision Tree

