# Decision Tree Basic Model

**Definition**
A supervised learning algorithm used for classification and regression tasks, structured like a tree with nodes and branches.

**Structure**
- **Root Node**: Represents the entire dataset and initiates the split.
- **Internal Nodes**: Represent features/attributes used for decisions.
- **Leaf Nodes**: Represent the final classification or output.

**Splitting Criteria**
- **Gini Impurity**: Measures how mixed the classes are in a group. *The lower the Gini score, the better.*
- **Entropy (Information Gain)**: Measures disorder and helps determine the best split.
- **Mean Squared Error (Regression)**: Evaluates the variance in continuous outputs.

**Advantages**
- Easy to understand and interpret.
- Handles both numerical and categorical data.
- Requires minimal data preparation.

**Limitations**
- Prone to overfitting, especially with complex trees.
- Not stable—small changes in data can significantly affect structure.
- Can be biased towards features with more levels.

**Pruning**
A process to remove unnecessary branches to reduce overfitting and simplify the model.

**Applications**
Used in finance, healthcare, marketing, and more for tasks like risk analysis, diagnosis, and customer segmentation.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv(r"C:\Users\Dell\Downloads\salaries.csv")
df.head()

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0


In [6]:
# Independent variables (features) used for prediction

inputs = df.drop('salary_more_then_100k',axis='columns')

In [7]:
# Dependent variable (target) that we are trying to predict

target = df['salary_more_then_100k']

In [9]:
# Creating LabelEncoder instances for categorical columns

from sklearn.preprocessing import LabelEncoder
le_company = LabelEncoder()  # Encodes company names into numerical values
le_job = LabelEncoder()      # Encodes job titles into numerical values
le_degree = LabelEncoder()   # Encodes degree types into numerical values

In [11]:
# Encode categorical columns into numerical values 

inputs['company_n'] = le_company.fit_transform(inputs['company'])  # Convert company names into numeric labels
inputs['job_n'] = le_job.fit_transform(inputs['job'])              # Convert job titles into numeric labels
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])     # Convert degree types into numeric labels

In [12]:
inputs

Unnamed: 0,company,job,degree,company_n,job_n,degree_n
0,google,sales executive,bachelors,2,2,0
1,google,sales executive,masters,2,2,1
2,google,business manager,bachelors,2,0,0
3,google,business manager,masters,2,0,1
4,google,computer programmer,bachelors,2,1,0
5,google,computer programmer,masters,2,1,1
6,abc pharma,sales executive,masters,0,2,1
7,abc pharma,computer programmer,bachelors,0,1,0
8,abc pharma,business manager,bachelors,0,0,0
9,abc pharma,business manager,masters,0,0,1


In [13]:
# Remove original categorical columns since we now have their encoded versions

inputs_n = inputs.drop(['company','job','degree'],axis='columns')

In [14]:
inputs_n

Unnamed: 0,company_n,job_n,degree_n
0,2,2,0
1,2,2,1
2,2,0,0
3,2,0,1
4,2,1,0
5,2,1,1
6,0,2,1
7,0,1,0
8,0,0,0
9,0,0,1


In [16]:
target

0     0
1     0
2     1
3     1
4     0
5     1
6     0
7     0
8     0
9     1
10    1
11    1
12    1
13    1
14    1
15    1
Name: salary_more_then_100k, dtype: int64

In [18]:
# Importing the Decision Tree module from scikit-learn
from sklearn import tree

# Create a DecisionTreeClassifier model instance
model = tree.DecisionTreeClassifier()

In [19]:
# Train the Decision Tree model using the encoded input data and target labels

model.fit(inputs_n, target)

In [20]:
# Evaluating the accuracy of the trained model on the same dataset

model.score(inputs_n, target)

1.0

**Predict whether a Google Computer Engineer with a Bachelor's degree earns more than 100K**

In [21]:
model.predict([[2,1,0]])  # Output [0] means "No"



array([0], dtype=int64)

**Predict whether a Google Computer Engineer with a Master's degree earns more than 100K**

In [23]:
model.predict([[2,1,1]])  # Output will indicate whether salary is above 100K



array([1], dtype=int64)

**Note:**

**Decision Tree Splitting Criteria**

**1. Gini Index (Default)**
- Measures impurity using probability of misclassification.
- Formula: **Gini = 1 - Σ (pᵢ²)**, where pᵢ is the probability of each class.
- Lower Gini = Purer split = Better separation.

**2. Entropy (Alternative)**
- Uses **Information Gain** to determine the best split.
- Formula: **Entropy = -Σ (pᵢ log₂ pᵢ)**.
- Higher Information Gain = More informative split.

**Default Behavior**
- If no criterion is specified, **Gini Index** is used by default in `DecisionTreeClassifier()`.
- Internally, the tree keeps calculating Gini scores at each node to find the best split.