# Decision Tree On Breast-Cancer dataset


# 1.Project Over View
This project aims to use Decision Tree algorithms to classify breast cancer data, improving model performance through various techniques including hyperparameter tuning, pruning, and feature engineering.


# 2. Preprocessing Data


2.1 Load Dataset

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/breast-cancer.csv')  # Adjust the path if necessary

# Display the first few rows of the dataframe
data.head()

2.2 Understanding Nature And Status Of Dataset


by Using 'value_counts()', 'set_option', 'columns.tolist', 'is_null().sum' Methods

In [None]:
data.value_counts().head()

In [None]:
import pandas as pd

# Set option to display all columns
pd.set_option('display.max_columns', None)

# Display the first few rows of the dataframe again
data.head()

In [None]:
data.columns.tolist()

In [None]:
data['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
B,357
M,212


In [None]:
data.isnull().sum()

2.3 convert 'diagnosis' feature gol to bolian '1' and '0'

In [None]:
# Encode the target variable 'diagnosis'
data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})
data['diagnosis']

In [None]:
data['diagnosis'].value_counts()

Unnamed: 0_level_0,count
diagnosis,Unnamed: 1_level_1
0,357
1,212


2.4 Spliting Data

In [None]:
from sklearn.model_selection import train_test_split

# Separate features and target variable
X = data.drop(columns=['diagnosis'])
y = data['diagnosis']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 3. Building the Model

# 3.1Applying Class Weights and Training the Decision Tree:
Now, let's train the Decision Tree with class weights:


In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree model with class weights
model = DecisionTreeClassifier(class_weight='balanced', random_state=42)

# Train the model
model.fit(X_train, y_train)

# 3.2 Evaluating the Initial Model:
Let's also keep track of our model’s performance using accuracy, classification report, and confusion matrix:

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.96
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Confusion Matrix:
[[69  2]
 [ 3 40]]


# Conclusion Step 3
Accuracy is 96% percision is 96% , recall is 97%, f1-score is 97% in sum our initial are good.


# We Try Improve Performance of Our Model

#

# 4. Improving the Decision Tree Model(By Defining Hyperparameter)

# 4.1: Define the Hyperparameter Grid
We’ll specify a range of values for max_depth, min_samples_split, min_samples_leaf, and criterion:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Define the hyperparameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
    'criterion': ['gini', 'entropy']
}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='accuracy')


In [None]:
grid_search

# 4.2: Fit the Model
Fit the GridSearchCV object to the training data:

In [None]:
grid_search.fit(X_train, y_train)

# Display the best parameters
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')

Best parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best score: 0.9472527472527472


# 4.3: Evaluate the Tuned Model
Use the best parameters to train and evaluate our final model:

In [None]:
# Use the best estimator
best_model = grid_search.best_estimator_

# Make predictions
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.95
Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.97      0.96        71
           1       0.95      0.91      0.93        43

    accuracy                           0.95       114
   macro avg       0.95      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

Confusion Matrix:
[[69  2]
 [ 4 39]]


# Conclusion Of Step 4:
While the overall accuracy is unchanged, the precision for malignant cases (Class 1) has improved, which is crucial in medical diagnostics. However, the slight decrease in recall for malignant cases should be monitored. Overall, the hyperparameter tuning has refined the model, making it more precise in identifying malignant cases.

# 5 Improving the Decision Tree Model(By Using Pruning Method)

# 5.1: Get the cost complexity pruning path:

In [None]:
path = model.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# 5.2: Evaluate Different ccp_alpha Values
Next, we'll prune the tree for different values of ccp_alpha and evaluate their performance:

In [None]:
models = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=42, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    models.append(clf)

# Evaluate models and select the best
train_scores = [model.score(X_train, y_train) for model in models]
test_scores = [model.score(X_test, y_test) for model in models]
train_scores

In [None]:
test_scores


#5.2.1 Tuple Comparison

In [None]:
for train, test in zip(train_scores, test_scores):
    print(f'Train Score: {train}, Test Score: {test}')


Train Score: 1.0, Test Score: 0.9385964912280702
Train Score: 0.9956043956043956, Test Score: 0.9473684210526315
Train Score: 0.9956043956043956, Test Score: 0.9473684210526315
Train Score: 0.9934065934065934, Test Score: 0.9473684210526315
Train Score: 0.9934065934065934, Test Score: 0.9473684210526315
Train Score: 0.989010989010989, Test Score: 0.956140350877193
Train Score: 0.989010989010989, Test Score: 0.956140350877193
Train Score: 0.9846153846153847, Test Score: 0.956140350877193
Train Score: 0.9802197802197802, Test Score: 0.956140350877193
Train Score: 0.967032967032967, Test Score: 0.956140350877193
Train Score: 0.9582417582417583, Test Score: 0.9473684210526315
Train Score: 0.9582417582417583, Test Score: 0.9473684210526315
Train Score: 0.9208791208791208, Test Score: 0.8947368421052632
Train Score: 0.6285714285714286, Test Score: 0.6228070175438597


# 5.2.2 Select Best Proning Manually
I see in line 10 train_score[9] and test_score[9] is best prune for my model

# 5.3 Apply best Prune (ccp_alpha) to Our Model

In [None]:
# Manually set the best ccp_alpha from index 9
best_ccp_alpha = ccp_alphas[9]

# Initialize and train the Decision Tree with this ccp_alpha
best_model = DecisionTreeClassifier(random_state=42, ccp_alpha=best_ccp_alpha)
best_model.fit(X_train, y_train)

# Make predictions
y_pred = best_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.96
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.97      0.97        71
           1       0.95      0.93      0.94        43

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Confusion Matrix:
[[69  2]
 [ 3 40]]


# 5.4 Over View Results

# Accuracy
Initial Model: 0.94 Pruned Model: 0.96
Analysis: A slight increase in accuracy, indicating that the pruned model generalizes better.

# Precision
Class 0 (Benign): Improved from 0.93 to 0.96
Class 1 (Malignant): Remained the same at 0.95

Analysis: Improved precision for Benign cases shows fewer false positives, indicating better model reliability.

# Recall
Class 0 (Benign): Remained the same at 0.97
Class 1 (Malignant): Improved from 0.88 to 0.93

Analysis: Increased recall for Malignant cases shows fewer false negatives, which is crucial in medical diagnostics.

# F1-Score
Class 0 (Benign): Improved from 0.95 to 0.97
Class 1 (Malignant): Improved from 0.92 to 0.94

Analysis: Higher F1-scores indicate a better balance between precision and recall, making the pruned model more robust.

#Confusion Matrix
Initial Model:
True Negatives: 69 False Positives: 2 False Negatives: 5 True Positives: 38

Pruned Model:
True Negatives: 69 False Positives: 2 False Negatives: 3 True Positives: 40

Analysis: The pruned model has fewer false negatives and more true positives for Malignant cases, further confirming its improved performance.

# Conclusion step 5 (Gaining Via Pruned)
The pruned model shows overall better performance with higher accuracy, precision, recall, and F1-scores. It’s more balanced and robust, making it a valuable tool for medical diagnostics.

# I Like Prune Method

#  6 Improving the Decision Tree Model(Feature Engineering)

6.1 Understanding Feature Importance


In [None]:
import pandas as pd

# Drop the 'id' column for feature importance calculation
features = data.drop(columns=['diagnosis']).columns

# Calculate feature importance
importances = best_model.feature_importances_
feature_importance = pd.DataFrame(importances, index=features, columns=['Importance']).sort_values(by='Importance', ascending=False)
print(feature_importance)


6.2 Creating New Features Based On Important Features

In [None]:
# Create interaction features based on the most important features
data['concave_mean_worst'] = data['concave points_mean'] * data['concave points_worst']
data['radius_perimeter_texture'] = data['radius_worst'] * data['perimeter_worst'] * data['texture_mean']
print(data['concave_mean_worst'])
print(data['radius_perimeter_texture'])


In [None]:
# Drop the 'id' and 'diagnosis' columns as they are not needed for training
data = data.drop(columns=['id', 'diagnosis'])


6.3 Standardizing data

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)


6.4 Spliting New data

In [None]:
from sklearn.model_selection import train_test_split

X_train_new, X_test_new, y_train, y_test = train_test_split(scaled_data, y, test_size=0.2, random_state=42)


6.5 Retraining our best model with new Featurs

In [None]:
# Retrain the pruned model with the new features
best_model.fit(X_train_new, y_train)
y_pred = best_model.predict(X_test_new)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.95
Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.96      0.96        71
           1       0.93      0.93      0.93        43

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

Confusion Matrix:
[[68  3]
 [ 3 40]]


# Conclusion Of Step 6
Although the difference in performance metrics is minimal, the pruned model performs slightly better. However, it’s worth noting that our feature engineering didn’t significantly degrade the model's performance.

## License
This project is licensed under the MIT License - see the LICENSE file for details.

## © 2024 Ali M Shafiei. All rights reserved.
