<a href="https://colab.research.google.com/github/frm1789/100DaysOfPython/blob/main/DT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### CART vs Random Forest vs Assemble models

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

# CART

How CART (Classification and Regression Trees) works?

1. Feature Selection: Finds the best feature to split the dataset.
2. Dataset Splitting: Divides the dataset into subsets based on features.
3. Recursion: Continues splitting subsets until stopping criteria are met.
4. Tree Construction: Constructs a decision tree with nodes and leaves.
5. Tree Pruning (optional): Removes subtrees to prevent overfitting.
6. Prediction: Uses the tree to predict outcomes for new samples.


In [136]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

# Load the diabetes dataset
df= pd.read_csv('diabetes.csv')

In [137]:
def check_data_quality(data):
    # Check for NaN values
    nan_count = np.sum(np.isnan(data))

    # Check for outliers (assuming outliers are values more than 3 standard deviations away from the mean)
    mean = np.mean(data)
    std_dev = np.std(data)
    outliers_count = np.sum(np.abs(data - mean) > 3 * std_dev)


    return nan_count, outliers_count, zero_count


In [149]:
def preprocess_data(df):
     # Replace zeros with NaN
    for col in df:
        df[col] = df[col].replace(0, np.nan)

    # Impute NaN values using KNN Imputer
    imputer = KNNImputer(n_neighbors=5)
    imputed_df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)


    return imputed_df



## A glimpse of the data

In [139]:
df.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


In [140]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


###

In [141]:

nan_count, outliers_count, zero_count = check_data_quality(df)

print("\nNaN count:", nan_count)
print("\nOutliers count:", outliers_count)
print("\nZero count:", zero_count)


NaN count: Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Outliers count: Pregnancies                  4
Glucose                      5
BloodPressure               35
SkinThickness                1
Insulin                     18
BMI                         14
DiabetesPedigreeFunction    11
Age                          5
Outcome                      0
dtype: int64

Zero count: Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                     500
dtype: int64


  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


In [155]:
#df_imp = preprocess_data(df)
df_imp

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6.0,148.0,72.0,35.0,169.0,33.6,0.627,50.0,1.0
1,1.0,85.0,66.0,29.0,58.6,26.6,0.351,31.0,1.0
2,8.0,183.0,64.0,26.8,186.6,23.3,0.672,32.0,1.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0,1.0
4,6.6,137.0,40.0,35.0,168.0,43.1,2.288,33.0,1.0
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,48.0,180.0,32.9,0.171,63.0,1.0
764,2.0,122.0,70.0,27.0,165.0,36.8,0.340,27.0,1.0
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0,1.0
766,1.0,126.0,60.0,34.6,134.2,30.1,0.349,47.0,1.0


In [151]:
nan_counts = df_imp.isna().sum()
print(nan_counts)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [152]:
y =  df_imp['Outcome']
X =  df_imp.drop('Outcome', axis=1)

In [156]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the decision tree regressor
clf = DecisionTreeClassifier(random_state=42)

# Train the regressor on the training data
clf.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 0.0


In [154]:

# Initialize the decision tree regressor
clf = DecisionTreeClassifier(random_state=42, max_depth=4)

# Train the regressor on the training data
clf.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = clf.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 0.0
