### Decision Tree(s)

Root Node: 

The topmost node in the tree
Represents the entire dataset
The first decision or test occurs here baed on the attribute that best splits the data.

Splitting:

Process of dividing a node into two or more sub-nodes.
The choice of attribute for splitting is based on criteria like Gini impurity, information gain (using entropy), or variance reduction for regression.
The goal is to create subgroups that are as pure as possible.
Decision Nodes:

Nodes that are created after splitting the root node or other nodes.
Represent a test or condition on an attribute.
Each decision node leads to further branches or sub-nodes.
Leaf Nodes (Terminal Nodes):

Nodes that do not split further.
Represent the final decision or output (e.g., class label for classification or a value for regression).
Once a node becomes a leaf node, it signifies that no further splitting can provide a better outcome.
Branch/Sub-Tree:

A section of the decision tree formed by splitting a node.
Each branch represents an outcome of a decision and leads to another decision node or a leaf node.
Attribute Selection Criteria:

Determines which attribute to split on at each step.
Common criteria include:
Gini Impurity: Measures the impurity of a dataset for classification.
Information Gain: Uses entropy to measure the reduction in uncertainty from a split.
Variance Reduction: Used for regression trees to measure the reduction in variance from a split.
Tree Pruning:

Technique to reduce the complexity of the decision tree.
Removes nodes that provide little predictive power to prevent overfitting.
Can be done using methods like cost-complexity pruning or setting a minimum number of samples per leaf.
Stopping Criteria:

Determines when to stop splitting nodes.

Common criteria include: 

-Maximum tree depth.

-Minimum number of samples required to split a node.

-Minimum impurity decrease for further splitting.

#### Process of Building a Decision Tree

Select the best attribute to split the data using attribute selection criteria.

Split the data into subsets based on the attribute's values.

Repeat this process for each branch, choosing the best attribute to split the remaining data.

Stop splitting when a stopping criterion is met (e.g., max depth, minimum samples).

Assign a class label or value to the leaf nodes based on the majority class (classification) or average value (regression) of the data points in that node.

In [1]:
#Starting with 500 hits dataset
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
#Code the model, import data
hits = pd.read_csv(r"C:\Users\user\Downloads\500hits.csv", encoding = 'latin-1')
#We are going to 'Drop' Player name and CS, caught stealing as it is not important for our model
hits = hits.drop(columns = ['PLAYER', 'CS'])

## separate data

We are going to split our data into 'x' and 'y' the variable we are trying to predict will be our 'y' in this case whether the player ended up in the Hall of Fame 'HOF', in this case our 'x' variables will be our predictors.

In [2]:
X = hits.iloc[:,0:13] #our x variables, all the rows and the first 14 features

y = hits.iloc[:,13] # the y variable we want to predict the outcome of Hall of Fame (HOF)

## train_test_split

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17, test_size=0.2) #random seed state, using 17 to verify,
#test_size = 0.2, (80% training data, 20% testing data)

#### Types of Decision Trees
Classification Trees:

Used when the target variable is categorical.
Predicts the class label for given input features.
Example: Predicting if a customer will purchase a product (Yes/No).
Regression Trees:

Used when the target variable is continuous.
Predicts a numerical value based on input features.
Example: Predicting the price of a house based on its features.
Decision trees are popular for their simplicity and interpretability, but they can become complex and prone to overfitting with noisy data. Pruning and ensemble methods like random forests or gradient boosting are often used to address these limitations.

In [4]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

In [5]:
###Helpful Tip For Future.... use get_params() to find out all the possible parameters for function.
dtc.get_params()

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'random_state': None,
 'splitter': 'best'}

In [6]:
#Fit training data
dtc.fit(X_train, y_train)

In [7]:
#Prediction
y_pred = dtc.predict(X_test)

In [8]:
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, y_pred))

[[51 10]
 [12 20]]


In [9]:
#Classification Report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.84      0.82        61
           1       0.67      0.62      0.65        32

    accuracy                           0.76        93
   macro avg       0.74      0.73      0.73        93
weighted avg       0.76      0.76      0.76        93



In [10]:
#What features had the biggest impact on the model? (array of how important each of the features are)
dtc.feature_importances_

array([0.03444801, 0.03355581, 0.03870641, 0.03806899, 0.38738161,
       0.05959745, 0.05013264, 0.        , 0.07925991, 0.09496146,
       0.03117801, 0.05632516, 0.09638455])

In [11]:
#much eaasier way to represent the data using dataframe
features = pd.DataFrame(dtc.feature_importances_, index=X.columns)

In [12]:
#Our Hits will be our most important feature based on the fact we took the top 500 hitters from baseball and predicted if they would make the hall of fame or not
features.head(15)

Unnamed: 0,0
YRS,0.034448
G,0.033556
AB,0.038706
R,0.038069
H,0.387382
2B,0.059597
3B,0.050133
HR,0.0
RBI,0.07926
BB,0.094961


#### Parameter Optimization
criterion='entropy', Specifies the function used to measure the quality of a split at each node in the decision tree.

By setting criterion='entropy', the decision tree will use information gain based on entropy to evaluate splits. Entropy measures the impurity or randomness in a dataset. Information gain is calculated as the reduction in entropy after a dataset is spilt on an attribute. 

ccp_alpha stands for Cost Complexity Pruning Alpha, it controls the pruning process on the decision tree, which helps to prevent overfitting by removing branches that add complexity without significantly improving the predictive power of the model. Pruning works by adding a penalty to the complexity of the tree. As ccp_alpha increases, more branches are pruned, resulting in a simpler model.
A value of ccp_alpha=0.04 means that branches with a complexity cost below this threshold will be pruned away.


Using criterion='entropy' means the tree will focus on reducing impurity by maximizing the reduction in entropy with each split.


A ccp_alpha value of 0.04 means that during pruning, the algorithm will remove any sub-tree that does not improve the overall complexity and performance of the tree by at least 0.04. This helps in simplifying the tree structure and can make the model more generalizable to new, unseen data.

In [13]:
dtc2 = DecisionTreeClassifier(criterion='entropy', ccp_alpha=0.04)

In [14]:
dtc2.fit(X_train, y_train)

In [15]:
y_pred2 = dtc2.predict(X_test)

In [16]:
print(confusion_matrix(y_test, y_pred2))

[[50 11]
 [ 9 23]]


In [17]:
print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

           0       0.85      0.82      0.83        61
           1       0.68      0.72      0.70        32

    accuracy                           0.78        93
   macro avg       0.76      0.77      0.77        93
weighted avg       0.79      0.78      0.79        93



Moving on to Random Forest with Hyperparameter Tuning. Above we are using the 500 hits baseball hall of fame data, previously we loaded in our csv file, dropped the player and CS column(s) and spilt our data into X and y respectively.

##### Random Forest train_test_split

In [19]:
from sklearn.model_selection import train_test_split
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X, y, random_state=21, test_size=0.2) #random state, using 21 to verify,
#test_size = 0.2, (80% training data, 20% testing data)

In [22]:
from sklearn.ensemble import RandomForestClassifier

For the fist one, not going to do any hyperparameter tuning.

In [24]:
rf = RandomForestClassifier()

rf.fit(X_train_rf, y_train_rf)

Building out Prediction

In [25]:
y_pred_rf_1 = rf.predict(X_test_rf)

In [26]:
rf.score(X_test_rf, y_test_rf)

0.7741935483870968

Classification Report

In [28]:
from sklearn.metrics import classification_report #Probably alreayd imported just importing for consistency
print(classification_report(y_test_rf, y_pred_rf_1))

              precision    recall  f1-score   support

           0       0.78      0.88      0.82        56
           1       0.77      0.62      0.69        37

    accuracy                           0.77        93
   macro avg       0.77      0.75      0.76        93
weighted avg       0.77      0.77      0.77        93



Feature Importance

In [29]:
features = pd.DataFrame(rf.feature_importances_, index=X.columns)
features.head(15)

Unnamed: 0,0
YRS,0.027702
G,0.050031
AB,0.079456
R,0.149169
H,0.127577
2B,0.051655
3B,0.051396
HR,0.058889
RBI,0.065624
BB,0.050098


What features are most imporant for the model, of course hits, at bats, home plate etc. Homeruns were not as important, Batting average had the largest influence on the model, this was due to the fact that this was an important metric in 'old school' baseball and was the reason many players were inducted into the hall of fame.

##### Hyper parameter Tuning

In [32]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
rf2 = RandomForestClassifier(random_state=21)

# Set up the hyperparameter grid
param_grid = {
    'n_estimators': [500, 1000, 1500],  # Number of trees in the forest
    'criterion': ['gini', 'entropy'],   # Split criterion
    'max_depth': [10, 14, 18, None],    # Maximum depth of each tree
    'min_samples_split': [2, 5, 10, 15],  # Minimum number of samples required to split
    'min_samples_leaf': [1, 2, 4],      # Minimum number of samples per leaf
    'max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider for best split
}

# Initialize GridSearchCV with cross-validation
grid_search = GridSearchCV(
    estimator=rf2,
    param_grid=param_grid,
    cv=5,  # 5-fold cross-validation
    n_jobs=-1,  # Use all processors
    verbose=2,  # Print progress
    scoring='accuracy'  # Use accuracy as the scoring metric
)

# Fit the grid search to your data (X and y should be your feature matrix and target vector)
grid_search.fit(X_train_rf, y_train_rf)

# Get the best parameters and best score
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)


Fitting 5 folds for each of 864 candidates, totalling 4320 fits
Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 1500}
Best Score: 0.8628108108108108
