### 1. What is the estimated depth of a Decision Tree trained (unrestricted) on a one million instance training set?


The estimated depth of a Decision Tree trained on a one million instance training set can vary significantly depending on several factors such as the complexity of the data, the number of features, and the stopping criteria used during training. However, there is no fixed or standard depth for Decision Trees on a given dataset, as it is an outcome of the learning process.

-------------

### 2. Is the Gini impurity of a node usually lower or higher than that of its parent? Is it always lower/greater, or is it usually lower/greater?

In a Decision Tree, the Gini impurity of a node is typically lower than that of its parent, assuming the tree is being trained to minimize Gini impurity. The Gini impurity is a measure of the node's impurity or the level of disorder in the node's class distribution.

----------

### 3. Explain if its a good idea to reduce max depth if a Decision Tree is overfitting the training set?

Reducing the maximum depth of a Decision Tree can be a good idea if the tree is overfitting the training set. 

Overfitting occurs when the Decision Tree becomes too complex and captures noise or irrelevant patterns in the training data, leading to poor generalization on unseen data. By reducing the maximum depth, you can apply regularization to the tree and improve its ability to generalize.

------------

### 4. Explain if its a good idea to try scaling the input features if a Decision Tree underfits the training set?

If a Decision Tree underfits the training set, trying to scale the input features is generally not the most effective solution. 

Decision Trees are not sensitive to feature scaling because the algorithm operates by comparing features at different nodes based on threshold values. Therefore, scaling the features does not have a significant impact on the performance of the Decision Tree.

----------

### 5. How much time will it take to train another Decision Tree on a training set of 10 million instances if it takes an hour to train a Decision Tree on a training set with 1 million instances?


Let's assume 
Time to train 1 million instances = 1 hour
Time to train 10 million instances ≈ (10 million / 1 million) * 1 hour
Time to train 10 million instances ≈ 10 hours

This is influenced by various factors, such as the complexity of the data, the number of features, the hardware used, and the specific implementation of the algorithm. 

----------

### 6. Will setting presort=True speed up training if your training set has 100,000 instances?

Setting presort=True in a Decision Tree algorithm might not necessarily speed up training, especially for a training set with 100,000 instances. In fact, it can lead to slower training times and increased memory usage.

The presort parameter in Decision Trees determines whether the data should be presorted before the tree is built. When presort=True, the algorithm will sort the data based on each feature before finding the best split at each node. This presorting can be beneficial for smaller datasets, but it becomes computationally expensive for larger datasets, like the one with 100,000 instances.

----------

### 7. Follow these steps to train and fine-tune a Decision Tree for the moons dataset:

a. To build a moons dataset, use make moons(n samples=10000, noise=0.4).

b. Divide the dataset into a training and a test collection with train test split().

c. To find good hyperparameters values for a DecisionTreeClassifier, use grid search with cross-
validation (with the GridSearchCV class). Try different values for max leaf nodes.

d. Use these hyperparameters to train the model on the entire training set, and then assess its
output on the test set. You can achieve an accuracy of 85 to 87 percent.


In [1]:
# Step 1: Import necessary libraries and generate the moons dataset

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Generate the moons dataset with 10000 samples and noise=0.4
X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train a Decision Tree classifier on the training set
decision_tree = DecisionTreeClassifier(random_state=42)

# Fit the Decision Tree on the training data
decision_tree.fit(X_train, y_train)

# Step 4: Fine-tune the Decision Tree hyperparameters using cross-validation
# Define the hyperparameters to tune
param_grid = {
    'max_depth': [None, 5, 10, 15],  # You can add more values to explore different depths
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object to find the best hyperparameters
grid_search = GridSearchCV(decision_tree, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Step 5: Evaluate the model on the testing set
best_decision_tree = grid_search.best_estimator_

# Predict on the test set
y_pred = best_decision_tree.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)

Test Accuracy: 0.8615


---------

### 8. Follow these steps to grow a forest:

a. Using the same method as before, create 1,000 subsets of the training set, each containing
100 instances chosen at random. You can do this with Scikit-ShuffleSplit Learn&#39;s class.

b. Using the best hyperparameter values found in the previous exercise, train one Decision
Tree on each subset. On the test collection, evaluate these 1,000 Decision Trees. These Decision Trees would likely perform worse than the first Decision Tree, achieving only around 80% accuracy,
since they were trained on smaller sets.

c. Now the magic begins. Create 1,000 Decision Tree predictions for each test set case, and
keep only the most common prediction (you can do this with SciPy&#39;s mode() function). Over the test
collection, this method gives you majority-vote predictions.

d. On the test range, evaluate these predictions: you should achieve a slightly higher accuracy
than the first model (approx 0.5 to 1.5 percent higher). You&#39;ve successfully learned a Random Forest
classifier!

In [3]:
# Step 1: Import necessary libraries
from sklearn.model_selection import ShuffleSplit

# Step 2: Define the number of subsets and their size
num_subsets = 1000
subset_size = 100

# Step 3: Create a ShuffleSplit object
shuffle_split = ShuffleSplit(n_splits=num_subsets, train_size=subset_size, random_state=42)

# Step 4: Generate the subsets
subsets_X = []
subsets_y = []

for train_indices, _ in shuffle_split.split(X_train):
    X_subset = X_train[train_indices]
    y_subset = y_train[train_indices]
    subsets_X.append(X_subset)
    subsets_y.append(y_subset)

# At this point, you have 1,000 subsets in subsets_X and subsets_y.
# Each subset contains 100 randomly selected instances from the training set.
# You can now train 1,000 Random Forest models, one for each subset.

# Reusing the best hyperparameters from the previous exercise
best_params = {'max_depth': 10, 'min_samples_split': 5, 'min_samples_leaf': 2}

# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize a list to store the Decision Trees
decision_trees = []

# Train one Decision Tree on each subset
for X_subset, y_subset in zip(subsets_X, subsets_y):
    dt = DecisionTreeClassifier(**best_params, random_state=42)
    dt.fit(X_subset, y_subset)
    decision_trees.append(dt)

# Evaluate the performance of the 1,000 Decision Trees on the test set
ensemble_predictions = []
for dt in decision_trees:
    y_pred = dt.predict(X_test)
    ensemble_predictions.append(y_pred)

# Combine the predictions using majority voting
import numpy as np
ensemble_predictions = np.array(ensemble_predictions)
final_predictions = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=0, arr=ensemble_predictions)

# Calculate accuracy
accuracy = accuracy_score(y_test, final_predictions)
print("Ensemble Accuracy:", accuracy)


Ensemble Accuracy: 0.87


In [4]:
# Import mode function from SciPy
from scipy.stats import mode

# Create an array to store predictions from all Decision Trees on the test set
ensemble_predictions = np.array(ensemble_predictions)

# Perform majority voting to get the final predictions
final_predictions, _ = mode(ensemble_predictions, axis=0)

# Convert the final_predictions to a 1D array
final_predictions = final_predictions.ravel()

# Calculate the accuracy of the Random Forest ensemble on the test set
ensemble_accuracy = accuracy_score(y_test, final_predictions)
print("Random Forest Ensemble Accuracy:", ensemble_accuracy)

Random Forest Ensemble Accuracy: 0.87


In the above example we used the SciPy's "mode" function to perform majority voting across all 1000 DT for each instance in the test set. Due to the ensemble effect, the Random Forest's accuracy is expected to be slightly higher (0.5% to 1.5%) than the accuracy achieved by the first DT model.