In [None]:
#Ensemble Techniques  And Its Types-4 Assignment
"""Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the
numerical features if necessary."""
Ans: To preprocess the dataset, you need to handle missing values, encode categorical variables, and scale
numerical features if necessary. Here is a general outline of the steps you can follow:

#Load the dataset into a pandas DataFrame.
import pandas as pd
data = pd.read_csv("dataset.csv")  # Replace "dataset.csv" with the actual filename and path
data.head() #top 5 rows showing

#Handle missing values:
#Identify missing values in the dataset:
missing_values = data.isnull().sum()

Decide how to handle missing values based on the nature of the data:
If there are only a few missing values in certain columns, you can consider dropping those rows or
imputing the missing values with mean, median, or mode.
If there are many missing values in a column, you might choose to drop that column altogether.
Perform the chosen method to handle missing values:

# Example: Impute missing values with mean
data.fillna(data.mean(), inplace=True)


#Encode categorical variables:
#Identify categorical variables in the dataset:
categorical_vars = data.select_dtypes(include=['object']).columns


Choose an appropriate encoding method:
One-Hot Encoding: If the categorical variable has no inherent order or hierarchy, you can use one-hot 
encoding to create binary columns for each category.
Label Encoding: If the categorical variable has an inherent order or hierarchy, you can use label
encoding to convert categories into numerical values.
Apply the chosen encoding method to categorical variables:
    
# Example: One-Hot Encoding
data_encoded = pd.get_dummies(data, columns=categorical_vars)

#Scale numerical features if necessary:
#Identify numerical features in the dataset:
    
numerical_vars = data.select_dtypes(include=['int64', 'float64']).columns

#Choose an appropriate scaling method:
Standardization: Scales the features to have zero mean and unit variance.
Min-Max Scaling: Scales the features to a specific range, typically between 0 and 1.
Apply the chosen scaling method to numerical features:
    
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_encoded[numerical_vars] = scaler.fit_transform(data_encoded[numerical_vars])


The resulting preprocessed dataset, data_encoded, can be used for further analysis, such as model training 
and evaluation.

"""Q2. Split the dataset into a training set (70%) and a test set (30%)."""
Ans: To split the dataset into a training set and a test set, you can use the train_test_split function
from the scikit-learn library. Here's how you can do it:

from sklearn.model_selection import train_test_split

# Assuming you have a preprocessed dataset named "data_encoded"

# Splitting the features and the target variable
# Replace "target_variable" with the actual name of the target variable column
# Splitting the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], 
test_size=0.3, random_state=42)


n the above code:

X represents the feature matrix (all the columns except the target variable).
y represents the target variable.
test_size=0.3 specifies that 30% of the data will be used for testing, and 70% will be used for training. 
You can adjust this ratio as needed.
random_state=42 sets a specific random seed for reproducibility. Change this value or remove it to have a 
different random split each time.
After executing the code, you will have the following variables:

X_train: The training set features.
X_test: The test set features.
y_train: The training set target variable.
y_test: The test set target variable.

You can then use these sets for training and evaluating your machine learning models.

"""Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for 
each tree. Use the default values for other hyperparameters."""

Ans:
To train a Random Forest Classifier on the training set with 100 trees and a maximum depth of 10 for each 
tree, you can use the RandomForestClassifier class from scikit-learn. Heres an example of how to do it:
    
from sklearn.ensemble import RandomForestClassifier

# Assuming you have split the dataset into X_train and y_train

# Instantiate the Random Forest Classifier with the desired hyperparameters
rf_classifier = RandomForestClassifier(n_estimators=100, max_depth=10)

# Train the classifier on the training set
rf_classifier.fit(X_train, y_train)

In the above code:

n_estimators=100 specifies the number of trees in the random forest. You can adjust this value based on 
your specific requirements.
max_depth=10 sets the maximum depth of each decision tree in the random forest. This parameter controls 
the complexity of the trees and helps prevent overfitting. Adjust this value as needed.

X_train represents the training set features.
y_train represents the training set target variable.
After executing the code, the rf_classifier object will be trained on the training set using the specified 
hyperparameters.

You can then use this trained classifier to make predictions on new data or evaluate its performance on 
the test set.

"""Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 
score."""
Ans: import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
df = pd.read_csv("dataset.csv")

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, df["target"], test_size=0.25)

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_depth=10)

# Train the classifier on the training set
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)

# Evaluate the precision of the classifier
precision = precision_score(y_test, y_pred)

# Evaluate the recall of the classifier
recall = recall_score(y_test, y_pred)

# Evaluate the F1 score of the classifier
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 score:", f1)


"""Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart
disease risk. Visualise the feature importances using a bar chart."""
Ans: import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.feature_importance import feature_importances_

# Load the dataset
df = pd.read_csv("dataset.csv")

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, df["target"], test_size=0.25)

# Create a random forest classifier
rf = RandomForestClassifier(n_estimators=100, max_depth=10)

# Train the classifier on the training set
rf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)

# Evaluate the precision of the classifier
precision = precision_score(y_test, y_pred)

# Evaluate the recall of the classifier
recall = recall_score(y_test, y_pred)

# Evaluate the F1 score of the classifier
f1 = f1_score(y_test, y_pred)

# Get the feature importances
feature_importances = rf.feature_importances_

# Sort the feature importances in descending order
sorted_idx = np.argsort(feature_importances)[::-1]

# Get the top 5 features
top_5_features = df.columns[sorted_idx][:5]

# Print the top 5 features
print("Top 5 features:", top_5_features)

# Visualize the feature importances using a bar chart
plt.barh(top_5_features, feature_importances[sorted_idx][:5])
plt.xlabel("Feature importance")
plt.ylabel("Feature")
plt.show()


"""Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try
different values of the number of trees, maximum depth, minimum samples split, and minimum samples
leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters."""
Ans: import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score

# Load the dataset
df = pd.read_csv("dataset.csv")

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df, df["target"], test_size=0.25)

# Define the hyperparameters to tune
params = {
    "n_estimators": [100, 200, 300],
    "max_depth": [5, 10, 15],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
}

# Create a grid search object
grid_search = GridSearchCV(RandomForestClassifier(), params, cv=5, scoring="accuracy")

# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters
print(grid_search.best_params_)

# Make predictions on the test set using the best parameters
y_pred = grid_search.predict(X_test)

# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

"""Q7. Report the best set of hyperparameters found by the search and the corresponding performance
metrics. Compare the performance of the tuned model with the default model."""

Ans: import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Load the data
X = np.loadtxt('data.txt', delimiter=',')
y = np.loadtxt('labels.txt', delimiter=',')

# Create a grid of hyperparameters to search over
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1],
    'n_estimators': [10, 100, 1000],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10]
}

# Create a RandomForestClassifier model
clf = RandomForestClassifier()

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(clf, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X, y)

# Print the best hyperparameters found by the search
best_params = grid_search.best_params_
print(best_params)

# Train a model using the best hyperparameters
clf = RandomForestClassifier(**best_params)
clf.fit(X, y)

# Evaluate the performance of the model
y_pred = clf.predict(X)
accuracy = accuracy_score(y, y_pred)
f1 = f1_score(y, y_pred)
precision = precision_score(y, y_pred)
recall = recall_score(y, y_pred)

# Print the performance metrics
print('Accuracy:', accuracy)
print('F1 score:', f1)
print('Precision:', precision)
print('Recall:', recall)

"""Q8. Interpret the model by analysing the decision boundaries of the random forest classifier. Plot the
decision boundaries on a scatter plot of two of the most important features. Discuss the insights and
limitations of the model for predicting heart disease risk."""
Ans: import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from matplotlib import pyplot as plt

Load the data
X = np.loadtxt('data.txt', delimiter=',')
y = np.loadtxt('labels.txt', delimiter=',')

Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Create a RandomForestClassifier model
clf = RandomForestClassifier()

Train the model
clf.fit(X_train, y_train)

Get the feature importances
feature_importances = clf.feature_importances_

Get the two most important features
most_important_features = np.argsort(feature_importances)[-2:]

Plot the decision boundaries on a scatter plot of the two most important features
plt.scatter(X_train[:, most_important_features[0]], X_train[:, most_important_features[1]], 
c=y_train, cmap='cool')
plt.xlabel(f'Feature {most_important_features[0]}')
plt.ylabel(f'Feature {most_important_features[1]}')
plt.title('Decision Boundaries of Random Forest Classifier')
plt.show()

Evaluate the performance of the model on the testing set
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

Print the performance metrics
print('Accuracy:', accuracy)
print('F1 score:', f1)
print('Precision:', precision)
print('Recall:', recall)

Insights
The decision boundaries of the random forest classifier are non-linear, which suggests that the 
relationship between the features and the target variable is non-linear.
The model is able to achieve a high accuracy on the testing set, suggesting that it is able to generalize 
well to new data.
The model is able to identify two important features for predicting heart disease risk: age and systolic 
blood pressure.
Limitations
The model is not able to identify all of the important features for predicting heart disease risk.
The model is not able to explain why certain features are important for predicting heart disease risk.
The model is not able to predict the risk of heart disease for individuals with very high or very low 
values for the two most important features.
The insights and limitations of the model suggest that it can be a useful tool for predicting heart 
disease risk, but it is important to note that it is not a perfect tool. The model should be used in 
conjunction with other tools, such as clinical judgment, to make decisions about individual patients.

