# Part 1 - Install

In [None]:
!pip install catboost

In [None]:
!pip install scikit-learn

# Part 2 - CatBoost on Iris Dataset

In [4]:
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Pool object for CatBoost (optional but recommended)
train_pool = Pool(X_train, y_train)
test_pool = Pool(X_test, y_test)

# Initialize the CatBoost Classifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=6, loss_function='MultiClass', verbose=0)

# Train the model
model.fit(train_pool)

# Make predictions
predictions = model.predict(test_pool)

# Evaluate model accuracy
accuracy = accuracy_score(y_test, predictions)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 1.00


## What is Pool Object in CatBoost?

The Pool object in CatBoost is a specialized data structure designed to efficiently handle datasets for training and evaluation. Here are the key features and functionalities of the Pool object:

1. Encapsulation of Data: The Pool object encapsulates the feature data (input features), target labels (for supervised tasks), and metadata about categorical features. This organization helps streamline the training process.
2. Support for Categorical Features: CatBoost is particularly effective at handling categorical features without requiring explicit encoding (like one-hot encoding). When you specify categorical features in the Pool, CatBoost automatically manages their conversion into numerical representations during training.
3. Memory and Performance Optimization: The Pool object is optimized for memory usage and performance, allowing CatBoost to process data more efficiently during model training. This is especially beneficial when working with large datasets.
4. Additional Metadata: Besides features and labels, the Pool can store additional information such as sample weights, group identifiers for ranking tasks, and baseline values, making it versatile for various machine learning tasks.
5. Convenience for Data Management: Using a Pool object simplifies data management by keeping related data together. This reduces the risk of misalignment between features and labels when saving or loading datasets.
