<a href="https://colab.research.google.com/github/Virendrashah02/first-repo/blob/main/data_prepration_phase_to_model_the_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

         @ 4. DATA PREPRATION PHASE TO MODEL THE DATA----------

. 1  Partitioning the Data

Partitioning data into training, validation, and test sets is crucial for building machine learning models. The training set is used to train the model, the validation set is used to tune hyperparameters, and the test set evaluates the model's performance.

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Create a sample dataset
data = pd.DataFrame({
    'Feature1': np.random.rand(100),  # Random numbers between 0 and 1
    'Feature2': np.random.rand(100),
    'Label': np.random.choice([0, 1], size=100)  # Binary classification labels
})

# Partition the data: 70% training, 15% validation, 15% test
train_data, temp_data = train_test_split(data, test_size=0.3, random_state=42)
validation_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

print("Training Data Size:", train_data.shape)
print("Validation Data Size:", validation_data.shape)
print("Test Data Size:", test_data.shape)


Training Data Size: (70, 3)
Validation Data Size: (15, 3)
Test Data Size: (15, 3)


1.train_test_split: Splits the data into training and temporary datasets. Then, the temporary data is further split into validation and test sets.

2.Input Data: Feature1 and Feature2 are features, while Label is the target.
Proportions:

70% of data is used for training (train_data).
The remaining 30% is split equally into validation (validation_data) and test sets (test_data).


3.Output:
Shapes of each dataset confirm the splits (e.g., train_data will have ~70 rows if the dataset has 100 rows).


2. Balancing the Training Dataset

Imbalanced data occurs when one class significantly outweighs the other,

leading to biased models. Balancing ensures equal representation of classes, which can be achieved using oversampling (e.g., SMOTE), undersampling, or other techniques

In [2]:
from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
import pandas as pd

# Create an imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3, n_redundant=1,
                           flip_y=0, n_features=5, n_clusters_per_class=1, n_samples=200, random_state=42)

# Before balancing
print("Before Balancing:", pd.Series(y).value_counts())

# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# After balancing
print("After Balancing:", pd.Series(y_resampled).value_counts())


Before Balancing: 1    180
0     20
Name: count, dtype: int64
After Balancing: 1    180
0    180
Name: count, dtype: int64


1. Imbalanced Dataset:
make_classification generates synthetic data with 90% of samples in one class (weights=[0.1, 0.9]).

  pd.Series(y).value_counts() shows the original class distribution.

2. SMOTE:
SMOTE generates synthetic samples for the minority class to balance the dataset
.
fit_resample() returns a new dataset with balanced class representation.

3. Output:
Before Balancing: Shows the dataset is imbalanced (e.g., 180 instances of one class, 20 of another).

4.  After Balancing: Confirms equal class distribution (e.g., 180:180).

3.  Building CART Decision Trees
CART (Classification and Regression Trees) split the data into subsets using Gini impurity or Mean Squared Error for classification and regression tasks, respectively.

In [3]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import load_iris

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Build a CART decision tree
cart_model = DecisionTreeClassifier(criterion='gini', random_state=42)
cart_model.fit(X, y)

# Predictions
predictions = cart_model.predict(X)
print(classification_report(y, predictions))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        50
           2       1.00      1.00      1.00        50

    accuracy                           1.00       150
   macro avg       1.00      1.00      1.00       150
weighted avg       1.00      1.00      1.00       150



1. Dataset:
load_iris() provides the Iris dataset with 4 features and 3 classes.
X: Features, y: Labels.
2. CART Model:
 DecisionTreeClassifier: Builds a classification tree.

  criterion='gini': Uses Gini impurity to split nodes.
4. Evaluation:
classification_report: Evaluates precision, recall, and F1-score for each class.
5. Output:
Precision, recall, and F1-score indicate how well the tree classifies the data.

 Perfect scores (1.00) imply overfitting to the training data.


     ---------------------------------------------------------------------------------------------------------------------------------------------------

4. Building C5.0 Decision Trees
C5.0 is an improvement over C4.5 but is not directly supported in Scikit-learn.

 Instead, we simulate it using entropy-based decision trees.

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import numpy as np

# Generate a small dataset
X = np.random.rand(100, 2)  # Two random features
y = np.random.choice([0, 1], size=100)  # Binary target

# Build a C5.0-like Decision Tree
model = DecisionTreeClassifier(criterion='entropy', random_state=42)
model.fit(X, y)

# Predictions
y_pred = model.predict(X)
print("Accuracy:", accuracy_score(y, y_pred))


Accuracy: 1.0


1. Entropy:
Unlike Gini, criterion='entropy' uses information gain for splits.
2. Model Training:
Trains on the generated dataset (X, y).
3. Evaluation:
accuracy_score: Measures the proportion of correct predictions.
4. Output:
Shows training accuracy. High accuracy may indicate overfitting.


1. Summary

* Partitioning: Splits data into training, validation, and test sets for unbiased evaluation.
* Balancing: Ensures equal class representation, improving model fairness and accuracy.
* CART Trees: Gini-based decision trees that split data for optimal classification or regression.
* C5.0 Trees: Entropy-based trees with improved pruning mechanisms.
* Random Forests: Ensemble method that reduces overfitting and improves accuracy.