# Machine Learning: Classification Model

Classification is a type of supervised machine learning where the goal is to predict a categorical label (also known as a class) based on a set of features or predictors.

A classification model takes a set of input features and maps them to one of several possible outputs or class labels. The model is trained on a labeled dataset where the correct class labels are known, and then used to make predictions on new, unseen data. The model learns to associate certain feature values with specific class labels, so that when new data is presented to the model, it can predict which class the data belongs to.

There are many algorithms for building classification models, including logistic regression, k-nearest neighbors, decision trees, random forests, and support vector machines, among others. The choice of algorithm depends on the problem, the type of data, and the computational resources available.

Classification models have several benefits for industry, including:

a. Automated predictions: Classification models can automate the process of making predictions, which can be time-consuming and error-prone when performed manually.

b. Improved accuracy: With proper training and evaluation, classification models can be highly accurate, reducing the number of incorrect predictions compared to manual methods.

c. Scalability: Classification models can handle large amounts of data, making them well suited for handling big data in industry.

d. Improved decision-making: By providing accurate and automated predictions, classification models can help improve decision-making in various industries.


Examples of applications of classification models in industry include:

1. Fraud detection: Classification models can be used to identify fraudulent activities, such as credit card fraud, by detecting patterns in transaction data that are indicative of fraud.

2. Customer segmentation: Classification models can be used to segment customers into different groups based on their demographics, purchase history, and other data. This information can be used to target marketing campaigns and improve customer retention.

3. Medical diagnosis: Classification models can be used in the medical field to diagnose diseases based on symptoms, test results, and other data.

4. Sentiment analysis: Classification models can be used in text analysis to determine the sentiment expressed in text data, such as social media posts or product reviews. This information can be used to understand customer sentiment and improve customer satisfaction.

Here are the general steps to create a classification model in machine learning:

1. Define the problem: Start by defining the problem you want to solve and determining whether it is a classification problem.

2. Collect and prepare the data: Gather the relevant data for your problem, and clean and preprocess the data as necessary. This may involve handling missing values, transforming variables, and scaling the data.

3. Split the data into training and test sets: Divide your data into two sets, one for training the model and one for testing its performance. It is important to keep the test set separate so that you can evaluate the model's performance on unseen data.

4. Choose a model: Select an appropriate model for the problem, such as logistic regression, decision trees, random forests, or support vector machines. Consider the type of data you are working with and the computational resources available.

5. Train the model: Train the model on the training data by fitting the model to the data and optimizing its parameters.

6. Evaluate the model: Use the test set to evaluate the performance of the model. This may involve computing metrics such as accuracy, precision, recall, and F1 score.

7. Fine-tune the model: Based on the evaluation results, make any necessary changes to the model or the data to improve its performance. Repeat steps 5-7 until you are satisfied with the performance of the model, this is optional

8. Use the model to make predictions: Use the trained model to make predictions on new, unseen data.

9. Deploy the model: Deploy the model in a production environment and monitor its performance over time. Make any necessary updates to keep the model up-to-date and accurate.





## Import Library

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

## Define Dataframe

In [2]:
# create dataframe
df = pd.DataFrame({'categorical_column_1': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E'],
                   'categorical_column_2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z'],
                   'numerical_column_1': np.random.rand(30),
                   'numerical_column_2': np.random.rand(30),
                   'numerical_column_3': np.random.rand(30),
                   'label': [0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0]})

In [3]:
df

Unnamed: 0,categorical_column_1,categorical_column_2,numerical_column_1,numerical_column_2,numerical_column_3,label
0,A,X,0.052037,0.16862,0.143165,0
1,B,Y,0.258971,0.503202,0.239435,1
2,C,Z,0.583829,0.393311,0.674859,1
3,D,X,0.585395,0.987178,0.67264,0
4,E,Y,0.40326,0.960202,0.142771,0
5,A,Z,0.945124,0.298565,0.900249,1
6,B,X,0.941275,0.665033,0.755958,0
7,C,Y,0.12374,0.566526,0.869093,1
8,D,Z,0.459953,0.494005,0.503336,1
9,E,X,0.102842,0.334663,0.480714,0


From the dataframe it's clear that we want to predict or classify the data based on label

## Split data label and non label

In [5]:
# split the dataframe into X and y
X = df.drop('label', axis=1)
y = df['label']

## Create dummies

Convert categorical variable into dummy/indicator variables.

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [6]:
# create dummies for the categorical variables in X
X = pd.get_dummies(X, columns=['categorical_column_1', 'categorical_column_2'])

## Split Train & Test Data

In [7]:
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Train Model

we use RandomForestClassifier in this case, A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [8]:
# train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

## Create Prediction by using Test Data

In [9]:
# make predictions on the test data
y_pred = clf.predict(X_test)

In [10]:
y_pred

array([1, 0, 1, 1, 1, 0, 1, 0, 0], dtype=int64)

## Evaluation

For excample in this study, we use accuray to know how accurate the prediction

In [11]:
# evaluate the model
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Accuracy: 77.78%


In [12]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[4 2]
 [0 3]]


The result looks good, it means that machine learning learning very well, if you want to improve the model you can use hyperparameter tuning, but it's optional if you satisfied by the result

## Lets deploy and use the model to predict new data without label

### Create new dataframe

In [13]:
# create a new dataframe without the label
# create dataframe
new_df = pd.DataFrame({'categorical_column_1': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E'],
                   'categorical_column_2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X'],
                   'numerical_column_1': np.random.randn(10),
                   'numerical_column_2': np.random.randn(10),
                   'numerical_column_3': np.random.randn(10)})

### Create dummies

In [14]:
# create dummies for the categorical variables in the new dataframe
new_df = pd.get_dummies(new_df, columns=['categorical_column_1', 'categorical_column_2'])

### Predict new data

In [15]:
# use the trained model to predict the labels for the new dataframe
new_df['label_prediction'] = clf.predict(new_df)

In [16]:
# view the new dataframe with the label_prediction column
new_df.head()

Unnamed: 0,numerical_column_1,numerical_column_2,numerical_column_3,categorical_column_1_A,categorical_column_1_B,categorical_column_1_C,categorical_column_1_D,categorical_column_1_E,categorical_column_2_X,categorical_column_2_Y,categorical_column_2_Z,label_prediction
0,1.350096,-0.473009,0.509288,1,0,0,0,0,1,0,0,0
1,-0.653313,1.469487,-0.098168,0,1,0,0,0,0,1,0,1
2,0.327763,0.412554,-0.395216,0,0,1,0,0,0,0,1,1
3,0.114424,0.01781,1.084722,0,0,0,1,0,1,0,0,0
4,0.980175,0.52806,-0.226565,0,0,0,0,1,0,1,0,0
