In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Reading the file**

In [None]:
df = pd.read_csv('/kaggle/input/processed-dataset/test_balanced_dataset_with_updated_columns.csv')

In [None]:
df_original = pd.read_csv('/kaggle/input/credit-card-transactions-dataset/credit_card_transactions.csv')

# **Data Exploration**

In [None]:
df.head()

> **The following code first explores the dataset using the get_df_info function, which provides an overview of the dataset's shape, columns, data types, unique values, null values, duplicate rows, and descriptive statistics.**

In [None]:
def get_df_info(df):
    print("\n\033[1mShape of DataFrame:\033[0m ", df.shape)
    print("\n\033[1mColumns in DataFrame:\033[0m ", df.columns.to_list())
    print("\n\033[1mData types of columns:\033[0m\n", df.dtypes)
    
    print("\n\033[1mInformation about DataFrame:\033[0m")
    df.info()
    
    print("\n\033[1mNumber of unique values in each column:\033[0m")
    for col in df.columns:
        print(f"\033[1m{col}\033[0m: {df[col].nunique()}")
        
    print("\n\033[1mNull values in columns:\033[0m")
    null_counts = df.isnull().sum()
    null_columns = null_counts[null_counts > 0]
    if len(null_columns) > 0:
        for col, count in null_columns.items():
            print(f"\033[1m{col}\033[0m: {count}")
    else:
        print("There are no null values in the DataFrame.")
    
    print("\n\033[1mNumber of duplicate rows:\033[0m ", df.duplicated().sum())
    
    print("\n\033[1mDescriptive statistics of DataFrame:\033[0m\n",)
    return df.describe().transpose()

# Call the function
get_df_info(df)

Here's a brief explanation of the output:

**Data Overview**
The dataset contains 1,296,675 rows and 24 columns. The columns include transaction details such as date, time, credit card number, merchant information, transaction amount, and location data.

**Data Types**
The data types of the columns vary, with 6 float64, 6 int64, and 12 object (string) columns.

**Unique Values**
The number of unique values in each column ranges from 2 (gender, is_fraud) to 1,296,675 (trans_num, unix_time). This suggests that some columns have a high cardinality, while others have a low cardinality.

**Null Values**
There are null values in the merch_zipcode column, with 195,973 missing values.

**Duplicate Rows**
There are no duplicate rows in the dataset.

**Descriptive Statistics**
The descriptive statistics provide an overview of the central tendency and variability of the numerical columns. For example, the mean transaction amount is around 70,  with a standard deviation of around 160. The mean latitude and longitude values suggest that the transactions are concentrated in a specific region.

Overall, this output provides a comprehensive overview of the dataset's structure, content, and distribution. It highlights the presence of null values, unique values, and duplicate rows, which can inform data preprocessing and feature engineering decisions.



# **Machine Learning using EvalML**


The command pip install evalml is used to install the EvalML library, which is an automated machine learning (AutoML) library for Python. EvalML provides a simple and efficient way to perform automated machine learning tasks, including data preprocessing, model selection, hyperparameter tuning, and model evaluation.

Here's a brief explanation of the EvalML library:

**Key Features:**

* **Automated Machine Learning:** EvalML automates the machine learning process, from data preprocessing to model evaluation.

* **Data Preprocessing:** EvalML provides automatic data preprocessing, including handling missing values, encoding categorical variables, and feature scaling.

* **Model Selection:** EvalML supports a wide range of machine learning algorithms and automatically selects the best model for the problem.

* **Hyperparameter Tuning:** EvalML performs hyperparameter tuning to optimize model performance.

* **Model Evaluation:** EvalML provides comprehensive model evaluation metrics and visualizations.


**Benefits:**

* **Saves Time:** EvalML automates the machine learning process, saving time and effort.

* **Improves Performance:** EvalML's automated hyperparameter tuning and model selection improve model performance.

* **Easy to Use:** EvalML has a simple and intuitive API, making it easy to use for both beginners and experienced machine learning practitioners.

By installing EvalML using pip install evalml, you can leverage these features and benefits to streamline your machine learning workflow and improve your model's performance.



In [None]:
pip install evalml

In [None]:
pip install evalml packaging --upgrade

In [None]:
import evalml
from evalml import AutoMLSearch

In [None]:
# Splitting Features and target
X = df.drop(['is_fraud'], axis=1)
y = df['is_fraud']

In [None]:
# Use df_original
# Splitting Features and target
X_original = df_original.drop(['is_fraud'], axis=1)
y_original = df_original['is_fraud']

In [None]:
X_train_original, X_test_original, y_train_original, y_test_original = evalml.preprocessing.split_data(X_original, y_original, problem_type='binary')

In [None]:
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='binary')


This line of code splits the dataset into training and testing sets using EvalML's split_data function. Here's a breakdown of the parameters:
* X: The feature data (independent variables)
* y: The target data (dependent variable)
* problem_type='binary': Specifies that this is a binary classification problem (i.e., the target variable has two classes)

The function returns four arrays:

* X_train: The training feature data.
* X_test: The testing feature data.
* y_train: The training target data.
* y_test: The testing target data.

In [None]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', max_iterations=50)
automl.search()

The output shows the results of the AutoML search process. Here's a breakdown of the output:
* The search process evaluated multiple models, and the results are stored in a dictionary with batch numbers as keys (1, 2, ...).
* Each batch contains a dictionary with model names as keys and their corresponding cross-validation scores as values.
* The scores are likely accuracy or F1 scores, given the problem type is binary classification.
* The models are a combination of different algorithms (e.g., Random Forest, LightGBM, Extra Trees, Elastic Net, XGBoost, Logistic Regression) with various preprocessors and transformers (e.g., Label Encoder, Select Columns By Type Transformer, DateTime Featurizer, Imputer, One Hot Encoder, Undersampler).
* The "Total time of batch" key shows the total time taken for each batch.

From the output, we can see that:
* Batch 1 evaluated a single model, Random Forest Classifier, with a score of 48.26.
* Batch 2 evaluated five different models, with scores ranging from 23.45 (Extra Trees Classifier) to 27.82 (XGBoost Classifier).

The AutoML search process aims to find the best-performing model for the given problem. You can access the best-performing model using automl.best_pipeline, which will return the model with the highest score.



In [None]:
# Get the rankings with details
rankings = automl.rankings
print(rankings)

Here's a breakdown of the output:
The rankings attribute returns a pandas DataFrame containing the results of the AutoML search process.

The columns are:
* id: A unique identifier for each pipeline.
* pipeline_name: The name of the pipeline, including the algorithm and preprocessors/transformers used.
* search_order: The order in which the pipeline was evaluated during the search process.
* ranking_score: A score used to rank the pipelines, with lower values indicating better performance.
* mean_cv_score: The mean cross-validation score for each pipeline.
* standard_deviation_cv_score: The standard deviation of the cross-validation scores for each pipeline.
* percent_better_than_baseline: The percentage improvement over the baseline model.
* high_variance_cv: A boolean indicating whether the cross-validation scores have high variance.
* parameters: The hyperparameters used for each pipeline.

The rows are sorted by the ranking_score, with the best-performing pipeline at the top.

From the output, we can see that:

* The top-performing pipeline is the LightGBM Classifier with a ranking score of 0.036276 and an 82.58% improvement over the baseline.
* The worst-performing pipeline is the Mode Baseline Binary Classification Pipeline with a ranking score of 0.208199 and no improvement over the baseline.


You can access the best-performing pipeline using automl.best_pipeline, which will return the pipeline with the lowest ranking score.



In [None]:
automl.best_pipeline


The best_pipeline attribute returns the best-performing pipeline from the AutoML search process. 

**Here's a breakdown of the output:**

The pipeline is a complex graph of components, including:

* Preprocessors: Label Encoder, Select Columns By Type Transformer, Drop Columns Transformer, DateTime Featurizer, Imputer, One Hot Encoder
* Balancer: Undersampler
* Classifier: LightGBM Classifier

The pipeline has a large number of hyperparameters, which are tuned to optimize performance.

The hyperparameters are stored in the parameters dictionary, which contains settings for each component.

The random_seed parameter is set to 0, which ensures reproducibility of the results.

In [None]:
best_pipeline=automl.best_pipeline

In [None]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])

The describe_pipeline method provides a detailed description of the pipeline, including:

**Pipeline Name:** LightGBM Classifier with various preprocessors and transformers.

**Problem Type:** Binary classification.

**Model Family:** LightGBM.

**Pipeline Steps:** A list of 13 components, including:

* Label Encoder (x3)
* Select Columns By Type Transformer
* Drop Columns Transformer
* DateTime Featurizer
* Imputer (x2)
* Select Columns Transformer (x2)
* One Hot Encoder
* Undersampler
* LightGBM Classifier

**Hyperparameters:** Detailed settings for each component, including:
* Column types and exclusions
* Imputation strategies
* Encoding schemes
* Sampling ratios
* LightGBM hyperparameters (e.g., learning rate, n_estimators, max_depth)

**Training:**
Total training time (including CV): 27.0 seconds

**Cross Validation:**
* Metrics: Log Loss, Binary MCC, Gini, AUC, Precision, F1, Balanced Accuracy, Binary Accuracy
* Mean and standard deviation of each metric across folds
* Coefficient of variation for each metric

This detailed description provides insight into the pipeline's architecture, hyperparameters, and performance.



In [None]:
# Evaluate on hold out data
best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])


The score method evaluates the performance of the best pipeline on the holdout data (X_test, y_test) using the specified objectives:

* AUC (Area Under the Receiver Operating Characteristic Curve)
* F1 (F1 score, the harmonic mean of precision and recall)
* Precision (the ratio of true positives to true positives plus false positives)
* Recall (the ratio of true positives to true positives plus false negatives)

The output is an OrderedDict with the objective names as keys and the corresponding scores as values.

Here's a brief interpretation of the scores:

* AUC: 0.997, indicating excellent performance, with a high degree of separation between positive and negative classes.
* F1: 0.819, indicating good balance between precision and recall.
* Precision: 0.840, indicating a high ratio of true positives to true positives plus false positives.
* Recall: 0.798, indicating a good ratio of true positives to true positives plus false negatives.

These scores suggest that the best pipeline is performing well on the holdout data, with excellent AUC and good balance between precision and recall.



In [None]:
# Evaluate on hold out data
y_pred = best_pipeline.predict(X_test)
y_pred_proba = best_pipeline.predict_proba(X_test)

In [None]:
# Classification Report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Mechanics of classification_report:
classification_report from scikit-learn generates a report on the classification performance. 

Here's a breakdown of the output:
* Precision: The ratio of true positives (TP) to the sum of true positives and false positives (FP) for each class.
* Recall: The ratio of true positives (TP) to the sum of true positives and false negatives (FN) for each class.
* F1-score: The harmonic mean of precision and recall for each class.
* Support: The number of instances in each class.

For the given output:

**Class 0 (likely the negative class):**
* Precision: 1.00 (perfect precision)
* Recall: 1.00 (perfect recall)
* F1-score: 1.00 (perfect F1-score)
* Support: 257834 instances


**Class 1 (likely the positive class):**
* Precision: 0.84
* Recall: 0.80
* F1-score: 0.82
* Support: 1501 instances

**The report also provides averages:**
* Accuracy: The overall accuracy of the classifier.
* Macro avg: The average of the precision, recall, and F1-score for all classes, weighted equally.
* Weighted avg: The average of the precision, recall, and F1-score for all classes, weighted by the support (number of instances) of each class.


In this case, the classifier performs perfectly on the negative class (Class 0) but has some errors on the positive class (Class 1). The macro average and weighted average provide a summary of the performance across both classes.



In [None]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# Calculate F1 score
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred, average='weighted')  # or 'macro' / 'micro' depending on your use case
print("F1 Score:", f1)


The confusion matrix provides a summary of the classifier's predictions against the actual true labels. Here's a breakdown of the output:

**Interpretation:**
* True Negatives (TN): 257606 instances correctly predicted as Class 0 (negative class)
* False Positives (FP): 303 instances incorrectly predicted as Class 1 (positive class) when they were actually Class 0
* False Negatives (FN): 228 instances incorrectly predicted as Class 0 when they were actually Class 1
* True Positives (TP): 1198 instances correctly predicted as Class 1 (positive class)

**The confusion matrix helps identify:**
* Errors in prediction (FP, FN)
* Accuracy of the classifier (TN, TP)
* Class imbalance (difference in support between classes)

In this case, the classifier performs well on the negative class (Class 0) but has some errors on the positive class (Class 1), with a relatively low number of true positives compared to false negatives.



# **Conclusion**

The AutoML process successfully identified a high-performing pipeline for the binary classification problem, utilizing a LightGBM classifier with various preprocessors and transformers. 

The best pipeline achieved excellent performance on the holdout data, with:

* AUC: 0.997
* F1-score: 0.819
* Precision: 0.840
* Recall: 0.798

The classification report and confusion matrix revealed:

* High accuracy on the negative class (Class 0)
* Good performance on the positive class (Class 1), with some room for improvement

The AutoML process demonstrated its effectiveness in:

* Automating the machine learning workflow
* Identifying a high-performing pipeline
* Providing insights into the classification performance

However, there is still room for improvement, particularly in:
* Addressing class imbalance
* Further optimizing hyperparameters
* Exploring additional algorithms and techniques

Overall, the AutoML process provided valuable insights and a solid foundation for further development and improvement.

# **Kindly UPVOTE**

Hope you find this notebook useful. Please Upvote