<a href="https://colab.research.google.com/github/emm-gl/machine-learning-portfolio/blob/main/00_Classical_ML/Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working Analyst Module I: Classifier
**Student:** Emmanuel González Calitl

May 2025

Consider the FraudeCanastas.csv dataset available in the corresponding zip file.
To learn more about the original dataset (not the one we will evaluate), visit this [link](https://challengedata.ens.fr/challenges/104).

* Explore and prepare the dataset to build a classifier, choosing one of the algorithms discussed during the module.
* Use a Google Collab notebook where, in addition to the code, you can discuss your findings and the steps to follow to build a classifier.
* Justify the choice of algorithm and the parameters used.
* Evaluate the model.
* Write an executive summary of no more than two paragraphs in a text document.

## 1. Get Data

In [None]:
#Import Libraries:

import numpy as np   # Biblioteca para manejar matrices y operaciones de matrices
import pandas as pd  # Biblioteca para manejar tablas de datos.

# Sklearn (Scikit-learn) es la  principal biblioteca para machine learning.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

# Bibliotecas para gráficar y visualizar.
import matplotlib.pyplot as plt
import seaborn as sns

### Functions:

In [None]:
def describe_datos(df):
    unicos =[]
    for col in df:
        unicos.append(df[col].unique())
    unicos = pd.Series(unicos, index=df.columns)
    descripcion = pd.concat([df.dtypes, df.isna().sum(),df.nunique(),unicos], axis=1)
    descripcion.columns = ['dtypes', 'null','nunique','unique' ]
    return(descripcion)

In [None]:

def plot_confusion_matrix(y_true, y_pred):
    """
    Plots a confusion matrix.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.
    """
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
                xticklabels=['Predicted 0', 'Predicted 1'],
                yticklabels=['Actual 0', 'Actual 1'])
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.title("Confusion Matrix")
    plt.show()


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
zip_path = '/content/drive/MyDrive/TrackCienciaDeDatos/Mod_1_Data_baskets/FraudeCanastas.zip'

import zipfile

# Ruta de destino donde se descomprimirá
extract_path = '/content/fraude_canastas'

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_path)


In [None]:
df_raw = pd.read_csv('/content/fraude_canastas/FraudeCanastas.csv')
df_raw.shape

(9319, 2457)

9319 rows
2457 columns

In [None]:
df_raw.head(5)

Unnamed: 0,ID,APPLE PRODUCTDESCRIPTION | SAMSUNG | MODEL90,AUDIO ACCESSORIES | AB AUDIO | AB AUDIO GO AIR TRUE WIRELESS BLUETOOTH IN-EAR H,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH CHARGING CASE,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH CHARGING CASE 2ND GENERATI,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH WIRELESS CHARGING CASE,AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH WIRELESS CHARGING CASE 2ND,AUDIO ACCESSORIES | APPLE | 2021 APPLE AIRPODS WITH MAGSAFE CHARGING CASE 3RD,AUDIO ACCESSORIES | APPLE | AIRPODS PRO,AUDIO ACCESSORIES | APPLE | APPLE AIRPODS MAX,...,WOMEN S NIGHTWEAR | ANYDAY RETAILER | ANYDAY RETAILER LEOPARD PRINT JERSEY PY,WOMEN S NIGHTWEAR | RETAILER | RETAILER CLEO VELOUR JOGGER LOUNGE PANT,WOMEN S NIGHTWEAR | SOSANDAR | SOSANDAR ZEBRA PRINT PYJAMA BOTTOMS BLACK 10,Nb_of_items,total_of_items,costo_total,costo_medio_item,costo_item_max,costo_item_min,fraud_flag
0,130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,1299,649.5,1299,0.0,1.0
1,195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3,3,4119,1373.0,2470,0.0,1.0
2,217,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,2806,1403.0,2799,7.0,1.0
3,552,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,2,1206,603.0,1199,7.0,1.0
4,854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,19,27,1807,66.925926,195,4.0,1.0




In [None]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9319 entries, 0 to 9318
Columns: 2457 entries, ID to fraud_flag
dtypes: float64(2452), int64(5)
memory usage: 174.7 MB


 ***We have a lot of columns, (features) we don't need each one.
Let's make an analysis to select what column is really giving important information:***



1.   Check Nans
2.   Unique values



In [None]:
df_raw.isna().sum().sort_values(ascending = False)

Unnamed: 0,0
fraud_flag,0
ID,0
APPLE PRODUCTDESCRIPTION | SAMSUNG | MODEL90,0
AUDIO ACCESSORIES | AB AUDIO | AB AUDIO GO AIR TRUE WIRELESS BLUETOOTH IN-EAR H,0
AUDIO ACCESSORIES | APPLE | 2019 APPLE AIRPODS WITH CHARGING CASE,0
...,...
AUDIO ACCESSORIES | APPLE | APPLE EARPODS WITH REMOTE AND MIC LIGHTNING CONNEC,0
AUDIO ACCESSORIES | APPLE | APPLE AIRPODS PRO WITH WIRELESS CHARGING CASE,0
AUDIO ACCESSORIES | APPLE | APPLE AIRPODS PRO WITH MAGSAFE CHARGING CASE,0
AUDIO ACCESSORIES | APPLE | APPLE AIRPODS MAX NOISE CANCELLING WIRELESS BLUETO,0


***All the columns have numerical data***

---



In [None]:
# prompt: show me the list of the 100 columns of my dataset

print(df_raw.columns[:1000])

In [None]:
# prompt: from the dataset, sort the length of the name of each column, first the short name, show the columns in a list, just the first 50

# Get the column names
column_names = df_raw.columns.tolist()

# Sort the column names by their length
sorted_column_names = sorted(column_names, key=len)

# Print the first sorted column names
sorted_column_names[:20]

['ID',
 'SERVICE',
 'WARRANTY',
 'COMPUTERS',
 'fraud_flag',
 'Nb_of_items',
 'costo_total',
 'total_of_items',
 'costo_item_max',
 'costo_item_min',
 'costo_medio_item',
 'BABY CHILD TRAVEL',
 'BEDROOM FURNITURE',
 'FULFILMENT CHARGE',
 'BAGS & CARRY CASES',
 'TOYS | MSPA | RETAILER',
 'LIVING DINING FURNITURE',
 'TELEVISIONS HOME CINEMA',
 'MAKEUP | DIOR | RETAILER',
 'LIVING & DINING FURNITURE']

***According the documentation of the dataset, most of the columns are to identify which product was in the basket or the brand or category, but it's insane to ahve more than 2000 categories***

In [None]:
#features to use:
features =['Nb_of_items', 'costo_total', 'total_of_items', 'costo_item_max', 'costo_item_min', 'costo_medio_item', 'fraud_flag']

df = df_raw[features]

In [None]:
df

Unnamed: 0,Nb_of_items,costo_total,total_of_items,costo_item_max,costo_item_min,costo_medio_item,fraud_flag
0,2,1299,2,1299,0.0,649.500000,1.0
1,3,4119,3,2470,0.0,1373.000000,1.0
2,2,2806,2,2799,7.0,1403.000000,1.0
3,2,1206,2,1199,7.0,603.000000,1.0
4,19,1807,27,195,4.0,66.925926,1.0
...,...,...,...,...,...,...,...
9314,1,369,1,369,369.0,369.000000,0.0
9315,16,2667,20,423,15.0,133.350000,0.0
9316,1,849,1,849,849.0,849.000000,0.0
9317,2,1906,2,1899,7.0,953.000000,0.0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9319 entries, 0 to 9318
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Nb_of_items       9319 non-null   int64  
 1   costo_total       9319 non-null   int64  
 2   total_of_items    9319 non-null   int64  
 3   costo_item_max    9319 non-null   int64  
 4   costo_item_min    9319 non-null   float64
 5   costo_medio_item  9319 non-null   float64
 6   fraud_flag        9319 non-null   float64
dtypes: float64(3), int64(4)
memory usage: 509.8 KB


In [None]:
# Function to get unique counts or total length for numerical columns
def unique_or_len(series):
    return series.nunique()

# Create the summary DataFrame
df_summary = pd.DataFrame({
    'Column': df_raw.columns,
    'Total_Unique_Values': [unique_or_len(df_raw[col]) for col in df_raw.columns],
    'Type': df_raw.dtypes
})

df_summary.reset_index(drop=True, inplace=True)

In [None]:
df_summary.sort_values(by='Total_Unique_Values', ascending=False).head(10)

Unnamed: 0,Column,Total_Unique_Values,Type
0,ID,9319,int64
2453,costo_medio_item,2034,float64
2452,costo_total,1639,int64
2454,costo_item_max,540,int64
2455,costo_item_min,528,float64
1424,LIVING & DINING FURNITURE | RETAILER | RETAILER,62,float64
2451,total_of_items,34,int64
2450,Nb_of_items,28,int64
879,COMPUTERS | APPLE | 2020 APPLE IPAD AIR 10 9 A...,26,float64
729,COMPUTER PERIPHERALS & ACCESSORIES | APPLE | A...,25,float64
