### Online Shoppers Intention

- **Dataset Descriptions**: This dataset contains feature vectors for 12,330 sessions, each representing a different user over a 1-year period. The data is curated to avoid bias towards specific campaigns, special days, user profiles, or periods.

Source: [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset)

| Feature                     | Description                                                                                                          |
|-----------------------------|----------------------------------------------------------------------------------------------------------------------|
| **Administrative**              | The number of pages of this type (administrative) visited by the user in that session.                                |
| **Administrative_Duration**     | The total amount of time (in seconds) spent by the user on administrative pages during the session.                  |
| **Informational**               | The number of informational pages visited by the user in that session.                                                |
| **Informational_Duration**      | The total time spent by the user on informational pages.                                                               |
| **ProductRelated**              | The number of product-related pages visited by the user.                                                              |
| **ProductRelated_Duration**     | The total time spent by the user on product-related pages.                                                             |
| **BounceRates**                 | The average bounce rate of the pages visited by the user. The bounce rate is the percentage of visitors who navigate away from the site after viewing only one page. |
| **ExitRates**                   | The average exit rate of the pages visited by the user. The exit rate is a metric that shows the percentage of exits from a page. |
| **PageValues**                  | The average value of the pages visited by the user. This metric is often used as an indicator of how valuable a page is in terms of generating revenue. |
| **SpecialDay**                  | Indicates the closeness of the site visiting time to a specific special day (e.g., Mother’s Day, Valentine's Day) in which the sessions are more likely to be finalized with a transaction. |
| **Month**                       | The month of the year in which the session occurred.                                                                  |
| **OperatingSystems**            | The operating system used by the user.                                                                               |
| **Browser**                     | The browser used by the user.                                                                                        |
| **Region**                      | The region from which the user is accessing the website.                                                              |
| **TrafficType**                 | The type of traffic (e.g., direct, paid search, organic search, referral).                                           |
| **VisitorType**                 | A categorization of users (e.g., Returning Visitor, New Visitor).                                                    |
| **Weekend**                     | A boolean indicating whether the session occurred on a weekend.                                                       |
| **Revenue**                     | A binary variable indicating whether the session ended in a transaction (purchase).                                   |


- Objetivos:

0. Columna a predecir: **Revenue**
1. Limpieza de Datos.
2. Exploratory Data Analysis.
3. Probar modelos de clasificación.
4. Mostrar el **Feature Importance** para las columnas.
5. Aplicar **_SMOTE_** para balanceo de clases y repetir modelos.
6. Definir una red neuronal para clasificación (dataset original y dataset balaceado).
7. Para ambos modelos de redes neuronales calcular: _**confusion_matrix**_, _**roc_auc**_, _**f1-score**_, _**recall**_, _**precision**_, _**accuracy**_.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv("online_shoppers_intention.csv")

df

In [None]:
# Revenue

df["Revenue"] = df["Revenue"].astype(int)

df["Revenue"].value_counts(normalize = True)

In [None]:
# Revenue

plt.figure(figsize = (8, 6))

sns.countplot(x = df["Revenue"], hue = df["Revenue"])
plt.show()

In [None]:
# Weekend

df["Weekend"] = df["Weekend"].astype(int)

df["Weekend"].value_counts(normalize = True)

In [None]:
# Relación de la columna "Weekend" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Weekend"], hue = df["Weekend"], ax = axes[0]);
sns.countplot(x = df["Weekend"], hue = df["Revenue"], ax = axes[1]);
plt.show()


In [None]:
# Month

map_month = {"Jan" : 0, "Feb" : 1, "Mar" : 2, "Apr" : 3, "May" : 4, "June" : 5, 
             "Jul" : 6, "Aug" : 7, "Sep" : 8, "Oct" : 9, "Nov" : 10, "Dec" : 11}

df["Month"] = df["Month"].apply(lambda x : map_month[x])

df["Month"].value_counts()

In [None]:
# Relación de la columna "Month" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Month"], hue = df["Month"], ax = axes[0]);
sns.countplot(x = df["Month"], hue = df["Revenue"], ax = axes[1]);

plt.show()

In [None]:
# VisitorType

df["VisitorType"].value_counts()

In [None]:
# Relación de la columna "VisitorType" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["VisitorType"], hue = df["VisitorType"], ax = axes[0]);
sns.countplot(x = df["VisitorType"], hue = df["Revenue"], ax = axes[1]);

plt.show()

In [None]:
# VisitorType

df = pd.get_dummies(data = df, prefix = "VisitorType", columns = ["VisitorType"], dtype = int)

df.head(3)

In [None]:
df.columns

In [None]:
# Mapa de correlación

plt.figure(figsize = (12, 8))

sns.heatmap(data = df.corr().round(2), annot = True)
plt.show()

In [None]:
# Relación de la columna "Administrative" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Administrative"], hue = df["Administrative"], ax = axes[0]);
sns.countplot(x = df["Administrative"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "Administrative_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["Administrative_Duration"], kde = True, ax = axes[0]);
sns.boxplot(y = df["Administrative_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["Administrative_Duration"]);
plt.show()

In [None]:
# Transformación Logaritmica "Administrative_Duration"

df["log_Administrative_Duration"] = df["Administrative_Duration"].apply(lambda x : np.log(x + 1))

# Relación de la columna "log_Administrative_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["log_Administrative_Duration"], kde = True, ax = axes[0]);
sns.boxplot(y = df["log_Administrative_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["log_Administrative_Duration"]);
plt.show()

In [None]:
# Relación de la columna "Informational" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Informational"], hue = df["Informational"], ax = axes[0]);
sns.countplot(x = df["Informational"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "Informational_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["Informational_Duration"], kde = True, ax = axes[0]);
sns.boxplot(y = df["Informational_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["Informational_Duration"]);
plt.show()

In [None]:
# Transformación Logaritmica "Informational_Duration"

df["log_Informational_Duration"] = df["Informational_Duration"].apply(lambda x : np.log(x + 1))

# Relación de la columna "log_Informational_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["log_Informational_Duration"], hue = df["Revenue"], kde = True, ax = axes[0]);
sns.boxplot(y = df["log_Informational_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["log_Informational_Duration"]);
plt.show()

In [None]:
# Relación de la columna "ProductRelated" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["ProductRelated"], hue = df["ProductRelated"], ax = axes[0]);
sns.countplot(x = df["ProductRelated"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "ProductRelated_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["ProductRelated_Duration"], kde = True, ax = axes[0]);
sns.boxplot(y = df["ProductRelated_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["ProductRelated_Duration"]);
plt.show()

In [None]:
# Transformación Logaritmica "ProductRelated_Duration"

df["log_ProductRelated_Duration"] = df["ProductRelated_Duration"].apply(lambda x : np.log(x + 1))

# Relación de la columna "log_ProductRelated_Duration" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["log_ProductRelated_Duration"], hue = df["Revenue"], kde = True, ax = axes[0]);
sns.boxplot(y = df["log_ProductRelated_Duration"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["log_ProductRelated_Duration"]);
plt.show()

In [None]:
# Relación de la columna "BounceRates" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["BounceRates"], hue = df["Revenue"], kde = True, ax = axes[0]);
sns.boxplot(y = df["BounceRates"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["BounceRates"]);
plt.show()

In [None]:
# Relación de la columna "PageValues" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.histplot(x = df["PageValues"], hue = df["Revenue"], kde = True, ax = axes[0]);
sns.boxplot(y = df["PageValues"], x = df["Revenue"], hue = df["Revenue"], ax = axes[1]);
plt.show()

plt.figure(figsize = (18, 7))

sns.boxplot(x = df["PageValues"]);
plt.show()

In [None]:
# Relación de la columna "SpecialDay" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["SpecialDay"], hue = df["SpecialDay"], ax = axes[0]);
sns.countplot(x = df["SpecialDay"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "OperatingSystems" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["OperatingSystems"], hue = df["OperatingSystems"], ax = axes[0]);
sns.countplot(x = df["OperatingSystems"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "Browser" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Browser"], hue = df["Browser"], ax = axes[0]);
sns.countplot(x = df["Browser"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "Region" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Region"], hue = df["Region"], ax = axes[0]);
sns.countplot(x = df["Region"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "TrafficType" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["TrafficType"], hue = df["TrafficType"], ax = axes[0]);
sns.countplot(x = df["TrafficType"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
# Relación de la columna "Weekend" con "Revenue"

fig, axes = plt.subplots(1, 2, figsize = (18, 7))

sns.countplot(x = df["Weekend"], hue = df["Weekend"], ax = axes[0]);
sns.countplot(x = df["Weekend"], hue = df["Revenue"], ax = axes[1]);
plt.show()

In [None]:
columnas = [x for x in df.columns if x not in ("Administrative_Duration", "Informational_Duration", "ProductRelated_Duration")]

print(columnas)

In [None]:
df = df[columnas].copy()

df.head(3)

### X, y

In [None]:
X = df.drop("Revenue", axis = 1)
y = df[["Revenue"]]

print(f"X: {X.shape}, y: {y.shape}")

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify = y)

print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

### MinMaxScaler

In [None]:
from sklearn.preprocessing import MinMaxScaler

X_scaler = MinMaxScaler()

X_train = X_scaler.fit_transform(X_train)

X_test = X_scaler.transform(X_test)

### Modelos

In [None]:
%%time

# Modelos
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

# Métricas
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import jaccard_score

modelos = [LogisticRegression(), KNeighborsClassifier(), NearestCentroid(), GaussianNB(), DecisionTreeClassifier(random_state = 42),
           RandomForestClassifier(random_state = 42), AdaBoostClassifier(random_state = 42), GradientBoostingClassifier(random_state = 42), SVC()]

data = list()

for model in modelos:
    
    model.fit(X_train, y_train)
    
    yhat = model.predict(X_test)
    
    acc = accuracy_score(y_test, yhat)
    pre = precision_score(y_test, yhat)
    rec = recall_score(y_test, yhat)
    roc = roc_auc_score(y_test, yhat)
    f1s = f1_score(y_test, yhat)
    jac = jaccard_score(y_test, yhat)
    mat = confusion_matrix(y_test, yhat)
    
    data.append([str(model), model, acc, pre, rec, roc, f1s, jac, mat])
    
df_metricas = pd.DataFrame(data = data,
                           columns = ["nombre", "modelo", "accuracy", "precision", "recall", "roc_auc", "f1_score", "jaccard", "confusion_matrix"])

df_metricas

In [None]:
df_metricas.sort_values("roc_auc", ascending = False)

In [None]:
df_metricas.iloc[7, -1]

In [None]:
# Calculamos Feature Importance
importances = df_metricas.iloc[7, 1].feature_importances_

df_importances = pd.DataFrame(data = zip([x for x in df.columns if x != "Revenue"], importances),
                              columns = ["Columnas", "Importancia"])

df_importances = df_importances.sort_values("Importancia", ascending = False)

print("Feature Importance:")

for index, (feature, importance) in enumerate(df_importances.values):
    
    print(f"{index + 1:2}. feature {index:2} ({importance:20}): {feature}")

plt.figure(figsize = (12, 8))

plt.title("Feature Importances")
sns.barplot(x = df_importances["Importancia"], y = df_importances["Columnas"], color = "red")

plt.grid()
plt.show()

### SMOTE

In [None]:
from collections import Counter
from imblearn.over_sampling import SMOTE

# Repetimos el train_test_split()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42, stratify = y)

print(f"Shape: {X_train.shape}")
print(Counter(y_train["Revenue"]))

smote = SMOTE(sampling_strategy = 0.3, random_state = 42)

X_train_balanceado, y_train_balanceado = smote.fit_resample(X_train, y_train["Revenue"])

print(f"Shape: {X_train_balanceado.shape}")
print(Counter(y_train_balanceado))

In [None]:
# MinMaxScaler()

X_scaler = MinMaxScaler()

X_train_balanceado = X_scaler.fit_transform(X_train_balanceado)

X_test = X_scaler.transform(X_test)

In [None]:
X_train_balanceado.shape, y_train_balanceado.shape

In [None]:
%%time

modelos = [LogisticRegression(), KNeighborsClassifier(), NearestCentroid(), GaussianNB(), DecisionTreeClassifier(random_state = 42),
           RandomForestClassifier(random_state = 42), AdaBoostClassifier(random_state = 42), GradientBoostingClassifier(random_state = 42), SVC()]

data = list()

for model in modelos:
    
    model.fit(X_train_balanceado, y_train_balanceado)
    
    yhat = model.predict(X_test)
    
    acc = accuracy_score(y_test, yhat)
    pre = precision_score(y_test, yhat)
    rec = recall_score(y_test, yhat)
    roc = roc_auc_score(y_test, yhat)
    f1s = f1_score(y_test, yhat)
    jac = jaccard_score(y_test, yhat)
    mat = confusion_matrix(y_test, yhat)
    
    data.append([str(model), model, acc, pre, rec, roc, f1s, jac, mat])
    
df_metricas_smote = pd.DataFrame(data = data,
                                 columns = ["nombre", "modelo", "accuracy", "precision", "recall", "roc_auc", "f1_score", "jaccard", "confusion_matrix"])

df_metricas_smote

In [None]:
df_metricas_smote.sort_values("roc_auc", ascending = False)

In [None]:
df_metricas.iloc[7, -1]

In [None]:
# Calculamos Feature Importance
importances = df_metricas.iloc[5, 1].feature_importances_

df_importances = pd.DataFrame(data = zip([x for x in df.columns if x != "Revenue"], importances),
                              columns = ["Columnas", "Importancia"])

df_importances = df_importances.sort_values("Importancia", ascending = False)

print("Feature Importance:")

for index, (feature, importance) in enumerate(df_importances.values):
    
    print(f"{index + 1:2}. feature {index:2} ({importance:20}): {feature}")

plt.figure(figsize = (12, 8))

plt.title("Feature Importances")
sns.barplot(x = df_importances["Importancia"], y = df_importances["Columnas"], color = "red")

plt.grid()
plt.show()

### PairPlot

In [None]:
%%time

sns.pairplot(data = df, hue = "Revenue")
plt.show()

### Clustering - KMeans

In [None]:
from sklearn.cluster import KMeans

# Método del codo

inercias = list()

for k in range(2, 11):
    
    kmeans = KMeans(n_clusters = k)
    
    kmeans.fit(df)
    
    inercias.append(kmeans.inertia_)

In [None]:
plt.plot(range(2, 11), inercias, color = "red")

plt.grid()
plt.show()

### Red Neuronal

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout, Input
from tensorflow.keras.utils import to_categorical

### Datos Originales

In [None]:
# One Hot Encoding 

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
model = Sequential()

model.add(Input(shape = (X_train.shape[1], )))

model.add(Dense(units = 512, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 128, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 64, activation = "relu"))
model.add(Dropout(0.5))

model.add(Dense(units = y_test.shape[1], activation = "softmax"))

model.summary()

# Compilamos el modelo
model.compile(optimizer = "adam",
              loss      = "categorical_crossentropy",
              metrics   = ["accuracy"])

In [None]:
# Entrenamos el modelo

history = model.fit(X_train,
                    y_train,
                    validation_data = (X_test, y_test),
                    epochs = 100,
                    verbose = 1)

In [None]:
# Metricas
scores = model.evaluate(X_test, y_test, verbose = 1)

scores

In [None]:
# loss
plt.plot(history.history["loss"], label = "loss")
plt.plot(history.history["val_loss"], label = "val_loss")
plt.legend()
plt.show()

In [None]:
# accuracy
plt.plot(history.history["accuracy"], label = "acc")
plt.plot(history.history["val_accuracy"], label = "val_acc")
plt.legend()
plt.show()

### Métricas Datos Originales

In [None]:
yhat = model.predict(X_test)

yhat.shape

In [None]:
yhat = [np.argmax(x) for x in yhat]
y_test = [np.argmax(x) for x in y_test]

In [None]:
print(f"accuracy: {accuracy_score(y_test, yhat)}")
print(f"recall: {recall_score(y_test, yhat)}")
print(f"f1-score: {f1_score(y_test, yhat)}")
print(f"precision: {precision_score(y_test, yhat)}")
print(f"roc_auc: {roc_auc_score(y_test, yhat)}")
print(f"confusion_matrix: ")
print(confusion_matrix(y_test, yhat))

### Datos SMOTE

In [None]:
# One Hot Encoding 

y_train_balanceado = to_categorical(y_train_balanceado)
y_test = to_categorical(y_test)

In [None]:
X_train_balanceado.shape, y_train_balanceado.shape

In [None]:
X_test.shape, y_test.shape

In [None]:
model = Sequential()

model.add(Input(shape = (X_train_balanceado.shape[1], )))

model.add(Dense(units = 512, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 256, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 128, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(units = 64, activation = "relu"))
model.add(Dropout(0.5))

model.add(Dense(units = y_test.shape[1], activation = "softmax"))

model.summary()

# Compilamos el modelo
model.compile(optimizer = "adam",
              loss      = "categorical_crossentropy",
              metrics   = ["accuracy"])

In [None]:
# Entrenamos el modelo

history = model.fit(X_train_balanceado,
                    y_train_balanceado,
                    validation_data = (X_test, y_test),
                    epochs = 100,
                    verbose = 1)

In [None]:
# Metricas
scores = model.evaluate(X_test, y_test, verbose = 1)

scores

In [None]:
# loss
plt.plot(history.history["loss"], label = "loss")
plt.plot(history.history["val_loss"], label = "val_loss")
plt.legend()
plt.show()

In [None]:
# accuracy
plt.plot(history.history["accuracy"], label = "acc")
plt.plot(history.history["val_accuracy"], label = "val_acc")
plt.legend()
plt.show()

### Métricas Datos SMOTE

In [None]:
yhat = model.predict(X_test)

yhat.shape

In [None]:
yhat = [np.argmax(x) for x in yhat]
y_test = [np.argmax(x) for x in y_test]

In [None]:
print(f"accuracy: {accuracy_score(y_test, yhat)}")
print(f"recall: {recall_score(y_test, yhat)}")
print(f"f1-score: {f1_score(y_test, yhat)}")
print(f"precision: {precision_score(y_test, yhat)}")
print(f"roc_auc: {roc_auc_score(y_test, yhat)}")
print(f"confusion_matrix: ")
print(confusion_matrix(y_test, yhat))

In [None]:
#########################################################################################################################################################################