# Project Title

## Problem Definition


State the business problem. Translate the business problem into a Data Science problem by stating what kind of problem it is ( supervised vs unsupervised ) and whether it is a classification, regression, or clustering problem.

 Business Problem

*   We want to know whether or not a future customer will make a transaction based on their previous transactions.
*  This is a supervised problem because we have labeled data (wheter or not the customer made the transaction in 0 or 1)
*  This is a classification problem because we are classifying into two categories (1 is a succesful transaction, 0 is unsuccesful)


## Data Collection/Sources


Load Pandas, Numpy, and Matplotlib..

In [None]:
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sns
from sklearn import datasets, metrics, model_selection
from sklearn import model_selection
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

Load data Train.csv from the Google Drive folder.

In [None]:
# First I will assign my file path a name
url = 'https://ddc-datascience.s3.amazonaws.com/Projects/Project.1-Transactions/Data/Transaction.train.csv'
# I am assigning a variable to the read_csv function to read a CSV file in a pandas dataframe
df = pd.read_csv(url)
# using the .head function I am looking at the first 5 rows of the dataframe
df.head()

## Data Cleaning


In [None]:
# using .info to look at a general summary in the dataframe
df.info()

In [None]:
# using .describe to look at common statistcal info of the data
df.describe()

In [None]:
# .shape gives info of (rows, columns)
df.shape

In [None]:
df.tail()

In [None]:
# using .copy allows me to make a new object with the data in the original dataframe without changing original dataframe
df_new = df.copy()
df_new

In [None]:
# Checking to see if the copy looks correct
df_new.head()

In [None]:
# Dropping unnamed column because it just is number of rows from 0-180000
df_new.drop(['Unnamed: 0'], axis=1, inplace=True)
df_new.head()

In [None]:
# Also dropping ID_code because it gives same info as counting down the rows
df_new.drop(['ID_code'], axis=1, inplace=True)
df_new.head()

In [None]:
# Column 'target' has values of either 0 or 1 based on the .info
# to make sure we still need this row I will look at the data
df_new['target'].unique()

In [None]:
#I will keep this row because it does have a value of 0 or 1 assigned to each row

In [None]:
# Check presence of nulls
df_new.isnull().sum()

In [None]:
# We have no nulls so we can move on from the data cleaning!

## Exploratory Data Analysis


In [None]:
predictors = df_new.drop(columns=['target'])  # Remove the target column

# Calculate the correlation matrix
correlation_matrix = predictors.corr()

# Plot the heatmap for visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of Predictor Variables')
plt.show()

This shows us that all of the values have a correlation value of nearly 0, which means that there is no correlation between the variables. This is great for a Naive Bayes because it assumes independence of variables.

In [None]:
#Here i'm not taking out the target column. This is an issue because i'm not just looking at predictor values.
corr = df_new.corr()
plt.figure(figsize=(8,4))
sns.heatmap(corr, cmap='RdYlBu',annot = False);

In [None]:
for col in df_new.columns:
    print(col)
    print(df_new[col].unique())

In [None]:
for plot in df_new.columns: # Looking at distribution of all variables
    fig = px.histogram(df_new, x=plot, color='target')
    fig.show()


In [None]:
for plot in df_new.columns:
    fig = px.box(df_new, x='target', y=plot)
    fig.show()

* Creating two data frames, one with succesful transactions, one with unsuccesful transactions.

In [None]:
# Successful transactions (target == 1)
successful_transactions = df[df['target'] == 1].copy()

# Unsuccessful transactions (target == 0)
unsuccessful_transactions = df[df['target'] == 0].copy()

In [None]:
# Making sure the copy worked by checking original dataframe
print(df.head())  # Original DataFrame should remain unchanged

## Processing


Create two data frames: one with all the predictor columns (everything except for Unnamed: 0, ID_code and target) and one with just the target. Make sure they are copies and not slices.

In [None]:
# Create the predictors DataFrame (excluding 'Unnamed: 0', 'ID_code', and 'target')
predictors = df.drop(columns=['Unnamed: 0', 'ID_code', 'target']).copy()

# Create the target DataFrame (just the 'target' column)
target = df['target'].copy()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=42)

# Define the Gaussian Naïve Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

In [None]:
# Perform cross-validation loop to calculate accuracy of model
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(gnb, predictors, target, cv=5, scoring='accuracy')

# Calculate the mean accuracy from the cross-validation scores
mean_accuracy = np.mean(cv_scores)

# Report the cross-validation accuracy
print(f"Cross-Validation Accuracy: {mean_accuracy * 100:.2f}%")

Cross-validation accuracy is 91.12%. The cross-validation result is a general estimate of model performance. The accuracy calculated in the previous training model is 90.89%. The difference could be coming from the randomness in the train-test split. The cross-validation is averages performance across multiple splits so can be more robust.


In [None]:
# plot histogram of the accuracy scores in cross-validation loop
plt.figure(figsize=(8, 6))
plt.hist(cv_scores, bins=5, edgecolor='black', color = 'green', alpha=0.7)  # Customize number of bins
plt.title("Histogram of Cross-Validation Accuracy Scores")
plt.xlabel("Accuracy")
plt.ylabel("Frequency")
plt.grid(True)

plt.show()
print("Cross-validation accuracy scores: ", cv_scores)

In [None]:
# Generate the confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_test, y_pred)

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["0", "1"], yticklabels=["0", "1"])
plt.title("Confusion Matrix")
plt.xlabel("Unsuccesful vs Succesful")
plt.ylabel("Unsuccesful vs Succesful")
plt.show()

# Generate the classification report
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)


In [None]:
# Separate the successful (target == 1) and unsuccessful (target == 0) transactions
successful_transactions = df[df['target'] == 1]
unsuccessful_transactions = df[df['target'] == 0]

# Determine the number of non-successful rows to keep
num_successful = len(successful_transactions)
num_unsuccessful_to_keep = num_successful  # We want a 50/50 split

# Randomly sample the non-successful transactions
unsuccessful_transactions_sampled = unsuccessful_transactions.sample(n=num_unsuccessful_to_keep, random_state=42)

# Combine the successful and non-successful transactions to form the balanced dataset
balanced_df = pd.concat([successful_transactions, unsuccessful_transactions_sampled])

# Shuffle the combined dataset to mix the rows of successful and non-successful transactions
balanced_df_x = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# balanced_df has a 50/50 split between successful and non-successful transactions
print(f"Balanced dataset shape: {balanced_df_x.shape}")

# Create the predictors DataFrame (excluding 'Unnamed: 0', 'ID_code', and 'target')
predictors_b = balanced_df_x.drop(columns=['Unnamed: 0', 'ID_code', 'target']).copy()

# Create the target DataFrame (just the 'target' column)
target_b = balanced_df_x['target'].copy()

balanced_df_x.head()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(predictors_b, target_b, test_size=0.2, random_state=42)

# Define the Gaussian Naïve Bayes model
gnb_b = GaussianNB()
gnb_b.fit(X_train_b, y_train_b)
y_pred_b = gnb_b.predict(X_test_b)

# Evaluate the model's performance
accuracy = accuracy_score(y_test_b, y_pred_b)
print(f"Accuracy: {accuracy * 100:.2f}%")

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics

# Generate confusion matrix
confusion_matrix_b = metrics.confusion_matrix(y_test_b, y_pred_b)

# Plot confusion matrix
plt.figure(figsize=(6, 5))

# Create heatmap
sns.heatmap(confusion_matrix_b, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Successful (0)', 'Successful (1)'], yticklabels=['Non-Successful (0)', 'Successful (1)'])
plt.title('Confusion Matrix')
plt.xlabel('Unsuccesful vs Succesful')
plt.ylabel('Unsuccesful vs Succesful')
plt.show()

In [None]:
confusion_matrix_b = metrics.confusion_matrix(y_test_b, y_pred_b)
print(confusion_matrix_b)

In [None]:
X_balanced = balanced_df_x.drop(columns=['Unnamed: 0', 'ID_code', 'target'])  # Exclude the target column and other irrelevant columns
y_balanced = balanced_df_x['target']

gnb_b = GaussianNB()
cv_scores = cross_val_score(gnb_b, X_balanced, y_balanced, cv=5, scoring='accuracy')

# Calculate the mean accuracy from the cross-validation scores
mean_accuracy = np.mean(cv_scores)

# Report the cross-validation accuracy
print(f"Cross-Validation Accuracy on 50/50 Split: {mean_accuracy * 100:.2f}%")


In [None]:
X_imbalanced = df.drop(columns=['Unnamed: 0', 'ID_code', 'target'])
y_imbalanced = df['target']

# For balanced data (50/50 split)
X_balanced = balanced_df_x.drop(columns=['Unnamed: 0', 'ID_code', 'target'])
y_balanced = balanced_df_x['target']

# Define the Gaussian Naïve Bayes model
gnb = GaussianNB()

#  Perform cross-validation on the imbalanced training data
cv_scores_imbalanced = cross_val_score(gnb, X_imbalanced, y_imbalanced, cv=5, scoring='accuracy')

# Perform cross-validation on the 50/50 balanced training data
cv_scores_balanced = cross_val_score(gnb_b, X_balanced, y_balanced, cv=5, scoring='accuracy')

# Calculate and compare the mean accuracy for both
mean_accuracy_imbalanced = np.mean(cv_scores_imbalanced)
mean_accuracy_balanced = np.mean(cv_scores_balanced)

print(f"Cross-Validation Accuracy on Imbalanced Data: {mean_accuracy_imbalanced * 100:.2f}%")
print(f"Cross-Validation Accuracy on 50/50 Balanced Data: {mean_accuracy_balanced * 100:.2f}%")


print(f"Imbalanced Data CV Scores: {cv_scores_imbalanced}")
print(f"Balanced Data CV Scores: {cv_scores_balanced}")


In [None]:
report = classification_report(y_test, y_pred)
print("Classification Report:")
print(report)

report_b = classification_report(y_test_b, y_pred_b)
print("Classification Report on Balanced Data:")
print(report_b)

## Data Visualization/Communication of Results


In [None]:
report = classification_report(y_test, y_pred, output_dict=True)
report_b = classification_report(y_test_b, y_pred_b, output_dict=True)

df_report = pd.DataFrame(report).transpose()
df_report_b = pd.DataFrame(report_b).transpose()

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Original dataset
sns.barplot(x=df_report.index[:-3], y=df_report['precision'][:-3], ax=axes[0], color='blue', label='Precision')
sns.barplot(x=df_report.index[:-3], y=df_report['recall'][:-3], ax=axes[0], color='green', label='Recall')
sns.barplot(x=df_report.index[:-3], y=df_report['f1-score'][:-3], ax=axes[0], color='hotpink', label='F1-Score')

axes[0].set_title('Classification Report - Original Data')
axes[0].set_xlabel('Classes')
axes[0].set_ylabel('Scores')
axes[0].legend()

# Balanced dataset
sns.barplot(x=df_report_b.index[:-3], y=df_report_b['precision'][:-3], ax=axes[1], color='blue', label='Precision')
sns.barplot(x=df_report_b.index[:-3], y=df_report_b['recall'][:-3], ax=axes[1], color='green', label='Recall')
sns.barplot(x=df_report_b.index[:-3], y=df_report_b['f1-score'][:-3], ax=axes[1], color='hotpink', label='F1-Score')

axes[1].set_title('Classification Report - Balanced Data')
axes[1].set_xlabel('Classes')
axes[1].set_ylabel('Scores')
axes[1].legend()

plt.tight_layout()
plt.show()


As we can see, in the balanced dataset, there is a much better F-1 score for the succesful transactions (1). We can take a better F-1 score to mean that our model is doing a better job in balancing both precison and recall in datasets that are imbalanced. This makes sense, because we acheived a more balanced set by selecting a random sampling of unsuccesful transactions in the balanced sets as the original dataset was skewed toward the unsuccesful transactions. In terns of our business model, we'd better be able to predict if someone will purchase something based on their past transactions.

We also want to focus on the succesful transactions instead of the negative ones because that will give us more data as to how to continue to get succesful transactions.