# Naive Bayes Classifier

- A simple but powerful algorithm based on Bayes' Theorem. It assumes features are independent (the "naive" part), making it fast and efficient for:
    - Text classification (spam detection, sentiment analysis)
    - Medical diagnosis
    - Real-time predictions

- Main concept: **"Given past data, what is the probability this new example belongs to a class?"**

## Step 1: Import necessary libraries

In [17]:
# Import pandas for data manipulation
import pandas as pd

## Step 2: Load and Explore the Data

In [25]:
# Load the dataset into a DataFrame
df = pd.read_csv("./data/online_shoppers_intention.csv")

# Display the first few rows to understand the data structure
df.head()

# Check the DataFrame shape and data types
df.info()

# Show how many sessions ended with a sale (Revenue = True) or not (False)
print("\n\nRevenue value counts:\n", df['Revenue'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType           

In [28]:
# Summarize numerical features
df.describe()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0,12330.0
mean,2.315166,80.818611,0.503569,34.472398,31.731468,1194.74622,0.022191,0.043073,5.889258,0.061427,2.124006,2.357097,3.147364,4.069586
std,3.321784,176.779107,1.270156,140.749294,44.475503,1913.669288,0.048488,0.048597,18.568437,0.198917,0.911325,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,0.0,7.0,184.1375,0.0,0.014286,0.0,0.0,2.0,2.0,1.0,2.0
50%,1.0,7.5,0.0,0.0,18.0,598.936905,0.003112,0.025156,0.0,0.0,2.0,2.0,3.0,2.0
75%,4.0,93.25625,0.0,0.0,38.0,1464.157214,0.016813,0.05,0.0,0.0,3.0,2.0,4.0,4.0
max,27.0,3398.75,24.0,2549.375,705.0,63973.52223,0.2,0.2,361.763742,1.0,8.0,13.0,9.0,20.0


## Step 3: Preprocess the Data

We need to convert the data into a numeric format suitable for the model. First, we map the boolean Revenue target to integers (0 and 1). Then we encode any categorical variables into numeric form. In this dataset, columns like VisitorType, Weekend, and Month are categorical. We can use label encoding for simplicity.

In [29]:
# Map the target 'Revenue' from False/True to 0/1
df['Revenue'] = df['Revenue'].map({False: 0, True: 1})

# Import LabelEncoder for encoding categorical features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# Encode the categorical columns: 'VisitorType', 'Weekend', and 'Month'
df['VisitorType'] = le.fit_transform(df['VisitorType'])  # e.g., 'Returning_Visitor'->numeric
df['Weekend'] = le.fit_transform(df['Weekend'])          # False->0, True->1
df['Month']   = le.fit_transform(df['Month'])            # e.g., 'Feb'->0, 'Mar'->1, etc.


## Step 4: Split the data into training and testing sets

In [32]:
# Separate features (X) and target (y)
X = df.drop('Revenue', axis=1)  # all columns except target
y = df['Revenue']               # target variable

# Split into a training set and a test set (70% train, 30% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42)

## Step 5: Create and Train the CART Decision Tree

**Explanation:**

- `model = DecisionTreeClassifier(criterion="gini", random_state=53)`
  - This creates an instance of a decision tree classifier using the CART algorithm.
  - **criterion="gini"**: Specifies that the tree should use the Gini impurity measure to decide splits (the default for CART).
  - **random_state=53**: Ensures reproducibility by setting the random seed.

- `model.fit(X_train, y_train)`
  - This trains (fits) the decision tree classifier on the training data (`X_train` for features, `y_train` for the target).
  - This step therefore builds the decision tree model so it can learn patterns from the training data and later make predictions on new, unseen data.

In [33]:
# Import Gaussian Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB

# Initialize the classifier
model = GaussianNB()

# Fit the model on the training data
model.fit(X_train, y_train)

0,1,2
,priors,
,var_smoothing,1e-09


## Step 5: Evaluate the Model

`y_pred = model.predict(X_test)`

- This uses the trained decision tree classifier (`model`) to predict the labels for the test set features (`X_test`). This gives you the model’s predictions on data it has not seen before, which is necessary for evaluating its performance.

`print("Classification Report:\n", classification_report(y_test, y_pred))`
- This prints a detailed classification report comparing the true labels (`y_test`) to the predicted labels (`y_pred`). The report includes precision, recall, F1-score, and support for each class, enabling you to understand how well the model performs for each category.
- It shows the performance metrics for a model that predicts two classes:
    - Class 0
    - Class 1

- There are 300 total items tested:
    - 56 items belong to class 0
    - 244 items belong to class 1

| Term          | Meaning                                                                      |
| ------------- | ---------------------------------------------------------------------------- |
| **Precision** | Out of all items the model said are class X, how many are actually class X?  |
| **Recall**    | Out of all actual items in class X, how many did the model correctly find?   |
| **F1-score**  | A balance between precision and recall such  that a higher value means better balance. |
| **Support**   | The number of actual items in that class.                                    |
| **Macro avg** | The average of precision, recall, and F1-score for both classes, treating them equally.|
| **Weighted avg**| The average of precision, recall, and F1-score, but weighted by how many samples are in each class (so class 1 has more influence).|

- The results show that the model is much better at predicting class 1 than class 0, and overall gets 75% of predictions correct. This may be because there are more class 1 cases in the data.

In [34]:
# Make predictions on the test set
y_pred = model.predict(X_test)

# Import evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Compute and display the accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))  # Overall fraction of correct predictions

# Show precision, recall, F1-score for each class
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Compute and display the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.849418761827521

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.91      0.91      3124
           1       0.52      0.53      0.52       575

    accuracy                           0.85      3699
   macro avg       0.71      0.72      0.72      3699
weighted avg       0.85      0.85      0.85      3699

Confusion Matrix:
[[2835  289]
 [ 268  307]]


## Step 7: Use the model to make a prediction on a new sample

In [43]:
# Create a new sample (example values based on the dataset's feature order)
new_session = [[0,     # Administrative
                0.0,   # Administrative_Duration
                2,     # Informational
                0.0,   # Informational_Duration
                20,    # ProductRelated
                500.0, # ProductRelated_Duration
                0.02,  # BounceRates
                0.01,  # ExitRates
                0.005, # PageValues
                0.0,   # SpecialDay
                3,     # Month (encoded)
                2,     # OperatingSystems
                1,     # Browser
                1,     # Region
                1,     # TrafficType
                1,     # VisitorType (Returning_Visitor encoded as 1)
                0      # Weekend (False encoded as 0)
               ]]

# Use the same column names as the training data
new_session_df = pd.DataFrame(new_session, columns=X.columns)

# Predict using the trained model
prediction = model.predict(new_session_df)

# Display the result
print("\nPredicted Revenue:", "Yes" if prediction[0] == 1 else "No")


Predicted Revenue: No
