# Practice Assignment 2

In this assignment, you will explore the Online Shoppers Purchasing Intention Dataset to build a machine learning model that predicts whether a visitor will make a purchase. you will analyze data in the online_shoppers_intention.csv file.You'll use functions from the pandas module for loading, inspecting and querying the data. You are expected to preprocess the data and use the data to train machine learning models from the sklearn library to solve problems.

The useful dataset link: https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset

For each question, there are clear instructions in each cell. Follow those instructions and write the code after each block of:

YOUR CODE HERE

Please use the exact variable name if it is specified in the comment.

We’ll run a Python test script against your program to test whether each function implementation is correct.

In [1]:
%%capture
###########################################################
### EXECUTE THIS CELL BEFORE YOU TO TEST YOUR SOLUTIONS ###
###########################################################
"""
Import all libraries needed for the entire exercise: numpy, pyplot, and seaborn
"""
import imp
import numpy as np
from nose.tools import assert_equal
from pandas.util.testing import assert_frame_equal, assert_series_equal
sol = imp.load_compiled("sol", "./.sol.py")

In [2]:
"""
Part 01: Setup and Data Loading

1. Import the pandas library.

2. Use pandas to read the dataset from the .csv file and save it to "shoppers_df". 

3. In certain datasets, a value of 0 (zero) can indicate missing information that was not recorded. 
   In the case of our dataset, assume that 0's represent missing data for the following analysis.
   
   Replace all 0 or 0.0 values in the DataFrame with NaN using replace(). 
   Hint: Make sure to set inplace to True.

3. Display the first 5 rows of the DataFrame "shoppers_df".

"""

# 📦 Step 1: Import the pandas library
import pandas as pd

# 📥 Step 2: Load the dataset
shoppers_df = pd.read_csv("online_shoppers_intention.csv")

# 🔁 Step 3: Replace all 0 and 0.0 values with NaN
shoppers_df.replace(to_replace=0, value=np.nan, inplace=True)
shoppers_df.replace(to_replace=0.0, value=np.nan, inplace=True)

# 👀 Step 4: Display the first 5 rows
shoppers_df.head()



Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,,,,,1.0,,0.2,0.2,,,Feb,1,1,1,1,Returning_Visitor,False,False
1,,,,,2.0,64.0,,0.1,,,Feb,2,2,1,2,Returning_Visitor,False,False
2,,,,,1.0,,0.2,0.2,,,Feb,4,1,9,3,Returning_Visitor,False,False
3,,,,,2.0,2.666667,0.05,0.14,,,Feb,3,2,2,4,Returning_Visitor,False,False
4,,,,,10.0,627.5,0.02,0.05,,,Feb,3,3,1,4,Returning_Visitor,True,False


In [3]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(shoppers_df, sol.shoppers_df)


In [4]:
"""
Part 02: Data Exploration

1. Display the shape of the DataFrame "shoppers_df" to know the number of rows and columns. 
   Store the shape in a variable called "shape_of_df"

2. Use the `info()` method on "shoppers_df" to display the summary including the data types 
   and the presence of null values.

3. Create a variable named "missing_values_count" that stores the total count of missing 
   values in the DataFrame.
"""

# 🔍 Step 1: Display and store the shape
shape_of_df = shoppers_df.shape
print("Shape of DataFrame:", shape_of_df)

# 🧾 Step 2: Summary of data types and null values
shoppers_df.info()

# ❓ Step 3: Count total missing values
missing_values_count = shoppers_df.isnull().sum().sum()
print("Total missing values:", missing_values_count)



Shape of DataFrame: (12330, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           6562 non-null   float64
 1   Administrative_Duration  6427 non-null   float64
 2   Informational            2631 non-null   float64
 3   Informational_Duration   2405 non-null   float64
 4   ProductRelated           12292 non-null  float64
 5   ProductRelated_Duration  11575 non-null  float64
 6   BounceRates              6812 non-null   float64
 7   ExitRates                12254 non-null  float64
 8   PageValues               2730 non-null   float64
 9   SpecialDay               1251 non-null   float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int

In [5]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(shape_of_df, sol.shape_of_df)


In [6]:
"""
Part 03: Data Pre-processing

1. First save old df to "new_shoppers_df"

2. Inspect the dataset for missing values in each column and handle them appropriately. 
   For each column, fill any missing values with the mode of that column. Implement this using a for loop.

3. Drop the 'Month' column from "new_shoppers_df" as it will not be used in further analysis.

4. Perform one-hot encoding on the 'VisitorType' column. 
  
  First, use `pd.get_dummies()` to create a DataFrame "visitor_dummies". 
  
  Then, concatenate "visitor_dummies" with "shoppers_df"
  
  a. Hint: For more about pd.get_dummies(): 
       https://www.geeksforgeeks.org/python-pandas-get_dummies-method/
  b. Do not forget to drop the old 'VisitorType' column
  c. Set parameter drop_first = True. We only need k-1 catergoies to represent
     k catergorical values. 

5. After the encoding and concatenation, display the first 5 rows of "new_shoppers_df" to verify the changes.
  
"""

# 📝 Step 1: Make a copy of the DataFrame
new_shoppers_df = shoppers_df.copy()

# 🧮 Step 2: Fill missing values with mode for each column
for col in new_shoppers_df.columns:
    if new_shoppers_df[col].isnull().any():
        mode_val = new_shoppers_df[col].mode()[0]
        new_shoppers_df[col].fillna(mode_val, inplace=True)

# ✂️ Step 3: Drop the 'Month' column
new_shoppers_df.drop('Month', axis=1, inplace=True)

# 🔢 Step 4: One-hot encoding for 'VisitorType'
visitor_dummies = pd.get_dummies(new_shoppers_df['VisitorType'], drop_first=True)

# 🧬 Concatenate dummies and drop original column
new_shoppers_df = pd.concat([new_shoppers_df.drop('VisitorType', axis=1), visitor_dummies], axis=1)

# 👀 Step 5: Display the first 5 rows
new_shoppers_df.head()



Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,OperatingSystems,Browser,Region,TrafficType,Weekend,Revenue,Other,Returning_Visitor
0,1.0,4.0,1.0,9.0,1.0,17.0,0.2,0.2,53.988,0.6,1,1,1,1,False,False,0,1
1,1.0,4.0,1.0,9.0,2.0,64.0,0.2,0.1,53.988,0.6,2,2,1,2,False,False,0,1
2,1.0,4.0,1.0,9.0,1.0,17.0,0.2,0.2,53.988,0.6,4,1,9,3,False,False,0,1
3,1.0,4.0,1.0,9.0,2.0,2.666667,0.05,0.14,53.988,0.6,3,2,2,4,False,False,0,1
4,1.0,4.0,1.0,9.0,10.0,627.5,0.02,0.05,53.988,0.6,3,3,1,4,True,False,0,1


In [7]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(new_shoppers_df, sol.new_shoppers_df)


In [None]:
"""
Part 04: Data Preparation for Modeling

1. Separate the dataset into features and target variable ("Revenue"), 
   storing them in "X" and "y" respectively.
   
2. Further, split the dataset into training and test sets using `train_test_split()` 
   with a test size of 0.2 and a random state of 42. 
   Store the outputs in "X_train", "X_test", "y_train", "y_test".

"""

# YOUR CODE HERE


In [None]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(X, sol.X)
assert_series_equal(y, sol.y)

assert_frame_equal(X_train, sol.X_train)
assert_frame_equal(X_test, sol.X_test)
assert_series_equal(y_train, sol.y_train)
assert_series_equal(y_test, sol.y_test)


In [None]:
"""
Part 05: Model Selection & Training

1. First, import libraries: LogisticRegression, accuracy_score and StandardScaler. 
   Use StandardScaler() to Scale the features
   
2. Train a logistic regression model using the training data and name the model "logistic_model". 
   Set max_iter to 1000.

3. Make predictions on the test set using the "logistic_model" 
   and store the predictions in a variable "y_pred".

4. Calculate the accuracy of the model on the test set and store it 
   in a variable "model_accuracy".

5. Display the result for "model_accuracy"
"""

# YOUR CODE HERE


In [None]:
##########################
### TEST YOUR SOLUTION ###
##########################

np.testing.assert_equal(y_pred, sol.y_pred)
assert_equal(model_accuracy, sol.model_accuracy)


In [None]:
"""
Part 06: Visualize the Confusion Matrix

1. Import matplotlib.pyplot and seaborn here

2. Calculate the confusion matrix using the actual and the predicted values.

3. Create a heatmap using `matplotlib` and `seaborn` to visualize the confusion matrix.

4. Ensure to label the heatmap with actual labels for better readability.

NOTE: 
- Your plot cannot be autograded.  
- If you go to the guide, there will be a PDF with the solution plot.
- Please check it and if your answer matches, set the variable q6_plot_check = 'yes' .   

"""

q6_plot_check = 'no' # change to yes after you verify your plot

# YOUR CODE HERE


In [None]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal('yes', q6_plot_check)