# Predicting if a customer will subscribe service
## Logistic Regression - sklearn
- Target Variable： Subscription Status (Yes/No)
- Features： Customer Age, Gender, Location, Purchase Amount, Discount Applied
- Goal： Predict whether customers will subscribe to the service based on their attributes and purchasing behavior, and then provide advices to Marketing and Sales team.


In [1]:
import pandas as pd

# Load Date
data = pd.read_csv('../data/raw/shopping_trends.csv')

# Print Data Information
print(data.info())
print(data.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount

In [2]:
# Map predicting target from bool value to int value
data['Subscription Status'] = data['Subscription Status'].map({'Yes': 1, 'No': 0})

# Verify Null Value
print(data.isnull().sum())


Customer ID                 0
Age                         0
Gender                      0
Item Purchased              0
Category                    0
Purchase Amount (USD)       0
Location                    0
Size                        0
Color                       0
Season                      0
Review Rating               0
Subscription Status         0
Payment Method              0
Shipping Type               0
Discount Applied            0
Promo Code Used             0
Previous Purchases          0
Preferred Payment Method    0
Frequency of Purchases      0
dtype: int64


In [3]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split


# Classsify the features for preprocessor
numeric_features = ['Age', 'Purchase Amount (USD)']
categorical_features = ['Gender', 'Location', 'Discount Applied']


numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])


# Get Train and Test dataset
X = data[['Age', 'Purchase Amount (USD)', 'Gender', 'Location', 'Discount Applied']]
y = data['Subscription Status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X.info())
print(X.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Age                    3900 non-null   int64 
 1   Purchase Amount (USD)  3900 non-null   int64 
 2   Gender                 3900 non-null   object
 3   Location               3900 non-null   object
 4   Discount Applied       3900 non-null   object
dtypes: int64(2), object(3)
memory usage: 152.5+ KB
None
   Age  Purchase Amount (USD) Gender       Location Discount Applied
0   55                     53   Male       Kentucky              Yes
1   19                     64   Male          Maine              Yes
2   50                     73   Male  Massachusetts              Yes
3   21                     90   Male   Rhode Island              Yes
4   45                     49   Male         Oregon              Yes


In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Use LogisticRegression to build a model 
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', LogisticRegression(max_iter=1000))])

result = model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Verify results by the Confusion Matrix and Accuracy Rate
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 0.8282051282051283
Confusion Matrix:
[[437 121]
 [ 13 209]]
Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.78      0.87       558
           1       0.63      0.94      0.76       222

    accuracy                           0.83       780
   macro avg       0.80      0.86      0.81       780
weighted avg       0.87      0.83      0.84       780



### Comprehensive Evaluation
- Accuracy: 83% indicates a high overall accuracy of the model.
- Precision: For subscribers (1) is 63%, indicating that about 63% of the samples predicted as subscribers are actually subscribers.
- Recall: For subscribers (1) is 94%, indicating that the model correctly identified about 94% of the actual subscribers.
- F1-Score: 76% combines precision and recall, providing a balanced measure.
<BR>
<BR>



### Utilize model prediction results for precision marketing:

- We may utilize the model prediction results to segment the customers and identify the characteristics of potential subscribers and non-subscribers, and also Develop corresponding marketing strategies for different customer groups.

- Subscription User Profile Analysis: Analyze the characteristics of predicted subscription users, such as age, gender, purchase amount, and purchase frequency, to understand which types of customers are more likely to subscribe.

- Customized Marketing: Based on these characteristics, create personalized marketing strategies. For example, push fashionable new products to young subscribers; offer higher membership discounts and exclusive benefits to frequent buyers.