<a href="https://colab.research.google.com/github/chechelan/0-chechelan/blob/main/Rayminder_machine_learning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About
- In this notebook, I will first import user behavioral dataset which was merged by exploratory group records and visual crossing API data.
- Then I will do data cleaning and exploratory analysis on the dataset.
- Then I will test and evaluate four predictive models ( logistic regression, decision trees, random forest and K-nearest neighbors).

# Load and review dataset

In [None]:
# initiate google drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
import pandas as pd

In [None]:
# load the dataset
df = pd.read_csv('gdrive/My Drive/rmdataset.csv')

In [None]:
df.head(2)

Unnamed: 0,city,date,datetime,uvindex,temp,conditions,icon,cloudcover,userid,age,...,skintype,skinconcerns,makeup,where,acitivity,SPF,forwhom,reminder_order,reminder,amount
0,beijing,2023-07-05,08:00:00,4.0,27.3,Clear,clear-day,0.0,25,31,...,4,no,yes,indoors,,30,myself,1,yes,c
1,beijing,2023-07-05,10:00:00,8.0,31.0,Clear,clear-day,0.0,25,31,...,4,no,yes,indoors,,30,myself,2,yes,a


# Data preprocessing

In [None]:
# check the missing values
missing_values = df.isnull().sum()
missing_values

city                0
date                0
datetime            0
uvindex             0
temp                0
conditions          0
icon                0
cloudcover          0
userid              0
age                 0
gender              0
skintype            0
skinconcerns        0
makeup              0
where               0
acitivity         744
SPF                 0
forwhom             0
reminder_order      0
reminder            0
amount              0
dtype: int64

The 'activity' column required participants to fill in on their own, most of them do not fill this column, so i decide to categorize it as 'unknown' to protect the small dataset.

In [None]:
# Replace missing values in the 'acitivity' column with 'Unknown'
df['acitivity'].fillna('unknown', inplace=True)

# Check again for missing values to ensure they've been handled
missing_values_after = df.isnull().sum()

missing_values_after


city              0
date              0
datetime          0
uvindex           0
temp              0
conditions        0
icon              0
cloudcover        0
userid            0
age               0
gender            0
skintype          0
skinconcerns      0
makeup            0
where             0
acitivity         0
SPF               0
forwhom           0
reminder_order    0
reminder          0
amount            0
dtype: int64

In [None]:
# Check the distribution of the target column 'reminder'
reminder_distribution = df['reminder'].value_counts()

reminder_distribution


no     519
yes    321
Name: reminder, dtype: int64

The distribution of the reminder column indicates that there are 519 "no" entries and 321 "yes" entries. This means the dataset is somewhat imbalanced, but not severely.

# builde a new dataframe with selected columns for model training
the columns related with users demographic information and environmental data are selected, because from the qualitative research it shows these are major decision factors for the users.

In [None]:
selected_columns = [
    'uvindex', 'temp', 'conditions', 'cloudcover', 'age', 'gender',
    'skintype', 'makeup', 'where', 'SPF', 'forwhom', 'reminder_order', 'reminder'
]
new_dataframe = df[selected_columns]

# Display the first 5 rows of the new dataframe
new_dataframe.head()

Unnamed: 0,uvindex,temp,conditions,cloudcover,age,gender,skintype,makeup,where,SPF,forwhom,reminder_order,reminder
0,4.0,27.3,Clear,0.0,31,female,4,yes,indoors,30,myself,1,yes
1,8.0,31.0,Clear,0.0,31,female,4,yes,indoors,30,myself,2,yes
2,10.0,38.0,Clear,0.0,31,female,4,yes,indoors,30,myself,3,yes
3,9.0,38.8,Clear,0.0,31,female,4,yes,indoors,30,myself,4,yes
4,7.0,40.0,Clear,0.0,31,female,4,yes,indoors,30,myself,5,no


there are both numerical and categorical columns in the dataset, to be used in the machine learning model, categorical columns should be changed to numerical, labelencoder is used to process categorical columns.

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Initialize label encoders for categorical columns
le_conditions = LabelEncoder()
le_gender = LabelEncoder()
le_makeup = LabelEncoder()
le_where = LabelEncoder()
le_forwhom = LabelEncoder()
le_reminder = LabelEncoder()

# Create a true copy of the sliced dataframe
new_dataframe = df[selected_columns].copy()

# Re-encode categorical columns using .loc[]
new_dataframe.loc[:, 'conditions'] = le_conditions.fit_transform(new_dataframe['conditions'])
new_dataframe.loc[:, 'gender'] = le_gender.fit_transform(new_dataframe['gender'])
new_dataframe.loc[:, 'makeup'] = le_makeup.fit_transform(new_dataframe['makeup'])
new_dataframe.loc[:, 'where'] = le_where.fit_transform(new_dataframe['where'])
new_dataframe.loc[:, 'forwhom'] = le_forwhom.fit_transform(new_dataframe['forwhom'])
new_dataframe.loc[:, 'reminder'] = le_reminder.fit_transform(new_dataframe['reminder'])

  new_dataframe.loc[:, 'conditions'] = le_conditions.fit_transform(new_dataframe['conditions'])
  new_dataframe.loc[:, 'gender'] = le_gender.fit_transform(new_dataframe['gender'])
  new_dataframe.loc[:, 'makeup'] = le_makeup.fit_transform(new_dataframe['makeup'])
  new_dataframe.loc[:, 'where'] = le_where.fit_transform(new_dataframe['where'])
  new_dataframe.loc[:, 'forwhom'] = le_forwhom.fit_transform(new_dataframe['forwhom'])
  new_dataframe.loc[:, 'reminder'] = le_reminder.fit_transform(new_dataframe['reminder'])


In [None]:
# Split the data into training and testing sets (80% train, 20% test)
X = new_dataframe.drop('reminder', axis=1)
y = new_dataframe['reminder']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Display the first few rows of the transformed dataframe
new_dataframe.head()

Unnamed: 0,uvindex,temp,conditions,cloudcover,age,gender,skintype,makeup,where,SPF,forwhom,reminder_order,reminder
0,4.0,27.3,0,0.0,31,0,4,1,0,30,1,1,1
1,8.0,31.0,0,0.0,31,0,4,1,0,30,1,2,1
2,10.0,38.0,0,0.0,31,0,4,1,0,30,1,3,1
3,9.0,38.8,0,0.0,31,0,4,1,0,30,1,4,1
4,7.0,40.0,0,0.0,31,0,4,1,0,30,1,5,0


In [None]:
# Split the data into features and target variable
X = new_dataframe.drop('reminder', axis=1)
y = new_dataframe['reminder']

In [None]:
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
# Split the scaled data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
# Display the shape of the training and testing sets
X_train.shape, X_test.shape, y_train.shape, y_test.shap

AttributeError: ignored

# Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Initialize a logistic regression model
logreg = LogisticRegression(random_state=42)

# Train the model on the training data
logreg.fit(X_train, y_train)

# Predict on the test set
lg_y_pred = logreg.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, lg_y_pred)
classification_rep = classification_report(y_test, lg_y_pred)

accuracy, classification_rep


(0.7857142857142857,
 '              precision    recall  f1-score   support\n\n           0       0.83      0.86      0.84       113\n           1       0.69      0.64      0.66        55\n\n    accuracy                           0.79       168\n   macro avg       0.76      0.75      0.75       168\nweighted avg       0.78      0.79      0.78       168\n')

The Logistic Regression model achieved an accuracy of approximately
78.57%, Overall, the model seems to perform decently well, with better performance for predicting the "no reminder" class compared to the "reminder" class.

# decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict on the test data
dt_y_pred = clf.predict(X_test)

# Calculate and display the accuracy
accuracy = accuracy_score(y_test, dt_y_pred)
classification_rep = classification_report(y_test, dt_y_pred)

accuracy, classification_rep


(0.7976190476190477,
 '              precision    recall  f1-score   support\n\n           0       0.88      0.81      0.84       113\n           1       0.66      0.78      0.72        55\n\n    accuracy                           0.80       168\n   macro avg       0.77      0.79      0.78       168\nweighted avg       0.81      0.80      0.80       168\n')

The Decision Tree classifier achieved an accuracy of approximately 79.76% on the test set. Precision measures the proportion of correctly predicted positive observations out of the total predicted positives.Recall (Sensitivity) measures the proportion of actual positives that were identified correctly.F1-Score provides a balance between precision and recall. Support is the number of actual occurrences of the class in the test set.

# random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize a Random Forest classifier
rf_clf = RandomForestClassifier(random_state=42, n_estimators=100)

# Train the classifier on the training data
rf_clf.fit(X_train, y_train)

# Predict on the test data
rf_y_pred = rf_clf.predict(X_test)

# Calculate and display the accuracy
rf_accuracy = accuracy_score(y_test, rf_y_pred)
rf_classification_rep = classification_report(y_test, rf_y_pred)

rf_accuracy, rf_classification_rep


(0.8095238095238095,
 '              precision    recall  f1-score   support\n\n           0       0.87      0.84      0.86       113\n           1       0.69      0.75      0.72        55\n\n    accuracy                           0.81       168\n   macro avg       0.78      0.79      0.79       168\nweighted avg       0.81      0.81      0.81       168\n')

The Random Forest classifier achieved an accuracy of approximately
81.0%, which is a slight improvement over the Decision Tree classifier.

# K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Initialize a KNN classifier with k=5
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Train the classifier on the training data
knn_clf.fit(X_train, y_train)

# Predict on the test data
knn_y_pred = knn_clf.predict(X_test)

# Calculate and display the accuracy
knn_accuracy = accuracy_score(y_test, knn_y_pred)
knn_classification_rep = classification_report(y_test, knn_y_pred)

knn_accuracy, knn_classification_rep


(0.7142857142857143,
 '              precision    recall  f1-score   support\n\n           0       0.78      0.81      0.79       113\n           1       0.57      0.53      0.55        55\n\n    accuracy                           0.71       168\n   macro avg       0.67      0.67      0.67       168\nweighted avg       0.71      0.71      0.71       168\n')

The K-Nearest Neighbors (KNN) classifier achieved an accuracy of approximately
71.4%

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np

# Define a range of k values to try
k_values = list(range(1, 51))

# Store cross-validation scores for each k value
cv_scores = []

# Perform cross-validation for each k value
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# Determine the best k value
best_k = k_values[np.argmax(cv_scores)]
best_accuracy = max(cv_scores)

best_k, best_accuracy


(9, 0.7425373134328359)

In [None]:
# Initialize a KNN classifier with the best k value
knn_best = KNeighborsClassifier(n_neighbors=best_k)

# Train the classifier on the training data
knn_best.fit(X_train, y_train)

# Predict on the test data
knn_best_y_pred = knn_best.predict(X_test)

# Calculate and display the accuracy
knn_best_accuracy = accuracy_score(y_test, knn_best_y_pred)
knn_best_classification_rep = classification_report(y_test, knn_best_y_pred)

knn_best_accuracy, knn_best_classification_rep


(0.7321428571428571,
 '              precision    recall  f1-score   support\n\n           0       0.78      0.84      0.81       113\n           1       0.61      0.51      0.55        55\n\n    accuracy                           0.73       168\n   macro avg       0.69      0.67      0.68       168\nweighted avg       0.72      0.73      0.73       168\n')


k=9 achieved an accuracy of approximately 73.21%

# Summary of models performance

In [None]:
# Metrics for logistic regression
lg_metrics = classification_report(y_test, lg_y_pred, output_dict=True)

# Metrics for Decision Tree
dt_metrics = classification_report(y_test, dt_y_pred, output_dict=True)

# Metrics for Random Forest
rf_metrics = classification_report(y_test, rf_y_pred, output_dict=True)

# Metrics for KNN with k=9
knn_metrics = classification_report(y_test, knn_best_y_pred, output_dict=True)

In [None]:
# Summarize the metrics in a structured manner
summary = {
    "Model": ["Logistic Regression", "Decision Tree", "Random Forest", "KNN (k=9)"],
    "Accuracy": [lg_metrics["accuracy"], dt_metrics["accuracy"], rf_metrics["accuracy"], knn_metrics["accuracy"]],
    "Precision (No)": [lg_metrics["0"]["precision"],dt_metrics["0"]["precision"], rf_metrics["0"]["precision"], knn_metrics["0"]["precision"]],
    "Recall (No)": [lg_metrics["0"]["recall"],dt_metrics["0"]["recall"], rf_metrics["0"]["recall"], knn_metrics["0"]["recall"]],
    "F1-Score (No)": [lg_metrics["0"]["f1-score"],dt_metrics["0"]["f1-score"], rf_metrics["0"]["f1-score"], knn_metrics["0"]["f1-score"]],
    "Precision (Yes)": [lg_metrics["1"]["precision"],dt_metrics["1"]["precision"], rf_metrics["1"]["precision"], knn_metrics["1"]["precision"]],
    "Recall (Yes)": [lg_metrics["1"]["recall"],dt_metrics["1"]["recall"], rf_metrics["1"]["recall"], knn_metrics["1"]["recall"]],
    "F1-Score (Yes)": [lg_metrics["1"]["f1-score"],dt_metrics["1"]["f1-score"], rf_metrics["1"]["f1-score"], knn_metrics["1"]["f1-score"]]
}

# Convert to DataFrame for display
model_comparison = pd.DataFrame(summary)
model_comparison

Unnamed: 0,Model,Accuracy,Precision (No),Recall (No),F1-Score (No),Precision (Yes),Recall (Yes),F1-Score (Yes)
0,Logistic Regression,0.785714,0.82906,0.858407,0.843478,0.686275,0.636364,0.660377
1,Decision Tree,0.797619,0.883495,0.80531,0.842593,0.661538,0.781818,0.716667
2,Random Forest,0.809524,0.87156,0.840708,0.855856,0.694915,0.745455,0.719298
3,KNN (k=9),0.732143,0.778689,0.840708,0.808511,0.608696,0.509091,0.554455


Accuracy shoewed the overall proportion of correct predictions, random forest had the highest accuracy, 80.95%, while it was not significant high compared with logistic regression and decision tree. Precision measured the proportion of correctly predicted positive observations out of the total predicted positives for each class.Recall (or Sensitivity) measured the proportion of actual positives that were identified correctly for each class.F1-Score provided a balance between precision and recall for each class. Random Forest had the highest accuracy and generally well-balanced Precision, Recall, and F1-Score values across both classes.Decision tree is close in performance, with slightly lower values for most metrics compared to the Random Forest.KNN (k=9) laged in performance metrics compared to the other two, especially for the "Yes" class.Based on all the metrics, Random Forest is the most robust and well-performing model among the four for this dataset.

In [None]:
# save the model
import joblib
save_path = "gdrive/My Drive/random_forest_model.pkl"
joblib.dump(rf_clf, save_path)

['gdrive/My Drive/random_forest_model.pkl']