# Activity: Trying out k-means in the Adobo credit card dataset

1. Given the rfm customer table we produced last week, can you group together customers using kmeans and the recency, frequency monetary value columns, this time, **without using rules to define levels**?

*   Scale the column values
*   Find optimal k using elbow and silhouette method
*  Plot radar plots per cluster
*   Attach resulting cluster labels to each row
*   Compare the new clusters with `rfm_level`. How are they similar/different?



2. Using your new results, can you describe each customer cluster produced using the transactions dataset?
*   Do you still see a clear leveled (1,2,3., 1=lowest, 3=highest) classification, or has it been mixed up?






In [1]:
# Import libraries
import pandas as pd

In [2]:
# Mount GDrive's folders
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# This code imports a library "os" that allows file navigation
import os
# This code sets the home directory
# Find your folder and put the path here as a string
os.chdir('/content/drive/MyDrive/my_workspace')

## Prepare data

Read csv

In [4]:
df = pd.read_csv("Data/cc_clean.csv")
df.head()

Unnamed: 0,cc_num,gender,city,city_pop,job,dob,acct_num,acct_num2,trans_num,unix_time,category,amt,trans_datetime
0,676000000000.0,M,Dasmarinas,659019,Chartered loss adjuster,12/12/1958,798000000000.0,798000000000,a72eaa86b043eed95b25bbb25b3153a1,1581314011,shopping_net,68.88,2020-02-10 13:53:31
1,3520000000000000.0,M,Digos,169393,"Administrator, charities/voluntary organisations",31/08/1970,968000000000.0,968000000000,060d12f91c13871a13963041736a4702,1590902968,entertainment,50.06,2020-05-31 13:29:28
2,4.14e+18,M,Calapan,133893,Financial controller,23/07/1953,628000000000.0,628000000000,18aafb6098ab0923886c0ac83592ef8d,1585461157,food_dining,105.44,2020-03-29 13:52:37
3,4720000000000000.0,M,Laoag,111125,Dance movement psychotherapist,11/01/1954,257000000000.0,257000000000,c20ee88b451f637bc6893b7460e9fee0,1601282159,gas_transport,82.69,2020-09-28 16:35:59
4,3530000000000000.0,M,City of Paranaque,665822,"Engineer, water",31/07/1961,540000000000.0,540000000000,b389cc449c9c298e8c004024449f7a27,1594960430,shopping_net,363.49,2020-07-17 12:33:50


In [5]:
rfm_df = pd.read_csv("Data/cc_rfm.csv")
rfm_df.head()

Unnamed: 0,acct_num,recency,recency_score,frequency,frequency_score,total_amt,monetary_score,rfm_score,rfm_level
0,124000000000.0,24,3,931,3,66457.92,3,9,Top
1,169000000000.0,141,1,9,1,2814.6,1,3,Low
2,170000000000.0,24,3,890,3,64448.85,3,9,Top
3,201000000000.0,25,3,306,2,24489.46,2,7,Top
4,203800000000.0,111,1,12,1,8803.87,1,3,Low


In [6]:
# Convert to pandas datetimes
df['trans_datetime'] = pd.to_datetime(df['trans_datetime'])
df['dob'] = pd.to_datetime(df['dob'], format='%d/%m/%Y')
df.head()

Unnamed: 0,cc_num,gender,city,city_pop,job,dob,acct_num,acct_num2,trans_num,unix_time,category,amt,trans_datetime
0,676000000000.0,M,Dasmarinas,659019,Chartered loss adjuster,1958-12-12,798000000000.0,798000000000,a72eaa86b043eed95b25bbb25b3153a1,1581314011,shopping_net,68.88,2020-02-10 13:53:31
1,3520000000000000.0,M,Digos,169393,"Administrator, charities/voluntary organisations",1970-08-31,968000000000.0,968000000000,060d12f91c13871a13963041736a4702,1590902968,entertainment,50.06,2020-05-31 13:29:28
2,4.14e+18,M,Calapan,133893,Financial controller,1953-07-23,628000000000.0,628000000000,18aafb6098ab0923886c0ac83592ef8d,1585461157,food_dining,105.44,2020-03-29 13:52:37
3,4720000000000000.0,M,Laoag,111125,Dance movement psychotherapist,1954-01-11,257000000000.0,257000000000,c20ee88b451f637bc6893b7460e9fee0,1601282159,gas_transport,82.69,2020-09-28 16:35:59
4,3530000000000000.0,M,City of Paranaque,665822,"Engineer, water",1961-07-31,540000000000.0,540000000000,b389cc449c9c298e8c004024449f7a27,1594960430,shopping_net,363.49,2020-07-17 12:33:50


Add useful columns

In [7]:
# Define the current date
current_date = pd.to_datetime('2022-01-01')
# Calculate the elapsed days from transaction date to current date
df['elapsed_days'] = (current_date - df['trans_datetime']).dt.days
# Calculate age of customer
df['age'] = (current_date - df['dob']).dt.days//365
df

Unnamed: 0,cc_num,gender,city,city_pop,job,dob,acct_num,acct_num2,trans_num,unix_time,category,amt,trans_datetime,elapsed_days,age
0,6.760000e+11,M,Dasmarinas,659019,Chartered loss adjuster,1958-12-12,7.980000e+11,798000000000,a72eaa86b043eed95b25bbb25b3153a1,1581314011,shopping_net,68.88,2020-02-10 13:53:31,690,63
1,3.520000e+15,M,Digos,169393,"Administrator, charities/voluntary organisations",1970-08-31,9.680000e+11,968000000000,060d12f91c13871a13963041736a4702,1590902968,entertainment,50.06,2020-05-31 13:29:28,579,51
2,4.140000e+18,M,Calapan,133893,Financial controller,1953-07-23,6.280000e+11,628000000000,18aafb6098ab0923886c0ac83592ef8d,1585461157,food_dining,105.44,2020-03-29 13:52:37,642,68
3,4.720000e+15,M,Laoag,111125,Dance movement psychotherapist,1954-01-11,2.570000e+11,257000000000,c20ee88b451f637bc6893b7460e9fee0,1601282159,gas_transport,82.69,2020-09-28 16:35:59,459,68
4,3.530000e+15,M,City of Paranaque,665822,"Engineer, water",1961-07-31,5.400000e+11,540000000000,b389cc449c9c298e8c004024449f7a27,1594960430,shopping_net,363.49,2020-07-17 12:33:50,532,60
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
92427,3.530000e+15,M,Dasmarinas,659019,"Physicist, medical",1965-03-26,2.010000e+11,201000000000,4f77498d91283c4910a636b2e8149dda,1587273415,misc_pos,6.54,2020-04-19 13:16:55,621,56
92428,2.470000e+15,M,San Fernando,306659,"Surveyor, quantity",1935-11-01,5.811000e+11,581000000000,d44f411eabd406a76a60546e723a98fd,1628185569,kids_pets,98.23,2021-08-06 01:46:09,147,86
92429,3.520000e+15,M,Masbate,95389,Wellsite geologist,1967-11-20,5.310000e+11,531000000000,7e767a74cae901c13f1a9d1d37aa63d4,1621481285,grocery_pos,78.79,2021-05-20 11:28:05,225,54
92430,4.620000e+15,M,San Fernando,121812,Personnel officer,1934-11-20,5.550000e+11,555000000000,6ced184c93e66028e8d235ad3060de90,1625341374,personal_care,31.37,2021-07-04 03:42:54,180,87


In [10]:
# get hour of transaction
df['trans_hour'] =  df['trans_datetime'].dt.hour
# get weekday, 1=Monday
df['trans_dayofweek'] =  df['trans_datetime'].dt.isocalendar().day()

TypeError: 'Series' object is not callable

# Perform k-means on data in `rfm_df`

# Describe clusters using `df`