# Checking the Python Environment

This is to ensure that the correct Python environment is set up correctly and used

In [2]:
import sys
import IPython

# Change path to append here
sys.path.append(r'C:\Users\hp\Downloads\Machine Learning Assignment')

print("Python executable:", sys.executable)
print("Python version:", sys.version)
print("Kernel:", IPython.get_ipython().kernel)
print("Sys paths:", sys.path)

Python executable: C:\Users\hp\anaconda3\envs\machine_learning_assignment\python.exe
Python version: 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)]
Kernel: <ipykernel.ipkernel.IPythonKernel object at 0x0000020779035150>
Sys paths: ['C:\\Program Files\\JetBrains\\DataSpell 2023.3.4\\plugins\\python-ce\\helpers-pro\\jupyter_debug', 'C:\\Program Files\\JetBrains\\DataSpell 2023.3.4\\plugins\\python-ce\\helpers\\pydev', 'C:\\Users\\hp\\Downloads\\Machine_Learning_Assignment\\testing', 'C:\\Users\\hp\\Downloads\\Machine_Learning_Assignment', 'C:\\Users\\hp\\anaconda3\\envs\\machine_learning_assignment\\python311.zip', 'C:\\Users\\hp\\anaconda3\\envs\\machine_learning_assignment\\DLLs', 'C:\\Users\\hp\\anaconda3\\envs\\machine_learning_assignment\\Lib', 'C:\\Users\\hp\\anaconda3\\envs\\machine_learning_assignment', '', 'C:\\Users\\hp\\anaconda3\\envs\\machine_learning_assignment\\Lib\\site-packages', 'C:\\Users\\hp\\anaconda3\\envs\\machine_le

# Importing Libraries

This is to ensure that the necessary libraries are imported and used

In [3]:
# For DataFrame
import pandas as pd

# For sampling methods
import random
import numpy as np
from sklearn.model_selection import train_test_split

# Sampling Methods
There are plenty sampling methods available, but not every method is suitable for every dataset. The viability of the sampling method depends on the nature of the dataset, source of data, and the objective of the study. The following shows the sampling methods available in Python:
1. Simple Random Sampling
2. Systematic Sampling
3. Stratified Sampling
4. Cluster Sampling
5. Reservoir Sampling

## 1. Simple Random Sampling
This is the most basic and intuitive form of sampling. It is a method where each individual in the population has an equal chance of being selected. This method is suitable  when the dataset is not biased. However, it requires a large sample approximating the population, which is not always available. It can also be unrepresentative when the population is not homogenous.

This might be useful for this dataset, but might not be representative enough as the number of records in the original dataset is 8950 rows only, which might not be large enough to represent the whole population.

Reference: 
https://www.investopedia.com/ask/answers/042815/what-are-disadvantages-using-simple-random-sample-approximate-larger-population.asp

In [4]:
# Reading preprocessed data
df_preprocessed = pd.read_csv(r'../raw_data/customer_preprocessed_imputed_knn_n3_outlier_retained.csv', index_col = 'CUST_ID')

# Defining sample size, checking if the sample size is larger than the dataset
n = min(3500, len(df_preprocessed))

# Simple random sampling, putting random_state for reproducibility
df_srs = df_preprocessed.sample(n = n, random_state = 42)

df_srs

Unnamed: 0_level_0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,KNN_IMPUTED_CREDIT_LIMIT,PAYMENTS,KNN_IMPUTED_MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
CUST_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
C17875,16.834929,0.454545,15.00,15.00,0.00,209.025389,0.090909,0.090909,0.000000,0.090909,1,1,7500.0,430.213001,86.959785,0.000000,11
C16296,540.020858,1.000000,612.23,495.61,116.62,1708.923217,0.666667,0.166667,0.500000,0.333333,10,10,2000.0,1642.068707,419.956251,0.000000,12
C17219,119.237712,1.000000,342.74,0.00,342.74,0.000000,1.000000,0.000000,1.000000,0.000000,0,20,2000.0,327.166041,165.207233,0.000000,12
C13108,894.081947,1.000000,1901.71,1853.11,48.60,206.618780,0.666667,0.666667,0.416667,0.083333,1,33,1500.0,947.130141,220.745296,0.000000,12
C13576,1294.145453,1.000000,3059.10,1836.98,1222.12,0.000000,1.000000,0.416667,1.000000,0.000000,0,42,7000.0,5560.033502,497.637767,0.083333,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C14210,1539.422470,1.000000,0.00,0.00,0.00,1156.434609,0.000000,0.000000,0.000000,0.083333,4,0,3500.0,369.044340,475.102035,0.000000,12
C17858,21.870580,0.636364,405.00,255.00,150.00,0.000000,0.333333,0.083333,0.250000,0.000000,0,7,4000.0,414.423817,88.853589,0.000000,12
C12903,7.817753,0.272727,170.02,0.00,170.02,0.000000,0.500000,0.000000,0.500000,0.000000,0,6,2000.0,373.028410,73.301702,0.500000,12
C10444,6822.877573,1.000000,137.34,137.34,0.00,4815.112874,0.083333,0.083333,0.000000,0.666667,16,1,10000.0,1874.954894,1716.092764,0.000000,12


## 2. Systematic Sampling
This method works by selecting every **k**th individual from the population, easier to implement than simple random sampling. This method eliminates clustered selection, from Simple Random Sampling, whereby samples chosen are exceptionally close to each other, which in Simple Random Sampling can be solved only by increasing the sample size. However, it is not suitable when there is a pattern in the dataset, or when there is a periodic pattern present in the dataset.

This might also be useful for this dataset, but might not be representative enough, same as the problem mentioned in 1. (Simple Random Sampling)

Reference:
https://www.investopedia.com/ask/answers/042415/what-are-advantages-and-disadvantages-using-systematic-sampling.asp

In [5]:
# Computing the k for the given sample size
k = len(df_preprocessed) // n

# Systematic sampling, limiting to get the desired sample size
df_systematic = df_preprocessed.iloc[::k, :].head(n)

df_systematic

Unnamed: 0_level_0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,KNN_IMPUTED_CREDIT_LIMIT,PAYMENTS,KNN_IMPUTED_MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
CUST_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
C10001,40.900749,0.818182,95.40,0.00,95.40,0.000000,0.166667,0.000000,0.083333,0.000000,0,2,1000.0,201.802084,139.509787,0.000000,12
C10003,2495.148862,1.000000,773.17,773.17,0.00,0.000000,1.000000,1.000000,0.000000,0.000000,0,12,7500.0,622.066742,627.284787,0.000000,12
C10005,817.714335,1.000000,16.00,16.00,0.00,0.000000,0.083333,0.083333,0.000000,0.000000,0,1,1200.0,678.334763,244.791237,0.000000,12
C10007,627.260806,1.000000,7091.01,6402.63,688.38,0.000000,1.000000,1.000000,1.000000,0.000000,0,64,13500.0,6354.314328,198.065894,1.000000,12
C10009,1014.926473,1.000000,861.49,661.49,200.00,0.000000,0.333333,0.083333,0.250000,0.000000,0,5,7000.0,688.278568,311.963409,0.000000,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C17181,177.602057,1.000000,2863.31,2407.53,455.78,832.203874,1.000000,0.583333,0.750000,0.416667,10,27,1500.0,3344.466066,203.049504,0.583333,12
C17183,2493.589343,1.000000,4655.55,3299.43,1356.12,0.000000,0.750000,0.500000,0.666667,0.000000,0,44,6500.0,1231.610235,832.178621,0.000000,12
C17185,916.627912,1.000000,155.96,0.00,155.96,0.000000,0.333333,0.000000,0.333333,0.000000,0,8,1200.0,717.212090,389.194590,0.000000,12
C17187,29.602980,0.545455,144.94,0.00,144.94,0.000000,0.333333,0.000000,0.250000,0.000000,0,7,5000.0,249.685121,88.988794,0.181818,12


## 3. Stratified Sampling
This method works by first defining the strata, which are the subgroups of the population, then selecting the sample from each stratum. This method is useful when the whole dataset is heterogeneous, but each stratum itself in the dataset is homogenous. However, this method requires strong prior knowledge about the dataset is required to define the strata, otherwise biases (e.g. selection bias) may be introduced to the samples. 

This method is not suitable for this dataset as the 'stratifiable' columns is not clearly observed and defined. But for illustration purposes, 'TENURE' is made the strata column.

Reference:
https://www.qualtrics.com/en-au/experience-management/research/stratified-random-sampling/?rid=ip&prevsite=en&newsite=au&geo=MY&geomatch=au

In [6]:
# Defining the strata column
strata_column = 'TENURE'

# Stratified sampling
_, df_stratified = train_test_split(df_preprocessed, test_size = n, stratify = df_preprocessed[strata_column], random_state=42)

df_stratified

Unnamed: 0_level_0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,KNN_IMPUTED_CREDIT_LIMIT,PAYMENTS,KNN_IMPUTED_MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
CUST_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
C15563,2933.387442,0.727273,0.00,0.00,0.00,3317.250842,0.000000,0.000000,0.000000,0.250000,5,0,6500.0,9317.533927,929.334025,0.125000,12
C12893,2432.276042,1.000000,7156.11,5921.26,1234.85,1543.726862,0.833333,0.833333,0.666667,0.166667,3,198,9500.0,1914.832296,621.422606,0.083333,12
C16341,4183.825982,1.000000,490.61,45.65,444.96,178.950075,0.666667,0.083333,0.666667,0.333333,4,13,4500.0,1541.998312,2410.083966,0.000000,12
C12359,5355.922523,1.000000,2469.28,905.98,1563.30,5839.680075,0.916667,0.583333,0.833333,0.666667,27,33,8000.0,1226.857482,2016.382990,0.000000,12
C18783,221.810125,0.727273,0.00,0.00,0.00,903.684584,0.000000,0.000000,0.000000,0.166667,2,0,500.0,662.179022,168.620500,0.166667,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C13414,790.367002,1.000000,3663.31,2744.88,918.43,0.000000,1.000000,1.000000,0.833333,0.000000,0,70,5000.0,4232.922974,250.545557,0.000000,12
C16991,119.368715,0.909091,339.11,611.65,12.41,87.547315,0.416667,0.500000,0.083333,0.083333,2,5,6500.0,107.203529,187.367893,0.000000,12
C12846,179.882212,1.000000,828.00,0.00,828.00,0.000000,1.000000,0.000000,1.000000,0.000000,0,12,1000.0,649.951462,152.775223,0.300000,12
C13691,1.869435,0.454545,0.24,0.24,0.00,0.000000,0.083333,0.083333,0.000000,0.000000,0,0,3000.0,150.381107,53.294711,0.000000,12


## 4. Cluster Sampling
This method works by dividing the population into clusters, then randomly selecting the clusters to sample from. This method is useful when the dataset is large. Homogeneity is present among clusters, but heterogeneity is present within clusters. However, this method is not suitable when the clusters are not homogenous, or when the clusters are not representative of the population.

This method is not suitable for this dataset, as the 'clusterable' column is not clearly observed and defined, and also it is not suitable for clustering machine learning tasks. But for illustration purposes, 'TENURE' is made the cluster column.

Reference:

https://corporatefinanceinstitute.com/resources/data-science/cluster-sampling/

In [7]:
# Defining the clustering column
cluster_column = 'TENURE'

# Cluster sampling, setting random seed for reproducibility
clusters = df_preprocessed.groupby(cluster_column)
np.random.seed(42)

# Getting the desired sample size
df_cluster = pd.DataFrame()
selected_clusters = set()
while len(df_cluster) < n:
    cluster = np.random.choice(list(clusters.groups.keys()), replace = False)
    selected_clusters.add(cluster)
    df_cluster = pd.concat([df_cluster, clusters.get_group(cluster)], axis = 0)

# Selecting the desired sample size
df_cluster = df_cluster.head(n)

df_cluster

Unnamed: 0_level_0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,KNN_IMPUTED_CREDIT_LIMIT,PAYMENTS,KNN_IMPUTED_MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
CUST_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
C10170,114.593817,1.000000,469.98,0.00,469.98,0.000000,1.000000,0.000000,0.833333,0.000000,0,6,1000.0,367.109312,86.102479,0.500000,6
C10274,7755.698607,1.000000,8533.54,4072.76,4460.78,7540.307350,1.000000,0.500000,0.833333,0.500000,12,72,10000.0,4758.209146,7256.951816,0.000000,6
C10411,918.342234,1.000000,1221.53,1180.00,41.53,1078.788118,0.333333,0.166667,0.166667,0.166667,7,2,1200.0,1421.035206,106.039362,0.250000,6
C10471,9601.071318,1.000000,238.34,0.00,238.34,4809.119550,0.833333,0.000000,0.833333,1.000000,8,5,15000.0,1194.510762,1206.257247,0.000000,6
C10567,76.935610,0.833333,277.00,277.00,0.00,0.000000,0.333333,0.333333,0.000000,0.000000,0,2,8500.0,814.605178,109.433095,0.333333,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C10884,10131.000550,1.000000,0.00,0.00,0.00,3330.326162,0.000000,0.000000,0.000000,0.166667,5,0,14500.0,4056.944212,2592.311811,0.000000,12
C10885,4419.302940,1.000000,12551.95,10901.24,1650.71,14896.540510,1.000000,0.666667,1.000000,0.500000,19,122,17000.0,31698.419020,1448.585054,0.416667,12
C10886,2173.621597,1.000000,200.00,200.00,0.00,201.404735,0.083333,0.083333,0.000000,0.250000,4,1,8000.0,495.968187,502.765453,0.000000,12
C10887,9901.685569,1.000000,2190.95,832.49,1358.46,1913.350112,1.000000,0.416667,0.916667,0.250000,4,43,11500.0,2972.248185,2849.235124,0.000000,12


## 5. Reservoir Sampling
This method works by selecting a sample from the population without replacement, basically much like Simple Random Sampling, but this method is generally used for large datasets, where Simple Random Sampling becomes inefficient. This method is useful when the dataset is large and the size of the dataset is unknown (like in event streaming). However, it might not be very suitable when the dataset is small, as it might not be representative enough.

This method might be somehow useful for this dataset, but might not be representative enough, same as the problem mentioned in 1. (Simple Random Sampling)

Reference:
https://medium.com/@choukibhanuprasad/reservoir-sampling-technique-71c9227d6743

In [8]:
# Defining function for Reservoir sampling
def reservoir_sampling(iterator, n):
    reservoir = []
    for i, row in enumerate(iterator):
        if i < n:
            reservoir.append(row)
        else:
            j = random.randint(0, i)
            if j < n:
                reservoir[j] = row
    return reservoir

# Reservoir sampling
# row is a tuple of (INDEX, DATA)
df_reservoir = pd.DataFrame([row[1] for row in reservoir_sampling(df_preprocessed.iterrows(), n)])

# dtypes of some columns has changed, need to convert back
columns_int64 = ['CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'TENURE']
df_reservoir[columns_int64] = df_reservoir[columns_int64].astype('int64')

df_reservoir

Unnamed: 0,BALANCE,BALANCE_FREQUENCY,PURCHASES,ONEOFF_PURCHASES,INSTALLMENTS_PURCHASES,CASH_ADVANCE,PURCHASES_FREQUENCY,ONEOFF_PURCHASES_FREQUENCY,PURCHASES_INSTALLMENTS_FREQUENCY,CASH_ADVANCE_FREQUENCY,CASH_ADVANCE_TRX,PURCHASES_TRX,KNN_IMPUTED_CREDIT_LIMIT,PAYMENTS,KNN_IMPUTED_MINIMUM_PAYMENTS,PRC_FULL_PAYMENT,TENURE
C10001,40.900749,0.818182,95.40,0.00,95.40,0.000000,0.166667,0.000000,0.083333,0.000000,0,2,1000.0,201.802084,139.509787,0.000000,12
C17838,25.407311,1.000000,504.96,0.00,504.96,0.000000,1.000000,0.000000,1.000000,0.000000,0,48,3000.0,464.099225,166.397650,0.916667,12
C10003,2495.148862,1.000000,773.17,773.17,0.00,0.000000,1.000000,1.000000,0.000000,0.000000,0,12,7500.0,622.066742,627.284787,0.000000,12
C13849,86.157142,1.000000,915.64,0.00,915.64,0.000000,0.916667,0.000000,0.833333,0.000000,0,11,9500.0,855.755398,170.160443,0.727273,12
C16845,51.581189,1.000000,384.89,120.00,264.89,0.000000,0.666667,0.083333,0.583333,0.000000,0,10,6000.0,213.515669,156.429964,0.222222,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
C14254,3907.641073,1.000000,2001.35,147.84,1853.51,466.607620,1.000000,0.083333,1.000000,0.166667,2,141,4000.0,2111.180988,3216.413727,0.000000,12
C18223,1256.819270,1.000000,1674.96,0.00,1674.96,0.000000,1.000000,0.000000,1.000000,0.000000,0,12,1000.0,970.361646,3449.118706,0.000000,12
C17908,64.937308,0.636364,442.03,209.00,233.03,0.000000,0.583333,0.083333,0.500000,0.000000,0,8,1200.0,149.634213,204.316917,0.000000,12
C13597,2280.447419,1.000000,1681.78,1219.00,462.78,0.000000,1.000000,0.250000,0.916667,0.000000,0,21,7500.0,1398.406343,513.533438,0.000000,12


# Evaluation of Sampling Methods using Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test is used to compare the distribution of the original dataset and the sampled dataset. The null hypothesis is that the two distributions are the same. If the p-value is less than the significance level (0.05), then the null hypothesis is rejected, meaning that the two distributions are different, and we can conclude that the sample drawn using the sampling method is not representative of the original dataset.

\begin{align*}
H_{0} : P = P_{0},  H_{1} : P \neq P_{0}.
\end{align*}

Reference: https://www.statology.org/kolmogorov-smirnov-test-python/


In [28]:
from scipy.stats import ks_2samp

sampling_methods = {
    'Simple Random Sampling': df_srs,
    'Systematic Sampling': df_systematic,
    'Reservoir Sampling': df_reservoir
}

non_rejected_count = {}

for method, df in sampling_methods.items():
    count = 0
    
    for column in df_preprocessed.columns:
        ks_statistic, p_value = ks_2samp(df_preprocessed[column], df[column])
        if p_value > 0.05:
            count += 1
    
    non_rejected_count[method] = count

print("Number of columns not rejected \n")
for method, count in non_rejected_count.items():
    print(f"{method}: {count}/{len(df_preprocessed.columns)}")

Number of columns not rejected 

Simple Random Sampling: 17/17
Systematic Sampling: 7/17
Reservoir Sampling: 17/17


**Conclusion:**
The number of columns not rejected for the Kolmogorov-Smirnov test is highest for Simple Random Sampling and Reservoir Sampling, followed by Systematic Sampling. However, Simple Random Sampling is more preferrable, please refer to justification made above. Hence, Simple Random Sampling is employed to sample the dataset.

# Saving the Sampled Dataset
The sampled dataset in saved into a new csv file for future use

In [8]:
df_srs.to_csv(r'../raw_data/customer_sampled_srs_preprocessed_knn_n3.csv')