# Analytics Vidhya Janatahack: Customer Segmentation

This notebook is in reference to the Janatahack Competition on customer segmentation conducted by Analytics Vidhya on  31-07-2020.

The given model performed a testing accuracy of 75% and has acquired rank 145 on LB.

Customer segmentation is the practice of dividing a customer base into groups of individuals that are similar in specific ways relevant to marketing, such as age, gender, interests and spending habits. Companies employing customer segmentation operate under the fact that every customer is different and that their marketing efforts would be better served if they target specific, smaller groups with messages that those consumers would find relevant and lead them to buy something. Companies also hope to gain a deeper understanding of their customers' preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment.

More info: https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/#About

NB: Not claiming this is one of the best models. I took the competition as a learning experience. Additional suggestions and insights are invited on where can I improve.

## Problem Statement

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market.

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 
You are required to help the manager to predict the right group of the new customers.

More info : https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/#ProblemStatement

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Reading the data

In [None]:
train = pd.read_csv('../input/analytics-vidhya-janatahack-customer-segmentation/Train_aBjfeNk.csv')
test = pd.read_csv('../input/analytics-vidhya-janatahack-customer-segmentation/Test_LqhgPWU.csv')
train_copy = train.copy()
test_copy = test.copy()
train_copy = train_copy.drop(['Segmentation'], axis = 1)

Since both training and test data should be undergone preprocessing both are concatenated into a single data frame 'data'. 

In [None]:
# Concatinating train and test data for prperpcessing purposes
train_copy['train'] = 1
test_copy['train'] = 0

In [None]:
#Concatinating the train and test data
data = pd.concat([train_copy,test_copy], axis = 0)
data.shape

## Preprocessing

Checking if there are any missing values in training and test data:

In [None]:
train_copy.isnull().sum()

Here, for the sake of simplicity, the missing values are filled with **mode value of the training data**. Care should be taken not to take the mode values of the whole 'data' dataframe to avoid information leakage from test data. 

In [None]:
# Treating missing values
# For the sake of time just fill the missing values using mean or mode
data['Ever_Married'] = data['Ever_Married'].fillna(train_copy['Ever_Married'].mode()[0])
data['Graduated'] = data['Graduated'].fillna(train_copy['Graduated'].mode()[0])
data['Profession'] = data['Profession'].fillna(train_copy['Profession'].mode()[0])
data['Work_Experience'] = data['Work_Experience'].fillna(train_copy['Work_Experience'].mode()[0])
data['Family_Size'] = data['Family_Size'].fillna(train_copy['Family_Size'].mode()[0])
data['Var_1'] = data['Var_1'].fillna(train_copy['Var_1'].mode()[0])
data.isnull().sum()

Encoding the categorical variables

In [None]:
data.head()

Here, the following variables can be  label encoded
* Gender
* Ever_Married
* Graduated
* Spending_Score

In [None]:
# Label encoding the variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
columns = ['Gender','Ever_Married','Graduated']
for col in columns:
    data[col] = le.fit_transform(data[col])
    



The variable 'Spending_Score' has three values : Low, Average and High. Since they are ordinal variables, they have to be label encoded separately.

In [None]:
data['Spending_Score'] = data['Spending_Score'].map({'Low': 0, 'Average':1, 'High':2}) 
data


If we look at the data, there are two additional categorical variables: 'Profession' and 'Var_1'. Since they are nominal variables, one hot encoding is to be performed. But, before encoding them directly, lets take a look at the distribution.

In [None]:
 train_copy.Profession.value_counts()
# temp

# data['Profession_counts'] = data['Profession'].apply(lambda x: temp[x])
# data[['Profession','Profession_counts']].head()

In [None]:
# for i in range(0, len(data['Profession'])):
#     if (data.iloc[i]['Profession_counts'] < 1000):
#         data['Profession'][i] = 'Others'

The variable 'Profession' has 9 categories. But among these 9 categories, only initial 3-4 are the most frequent. Hence, rest if them are binned into a new category named 'Other'.

In [None]:
data['Profession'] = data['Profession'].replace(['Lawyer','Executive','Marketing','Homemaker'],'Other')


Same strategy is applied for the variable 'Var_1'

In [None]:
train_copy['Var_1'].value_counts()

In [None]:
data['Var_1'] = data['Var_1'].replace(['Cat_5','Cat_1','Cat_7','Cat_2'],'Other')
data['Var_1'].value_counts()
# data.drop(['Profession_counts'], axis = 1,inplace = True)

Now, applying the one hot encoding,

In [None]:
data = pd.get_dummies(data, columns = ['Profession','Var_1'])

In [None]:
# ############### Temporary, remove after check ########################
# 'Var_1_Cat_3' 'Profession_Doctor
 
# data = data.drop(['Profession_Engineer','Var_1_Other',], axis = 1)

In [None]:
data 

In [None]:
#data['Age_Work_product'] = data['Age'] * data['Work_Experience']

After preprocessing, the training and test data are separated

In [None]:
training_preprocessed  = data[data['train']==1]
training_preprocessed

In [None]:
training_preprocessed['Segmentation'] = train.Segmentation
#training_preprocessed['Segmentation'] = training_preprocessed['Segmentation'].map({'A':0,'B':1,'C':2,'D':3})
training_preprocessed = training_preprocessed.drop(['train'], axis = 1)

Splitting dependent and independent variables.

In [None]:
# Splitting into dependent and independent variables
X_train = training_preprocessed.drop(['Segmentation'],axis = 1)
Y_train = training_preprocessed['Segmentation']


In [None]:
training_preprocessed

In [None]:
X_train

In [None]:
X_test = data[data['train'] == 0]
X_test = X_test.drop(['train'], axis = 1)

# Training Stage

In [None]:
from sklearn.metrics import accuracy_score
def get_score(model,x_train, x_test, y_train, y_test):
    model.fit(x_train,y_train)
    y_predict = model.predict(x_test)
    accuracy = accuracy_score(y_test, y_predict)
    return accuracy

In [None]:
from sklearn.model_selection import KFold
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier



from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV

xgb = XGBClassifier()
lgbm = LGBMClassifier()
catb = CatBoostClassifier()

# model = lr
# learning_rates = [0.001,0.01, 0.1]# 0.01, 0.1
# n_estimators   = [10, 100, 1000]
# # subsample = [0.5, 0.7, 1.0]
# # max_depth = [3, 7, 9]
# grid = dict(learning_rate = learning_rates, n_estimators = n_estimators)
# cv = RepeatedStratifiedKFold(n_splits = 6)
# grid_search = GridSearchCV(model, param_grid = grid,  cv=cv, scoring='accuracy')
# opt_param_result = grid_search.fit(X_train,Y_train)

# print('Best :{} using parameters:{}'.format(grid_search.best_score_,grid_search.best_params_))



kf = KFold(n_splits = 6)
score_xgb= []
# score_lgbm = []
# score_catb = []
# score_ada = []

for train_index, test_index in kf.split(X_train):
    x_train, x_val, y_train, y_val = X_train.iloc[train_index],X_train.iloc[test_index],Y_train.iloc[train_index],Y_train.iloc[test_index]
    score_xgb.append(get_score(xgb,x_train, x_val, y_train, y_val))
#     #score_lgbm.append(get_score(lgbm,x_train, x_val, y_train, y_val)) #Using LGBMBooster
#     #score_xgb.append(get_score(xgb,x_train, x_val, y_train, y_val)) #Using XGBooster

print('Accurcay for Bagging is {} ({})'.format(np.mean(score_xgb), np.std(score_xgb)))
# # print('Accurcay for xgb is {} ({})'.format(np.mean(score_xgb), np.std(score_xgb)))
# # print('Accurcay for catb is {} ({})'.format(np.mean(score_catb), np.std(score_catb)))

Additional evaluations are done on LGBM and CatBoost Classifiers . Yet, I was't able to get the training accurcay more than 53%. Even with additional hyper parameter tuning on XGB classifier, the training accuracy was not improved. Hence, decided to stick with the XGB classifier.

In [None]:
# Training using the whole dataset
xgb.fit(X_train,Y_train)
predict = xgb.predict(X_test)

In [None]:
import matplotlib.pyplot as plt
from xgboost import plot_importance 
plot_importance(xgb)
plt.show()

After checking the feature importance, I tried to remove some  the less important variables and checked the performance. But the accuracy got decreased. Hence those features are  kept intact. 

In [None]:
predict  

In [None]:
# Writing the test results to a separate dataframe
submission = pd.DataFrame()
submission['ID'] = test_copy['ID']
submission['Segmentation'] = predict
submission.to_csv('submission.csv', index = False)

Initially, I removed the 'ID' variable and perfomed the training resulting in very low accuracy. Also, I tried to do some more feature engineering  and hyper parameter tuning to improve the training accuracy. But even then, the accuracy couldn't be improved. So, I submitted the baseline model itself as the final result and got 74% testing accuracy. Hope someone could provide additional insight into that.

Since I am a newbie to this field, the hackathon was a nice learning experience.  

Additional suggestions and insights are invited on where can I improve.

Thank you