<a href="https://www.kaggle.com/prosperalikizang/customer-segmentation-xgb-lgbm-rf?scriptVersionId=90296548" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### Problem Statement
An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 
<br><br>
In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 
<br><br>
We are required to help the manager to predict the right group of the new customers.<br><br>
We can check this link: https://datahack.analyticsvidhya.com/contest/janatahack-customer-segmentation/

### Variables Description

           
| Variable	            | Definition                                                        |
|---------------------- |-------------------------------------------------------------------|
| ID	                | Unique ID                                                         |
| Gender	            | Gender of the customer                                            |
| Ever_Married	        | Marital status of the customer                                    |
| Age	                | Age of the customer                                               |
| Graduated	            | Is the customer a graduate?                                       |
| Profession	        | Profession of the customer                                        |
| Work_Experience	    | Work Experience in years                                          |
| Spending_Score	    | Spending score of the customer                                    |
| Family_Size	        | Number of family members for the customer(including the customer) |
| Var_1	                | Anonymised Category for the customer                              |
| Segmentation(target)  | Customer Segment of the customer                                  |

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Import Scientific and Data Manipulation Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
# to increase no. of rows and column visibility in outputs
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [None]:
# Load data 

train = pd.read_csv('../input/customer/Train.csv')
test = pd.read_csv('../input/customer/Test.csv')
sub = pd.read_csv('../input/customer/sample_submission.csv')

#### Training data

In [None]:
train.head()

In [None]:
train.shape

In [None]:
# Looking for train dataset informations
train.info()

We have 7 categoricals variables and 3 numericals variables in our training dataset

#### Testing data

In [None]:
test.shape

In [None]:
# Looking for test dataset informations
test.info()

# EDA(Exploratory Data Analysis) and Vizualisation

## Segmentation

In [None]:
print('Count of each category of segmentation\n',train.Segmentation.value_counts(normalize=True))

### Check for the imbalance of train 

In [None]:
import seaborn as sns
sns.countplot(train['Segmentation'], order=['A','B','C','D'])

### Finding and Removing Duplicate Rows from the train and test datasets if present.

In [None]:
# Check for duplicate 
print('Duplicated value(s) in our train dataset : ', train.duplicated().sum())
print('Duplicated value(s) in our test dataset : ', test.duplicated().sum())

### Checking for missing values

In [None]:
train.isnull().sum()

## Cleaning or filling missing data and creation of some new attributes based upon given data/domain knowledge/prior experience.

### Missing-Data-Techniques

CCA- Complete case analysis, in this technique we drops the NAN values.

3M- Mean,Median and Mode , in this technique we use these Ms for imputation

End tail imputation - we use statistical method to impute missing values

Missing tag imputation - here we use tags to impute the values

Random-sample imputation - here we take random sample of dame size as of our missing values

## Var_1

In [None]:
print('The count of each category\n',train.Var_1.value_counts())

In [None]:
# Checking for null values
train.Var_1.isnull().sum()

In [None]:
train['Var_1'].fillna(train['Var_1'].mode()[0], inplace=True)

In [None]:
# Counting Var_1 in each segment
ax1 = train.groupby(["Segmentation"])["Var_1"].value_counts().unstack().round(3)

# Percentage of category of Var_1 in each segment
ax2 = train.pivot_table(columns='Var_1',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

In each of the segment the count of cat_6 or proportion of cat_6 is very high i.e. most of the entries in the given data belongs to cat_6.
Cat_6 Cat_4 Cat_3 Cat_2 are the most important labels

## Gender

In [None]:
print('The count of gender\n',train.Gender.value_counts())

In [None]:
# Counting male-female in each segment
ax1 = train.groupby(["Segmentation"])["Gender"].value_counts().unstack().round(3)

# Percentage of male-female in each segment
ax2 = train.pivot_table(columns='Gender',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

All the 4 segments have around same number of male-female distribution. In all segment male are more than female. <br> 
But segment D has highest male percentage as compared to other segments.

## Ever Married

In [None]:
print('Count of married vs not married\n',train.Ever_Married.value_counts())

In [None]:
# Checking the count of missing values
train.Ever_Married.isnull().sum()

In [None]:
train['Ever_Married'].fillna(train['Ever_Married'].mode()[0], inplace=True)

In [None]:
# Counting married and non-married in each segment
ax1 = train.groupby(["Segmentation"])["Ever_Married"].value_counts().unstack().round(3)

# Percentage of married and non-married in each segment
ax2 = train.pivot_table(columns='Ever_Married',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

We saw that most of the customers in segment C are married while segment D has the least number of married customers. It means segment D is a group of customers that are singles and maybe younger in age.

## Age

In [None]:
# Looking the distribution of column Age with respect to each segment
a = train[train.Segmentation =='A']["Age"]
b = train[train.Segmentation =='B']["Age"]
c = train[train.Segmentation =='C']["Age"]
d = train[train.Segmentation =='D']["Age"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = train, x = "Segmentation", y="Age")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

The mean age of segment D is 33 and we can say that people in this segment are belong to 30s i.e. they are younger and also from 'ever_married' distribution it is seen that segment D has maximum number of customers who are singles indicating they are younger.<br>
Also segment C has mean age of 49 and we also seen that most customers in this segment are married. 

## Graduated

In [None]:
print('Count of each graduate and non-graduate\n',train.Graduated.value_counts())

In [None]:
# Checking the count of missing values
train.Graduated.isnull().sum()

In [None]:
train['Graduated'].fillna((train['Graduated'].mode()[0]), inplace=True)

In [None]:
# Counting graduate and non-graduate in each segment
ax1 = train.groupby(["Segmentation"])["Graduated"].value_counts().unstack().round(3)

# Percentage of graduate and non-graduate in each segment
ax2 = train.pivot_table(columns='Graduated',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

Segment C has most number of customers who are graduated while segment D has lowest number of graduate customers.

## Profession

In [None]:
print('Count of each profession\n',train.Profession.value_counts())

In [None]:
# Checking the count of missing values
train.Profession.isnull().sum()

In [None]:
train['Profession'].fillna(train['Profession'].mode()[0], inplace=True)

In [None]:
# Count of segments in each profession
ax1 = train.groupby(["Profession"])["Segmentation"].value_counts().unstack().round(3)

# Percentage of segments in each profession
ax2 = train.pivot_table(columns='Segmentation',index='Profession',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (16,5))
label = ['Artist','Doctor','Engineer','Entertainment','Executives','Healthcare','Homemaker','Lawyer','Marketing']
ax[0].set_xticklabels(labels = label,rotation = 45)

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (16,5))
ax[1].set_xticklabels(labels = label,rotation = 45)

plt.show()

Segment A,B and C have major customers from profession:**Artist** while Segment D have major customers from profession:**Healthcare** <br>
**Homemaker** is least in all the four segment

## Work Experience

In [None]:
# Checking the count of missing values
train.Work_Experience.isnull().sum()

In [None]:
train['Work_Experience'].fillna(train['Work_Experience'].mode()[0], inplace=True)

In [None]:
# Looking the distribution of column Work_Experience w.r.t to each segment
a = train[train.Segmentation =='A']["Work_Experience"]
b = train[train.Segmentation =='B']["Work_Experience"]
c = train[train.Segmentation =='C']["Work_Experience"]
d = train[train.Segmentation =='D']["Work_Experience"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = train, x = "Segmentation", y="Work_Experience")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Work Experience')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

Segment D has people with relatively more experienced than other segments while Segment C has people with low experience

## Spending Score

In [None]:
print('Count of spending score\n',train.Spending_Score.value_counts())

In [None]:
# Counting different category of spending score in each segment
ax1 = train.groupby(["Segmentation"])["Spending_Score"].value_counts().unstack().round(3)

# Percentage of spending score in each segment
ax2 = train.pivot_table(columns='Spending_Score',index='Segmentation',values='ID',aggfunc='count')
ax2 = ax2.div(ax2.sum(axis=1), axis = 0).round(2)

#count plot
fig, ax = plt.subplots(1,2)
ax1.plot(kind="bar",ax = ax[0],figsize = (15,4))
ax[0].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[0].set_title(str(ax1))

#stacked bars
ax2.plot(kind="bar",stacked = True,ax = ax[1],figsize = (15,4))
ax[1].set_xticklabels(labels = ['A','B','C','D'],rotation = 0)
ax[1].set_title(str(ax2))
plt.show()

## Family Size

In [None]:
# Checking the count of missing values
train.Family_Size.isnull().sum()

In [None]:
train['Family_Size'].fillna((train['Family_Size'].mean()), inplace=True)

In [None]:
# Looking the distribution of column Family Size w.r.t to each segment
a = train[train.Segmentation =='A']["Family_Size"]
b = train[train.Segmentation =='B']["Family_Size"]
c = train[train.Segmentation =='C']["Family_Size"]
d = train[train.Segmentation =='D']["Family_Size"]

plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
sns.boxplot(data = train, x = "Segmentation", y="Family_Size")
plt.title('Boxplot')

plt.subplot(1,2,2)
sns.kdeplot(a,shade= False, label = 'A')
sns.kdeplot(b,shade= False, label = 'B')
sns.kdeplot(c,shade= False, label = 'C')
sns.kdeplot(d,shade= False, label = 'D')
plt.xlabel('Family Size')
plt.ylabel('Density')
plt.title("Mean\n A: {}\n B: {}\n C: {}\n D: {}".format(round(a.mean(),0),round(b.mean(),0),round(c.mean(),0),round(d.mean(),0)))

plt.show()

In the given data it is observed that most of the people have family size of 1 or 2 (i.e. they have small family).<br> But Segment D has more number of  big families as compared to other segments.

In [None]:
train.Segmentation.value_counts(normalize=True)

Our model baseline is 0.28 

# Features Engineering

### Data Encoding- Label Encoding

As the ID column is not useful in dividing customers into segments because it is any random value thus having no correlations with the segmentation , we could drop it. 

In [None]:
train.drop(['ID'],inplace=True,axis=1)
train.head(5)

In [None]:
# Encoding
from sklearn.preprocessing import LabelEncoder


binary_columns = ["Gender", "Ever_Married" , "Graduated"]


Encoder = LabelEncoder()
for column in binary_columns:
     train[column] = Encoder.fit_transform(tuple(train[ column ]))

train.head()

In [None]:
# We use Dummy Variable Encoding for profession
profession=pd.get_dummies(train.Profession)
train.drop(['Profession'],axis=1,inplace=True)
profession.head()

In [None]:
# We join our profession dataframe to our train dataset
train=train.join(profession)

In [None]:
# Spending_Score
train.Spending_Score=pd.Categorical(train.Spending_Score,categories=['Low','Average','High'],ordered=True).codes

# Var_1
train.Var_1=pd.Categorical(train.Var_1).codes

### Checking for the correlation between all features and the labels

In [None]:
# Creating encoded label Dataframe
label=pd.Categorical(train.Segmentation,categories=['A','B','C','D']).codes
label

In [None]:
# Correlation between features and label
correlation_data=pd.DataFrame(label,columns=['label'])
correlation_data=correlation_data.join(train)
correlation_data.head()

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(correlation_data.corr(),annot=True)

In [None]:
# Label and Features datasets
# For the Y_train we will use the label encoded dataframe
# and for the X_train we will drop the label from our dataset
Y_train=label
X_train=train.drop('Segmentation',axis=1)

In [None]:
Y_train.shape

X_train.shape

## Splitting the data for training and validation

In [None]:
# Spliting our dataframes to train(X_train, y_train) and validation(X_valid, y_valid) subsets
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, Y_train, test_size=.25, random_state=2)

# XGBClassifer Model

In [None]:
# from sklearn.model_selection import GridSearchCV
# from sklearn.model_selection import KFold
# grid = {
#         'min_child_weight': [1, 5, 10],
#         'gamma': [0.5, 1, 1.5, 2, 5],
#         'subsample': [0.6, 0.8, 1.0],
#         'colsample_bytree': [0.6, 0.8, 1.0],
#         'max_depth': [3, 4, 5, 15],
#         }
# kf = KFold(n_splits=2)

# gs = GridSearchCV(estimator = XGBClassifier(n_estimators=500), param_grid = grid, scoring='accuracy', n_jobs=4, cv=kf)

In [None]:
# gs.fit(X_train, y_train)

# y_pred = gs.predict(X_valid)

In [None]:
# from sklearn.metrics import accuracy_score

# accuracy = accuracy_score(y_valid, y_pred)
# print("Gs Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
# gs.best_estimator_

In [None]:
from xgboost.sklearn import XGBClassifier

xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=5,
              learning_rate=0.3, max_delta_step=0, max_depth=4,
              min_child_weight=5, missing=1, n_estimators=500, n_jobs=4,
              nthread=None, objective='multi:softprob', eval_metric='mlogloss', use_label_encoder=False,
              random_state=0,reg_alpha=0, reg_lambda=1, scale_pos_weight=None, seed=None,
              silent=None, subsample=0.8, verbosity=None)

In [None]:
from sklearn.metrics import accuracy_score
xgb.fit(X_train, y_train)

y_pred = xgb.predict(X_valid)

In [None]:
accuracy = accuracy_score(y_valid, y_pred)
print("Xgb Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
# Show features importances according to the model
xgb.feature_importances_

## LGBMClassifier Model

In [None]:
# Tuning the Hyperparameters(Core parameters) of LighGBM Classifier :
from lightgbm import LGBMClassifier

model = LGBMClassifier(    boosting_type='gbdt', 
                           max_depth=5, 
                           learning_rate=0.01, 
                           objective='multiclass', # Multi Class Classification
                           n_estimators=100,
                           n_jobs=-1 )

Lgbm = model.fit(X_train, y_train,eval_metric='multi_logloss',eval_set=(X_valid , y_valid))
valid_accuracy = Lgbm.score(X_valid , y_valid)

In [None]:
Lgbm = model.fit(X_train, y_train,eval_metric='multi_logloss',eval_set=(X_valid , y_valid))
valid_accuracy = Lgbm.score(X_valid , y_valid)

In [None]:
accuracy = valid_accuracy
print("LGBM Accuracy: %.2f%%" % (accuracy * 100.0))

## RandomForestClassifier Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
Rf = RandomForestClassifier(
    max_depth=2, 
    random_state=0,
    n_estimators=1000)

Rf.fit(X_train, y_train)
y_pred = Rf.predict(X_valid)

accuracy = accuracy_score(y_valid, y_pred)
print("RF Accuracy: %.2f%%" % (accuracy * 100.0))