<a href="https://colab.research.google.com/github/dravichi/Data-Analytics/blob/main/Training_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Market Entry Analysis for ABG Motors in India**

**Import Essentials Libraries**

In [73]:
import warnings
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
warnings.filterwarnings("ignore")

## **Classification Model for Japanese Market**

**Initialize Dataset**

In [68]:
df_jp = pd.read_excel('JPN Data.xlsx')

**Data Overview**

In [69]:
df_jp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID          40000 non-null  object 
 1   CURR_AGE    40000 non-null  int64  
 2   GENDER      40000 non-null  object 
 3   ANN_INCOME  40000 non-null  float64
 4   AGE_CAR     40000 non-null  int64  
 5   PURCHASE    40000 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 1.8+ MB


In [70]:
df_jp.describe()

Unnamed: 0,CURR_AGE,ANN_INCOME,AGE_CAR,PURCHASE
count,40000.0,40000.0,40000.0,40000.0
mean,44.99745,359398.87805,359.08025,0.575775
std,11.82008,175109.26295,203.063724,0.494231
min,25.0,70089.0,1.0,0.0
25%,35.0,219766.0,235.0,0.0
50%,45.0,337656.833333,331.0,1.0
75%,55.0,464261.0,444.0,1.0
max,65.0,799970.666667,1020.0,1.0


In [None]:
# Showing the top 5 data
df_jp.head()

Unnamed: 0,ID,CURR_AGE,GENDER,ANN_INCOME,AGE_CAR,PURCHASE
0,00001Q15YJ,50,M,445344.0,439,0
1,00003I71CQ,35,M,107634.0,283,0
2,00003N47FS,59,F,502786.666667,390,1
3,00005H41DE,43,M,585664.0,475,0
4,00007E17UM,39,F,705722.666667,497,1


**Data Preprocessing**

In [None]:
# Checking for null data
df_jp.isnull().sum().sort_values(ascending=False)

ID            0
CURR_AGE      0
GENDER        0
ANN_INCOME    0
AGE_CAR       0
PURCHASE      0
dtype: int64

In [None]:
# Checking for duplicate data
df_jp.duplicated().any()

False

In [71]:
# Encoding gender attribute
encoder = LabelEncoder()
df_jp['GENDER']=encoder.fit_transform(df_jp['GENDER'])

# Dividing maintenance days into segments
df_jp['SEGMENT'] = np.where(
    df_jp['AGE_CAR'] < 200, 1, np.where(
    df_jp['AGE_CAR'] <= 360, 2, np.where(
    df_jp['AGE_CAR'] <= 500, 3, 4
    )))

In [6]:
# Dropping unnecessary attributes
df_jp.drop(columns=['ID', 'AGE_CAR'], inplace=True)

**Data Visualization**

In [85]:
numerical = df_jp.select_dtypes(include=['int64','float64']).corr()
fig = px.imshow(numerical, color_continuous_scale='rdbu')
fig.update_layout(title={'text': 'Correlation Analysis','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'})

**Insight gained from the heatmap:**
1. CURR_AGE: This attribute has a very weak correlation to the PURCHASE attribute (-0.012). Since it is a negative correlation, it means less age has more tendency to purchase the vehicle.
2. GENDER: This attribute has a weak correlation to the PURCHASE attribute (0.037). Since it is a positive correlation, it means male has more tendency to purchase the vehicle.
3. ANN_INCOME: This attribute has quite a strong correlation to the PURCHASE attribute (0.169). Since it is a positive correlation, it means higher income has more tendency to purchase the vehicle.
4. SEGMENT: This attribute has the strongest correlation to the PURCHASE attribute compared to others (0.369). Since it is a positive correlation, it means that higher segments have more tendency to purchase the vehicle.

In [86]:
fig = px.violin(df_jp, x='PURCHASE', y="CURR_AGE")
fig.update_layout(title={'text': 'Current Age and Purchase Relation','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'}, yaxis_title='CURRENT AGE')

**Insight gained from the violin plot:**

The median of CURR_AGE in both graphs is 45 and there are no outliers present in the dataset, which is good.

 In the **Non-Purchase Graph (0)**, the CURR_AGE range of 38-48 has a low distribution. It means that fewer people do not purchase the vehicle in this range.
 On the other hand, in the **Purchase Graph (1)**, the CURR_AGE of range 34-51 has a high distribution means that more people in these ages purchase the vehicle. Moreover, overall, we can also see that lower age tends to not purchase the vehicle. These two graphs support each other.

In [None]:
fig = px.histogram(df_jp, x='PURCHASE', color='GENDER', barmode='group')
newnames = {'1':'Male', '0': 'Female'}
fig.for_each_trace(lambda t: t.update(name = newnames[t.name], legendgroup = newnames[t.name], hovertemplate = t.hovertemplate.replace(t.name, newnames[t.name])))
fig.update_layout(title={'text': 'Gender and Purchase Relation','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'}, yaxis_title='COUNT')

**Insight gained from the histogram:**

The male has higher numbers in both the **Non-Purchase Graph (0)** and the **Purchase Graph (1)**.

As we can see, the difference between males and females in **Non-Purchase Graph (0)** is not significant, which is 1211. Whereas, the difference between males and females in **Purchase Graph (1)** is quite significant, which is 3359. In summary, the male has more tendency to purchase the car than the female.

In [None]:
fig = px.violin(df_jp, x='PURCHASE', y="ANN_INCOME")
fig.update_layout(title={'text': 'Annual Income and Purchase Relation','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'}, yaxis_title='ANNUAL INCOME')

**Insight gained from the violin plot:**

The median in the **Non-Purchase Graph (0)** is 296761. This value is lower than the **Purchase Graph (1)**, which is 364932. This indicates that people with higher salaries have more tendency to buy the car. Moreover, we can check that there are no outliers in the data, which is good.

In detail, we can see the peak distribution of ANN_INCOME in the **Non-Purchase Graph (0)** is around 135k. Furthermore, for higher ANN_INCOME, let's say 575k and above, the distribution in the **Purchase Graph (1)** is higher than the **Non-Purchase Graph (0)**. These both graphs support each other.

In [67]:
fig = px.histogram(df_jp, x='PURCHASE', color='SEGMENT', barmode='group', category_orders={"SEGMENT": [1,  2, 3, 4]})
fig.update_layout(title={'text': 'Segment and Purchase Relation','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'}, yaxis_title='COUNT')

**Insight gained from the histogram:**

The highest number in the **Non-Purchase Graph (0)** is segment number 2. On the other hand, the highest number in the **Purchase Graph (1)** is segment number 3. Overall we can clearly see that the distribution in the **Purchase Graph (1)** is more in the greater segments. In contrast to the **Non-Purchase Graph (0)** that mainly distributed in the lower segments. We can conclude that a higher segment means more probability of purchasing the car.

**Splitting Dataset Into Training and Testing**

In [18]:
X = df_jp.drop(['PURCHASE'], axis=1)
y = df_jp['PURCHASE']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**Model Selection**

In [19]:
models = {
    'LogisticRegression': LogisticRegression(),
    'RandomForest': RandomForestClassifier(),
    'DecisionTree': DecisionTreeClassifier(),
    'SVC': SVC(),
    'KNeighborsClassifier': KNeighborsClassifier(n_neighbors=2),
    'XGBoost': XGBClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = round(float(str(accuracy_score(y_test, y_pred))[:6])*100, 2)
    print(name, 'Analysis\n')
    print(f"Accuracy: {accuracy}%\n")

LogisticRegression Analysis

Accuracy: 62.17%

RandomForest Analysis

Accuracy: 65.57%

DecisionTree Analysis

Accuracy: 63.55%

SVC Analysis

Accuracy: 63.22%

KNeighborsClassifier Analysis

Accuracy: 50.62%

XGBoost Analysis

Accuracy: 70.02%



**Hyper Tunning for XGBoost Model**

In [35]:
params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7, 8, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3],
    'subsample': [0.5, 0.6, 0.8, 1.0],
    'colsample_bytree': [0.4, 0.6, 0.8, 1.0],
    'gamma': [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
}
random = RandomizedSearchCV(models['XGBoost'], param_distributions=params, n_iter=10, scoring='roc_auc', n_jobs=-1, cv=5, verbose=3)
random.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [36]:
# Best parameters for XGBoost model
xgb_best = random.best_estimator_
xgb_best

**Model Result**

In [37]:
xgb_pred = xgb_best.predict(X_test)
print('XGBoost Hyperparameter Tuned Analysis\n')
print('XGBoost Best Parameters:', random.best_params_)
accuracy = round(float(str(accuracy_score(y_test, xgb_pred))[:6])*100, 2)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}%")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{classification_rep}")

XGBoost Hyperparameter Tuned Analysis

XGBoost Best Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 6, 'learning_rate': 0.1, 'gamma': 0.0, 'colsample_bytree': 1.0}
Accuracy: 70.65%
Confusion Matrix:
[[2094 1245]
 [1153 3508]]
Classification Report:
              precision    recall  f1-score   support

           0       0.64      0.63      0.64      3339
           1       0.74      0.75      0.75      4661

    accuracy                           0.70      8000
   macro avg       0.69      0.69      0.69      8000
weighted avg       0.70      0.70      0.70      8000



## **Prediction of Potential Customer in The Indian Market Based on The Model**

**Initialize Dataset**

In [80]:
df_in = pd.read_excel('IN_Data.xlsx')

**Data Overview**

In [39]:
df_in.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   ID          70000 non-null  object        
 1   CURR_AGE    70000 non-null  int64         
 2   GENDER      70000 non-null  object        
 3   ANN_INCOME  70000 non-null  int64         
 4   DT_MAINT    70000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 2.7+ MB


In [40]:
df_in.describe()

Unnamed: 0,CURR_AGE,ANN_INCOME,DT_MAINT
count,70000.0,70000.0,70000
mean,44.995314,1148679.0,2018-06-28 16:10:28.662856960
min,25.0,300033.0,2016-09-14 00:00:00
25%,35.0,856823.8,2018-03-15 00:00:00
50%,45.0,1125152.0,2018-07-26 00:00:00
75%,55.0,1438676.0,2018-12-24 00:00:00
max,65.0,1999989.0,2019-06-30 00:00:00
std,11.822122,399450.5,


In [41]:
df_in.head()

Unnamed: 0,ID,CURR_AGE,GENDER,ANN_INCOME,DT_MAINT
0,20710B05XL,54,M,1425390,2018-04-20
1,89602T51HX,47,M,1678954,2018-06-08
2,70190Z52IP,60,M,931624,2017-07-31
3,25623V15MU,55,F,1106320,2017-07-31
4,36230I68CE,32,F,748465,2019-01-27


**Data Preprocessing**

In [42]:
# Checking for null data
df_in.isnull().sum().sort_values(ascending=False)

ID            0
CURR_AGE      0
GENDER        0
ANN_INCOME    0
DT_MAINT      0
dtype: int64

In [43]:
# Checking for duplicate data
df_in.duplicated().any()

False

In [81]:
# Encoding gender attribute
encoder = LabelEncoder()
df_in['GENDER']=encoder.fit_transform(df_in['GENDER'])

# Modifying date maintenance into maintenance days (simplifying date for everyone in Indian dataset as 1st July 2019)
df_in['DT_MAINT'] = (np.datetime64('2019-07-01') - df_in['DT_MAINT']).dt.days.astype('int64')

# Converting INR to JPY
df_in['ANN_INCOME'] = df_in['ANN_INCOME'] * 0.52

# Dividing maintenance days into segments
df_in['SEGMENT'] = np.where(
    df_in['DT_MAINT'] < 200, 1, np.where(
    df_in['DT_MAINT'] <= 360, 2, np.where(
    df_in['DT_MAINT'] <= 500, 3, 4
    )))

In [82]:
# Dropping unnecessary attributes
df_in.drop(columns=['ID', 'DT_MAINT'], inplace=True)

**Predict the Indian Market using the hyper-tuned XGBoost model**

In [84]:
df_in['PURCHASE'] = xgb_best.predict(df_in)

**Data Visualization**

In [86]:
# Creating new column for counting purpose
df_in['COUNT'] = 1

In [87]:
fig = px.pie(df_in, names='PURCHASE', values='COUNT', hole=.3)
newnames = {'1':'Potential', '0': 'Not Potential'}
fig.update_layout(legend_title='PURCHASE',title={'text': 'Potential Customer Ratio in Indian Market','y':0.95,'x':0.5,'xanchor': 'center','yanchor': 'top'})

**Insight gained from the pie chart:**

The majority of customer in Indian market have tendency to purchase a new vehicle. As we can see, 71.3% of the customers are potential. It means 49889 out of 70000 predicted to purchase a new vehicle and only 20111 customers predicted not to purchase a new vehicle.

## **Conclusion**

Created a good prediction model with an accuracy of 70.65%. As we know, a machine learning model with an accuracy ranging from 70% to 90% is considered an ideal and realistic model.

Concluded that entering the Indian market is a good decision for ABG Motors. Based on the prediction from the created model, there will be approximately 71.3% of potential customers.