### Problem Statement

A cloth manufacturing company is interested to know about the segment or attributes causes high sale.

Approach - A Random Forest can be built with target variable Sales (we will first convert it in categorical variable) & 
all other variable will be independent in the analysis.  

About the data: 
    
Let’s consider a Company dataset with around 10 variables and 400 records. 

The attributes are as follows: 
    
 Sales -- Unit sales (in thousands) at each location

 Competitor Price -- Price charged by competitor at each location

 Income -- Community income level (in thousands of dollars)

 Advertising -- Local advertising budget for company at each location (in thousands of dollars)

 Population -- Population size in region (in thousands)

 Price -- Price company charges for car seats at each site

 Shelf Location at stores -- A factor with levels Bad, Good and Medium indicating the quality of the shelving location for 
the car seats at each site

 Age -- Average age of the local population

 Education -- Education level at each location

 Urban -- A factor with levels No and Yes to indicate whether the store is in an urban or rural location

 US -- A factor with levels No and Yes to indicate whether the store is in the US or not

In [136]:
# Importing the rrequired libraries

import pandas as pd
import numpy  as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_score, KFold, RepeatedStratifiedKFold, RepeatedKFold

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Company_Data.csv')
df

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.50,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.40,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No
...,...,...,...,...,...,...,...,...,...,...,...
395,12.57,138,108,17,203,128,Good,33,14,Yes,Yes
396,6.14,139,23,3,37,120,Medium,55,11,No,Yes
397,7.41,162,26,12,368,159,Medium,40,18,Yes,Yes
398,5.94,100,79,7,284,95,Bad,50,12,Yes,Yes


In [3]:
# Getting the descriotive statisitcs

df.describe()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,7.496325,124.975,68.6575,6.635,264.84,115.795,53.3225,13.9
std,2.824115,15.334512,27.986037,6.650364,147.376436,23.676664,16.200297,2.620528
min,0.0,77.0,21.0,0.0,10.0,24.0,25.0,10.0
25%,5.39,115.0,42.75,0.0,139.0,100.0,39.75,12.0
50%,7.49,125.0,69.0,5.0,272.0,117.0,54.5,14.0
75%,9.32,135.0,91.0,12.0,398.5,131.0,66.0,16.0
max,16.27,175.0,120.0,29.0,509.0,191.0,80.0,18.0


In [4]:
# Getting the information about the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Sales        400 non-null    float64
 1   CompPrice    400 non-null    int64  
 2   Income       400 non-null    int64  
 3   Advertising  400 non-null    int64  
 4   Population   400 non-null    int64  
 5   Price        400 non-null    int64  
 6   ShelveLoc    400 non-null    object 
 7   Age          400 non-null    int64  
 8   Education    400 non-null    int64  
 9   Urban        400 non-null    object 
 10  US           400 non-null    object 
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB


In [5]:
df.isnull().sum()

Sales          0
CompPrice      0
Income         0
Advertising    0
Population     0
Price          0
ShelveLoc      0
Age            0
Education      0
Urban          0
US             0
dtype: int64

There are no null values for any of the features in the given data set

In [6]:
# Separating independant variables from the data set

X = df.drop('Sales', axis = 1)

In [7]:
X

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,138,73,11,276,120,Bad,42,17,Yes,Yes
1,111,48,16,260,83,Good,65,10,Yes,Yes
2,113,35,10,269,80,Medium,59,12,Yes,Yes
3,117,100,4,466,97,Medium,55,14,Yes,Yes
4,141,64,3,340,128,Bad,38,13,Yes,No
...,...,...,...,...,...,...,...,...,...,...
395,138,108,17,203,128,Good,33,14,Yes,Yes
396,139,23,3,37,120,Medium,55,11,No,Yes
397,162,26,12,368,159,Medium,40,18,Yes,Yes
398,100,79,7,284,95,Bad,50,12,Yes,Yes


In [8]:
# Applying Label encoding for the 'ShelveLoc' feature

X.ShelveLoc.value_counts().to_dict()

{'Medium': 219, 'Bad': 96, 'Good': 85}

In [9]:
X.ShelveLoc.replace({'Medium': 2, 'Bad': 1, 'Good': 3}, inplace = True)

In [10]:
X

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,138,73,11,276,120,1,42,17,Yes,Yes
1,111,48,16,260,83,3,65,10,Yes,Yes
2,113,35,10,269,80,2,59,12,Yes,Yes
3,117,100,4,466,97,2,55,14,Yes,Yes
4,141,64,3,340,128,1,38,13,Yes,No
...,...,...,...,...,...,...,...,...,...,...
395,138,108,17,203,128,3,33,14,Yes,Yes
396,139,23,3,37,120,2,55,11,No,Yes
397,162,26,12,368,159,2,40,18,Yes,Yes
398,100,79,7,284,95,1,50,12,Yes,Yes


In [11]:
# Applying One Hot Encoding for the features 'Urban' and 'US'

X = pd.get_dummies(X)
X

Unnamed: 0,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban_No,Urban_Yes,US_No,US_Yes
0,138,73,11,276,120,1,42,17,0,1,0,1
1,111,48,16,260,83,3,65,10,0,1,0,1
2,113,35,10,269,80,2,59,12,0,1,0,1
3,117,100,4,466,97,2,55,14,0,1,0,1
4,141,64,3,340,128,1,38,13,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
395,138,108,17,203,128,3,33,14,0,1,0,1
396,139,23,3,37,120,2,55,11,1,0,0,1
397,162,26,12,368,159,2,40,18,0,1,0,1
398,100,79,7,284,95,1,50,12,0,1,0,1


In [12]:
# Converting values in the dependant variable in to categorical form

y = pd.cut(df.Sales, bins = [-1, 10,17], labels = ['low', 'high'])
y

0       low
1      high
2      high
3       low
4       low
       ... 
395    high
396     low
397     low
398     low
399     low
Name: Sales, Length: 400, dtype: category
Categories (2, object): ['low' < 'high']

In [73]:
y.value_counts()

low     322
high     78
Name: Sales, dtype: int64

The above data set is imbalanced one.

In [17]:
# Splitting given data in to training and testing data set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Building the Random Forest Classifier Model

In [103]:
# Building model without balancing the data

model_rf = RandomForestClassifier(n_estimators = 150, oob_score = True, random_state = True, max_features = 0.8, 
                                  criterion = 'entropy')
model_rf.fit(X_train,y_train)
y_pred_test = model_rf.predict(X_test)

In [104]:
accuracy_score(y_test,y_pred_test)

0.8608247422680413

In [54]:
confusion_matrix(y_test,y_pred_test)

array([[17,  7],
       [ 4, 92]], dtype=int64)

In [55]:
print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

        high       0.81      0.71      0.76        24
         low       0.93      0.96      0.94        96

    accuracy                           0.91       120
   macro avg       0.87      0.83      0.85       120
weighted avg       0.91      0.91      0.91       120



In [59]:
model_rf.score(X_train,y_train)

1.0

In [None]:
# Creating Random Forest Classifier with oversampling

In [74]:
from imblearn.over_sampling import SMOTE

In [76]:
smt = SMOTE()

In [77]:
X_resampled, y_resampled = smt.fit_resample(X,y)

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

In [80]:
model_rf = RandomForestClassifier(n_estimators = 150, oob_score = True, random_state = True, max_features = 0.8)
model_rf.fit(X_train,y_train)
y_pred_test = model_rf.predict(X_test)

In [83]:
y_pred_test = model_rf.predict(X_test)

In [84]:
accuracy_score(y_test,y_pred_test)

0.845360824742268

In [85]:
confusion_matrix(y_test,y_pred_test)

array([[76, 13],
       [17, 88]], dtype=int64)

In [86]:
print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

        high       0.82      0.85      0.84        89
         low       0.87      0.84      0.85       105

    accuracy                           0.85       194
   macro avg       0.84      0.85      0.84       194
weighted avg       0.85      0.85      0.85       194



### Hyperparameter Tuning

In [121]:
model1 = RandomForestClassifier(oob_score = True, random_state = True)

In [122]:
hyp = { 'n_estimators': np.arange(20,150),
        'criterion': ['gini', 'entropy'],
        'max_depth': np.arange(3,9),
        'min_samples_split': np.arange(3,20),
        'min_samples_leaf': np.arange(2,10),
      }

random_CV = RandomizedSearchCV(model1, hyp, cv = 7)
random_CV.fit(X_train,y_train)

RandomizedSearchCV(cv=7,
                   estimator=RandomForestClassifier(oob_score=True,
                                                    random_state=True),
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': array([3, 4, 5, 6, 7, 8]),
                                        'min_samples_leaf': array([2, 3, 4, 5, 6, 7, 8, 9]),
                                        'min_samples_split': array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
                                        'n_estimators': array([ 20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31...
        46,  47,  48,  49,  50,  51,  52,  53,  54,  55,  56,  57,  58,
        59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,  70,  71,
        72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,  84,
        85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
        98,  99, 100, 101, 102, 103, 

In [123]:
random_CV.best_score_

0.9000343406593406

In [129]:
random_CV.best_params_

{'n_estimators': 95,
 'min_samples_split': 18,
 'min_samples_leaf': 7,
 'max_depth': 6,
 'criterion': 'entropy'}

In [130]:
model2 = random_CV.best_estimator_

In [131]:
model2.fit(X_train,y_train)

RandomForestClassifier(criterion='entropy', max_depth=6, min_samples_leaf=7,
                       min_samples_split=18, n_estimators=95, oob_score=True,
                       random_state=True)

In [132]:
y_pred_test = model2.predict(X_test)

In [133]:
accuracy_score(y_test,y_pred_test)

0.8402061855670103

In [None]:
There is not much change in the accuracy of the model.

### Using Cross Validation

In [139]:
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [140]:
scores = cross_val_score(model_rf, X_resampled, y_resampled, cv= kfold, scoring="accuracy")
print(scores)

[0.87692308 0.86153846 0.87692308 0.93846154 0.890625   0.953125
 0.96875    0.921875   0.9375     0.875     ]


In [142]:
np.max(scores)

0.96875

In [143]:
np.min(scores)

0.8615384615384616

In [144]:
np.mean(scores)

0.9100721153846154

There is improvement in the accuracy of the model

### Using Adaboost

In [145]:
from sklearn.ensemble import AdaBoostClassifier

In [162]:
model3= AdaBoostClassifier(base_estimator=model_rf, n_estimators= 150, random_state=42)
model3.fit(X_train, y_train)

AdaBoostClassifier(base_estimator=RandomForestClassifier(criterion='entropy',
                                                         max_features=0.8,
                                                         n_estimators=150,
                                                         oob_score=True,
                                                         random_state=True),
                   n_estimators=150, random_state=42)

In [163]:
y_pred_test = model3.predict(X_test)

In [164]:
accuracy_score(y_test,y_pred_test)

0.8556701030927835

There is slight improvement in accuracy using Adaboost 

In [165]:
confusion_matrix(y_test,y_pred_test)

array([[79, 10],
       [18, 87]], dtype=int64)

In [166]:
print(classification_report(y_test,y_pred_test))

              precision    recall  f1-score   support

        high       0.81      0.89      0.85        89
         low       0.90      0.83      0.86       105

    accuracy                           0.86       194
   macro avg       0.86      0.86      0.86       194
weighted avg       0.86      0.86      0.86       194

