# Model Iteration 2 – Logistic Regression Model with Feature Engineered Variables

As the data scientist of the bank, you created a benchmark model to predict which customers are likely to buy a term deposit. However, management wants to improve the results you got in the benchmark model. In Exercise 3.04, Creating New Features from Existing Ones you discussed the business scenario with the marketing and operations teams and created a new variable, assetIndex, by feature engineering three raw variables. You are now fitting another logistic regression model on the feature engineered variables and are trying to improve the results.

In this activity, you will be feature engineering some of the variables to verify their effects on the predictions.

### 1. Open the Colab notebook used for the feature engineering in Exercise 3.04, Creating New Features from Existing Ones Perform all of the steps up to Step 18.

In [14]:
import pandas as pd
import numpy as np

In [15]:
file_url = 'https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter03/bank-full.csv'
bankData = pd.read_csv(file_url, sep=";")

In [16]:
# Normalising data
from sklearn import preprocessing
x = bankData[['balance']].values.astype(float)
# Creating the scaling function
minmaxScaler = preprocessing.MinMaxScaler()
# Transforming the balance data by normalising it with minmaxScaler
bankData['balanceTran'] = minmaxScaler.fit_transform(x)
# Adding a small numerical constant to eliminate 0 values
bankData['balanceTran'] = bankData['balanceTran'] + 0.00001
# Let us transform values for loan data
bankData['loanTran'] = 1
# Giving a weight of 5 if there is no loan
bankData.loc[bankData['loan'] == 'no', 'loanTran'] = 5
# Let us transform values for Housing data
bankData['houseTran'] = 5
# Giving a weight of 1 if the customer has a house
bankData.loc[bankData['housing'] == 'no', 'houseTran'] = 1
# Let us now create the new variable which is a product of all these
bankData['assetIndex'] = bankData['balanceTran'] * bankData['loanTran'] * bankData['houseTran']
bankData.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,duration,campaign,pdays,previous,poutcome,y,balanceTran,loanTran,houseTran,assetIndex
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,261,1,-1,0,unknown,no,0.092269,5,5,2.306734
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,151,1,-1,0,unknown,no,0.073077,5,5,1.826916
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,76,1,-1,0,unknown,no,0.072832,1,5,0.364158
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,92,1,-1,0,unknown,no,0.086486,5,5,2.162153
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,198,1,-1,0,unknown,no,0.072822,5,1,0.364112


### 2. Create dummy variables for the categorical variables using the pd.get_dummies() function. Exclude original raw variables such as loan and housing, which were used to create the new variable, assetIndex.

In [17]:
bankData.drop(columns=['balanceTran', 'loanTran', 'houseTran', 'loan', 'housing', 'balance'], inplace=True)

In [18]:
bankData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   age         45211 non-null  int64  
 1   job         45211 non-null  object 
 2   marital     45211 non-null  object 
 3   education   45211 non-null  object 
 4   default     45211 non-null  object 
 5   contact     45211 non-null  object 
 6   day         45211 non-null  int64  
 7   month       45211 non-null  object 
 8   duration    45211 non-null  int64  
 9   campaign    45211 non-null  int64  
 10  pdays       45211 non-null  int64  
 11  previous    45211 non-null  int64  
 12  poutcome    45211 non-null  object 
 13  y           45211 non-null  object 
 14  assetIndex  45211 non-null  float64
dtypes: float64(1), int64(6), object(8)
memory usage: 5.2+ MB


In [19]:
# dummies
cat = ['job', 'marital', 'education', 'default', 'contact', 'month', 'poutcome']
bankCat = pd.get_dummies(bankData[cat])
bankCat.shape

(45211, 40)

### 3. Select the numerical variables including the new feature engineered variable, assetIndex, that was created.
### 4. Transform some of the numerical variables by normalizing them using the MinMaxScaler() function.

In [20]:
num = ['age', 'day', 'duration', 'campaign', 'pdays', 'previous', 'assetIndex']
bankNum = bankData[num]
bankNum.shape

(45211, 7)

In [21]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
bankNum = pd.DataFrame(scaler.fit_transform(bankNum), columns=bankNum.columns)
bankNum

Unnamed: 0,age,day,duration,campaign,pdays,previous,assetIndex
0,0.519481,0.133333,0.053070,0.000000,0.000000,0.000000,0.152681
1,0.337662,0.133333,0.030704,0.000000,0.000000,0.000000,0.120922
2,0.194805,0.133333,0.015453,0.000000,0.000000,0.000000,0.024103
3,0.376623,0.133333,0.018707,0.000000,0.000000,0.000000,0.143111
4,0.194805,0.133333,0.040260,0.000000,0.000000,0.000000,0.024100
...,...,...,...,...,...,...,...
45206,0.428571,0.533333,0.198658,0.032258,0.000000,0.000000,0.026576
45207,0.688312,0.533333,0.092721,0.016129,0.000000,0.000000,0.029292
45208,0.701299,0.533333,0.229158,0.064516,0.212156,0.010909,0.041268
45209,0.506494,0.533333,0.103294,0.048387,0.000000,0.000000,0.026104


### 5. Concatenate the numerical variables and categorical variables using the pd.concat() function and then create X and Y variables.

In [22]:
bankData = pd.concat([bankCat, bankNum, bankData['y']], axis=1)
bankData

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,age,day,duration,campaign,pdays,previous,assetIndex,y
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0.519481,0.133333,0.053070,0.000000,0.000000,0.000000,0.152681,no
1,0,0,0,0,0,0,0,0,0,1,...,0,1,0.337662,0.133333,0.030704,0.000000,0.000000,0.000000,0.120922,no
2,0,0,1,0,0,0,0,0,0,0,...,0,1,0.194805,0.133333,0.015453,0.000000,0.000000,0.000000,0.024103,no
3,0,1,0,0,0,0,0,0,0,0,...,0,1,0.376623,0.133333,0.018707,0.000000,0.000000,0.000000,0.143111,no
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0.194805,0.133333,0.040260,0.000000,0.000000,0.000000,0.024100,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,0,0,0,0,0,0,0,0,0,1,...,0,1,0.428571,0.533333,0.198658,0.032258,0.000000,0.000000,0.026576,yes
45207,0,0,0,0,0,1,0,0,0,0,...,0,1,0.688312,0.533333,0.092721,0.016129,0.000000,0.000000,0.029292,yes
45208,0,0,0,0,0,1,0,0,0,0,...,1,0,0.701299,0.533333,0.229158,0.064516,0.212156,0.010909,0.041268,yes
45209,0,1,0,0,0,0,0,0,0,0,...,0,1,0.506494,0.533333,0.103294,0.048387,0.000000,0.000000,0.026104,no


In [34]:
X = bankData.drop('y', axis=1)
y = bankData[['y']]
Y = y.replace({'yes' : 1, 'no' : 0})

### 6. Split the dataset using the train_test_split() function and then fit a new model using the LogisticRegression() model on the new features.

In [42]:
from sklearn.model_selection import train_test_split
np.random.seed = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(31647, 47) (13564, 47) (31647, 1) (13564, 1)


In [43]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

### 7. Analyze the results after generating the confusion matrix and classification report.

In [44]:
model.score(X_test, y_test)

0.9006192863462106

In [40]:
from sklearn.metrics import confusion_matrix, classification_report

In [48]:
y_pred = model.predict(X_test)
print(confusion_matrix(y_test, y_pred))

[[11733   280]
 [ 1068   483]]


In [47]:
print(classification_report(y_test, y_pred))

precision    recall  f1-score   support

          no       0.92      0.98      0.95     12013
         yes       0.63      0.31      0.42      1551

    accuracy                           0.90     13564
   macro avg       0.77      0.64      0.68     13564
weighted avg       0.88      0.90      0.89     13564

