# Fitting a Logistic Regression Model on a High-Dimensional Dataset

You want to test the performance of your models when the dataset is large. To do this, you are artificially augmenting the internet ads dataset so that the dataset is 300 times bigger in dimension than the original dataset. You will be fitting a logistic regression model on this new dataset and then observe the results.

Hint: In this activity, we will use a notebook similar to Exercise 14.01, Loading and Cleaning the Dataset, and we will also be fitting a logistic regression model as done in Chapter 3, Binary Classification.

Note: We will be using the same ads dataset for this activity.

The internet_ads dataset has been uploaded on GitHub.

In [7]:
import time
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# load DataFrame
adData = pd.read_csv('../Dataset/ad.data', header=None, error_bad_lines=False)
# Seperating the dependent and independent variables
X = adData.iloc[:, :-1]
Y = adData.iloc[:, -1]
# Replacing special characters in first 3 columns which are of type object
for i in range(0,3):
    X[i] = X[i].str.replace("?", 'NaN').values.astype(float)
# Replacing special characters in the remaining columns which are of type integer
for i in range(3,1557):
    X[i] = X[i].replace("?", 'NaN').values.astype(float) 
# Imputing the NaN'  with mean of the values
for i in range(0,1557):
    X[i] = X[i].fillna(X[i].mean())
# Normalising the data sets
minmaxScaler = MinMaxScaler()
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
# Printing the output
X_tran

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557
0,0.194053,0.194053,0.016642,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.087637,0.730829,0.136820,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.050078,0.358372,0.116138,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.092332,0.730829,0.129978,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.092332,0.730829,0.129978,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3274,0.264476,0.145540,0.009190,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3275,0.156495,0.217527,0.023077,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3276,0.034429,0.186228,0.086932,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3277,0.098626,0.241541,0.065176,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Create a high-dimensional dataset by replicating the columns 300 times using the pd.np.tile() function. Print the shape of the new dataset and observe the number of features in the new dataset.

In [3]:
# Creating a high dimension data set
X_hd = pd.DataFrame(pd.np.tile(X_tran, (1, 100)))
X_hd.shape

(3279, 155800)

In [4]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_hd, Y, test_size=0.3, random_state=123)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(2295, 155800)
(984, 155800)
(2295,)
(984,)


In [5]:
t0=time.time()
# Fit a logistic regression model on the new dataset and note the time it takes to fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

print("Total training time:", round(time.time()-t0, 3), "s")

Total training time: 85.046 s


In [6]:
# make prediction
y_pred = model.predict(X_test)

In [11]:
# metrics

print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print('Confusion matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification report:')
print(classification_report(y_test, y_pred))

Accuracy: 0.97
Confusion matrix:
[[110  16]
 [ 12 846]]
Classification report:
              precision    recall  f1-score   support

         ad.       0.90      0.87      0.89       126
      nonad.       0.98      0.99      0.98       858

    accuracy                           0.97       984
   macro avg       0.94      0.93      0.94       984
weighted avg       0.97      0.97      0.97       984

