## Code Along :: Feature Selection and Logistic regression

## About the Dataset

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... 

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. 

-  Number of Instances: 4601 (1813 Spam = 39.4%)
-  Number of Attributes: 58 (57 continuous, 1 nominal class label)

 -  Attribute Information:

    -  The last column of 'spambase.data' denotes whether the e-mail was 
       considered spam (1) or not (0)
    
    - 48 attributes are continuous real [0,100] numbers of type `word freq WORD` i.e. percentage of words in the e-mail that         match WORD

    - 6 attributes are continuous real [0,100] numbers of type `char freq CHAR` i.e. percentage of characters in the e-mail           that match CHAR
    
    - 1 attribute is continuous real [1,...] numbers of type `capital run length average` i.e. average length of uninterrupted       sequences of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type `capital run length longest` i.e. length of longest                   uninterrupted sequence of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type `capital run length total` i.e. sum of length of uninterrupted       sequences of capital letters in the email

    - 1 attribute is nominal {0,1} class  of type spam i.e  denotes whether the e-mail was considered spam (1) or not (0),  

- Missing Attribute Values: None

- Class Distribution:
	Spam	  1813  (39.4%)
	Non-Spam  2788  (60.6%)



You can read more about dataset [here](https://archive.ics.uci.edu/ml/datasets/spambase)


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings("ignore")

## Loading the dataset

In [2]:
#Loading the Spam data for the mini challenge
#Target variable is the 57th column i.e spam, non-spam classes 
df = pd.read_csv('spambase.data.csv',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### 1. Get an overview of your data by using info() and describe() functions of pandas.


In [3]:
df.shape

(4601, 58)

In [4]:
# Overview of the data
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 58 columns):
0     4601 non-null float64
1     4601 non-null float64
2     4601 non-null float64
3     4601 non-null float64
4     4601 non-null float64
5     4601 non-null float64
6     4601 non-null float64
7     4601 non-null float64
8     4601 non-null float64
9     4601 non-null float64
10    4601 non-null float64
11    4601 non-null float64
12    4601 non-null float64
13    4601 non-null float64
14    4601 non-null float64
15    4601 non-null float64
16    4601 non-null float64
17    4601 non-null float64
18    4601 non-null float64
19    4601 non-null float64
20    4601 non-null float64
21    4601 non-null float64
22    4601 non-null float64
23    4601 non-null float64
24    4601 non-null float64
25    4601 non-null float64
26    4601 non-null float64
27    4601 non-null float64
28    4601 non-null float64
29    4601 non-null float64
30    4601 non-null float64
31    4601 non-null float

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


### 2. Split the data into train and test set and fit the base logistic regression model on train set.

In [5]:
#Dividing the dataset set in train and test set and apply base logistic model
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
lr = LogisticRegression()
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### 3. Find out the accuracy , print out the Classification report and Confusion Matrix.

In [6]:
from sklearn.metrics import accuracy_score

In [7]:
# Calculate accuracy , print out the Classification report and Confusion Matrix.
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
#accuracy_score(y_test, y_pred)

Accuracy on test data: 0.9319333816075308


In [8]:
accuracy_score(y_test, y_pred)

0.9319333816075308

In [9]:
print("Confusion Matrix:")#,confusion_matrix(y_test,y_pred))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tp, fp)
print(fn, tn)

Confusion Matrix:
517 34
60 770


In [10]:
print("Classification Report: \n",classification_report(y_test,y_pred))

Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.94      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



In [11]:
lr.coef_

array([[-2.16805450e-01, -1.28166822e-01,  1.14405099e-01,
         7.34284240e-01,  5.99767114e-01,  5.23393968e-01,
         2.04994572e+00,  4.44215763e-01,  1.14834883e+00,
         8.35366878e-02, -3.54906079e-01, -1.09211759e-01,
         3.41020364e-02,  1.57280831e-01,  1.33967166e+00,
         1.06926351e+00,  9.85716797e-01,  7.79073045e-02,
         1.02520248e-01,  9.26969623e-01,  2.48487177e-01,
         2.49447422e-01,  2.14271849e+00,  3.07180333e-01,
        -1.77622081e+00, -8.36888511e-01, -3.76150550e+00,
         3.37513893e-01, -1.17319555e+00, -6.76766515e-01,
        -1.36354100e-01, -6.88026680e-02, -1.17460011e+00,
         1.46772108e-01, -9.10339625e-01,  9.83000129e-01,
        -2.81212165e-01, -5.86646162e-01, -7.76659081e-01,
        -1.77394328e-01, -1.16818857e+00, -1.46324346e+00,
        -6.29666835e-01, -1.72698946e+00, -6.63654824e-01,
        -1.31544276e+00, -5.85044405e-01, -1.43031573e+00,
        -1.07896296e+00, -1.82356030e-01, -3.81539565e-0

In [12]:
lr.intercept_

array([-1.49485805])

### 4. Copy dataset df into df1 variable and apply correlation on df1

In [13]:
# Copy df in new variable df1
df1 = df.copy()

### 5. As we have learned  one of the assumptions of Logistic Regression model is that the independent features should not be correlated to each other(i.e Multicollinearity), So we have to find the features that have a correlation higher that 0.75 and remove the same so that the assumption for logistic regression model is satisfied. 

In [14]:
df1.drop(57,1).corr().abs()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,47,48,49,50,51,52,53,54,55,56
0,1.0,0.016759,0.065627,0.013273,0.023119,0.059674,0.007669,0.00395,0.106263,0.041198,...,0.017755,0.026505,0.021196,0.033301,0.058292,0.117419,0.008844,0.044491,0.061382,0.089165
1,0.016759,1.0,0.033526,0.006923,0.02376,0.02484,0.003918,0.01628,0.003826,0.032962,...,0.015747,0.007282,0.049837,0.018527,0.014461,0.009605,0.001946,0.002083,0.000271,0.02268
2,0.065627,0.033526,1.0,0.020246,0.077734,0.087564,0.036677,0.012003,0.093786,0.032075,...,0.026344,0.033213,0.016495,0.03312,0.10814,0.087618,0.003336,0.097398,0.107463,0.070114
3,0.013273,0.006923,0.020246,1.0,0.003238,0.010014,0.019784,0.010268,0.002454,0.004947,...,0.001924,0.000591,0.01237,0.007148,0.003138,0.010862,0.000298,0.00526,0.022081,0.021369
4,0.023119,0.02376,0.077734,0.003238,1.0,0.054054,0.147336,0.029598,0.020823,0.034495,...,0.032005,0.032759,0.046361,0.02639,0.025509,0.041582,0.002016,0.052662,0.05229,0.002492
5,0.059674,0.02484,0.087564,0.010014,0.054054,1.0,0.061163,0.079561,0.117438,0.013897,...,0.031693,0.019119,0.008705,0.015133,0.065043,0.105692,0.019894,0.010278,0.090172,0.082089
6,0.007669,0.003918,0.036677,0.019784,0.147336,0.061163,1.0,0.044545,0.050786,0.056809,...,0.031408,0.033089,0.051885,0.027653,0.053706,0.070127,0.046612,0.041565,0.059677,0.008344
7,0.00395,0.01628,0.012003,0.010268,0.029598,0.079561,0.044545,1.0,0.105302,0.083129,...,0.021224,0.027432,0.032494,0.019548,0.031454,0.05791,0.008012,0.011254,0.037575,0.040252
8,0.106263,0.003826,0.093786,0.002454,0.020823,0.117438,0.050786,0.105302,1.0,0.130624,...,0.026017,0.014646,0.031003,0.013601,0.043639,0.149365,0.000522,0.111308,0.189247,0.248724
9,0.041198,0.032962,0.032075,0.004947,0.034495,0.013897,0.056809,0.083129,0.130624,1.0,...,0.016842,0.011945,0.003936,0.007357,0.036737,0.075786,0.04483,0.073677,0.103308,0.087273


In [15]:
# Remove Correlated features above 0.75 and then apply logistic model
corr_matrix = df1.drop(57,1).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

In [16]:
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]
print("Columns to be dropped: ")
print(to_drop)

Columns to be dropped: 
[33, 39]


In [17]:
df1.drop(to_drop,axis=1,inplace=True)

### 6. Split the  new subset of the  data acquired by feature selection into train and test set and fit the logistic regression model on train set.

In [18]:
# Split the new subset of data and fit the logistic model on training data
X = df1.iloc[:,:-1]
y = df1.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### 7. Find out the accuracy , print out the Classification report and Confusion Matrix.

In [19]:
# Calculate accuracy , print out the Classification report and Confusion Matrix for new data
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
print("=="*20)
print("Confusion Matrix:")#,confusion_matrix(y_test,y_pred))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tp, fp)
print(fn, tn)
print("=="*20)
print("Classification Report: \n",classification_report(y_test,y_pred))

Accuracy on test data: 0.9304851556842868
Confusion Matrix:
517 36
60 768
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.93      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



In [20]:
lr.coef_

array([[-2.16312961e-01, -1.22685386e-01,  1.09229931e-01,
         7.34777380e-01,  6.11079101e-01,  5.46169663e-01,
         1.95734252e+00,  4.46812689e-01,  1.09448415e+00,
         8.41690171e-02, -3.69402937e-01, -1.19030303e-01,
         3.85172140e-02,  1.52881485e-01,  1.31317730e+00,
         1.05423405e+00,  9.70568403e-01,  9.14412771e-02,
         9.95073451e-02,  9.28763362e-01,  2.48941194e-01,
         2.52146762e-01,  2.12427485e+00,  2.89857438e-01,
        -1.74008716e+00, -8.26775316e-01, -3.81041612e+00,
         3.51311910e-01, -1.18688124e+00, -6.82835117e-01,
        -2.71736695e-01, -8.12329143e-02, -1.15157973e+00,
        -9.06714014e-01,  9.29402313e-01, -2.69508399e-01,
        -5.58584648e-01, -7.71684461e-01, -1.10916437e+00,
        -1.44185421e+00, -6.36660371e-01, -1.74713780e+00,
        -6.60213632e-01, -1.28250737e+00, -5.60665124e-01,
        -1.41402657e+00, -1.07372006e+00, -1.66838694e-01,
        -3.86902330e-01,  2.17592041e-01,  3.42981521e+0

In [21]:
lr.intercept_

array([-1.47098721])

### 8. After keeping highly correlated features, there is not much change in the score. Lets apply another feature selection technique(Chi Squared test) to see whether we can increase our score. Find the optimum number of features using Chi Square and fit the logistic model on train data.

In [22]:
# Apply Chi Square and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=chi2 , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    
    model = LogisticRegression(random_state=101)
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))
    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

For no of features= 20 , score= 0.9029688631426502
For no of features= 25 , score= 0.9152787834902245
For no of features= 30 , score= 0.9131064446053584
For no of features= 35 , score= 0.9196234612599565
For no of features= 40 , score= 0.9254163649529327
For no of features= 50 , score= 0.9254163649529327
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55


### 9. Find out the accuracy , print out the Confusion Matrix.

In [23]:
# Calculate accuracy , print out the Classification report and Confusion Matrix for new data
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
print("=="*20)
print("Confusion Matrix:")#,confusion_matrix(y_test,y_pred))
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tp, fp)
print(fn, tn)
print("=="*20)
print("Classification Report: \n",classification_report(y_test,y_pred))

Accuracy on test data: 0.9304851556842868
Confusion Matrix:
517 36
60 768
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.93      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



In [24]:
# Calculate accuracy , print out the Confusion Matrix 
# y_pred = lr.predict(X_test)
# print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

### 10. Using chi squared test there is no change in the score and the optimum features that we got is 55. Now lets see if we can increase our score using another feature selection technique called Anova.Find the optimum number of features using Anova and fit the logistic model on train data.

In [25]:
# Apply Anova and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=f_classif , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    model = LogisticRegression()
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))

    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))


For no of features= 20 , score= 0.889210716871832
For no of features= 25 , score= 0.9015206372194062
For no of features= 30 , score= 0.9145546705286025
For no of features= 35 , score= 0.9203475742215785
For no of features= 40 , score= 0.9217958001448225
For no of features= 50 , score= 0.9261404779145547
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55
Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### 11. Find out the accuracy , print out the Confusion Matrix.

In [26]:
# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### 12. Unfortunately Anova also couldn't give us a better score . Let's finally attempt PCA on train data and find if it helps in  giving a better model by reducing the features.

In [27]:
# Apply PCA and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    pca = PCA(n_components=n)
    pca.fit(X_train)
    X_train = pca.transform(X_train)
    X_test = pca.transform(X_test)
    logistic = LogisticRegression(random_state=101)
    logistic.fit(X_train, y_train)
    print("For no of features=",n,", score=", logistic.score(X_test,y_test))
    
    if logistic.score(X_test,y_test)>high_score:
        high_score=logistic.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)


For no of features= 20 , score= 0.9036929761042722
For no of features= 25 , score= 0.9109341057204924
For no of features= 30 , score= 0.9138305575669804
For no of features= 35 , score= 0.9239681390296887
For no of features= 40 , score= 0.9188993482983345
For no of features= 50 , score= 0.9275887038377987
For no of features= 55 , score= 0.9312092686459088
High Score is: 0.9312092686459088 with features= 55


### 13. Find out the accuracy , print out the Confusion Matrix.   

In [28]:
# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

Confusion Matrix: 
 [[124 680]
 [239 338]]


### 14. You can also compare your predicted values and observed values by printing out values of logistic.predict(X_test[]) and  y_test[].values

In [29]:
# Compare observed value and Predicted value
print("Prediction for 10 observation:    ",logistic.predict(X_test[0:10]))
print("Actual values for 10 observation: ",y_test[0:10].values)

Prediction for 10 observation:     [0 0 0 1 0 1 0 0 0 0]
Actual values for 10 observation:  [0 0 0 1 0 1 0 0 0 0]
