## Code Along :: Feature Selection and Logistic regression

## About the Dataset

The "spam" concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography... 

Our collection of spam e-mails came from our postmaster and individuals who had filed spam. Our collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. 

-  Number of Instances: 4601 (1813 Spam = 39.4%)
-  Number of Attributes: 58 (57 continuous, 1 nominal class label)

 -  Attribute Information:

    -  The last column of 'spambase.data' denotes whether the e-mail was 
       considered spam (1) or not (0)
    
    - 48 attributes are continuous real [0,100] numbers of type `word freq WORD` i.e. percentage of words in the e-mail that         match WORD

    - 6 attributes are continuous real [0,100] numbers of type `char freq CHAR` i.e. percentage of characters in the e-mail           that match CHAR
    
    - 1 attribute is continuous real [1,...] numbers of type `capital run length average` i.e. average length of uninterrupted       sequences of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type `capital run length longest` i.e. length of longest                   uninterrupted sequence of capital letters

    - 1 attribute is continuous integer [1,...] numbers of type `capital run length total` i.e. sum of length of uninterrupted       sequences of capital letters in the email

    - 1 attribute is nominal {0,1} class  of type spam i.e  denotes whether the e-mail was considered spam (1) or not (0),  

- Missing Attribute Values: None

- Class Distribution:
	Spam	  1813  (39.4%)
	Non-Spam  2788  (60.6%)



You can read more about dataset [here](https://archive.ics.uci.edu/ml/datasets/spambase)


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import SelectKBest
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,confusion_matrix
import warnings
warnings.filterwarnings("ignore")

## Loading the dataset

In [2]:
#Loading the Spam data for the mini challenge
#Target variable is the 57 column i.e spam, non-spam classes 
df = pd.read_csv('spambase.data.csv',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


### 1. Get an overview of your data by using info() and describe() functions of pandas.


In [90]:
# Overview of the data
# Overview of the data
# df.info()
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
count,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,...,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0,4601.0
mean,0.104553,0.213015,0.280656,0.065425,0.312223,0.095901,0.114208,0.105295,0.090067,0.239413,...,0.038575,0.13903,0.016976,0.269071,0.075811,0.044238,5.191515,52.172789,283.289285,0.394045
std,0.305358,1.290575,0.504143,1.395151,0.672513,0.273824,0.391441,0.401071,0.278616,0.644755,...,0.243471,0.270355,0.109394,0.815672,0.245882,0.429342,31.729449,194.89131,606.347851,0.488698
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.588,6.0,35.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.065,0.0,0.0,0.0,0.0,2.276,15.0,95.0,0.0
75%,0.0,0.0,0.42,0.0,0.38,0.0,0.0,0.0,0.0,0.16,...,0.0,0.188,0.0,0.315,0.052,0.0,3.706,43.0,266.0,1.0
max,4.54,14.28,5.1,42.81,10.0,5.88,7.27,11.11,5.26,18.18,...,4.385,9.752,4.081,32.478,6.003,19.829,1102.5,9989.0,15841.0,1.0


In [4]:
df.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45    0
46    0
47    0
48    0
49    0
50    0
51    0
52    0
53    0
54    0
55    0
56    0
57    0
dtype: int64

### 2. Split the data into train and test set and fit the base logistic regression model on train set.

In [5]:

#Dividing the dataset set in train and test set and apply base logistic model
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
y.value_counts()/len(df)*100   # 0 is not spam and 1 ia spam

0    60.595523
1    39.404477
Name: 57, dtype: float64

### 3. Find out the accuracy , print out the Classification report and Confusion Matrix.

In [13]:
y_true = [2, 0, 2, 2, 0, 1]
y_pred = [0, 0, 2, 2, 0, 2]
confusion_matrix(y_true, y_pred)

array([[2, 0, 0],
       [0, 0, 1],
       [1, 0, 2]], dtype=int64)

In [43]:
# Calculate accuracy , print out the Classification report and Confusion Matrix.
print("Accuracy on test data:", lr.score(X_test,y_test))   # it give mean accuracy
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

print("Classification Report: \n",classification_report(y_test,y_pred))


Accuracy on test data: 0.9319333816075308
Confusion Matrix: 
 [[770  34]
 [ 60 517]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.94      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



In [42]:
# Calculate accuracy , print out the Classification report and Confusion Matrix.
# print("Accuracy on test data:", lr.score(X_test,y_test))
# y_pred = lr.predict(X_test)
a = list(y_test.iloc[15:20,])
b = list(y_pred[15:20])
print('Actual' + str(a))
print('Predic' + str(b))
print("Confusion Matrix: \n",confusion_matrix(a,b))

print("Classification Report: \n",classification_report(a,b))


Actual[0, 1, 1, 1, 0]
Predic[0, 1, 0, 1, 0]
Confusion Matrix: 
 [[2 0]
 [1 2]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.67      0.80         3

   micro avg       0.80      0.80      0.80         5
   macro avg       0.83      0.83      0.80         5
weighted avg       0.87      0.80      0.80         5



In [16]:
y_pred.sum()

551

In [23]:
a = y_pred.tolist()
print(a.count(1))
print(a.count(0))

551
830


### 4. Copy dataset df into df1 variable and apply correlation on df1

In [81]:
# Copy df in new variable df1
df1 = df.copy()

### 5. As we have learned  one of the assumptions of Logistic Regression model is that the independent features should not be correlated to each other(i.e Multicollinearity), So we have to find the features that have a correlation higher that 0.75 and remove the same so that the assumption for logistic regression model is satisfied. 

In [82]:
a = np.array([[4, 3, 8], [8, 3, 4], [5,7,6]])
# print('matrix:\n',a)

# print(np.triu(a)) #a is an matris
print(np.triu(a, 1).astype(np.bool))  #for upper triangle
# print(np.tril(a,-1)) #for lower triangle

[[False  True  True]
 [False False  True]
 [False False False]]


In [83]:
# Remove Correlated features above 0.75 and then apply logistic model
corr_matrix = df1.drop(57,1).corr().abs()
# triangle upper 'triu' (or tril) and k=1 means ignore diagonal and save in true and false 
# where is useful for mapping with bool matrix which r true that positions values from corr matrix selected
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) 
# upper
to_drop = [column for column in upper.columns if any(upper[column] > 0.75)]

print("Columns to be dropped: ")
print(to_drop)
df1.drop(to_drop,axis=1,inplace=True)

Columns to be dropped: 
[33, 39]


### 6. Split the  new subset of the  data acquired by feature selection into train and test set and fit the logistic regression model on train set.

In [89]:
# Split the new subset of data and fit the logistic model on training data
X = df1.iloc[:,:-1]
y = df1.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
lr = LogisticRegression(random_state=101)
lr.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### 7. Find out the accuracy , print out the Classification report and Confusion Matrix.

In [85]:
# Calculate accuracy , print out the Classification report and Confusion Matrix for new data
print("Accuracy on test data:", lr.score(X_test,y_test))
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))
print("=="*20)
print("Classification Report: \n",classification_report(y_test,y_pred))

Accuracy on test data: 0.9304851556842868
Confusion Matrix: 
 [[768  36]
 [ 60 517]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.96      0.94       804
           1       0.93      0.90      0.92       577

   micro avg       0.93      0.93      0.93      1381
   macro avg       0.93      0.93      0.93      1381
weighted avg       0.93      0.93      0.93      1381



### 8. After keeping highly correlated features, there is not much change in the score. Lets apply another feature selection technique(Chi Squared test) to see whether we can increase our score. Find the optimum number of features using Chi Square and fit the logistic model on train data.

In [91]:
# Apply Chi Square and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=chi2 , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    
    model = LogisticRegression(random_state=101)
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))
    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

For no of features= 20 , score= 0.9029688631426502
For no of features= 25 , score= 0.9152787834902245
For no of features= 30 , score= 0.9131064446053584
For no of features= 35 , score= 0.9196234612599565
For no of features= 40 , score= 0.9254163649529327
For no of features= 50 , score= 0.9254163649529327
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55


### 9. Find out the accuracy , print out the Confusion Matrix.

In [92]:
# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### 10. Using chi squared test there is no change in the score and the optimum features that we got is 55. Now lets see if we can increase our score using another feature selection technique called Anova.Find the optimum number of features using Anova and fit the logistic model on train data.

In [98]:
# Apply Anova and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    test = SelectKBest(score_func=f_classif , k= n )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 42)
    X_train = test.fit_transform(X_train,y_train)
    X_test = test.transform(X_test)
    model = LogisticRegression()
    model.fit(X_train,y_train)
    print("For no of features=",n,", score=", model.score(X_test,y_test))

    if model.score(X_test,y_test)>high_score:
        high_score=model.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)

# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))


For no of features= 20 , score= 0.889210716871832
For no of features= 25 , score= 0.9015206372194062
For no of features= 30 , score= 0.9145546705286025
For no of features= 35 , score= 0.9203475742215785
For no of features= 40 , score= 0.9217958001448225
For no of features= 50 , score= 0.9261404779145547
For no of features= 55 , score= 0.9304851556842868
High Score is: 0.9304851556842868 with features= 55
Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### 11. Find out the accuracy , print out the Confusion Matrix.

In [94]:
# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

Confusion Matrix: 
 [[768  36]
 [ 60 517]]


### 12. Unfortunately Anova also couldn't give us a better score . Let's finally attempt PCA on train data and find if it helps in  giving a better model by reducing the features.

In [95]:
# Apply PCA and fit the logistic model on train data use df dataset
nof_list=[20,25,30,35,40,50,55]
high_score=0
nof=0

for n in nof_list:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state = 42)
    pca = PCA(n_components=n)
    pca.fit(X_train)
    X_train = pca.transform(X_train)
    X_test = pca.transform(X_test)
    logistic = LogisticRegression(solver = 'lbfgs')
    logistic.fit(X_train, y_train)
    print("For no of features=",n,", score=", logistic.score(X_test,y_test))
    
    if logistic.score(X_test,y_test)>high_score:
        high_score=logistic.score(X_test,y_test)
        nof=n 
print("High Score is:",high_score, "with features=",nof)


For no of features= 20 , score= 0.9007965242577842
For no of features= 25 , score= 0.8986241853729182
For no of features= 30 , score= 0.9102099927588704
For no of features= 35 , score= 0.9167270094134685
For no of features= 40 , score= 0.9188993482983345
For no of features= 50 , score= 0.9145546705286025
For no of features= 55 , score= 0.9174511223750905
High Score is: 0.9188993482983345 with features= 40


### 13. Find out the accuracy , print out the Confusion Matrix.   

In [96]:
# Calculate accuracy , print out the Confusion Matrix 
y_pred = lr.predict(X_test)
print("Confusion Matrix: \n",confusion_matrix(y_test,y_pred))

Confusion Matrix: 
 [[124 680]
 [239 338]]


### 14. You can also compare your predicted values and observed values by printing out values of logistic.predict(X_test[]) and  y_test[].values

In [97]:
# Compare observed value and Predicted value
print("Prediction for 10 observation:    ",logistic.predict(X_test[0:10]))
print("Actual values for 10 observation: ",y_test[0:10].values)

Prediction for 10 observation:     [0 0 0 1 0 1 0 0 0 1]
Actual values for 10 observation:  [0 0 0 1 0 1 0 0 0 0]
