# Pipelines in Logistic Regression

Attribute Information:

1) Number of times pregnant

2) Plasma glucose concentration a 2 hours in an oral glucose tolerance test

3) Diastolic blood pressure (mm Hg)

4) Triceps skin fold thickness (mm)

5) 2-Hour serum insulin (mu U/ml)

6) Body mass index (weight in kg/(height in m)^2)

7) Diabetes pedigree function

8) Age (years)

9) Class variable (0 or 1)

In [1]:
#Importing the necessary libraries.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#Importing data from CSV.
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#Checking the info for the dataframe.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
#Using train test split to split the data between training data and test data, in this case 25% test data and 75% training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Outcome',axis=1), 
                                                    df['Outcome'], test_size=0.25, 
                                                    random_state=101)

In [5]:
#Importing the logistics regression model.
from sklearn.linear_model import LogisticRegression

In [6]:
#Creating an instance of the model and fitting the training data into it to train our model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [7]:
#Predicting how are model has performed on bases of test data.
predictions = logmodel.predict(X_test)

In [8]:
#Importing and running the confusion matrix and classification report to check how our model has performed.
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test,predictions))
print('\n')
print(confusion_matrix(y_test,predictions))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85       124
           1       0.76      0.62      0.68        68

   micro avg       0.80      0.80      0.80       192
   macro avg       0.79      0.76      0.77       192
weighted avg       0.79      0.80      0.79       192



[[111  13]
 [ 26  42]]


In [9]:
#Now to make the same model using pipelines.
#Importing the pipeline package.
from sklearn.pipeline import Pipeline

In [10]:
#Creating the pipeline.In this case it will consist of only one feature and that is the model itself.
pipeline = Pipeline([
    ('logmodel', LogisticRegression()),
])

In [11]:
#Fitting the training data into the pipeline.
pipeline.fit(X_train,y_train)



Pipeline(memory=None,
     steps=[('logmodel', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

In [12]:
#Creating an prediction on bases of test data for the same.
predictions = pipeline.predict(X_test)

In [13]:
#Checking how our model has performed using the pipeline model structure
print(classification_report(y_test,predictions))
print('\n')
print(confusion_matrix(y_test,predictions))

              precision    recall  f1-score   support

           0       0.81      0.90      0.85       124
           1       0.76      0.62      0.68        68

   micro avg       0.80      0.80      0.80       192
   macro avg       0.79      0.76      0.77       192
weighted avg       0.79      0.80      0.79       192



[[111  13]
 [ 26  42]]


# Pipelines in Linear Regression
Attribute Information:

1) Avg. Area Income (dollars) 

2) Avg. Area House Age (years)

3) Avg. Area Number of Rooms 

4) Avg. Area Number of Bedrooms

5) Area Population

6) Price (dollars)

7) Address


In [14]:
#Importing data from CSV.
df1 = pd.read_csv('USA_Housing.csv')
df1.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [15]:
#Checking the info for the dataframe.
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
Avg. Area Income                5000 non-null float64
Avg. Area House Age             5000 non-null float64
Avg. Area Number of Rooms       5000 non-null float64
Avg. Area Number of Bedrooms    5000 non-null float64
Area Population                 5000 non-null float64
Price                           5000 non-null float64
Address                         5000 non-null object
dtypes: float64(6), object(1)
memory usage: 273.5+ KB


In [16]:
#Using train test split to split the data between training data and test data, in this case 25% test data and 75% training.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(df1.drop(['Price','Address'],axis=1), 
                                                    df1['Price'], test_size=0.25, 
                                                    random_state=101)

In [17]:
#Creating an instance of the model and fitting the training data into it.
lm = LinearRegression()
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [18]:
#Predicting how are model has performed on bases of test data.
predictions = lm.predict(X_test)

In [19]:
#Checking how our model has done.
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MAE: 81877.46809001848
MSE: 10364756049.53774
RMSE: 101807.44594349542


In [20]:
#Now to make the same model using pipelines.
#Creating the pipeline.In this case it will consist of only one feature and that is the model itself.
pipeline = Pipeline([
    ('linmodel', LinearRegression()),
])

In [21]:
#Fitting the training data into the pipeline.
pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('linmodel', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False))])

In [22]:
#Creating an prediction on bases of test data for the same.
predictions = pipeline.predict(X_test)

In [23]:
#Checking how our pipeline model has done.
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

MAE: 81877.46809001848
MSE: 10364756049.53774
RMSE: 101807.44594349542


# Pipelines in KNN
Attribute Information:

1) WTT

2) PTI

3) EQW 

4) SBI

5) LQE

6) QWG

7) FDJ

8) PJF

9) HQE

10) NXJ

11) Class variable (0 or 1)

In [24]:
#Importing data from CSV.
df2 = pd.read_csv("Classified Data",index_col=0)
df2.head()

Unnamed: 0,WTT,PTI,EQW,SBI,LQE,QWG,FDJ,PJF,HQE,NXJ,TARGET CLASS
0,0.913917,1.162073,0.567946,0.755464,0.780862,0.352608,0.759697,0.643798,0.879422,1.231409,1
1,0.635632,1.003722,0.535342,0.825645,0.924109,0.64845,0.675334,1.013546,0.621552,1.492702,0
2,0.72136,1.201493,0.92199,0.855595,1.526629,0.720781,1.626351,1.154483,0.957877,1.285597,0
3,1.234204,1.386726,0.653046,0.825624,1.142504,0.875128,1.409708,1.380003,1.522692,1.153093,1
4,1.279491,0.94975,0.62728,0.668976,1.232537,0.703727,1.115596,0.646691,1.463812,1.419167,1


In [25]:
#Importing the standardscaler and creating an instance of it.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [26]:
#Fitting the data into the scaler and creating its scaled features.
scaler.fit(df2.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df2.drop('TARGET CLASS',axis=1))

In [27]:
#Creating a dataframe without the target class to run operations using its data.
df_feat = pd.DataFrame(scaled_features,columns=df2.columns[:-1])
df_feat.head()

Unnamed: 0,WTT,PTI,EQW,SBI,LQE,QWG,FDJ,PJF,HQE,NXJ
0,-0.123542,0.185907,-0.913431,0.319629,-1.033637,-2.308375,-0.798951,-1.482368,-0.949719,-0.643314
1,-1.084836,-0.430348,-1.025313,0.625388,-0.444847,-1.152706,-1.129797,-0.20224,-1.828051,0.636759
2,-0.788702,0.339318,0.301511,0.755873,2.031693,-0.870156,2.599818,0.285707,-0.682494,-0.37785
3,0.982841,1.060193,-0.621399,0.625299,0.45282,-0.26722,1.750208,1.066491,1.241325,-1.026987
4,1.139275,-0.640392,-0.709819,-0.057175,0.822886,-0.936773,0.596782,-1.472352,1.040772,0.27651


In [28]:
#Splitting the data into training and test data. 
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df2['TARGET CLASS'],
                                                    test_size=0.30,random_state=101)

In [29]:
#Importing the KNN package, creating an instance of it and fitting the training data into it.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

In [30]:
#Creating an prediction on bases of test data for the same.
pred = knn.predict(X_test)

In [31]:
#Checking how our model has performed using the pipeline model structure
# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

WITH K=1


[[151   8]
 [ 15 126]]


              precision    recall  f1-score   support

           0       0.91      0.95      0.93       159
           1       0.94      0.89      0.92       141

   micro avg       0.92      0.92      0.92       300
   macro avg       0.92      0.92      0.92       300
weighted avg       0.92      0.92      0.92       300



In [32]:
#Now to make the same model using pipelines.
#Creating the pipeline.In this case it will consist of the model itself and standardscaler.
pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('KNN',KNeighborsClassifier(n_neighbors=1)),
])

In [33]:
#Fitting the training data into the pipeline.
pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('KNN', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform'))])

In [34]:
#Creating an prediction on bases of test data for the same.
predictions = pipeline.predict(X_test)

In [35]:
#Checking how our model has performed using the pipeline model structure
# FIRST A QUICK COMPARISON TO OUR ORIGINAL K=1
print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,predictions))
print('\n')
print(classification_report(y_test,predictions))

WITH K=1


[[151   8]
 [ 14 127]]


              precision    recall  f1-score   support

           0       0.92      0.95      0.93       159
           1       0.94      0.90      0.92       141

   micro avg       0.93      0.93      0.93       300
   macro avg       0.93      0.93      0.93       300
weighted avg       0.93      0.93      0.93       300



# Pipelines in SVM

Attribute Information:

       mean radius
       mean texture 
       mean perimeter
       mean area,
       mean smoothness
       mean compactness
       mean concavity
       mean concave points 
       mean symmetry
       mean fractal dimension
       radius error 
       texture error
       perimeter error
       area error
       smoothness error
       compactness error
       concavity error
       concave points error
       symmetry error
       fractal dimension error 
       worst radius 
       worst texture
       worst perimeter 
       worst area 
       worst smoothness
       worst compactness 
       worst concavity 
       worst concave points
       worst symmetry 
       worst fractal dimension

In [36]:
#Importing the dataset from saved data.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [37]:
#Since its a dictionary we check its keys.
cancer.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [38]:
#Creating a dataframe using data key and its values.
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df_feat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
mean radius                569 non-null float64
mean texture               569 non-null float64
mean perimeter             569 non-null float64
mean area                  569 non-null float64
mean smoothness            569 non-null float64
mean compactness           569 non-null float64
mean concavity             569 non-null float64
mean concave points        569 non-null float64
mean symmetry              569 non-null float64
mean fractal dimension     569 non-null float64
radius error               569 non-null float64
texture error              569 non-null float64
perimeter error            569 non-null float64
area error                 569 non-null float64
smoothness error           569 non-null float64
compactness error          569 non-null float64
concavity error            569 non-null float64
concave points error       569 non-null float64
symmetry error             569 

In [39]:
#Creating another dataframe for target column which consists of ones and zeroes.
df_target = pd.DataFrame(cancer['target'],columns=['Cancer'])

In [40]:
#Checking the first dataframe.
df_feat.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [41]:
#Splitting the data into training data and test data.
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)

In [42]:
#Importing the SVC package and creating an instance of it.
from sklearn.svm import SVC
model = SVC()

In [43]:
#Creating a param_grid dictionary for the GridSeachCV.
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 

In [44]:
#Importing the GridSearchCV and creating an instance of it.
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)

In [45]:
#Fitting the data into the model.
grid.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ................................

[CV]  C=100, gamma=0.0001, kernel=rbf, score=0.9172932330827067, total=   0.0s
[CV] C=100, gamma=0.0001, kernel=rbf .................................
[CV]  C=100, gamma=0.0001, kernel=rbf, score=0.9774436090225563, total=   0.0s
[CV] C=100, gamma=0.0001, kernel=rbf .................................
[CV]  C=100, gamma=0.0001, kernel=rbf, score=0.9393939393939394, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV]  C=1000, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV]  C=1000, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV]  C=1000, gamma=1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    1.5s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [0.1, 1, 10, 100, 1000], 'gamma': [1, 0.1, 0.01, 0.001, 0.0001], 'kernel': ['rbf']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [46]:
#Predicting off test data.
grid_predictions = grid.predict(X_test)

In [47]:
#Seeing how our model has done.
print(confusion_matrix(y_test,grid_predictions))
print('\n')
print(classification_report(y_test,grid_predictions))

[[ 60   6]
 [  3 102]]


              precision    recall  f1-score   support

           0       0.95      0.91      0.93        66
           1       0.94      0.97      0.96       105

   micro avg       0.95      0.95      0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



In [48]:
#Now to make the same model using pipelines.
#Creating the pipeline.In this case it will consist of GridSeachCV.
pipeline = Pipeline([
    ('CV',GridSearchCV(SVC(),param_grid,refit=True,verbose=3)),
])

In [49]:
#Fitting the training data into the pipeline.
pipeline.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s


Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=1, kernel=rbf ......................................
[CV]  C=0.1, gamma=1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=0.1, gamma=0.1, kernel=rbf ....................................
[CV]  C=0.1, gamma=0.1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=0.1, gamma=0.01, kernel=rbf ................................

[CV]  C=1000, gamma=1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=1, kernel=rbf .....................................
[CV]  C=1000, gamma=1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=0.1, kernel=rbf ...................................
[CV]  C=1000, gamma=0.1, kernel=rbf, score=0.6363636363636364, total=   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV]  C=1000, gamma=0.01, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ..................................
[CV]  C=1000, gamma=0.01, kernel=rbf, score=0.631578947368421, total=   0.0s
[CV] C=1000, gamma=0.01, kernel=rbf ......

[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:    1.8s finished


Pipeline(memory=None,
     steps=[('CV', GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, to...   pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3))])

In [50]:
#Creating an prediction on bases of test data for the same.
predictions = pipeline.predict(X_test)

In [51]:
#Seeing how our pipeline model has done.
print(confusion_matrix(y_test,grid_predictions))
print('\n')
print(classification_report(y_test,grid_predictions))

[[ 60   6]
 [  3 102]]


              precision    recall  f1-score   support

           0       0.95      0.91      0.93        66
           1       0.94      0.97      0.96       105

   micro avg       0.95      0.95      0.95       171
   macro avg       0.95      0.94      0.94       171
weighted avg       0.95      0.95      0.95       171



# Pipelines in NLP

Attribute Information:
Messages are classified on the basis of ham or spam.

In [52]:
#Importing the nltk library.
import nltk

In [53]:
#Importing the data.
messages = [line.rstrip() for line in open('SMSSpamCollection')]

In [54]:
#Reading the data in a comprehendible method.
for message_no, message in enumerate(messages[:3]):
    print(message_no, message)
    print('\n')

0 ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


1 ham	Ok lar... Joking wif u oni...


2 spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's




In [66]:
#Getting the data in the form of a dataframe.
messages = pd.read_csv('SMSSpamCollection', sep='\t',
                           names=["label", "message"])
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [67]:
#Adding another column to help check the length of message.
messages['length'] = messages['message'].apply(len)
messages.head()

Unnamed: 0,label,message,length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61


In [68]:
#Importing the stopwords and seeing the first ten.
from nltk.corpus import stopwords
stopwords.words('english')[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [69]:
#Creating a function to tokenise the message.
import string
def text_process(mess):
    
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    nopunc = [char for char in mess if char not in string.punctuation]

    nopunc = ''.join(nopunc)
    
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

In [70]:
#We have the messages as lists of tokens and  we need to convert each of those messages into a vector.
from sklearn.feature_extraction.text import CountVectorizer

In [71]:
#Creating an instance of the CountVectorizer.
bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])

In [72]:
#Printing the number of words there were in all the messages.
print(len(bow_transformer.vocabulary_))

11425


In [73]:
#Creating a side data to see if our model works well or not, so we take a message and put it into the transformer. 
message4 = messages['message'][3]
print(message4)

U dun say so early hor... U c already then say...


In [74]:
#Putting the message into the transformer to vectorise it.
bow4 = bow_transformer.transform([message4])
print(bow4)
print(bow4.shape)

  (0, 4068)	2
  (0, 4629)	1
  (0, 5261)	1
  (0, 6204)	1
  (0, 6222)	1
  (0, 7186)	1
  (0, 9554)	2
(1, 11425)


In [75]:
#Checking out 2 of the vectors.
print(bow_transformer.get_feature_names()[4068])
print(bow_transformer.get_feature_names()[9554])

U
say


In [76]:
#Now we will vectorise the whole messages column.
messages_bow = bow_transformer.transform(messages['message'])

In [77]:
#Printing out the shape of the sparse matrix and amount of non-zero occurences.
print('Shape of Sparse Matrix: ', messages_bow.shape)
print('Amount of Non-Zero occurences: ', messages_bow.nnz)

Shape of Sparse Matrix:  (5572, 11425)
Amount of Non-Zero occurences:  50548


In [78]:
#Now to check the term frequency and inverted document frequency we import TdidfTransformer.
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(messages_bow)
tfidf4 = tfidf_transformer.transform(bow4)
print(tfidf4)

  (0, 9554)	0.5385626262927564
  (0, 7186)	0.4389365653379857
  (0, 6222)	0.3187216892949149
  (0, 6204)	0.29953799723697416
  (0, 5261)	0.29729957405868723
  (0, 4629)	0.26619801906087187
  (0, 4068)	0.40832589933384067


In [79]:
#Checking the idf of any two words.
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['u']])
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['university']])

3.2800524267409408
8.527076498901426


In [80]:
#Fitting the vectorised message into the tfidf transformer and then checking its shape.
messages_tfidf = tfidf_transformer.transform(messages_bow)
print(messages_tfidf.shape)

(5572, 11425)


In [81]:
#Now to train our model we use Naive Bayes model and fit the data from the tdidf transformer.
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(messages_tfidf, messages['label'])

In [82]:
#Creating an instinct of prediction and predicting it.
all_predictions = spam_detect_model.predict(messages_tfidf)
print(all_predictions)

['ham' 'ham' 'spam' ... 'ham' 'ham' 'ham']


In [83]:
#Checking how our model has done.
from sklearn.metrics import classification_report
print (classification_report(messages['label'], all_predictions))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      4825
        spam       1.00      0.85      0.92       747

   micro avg       0.98      0.98      0.98      5572
   macro avg       0.99      0.92      0.95      5572
weighted avg       0.98      0.98      0.98      5572



In [84]:
#Now to work the same with pipelines we first split the data into training data and test data where message is like X and label is y.
from sklearn.model_selection import train_test_split

msg_train, msg_test, label_train, label_test = \
train_test_split(messages['message'], messages['label'], test_size=0.2)

print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

4457 1115 5572


In [85]:
#Creating the pipeline involving all the features which are CountVectorizer and Tdidf tranformer and the model which is Naive Bayes.
pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=text_process)),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB()),
])

In [86]:
#Fitting the training data into the pipeline.
pipeline.fit(msg_train,label_train)

Pipeline(memory=None,
     steps=[('bow', CountVectorizer(analyzer=<function text_process at 0x0000020963F741E0>,
        binary=False, decode_error='strict', dtype=<class 'numpy.int64'>,
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
        max_features=None, min_df=1, ngram_range=(1, 1), preprocesso...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [87]:
#Creating the predictions using test data.
predictions = pipeline.predict(msg_test)

In [88]:
#Checking how our model has performed.
print(classification_report(predictions,label_test))

              precision    recall  f1-score   support

         ham       1.00      0.96      0.98      1011
        spam       0.73      1.00      0.84       104

   micro avg       0.97      0.97      0.97      1115
   macro avg       0.86      0.98      0.91      1115
weighted avg       0.97      0.97      0.97      1115

