### Build text classification models using scikit-learn
- Use TfidfVectorizer to transform input texts into tfidf encoded float point matrix
- Build a pipeline that include both feature extraction, and classification model
- Build and train models
- Evaluate model performace

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn import metrics
import pickle

In [2]:
# ! ls -ltr

Read in preprocessed 20 news data into pandas

In [3]:
df = pd.read_csv('bbc-text.csv')
print(df.shape, df['category'].nunique())
df.head(2)

(2225, 2) 5


Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...


Split dataset into train and test subset, we will train model using train subset and test model accuracy using test subset
Also check the shape of train and test dataset 

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['category'], test_size=.2, stratify=df['category'], random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1780,) (445,) (1780,) (445,)


### Build and train model pipeline
- Will build and train two classification models: [SGD](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) and [SVM](https://scikit-learn.org/stable/modules/svm.html)
- Report on model metrics of each model trained
- Save model so we can use later for prediction

In [5]:
def sgd_pipeline():
    return Pipeline(
        [
            (
                "tfidf_vector_com",
                TfidfVectorizer(
                    input="array",
                    norm="l2",
                    max_features=None,
                    sublinear_tf=True,
                    stop_words="english",
                ),
            ),
            (
                "clf",
                SGDClassifier(
                    loss="log",
                    penalty="l2",
                    class_weight='balanced',
                    tol=0.001,
                ),
            ),
        ]
    )

def svc_pipleline():
    return Pipeline(
        [
            (
                "tfidf_vector_com",
                TfidfVectorizer(
                    input="array",
                    norm="l2",
                    max_features=None,
                    sublinear_tf=True,
                    stop_words="english",
                ),
            ),
            (
                "clf",
                SVC(
                    C=10,
                    kernel="rbf",
                    gamma=0.1,
                    probability=True,
                    class_weight=None,
                ),
            ),
        ]
    )

In [6]:
def print_metrics(pred_test, y_test, pred_train, y_train):
    print("test accuracy", str(np.mean(pred_test == y_test)))
    print("train accuracy", str(np.mean(pred_train == y_train)))
    print("\n Metrics and Confusion for SVM \n")
    print(metrics.confusion_matrix(y_test, pred_test))
    print(metrics.classification_report(y_test, pred_test))

Train model and report SVC model metrics

In [7]:
%%time
svc_pipe = svc_pipleline()
svc_pipe.fit(X_train, y_train)
pred_test = svc_pipe.predict(X_test)
pred_train = svc_pipe.predict(X_train)
print_metrics(pred_test, y_test, pred_train, y_train)

test accuracy 0.9842696629213483
train accuracy 1.0

 Metrics and Confusion for SVM 

[[100   0   2   0   0]
 [  0  77   0   0   0]
 [  1   1  81   0   1]
 [  0   0   0 102   0]
 [  1   1   0   0  78]]
               precision    recall  f1-score   support

     business       0.98      0.98      0.98       102
entertainment       0.97      1.00      0.99        77
     politics       0.98      0.96      0.97        84
        sport       1.00      1.00      1.00       102
         tech       0.99      0.97      0.98        80

     accuracy                           0.98       445
    macro avg       0.98      0.98      0.98       445
 weighted avg       0.98      0.98      0.98       445

CPU times: user 22.3 s, sys: 22.3 ms, total: 22.3 s
Wall time: 22.3 s


Train model and report SGD model metrics

In [8]:
%%time
sgd_pipe = sgd_pipeline()
sgd_pipe.fit(X_train, y_train)
pred_test = sgd_pipe.predict(X_test)
pred_train = sgd_pipe.predict(X_train)
print_metrics(pred_test, y_test, pred_train, y_train)

test accuracy 0.9820224719101124
train accuracy 1.0

 Metrics and Confusion for SVM 

[[ 99   0   2   0   1]
 [  0  77   0   0   0]
 [  1   1  81   0   1]
 [  0   0   0 102   0]
 [  1   1   0   0  78]]
               precision    recall  f1-score   support

     business       0.98      0.97      0.98       102
entertainment       0.97      1.00      0.99        77
     politics       0.98      0.96      0.97        84
        sport       1.00      1.00      1.00       102
         tech       0.97      0.97      0.97        80

     accuracy                           0.98       445
    macro avg       0.98      0.98      0.98       445
 weighted avg       0.98      0.98      0.98       445

CPU times: user 2.38 s, sys: 14.3 ms, total: 2.39 s
Wall time: 819 ms


### Conclusion:

Using tfidf vectorization method we traind tow models: SGD and SVM. 
Training SGD model is very efficient and it only took 15 CPU seconds to get the model trained, and it achieved 98.2% test accuracy.
Training SVM model took much much longer than SGD with 22 seconds, and the test accuracy is marginally better at 98.4%.

Now we can save both model to pickle files so we can load it later for prediction

In [None]:
with open('scikit_learn_sgd.pickle', 'wb') as f:
    pickle.dump(sgd_pipe, f)
    
with open('scikit_learn_svm.pickle', 'wb') as f:
    pickle.dump(svc_pipe, f)