
## Build text classification models using scikit-learn
- Use TfidfVectorizer to transform input texts into tfidf encoded float point matrix
- Build a pipeline that include both feature extraction, and classification model
- Build and train models
- Evaluate model performace


In [51]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn import metrics

In [52]:
df = pd.read_csv('kaggle_data/bbc-text.csv')
print(df.shape, df['category'].nunique())
df.head(2)

(2225, 2) 5


Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...


In [54]:
df['category'].value_counts()

sport            511
business         510
politics         417
tech             401
entertainment    386
Name: category, dtype: int64

In [55]:
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['category'], test_size=.2, stratify=df['category'], random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(1780,) (445,) (1780,) (445,)


In [56]:
sgd = Pipeline(
        [
            (
                "tfidf_vector_com",
                TfidfVectorizer(),
            ),
            (
                "clf",
                SGDClassifier(),
            ),
        ]
    )

In [57]:
def print_metrics(pred_test, y_test, pred_train, y_train):
    print("test accuracy", str(np.mean(pred_test == y_test)))
    print("train accuracy", str(np.mean(pred_train == y_train)))
    print("\n Metrics and Confusion for SVM \n")
    print(metrics.confusion_matrix(y_test, pred_test))
    print(metrics.classification_report(y_test, pred_test))

In [59]:
%%time
sgd.fit(X_train, y_train)
pred_test = sgd.predict(X_test)
pred_train = sgd.predict(X_train)
print_metrics(pred_test, y_test, pred_train, y_train)

test accuracy 0.9820224719101124
train accuracy 1.0

 Metrics and Confusion for SVM 

[[100   1   1   0   0]
 [  0  76   0   0   1]
 [  1   1  81   0   1]
 [  0   0   0 102   0]
 [  1   1   0   0  78]]
               precision    recall  f1-score   support

     business       0.98      0.98      0.98       102
entertainment       0.96      0.99      0.97        77
     politics       0.99      0.96      0.98        84
        sport       1.00      1.00      1.00       102
         tech       0.97      0.97      0.97        80

     accuracy                           0.98       445
    macro avg       0.98      0.98      0.98       445
 weighted avg       0.98      0.98      0.98       445

CPU times: user 2.65 s, sys: 46.1 ms, total: 2.69 s
Wall time: 1.07 s


# Understand model coefficient. What are the the most import features/words for classficiation

In [106]:
# the model pipeline
sgd

Pipeline(steps=[('tfidf_vector_com', TfidfVectorizer()),
                ('clf', SGDClassifier())])

In [107]:
# model classes
sgd.classes_

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype='<U13')

In [108]:
# the feature coefficients for business class
sgd['clf'].coef_[0]

array([-0.04044558,  0.37417501,  0.        , ...,  0.        ,
        0.        ,  0.        ])

In [110]:
# top 10 positive feature coefficients for business class
top_coef_idx = sorted([(v,i) for (i, v) in enumerate(sgd['clf'].coef_[0])], reverse=True)[:10]
top_coef_idx

[(2.482477596699568, 8434),
 (2.3783413285203436, 13260),
 (2.338598659020302, 3128),
 (2.0065195482212057, 21702),
 (1.8021871092631958, 9882),
 (1.733046180786052, 5806),
 (1.62326722059601, 4489),
 (1.4838528375223712, 24535),
 (1.4690300566766994, 13107),
 (1.415096153103106, 20986)]

In [111]:
# the top indices 
top_idx = [e[1] for e in top_coef_idx]
top_idx

[8434, 13260, 3128, 21702, 9882, 5806, 4489, 24535, 13107, 20986]

In [112]:
# the top word related to the business class
[ word for (word,seq) in sgd['tfidf_vector_com'].vocabulary_.items() if seq in top_idx]

['business',
 'its',
 'economic',
 'bank',
 'trade',
 'firm',
 'sales',
 'company',
 'investment',
 'shares']

## Exercise:
The above 10 most import words are not ordered by its importance/coefficient, I leave it as an exercise for you to complete:
- Get the top N words ordered by coefficent for a given class
- report all top words for all class categories
- Aso report top N negative coeeficients for the class, the top N negative coeeficients in the class are most likely to be top words in other class categories
- format the report nicely 