# Practical 3: Feature Selection & Dimension Reduction
<img src="img/uu_logo.png" alt="logo" align="right" title="UU" width="50" height="20" />

#### Applied Text Mining - Utrecht Summer School

In this practical, we are going to learn about feature selection and dimension reduction methods for text data.

### Let's get started!

1\. **Here we are going to use a news article data set, originating from BBC News (http://mlg.ucd.ie/datasets/bbc.html). This data set provided for use as benchmarks for machine learning research. The BBC data set consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and tech. Your first task is to load the dataset so that you can proceed. Do not forget to import the necessary dependencies, you are going to need. You can import the other ones as you go along.**

In [27]:
from sklearn.datasets import load_files
import pandas as pd
import numpy as np

# for reproducibility
random_state = 321 

DATA_DIR = "data/bbc"
data = load_files(DATA_DIR, encoding="utf-8", decode_error="replace", random_state=random_state)
df = pd.DataFrame(list(zip(data['data'], data['target'])), columns=['text', 'label'])
df.head()

Unnamed: 0,text,label
0,Chris Evans back on the market\n\nBroadcaster ...,1
1,Giggs handed Wales leading role\n\nRyan Giggs ...,3
2,Wales silent on Grand Slam talk\n\nRhys Willia...,3
3,Kenya lift Chepkemei's suspension\n\nKenya's a...,3
4,Lee to create new film superhero\n\nComic book...,1


Here you can check the number of articles in each category:

In [28]:
labels, counts = np.unique(data.target, return_counts=True)
# convert data.target_names to np array for fancy indexing
labels_str = np.array(data.target_names)[labels]
print(dict(zip(labels_str, counts)))

{'business': 510, 'entertainment': 386, 'politics': 417, 'sport': 511, 'tech': 401}


In [153]:
labels_str

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype='<U13')

In [130]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df["text"], df["label"], test_size=0.2, random_state=random_state)

2\. **Use the CountVectorizer from sklearn and convert the text data into a document-term matrix. What is the difference between CountVectorizer and tfidfVectorizer(use_idf=False)?**

In [63]:
from sklearn.feature_extraction.text import CountVectorizer # for bag of words feature extraction

from nltk.tokenize import RegexpTokenizer
#tokenizer to remove unwanted elements from out data like symbols
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

# Initialize the "CountVectorizer" object, which is scikit-learn's bag of words tool.
# If you have memory issues, reduce the max_features value so you can continue with the practical
vectorizer = CountVectorizer(lowercase=True,
                             tokenizer=None,
                             stop_words='english',
                             ngram_range=(1, 2),
                             analyzer='word',
                             min_df=3,
                             max_features=None)

# fit_transform() does two functions: First, it fits the model and learns the vocabulary; 
# second, it transforms our data into feature vectors. 
# The input to fit_transform should be a list of strings.
bbc_dtm = vectorizer.fit_transform(X_train)
print(bbc_dtm.shape)

(1780, 23908)


The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that’s to be expected – as explained in the documentation quoted above, TfidfVectorizer() assigns a score while CountVectorizer() counts.

3\. **Print top 20 frequent words in the training set.**

In [65]:
importance = np.argsort(np.asarray(bbc_dtm.sum(axis=0)).ravel())[::-1]
feature_names = np.array(vectorizer.get_feature_names())
feature_names[importance[:20]]

array(['said', 'mr', 'year', 'people', 'new', 'time', 'world',
       'government', 'uk', 'years', 'best', 'just', 'told', 'film',
       'make', 'game', 'like', 'music', 'labour', '000'], dtype='<U27')

You can also sort the counts based on a document:

In [68]:
counts = pd.DataFrame(bbc_dtm.toarray(),
                      columns=vectorizer.get_feature_names())

# Show us the top 10 most common words in document 2
counts.T.sort_values(by=2, ascending=False).head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1770,1771,1772,1773,1774,1775,1776,1777,1778,1779
retailers,0,0,7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
figures,0,0,7,0,0,0,0,1,0,0,...,0,0,0,0,2,0,0,0,0,0
sales,0,0,6,0,0,0,0,0,0,1,...,0,0,0,4,0,0,0,0,0,0
retail,0,0,6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
december,0,0,5,0,0,1,1,0,0,0,...,0,0,0,1,0,0,0,1,0,0
christmas,0,0,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ons,0,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
worst,0,0,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
said,6,3,4,2,0,8,2,2,0,4,...,3,6,0,5,5,3,2,4,1,1
bank england,0,0,3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Feature selection

4\. **From the feature selection library in sklearn load the SelectKBest function and apply it on the BBC dataset using the chi-squared method. Extract top 20 features.**

In [72]:
X_test_vectorized = vectorizer.transform(X_test)


In [78]:
from sklearn.feature_selection import SelectKBest, chi2
ch2 = SelectKBest(chi2, k=20)
ch2.fit_transform(bbc_dtm, y_train)

In [79]:
feature_names_chi = [feature_names[i] for i
                         in ch2.get_support(indices=True)]

In [80]:
feature_names_chi

['best',
 'blair',
 'brown',
 'computer',
 'digital',
 'election',
 'film',
 'government',
 'labour',
 'minister',
 'mobile',
 'mr',
 'mr blair',
 'music',
 'net',
 'party',
 'people',
 'software',
 'technology',
 'users']

5\. **Extract the 20 top features according to the mutual information method. Do you get the same list of words as compared with the chi-squared method?**

In [83]:
from sklearn.feature_selection import mutual_info_classif
mutual_info = SelectKBest(mutual_info_classif, k=20)
mutual_info.fit_transform(bbc_dtm, y_train)

<1780x20 sparse matrix of type '<class 'numpy.int64'>'
	with 6350 stored elements in Compressed Sparse Row format>

In [85]:
feature_names_mutual_info = [feature_names[i] for i
                         in mutual_info.get_support(indices=True)]
feature_names_mutual_info

['blair',
 'coach',
 'election',
 'film',
 'firm',
 'game',
 'government',
 'labour',
 'market',
 'minister',
 'mr',
 'music',
 'party',
 'people',
 'said',
 'secretary',
 'technology',
 'tory',
 'users',
 'win']

Now you can build a classifier and train it using the output of these feature selection techniques. We are not going to do this right now, but if you are interested you can transform your training and test set using the selected features and continue with your classifier! Here are some tips:

In [155]:
X_train = mutual_info.fit_transform(bbc_dtm, y_train)
X_test = mutual_info.transform(X_test_vectorized)

KeyboardInterrupt: 

### Embedded feature selection

6\. **One of the functions for embedded feature selection is the SelectFromModel function in sklearn. Use this function with L1 norm SVM and check how many non-zero coefficients left in the model.**

In [128]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

print("shape of the matrix before applying the embedded feature selection:", bbc_dtm.shape)

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False)
model = SelectFromModel(lsvc).fit(bbc_dtm, y_train) # you can add threshold=0.18 as another argument to select features that have an importance of more than 0.18
X_new = model.transform(bbc_dtm)
print("shape of the matrix before applying the embedded feature selection:", X_new.shape)


shape of the matrix before applying the embedded feature selection: (1780, 23908)
shape of the matrix before applying the embedded feature selection: (1780, 156)


In [127]:
model

SelectFromModel(estimator=LinearSVC(C=0.01, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    loss='squared_hinge', max_iter=1000,
                                    multi_class='ovr', penalty='l1',
                                    random_state=None, tol=0.0001, verbose=0),
                max_features=None, norm_order=1, prefit=False, threshold=None)

In [234]:
# you can also check the coefficient values
model.estimator_.coef_

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        , -0.08075104,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

7\. **What are the top features according to the SVM model? Tip: Use the function model.get_support() to find these features.**

In [120]:
model.get_support()

array([False,  True, False, ..., False, False, False])

In [124]:
print("Features selected by SelectFromModel: ", feature_names[model.get_support()])

Features selected by SelectFromModel:  ['000' '2004' 'airlines' 'album' 'analysts' 'apple' 'arsenal' 'athens'
 'athletics' 'award' 'ballet' 'ban' 'band' 'bank' 'bbc' 'best' 'bid'
 'blair' 'blog' 'book' 'britain' 'british' 'broadband' 'brown' 'business'
 'champion' 'chart' 'chelsea' 'chief' 'children' 'china' 'club' 'coach'
 'comedy' 'committee' 'companies' 'company' 'computer' 'content' 'council'
 'countries' 'cup' 'data' 'deal' 'deutsche' 'digital' 'dollar' 'doping'
 'drugs' 'economic' 'economy' 'education' 'election' 'england' 'european'
 'euros' 'film' 'final' 'financial' 'firm' 'firms' 'fraud' 'game' 'games'
 'gaming' 'glazer' 'good' 'government' 'great' 'group' 'growth' 'half'
 'high' 'home' 'howard' 'hunting' 'iaaf' 'including' 'information'
 'injury' 'internet' 'ireland' 'jones' 'just' 'labour' 'like' 'liverpool'
 'lord' 'mail' 'make' 'market' 'match' 'microsoft' 'million' 'minister'
 'mobile' 'mps' 'mr' 'music' 'musical' 'net' 'new' 'nintendo' 'number'
 'oil' 'old' 'olympic' 'o

### Model comparison

8\. **Create a pipeline with tfidf representation and a random forest classifier.**

In [157]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf1 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('classification', RandomForestClassifier())
])

8\. **Fit the pipeline on the training set.**

In [159]:
clf1.fit(X_train, y_train)
clf1.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, vocabulary=None)),
  ('feature_extraction',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('classification',
   RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                          criterion='gini', max_depth=None, max_features='auto',
                          max_leaf_nodes=None, max_samples=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                         

8\. **Use the pipeline to predict the outcome variable on your test set. Evaluate the performance of the pipeline using the classification_report function on the test subset. How do you analyze your results?**

In [161]:
y_pred = clf1.predict(X_test)
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.92      0.97      0.94        92
entertainment       0.97      0.88      0.93        84
     politics       0.96      0.90      0.93        77
        sport       0.94      1.00      0.97       111
         tech       0.95      0.96      0.96        81

     accuracy                           0.95       445
    macro avg       0.95      0.94      0.94       445
 weighted avg       0.95      0.95      0.95       445



8\. **Create another pipeline with tfidf representation and a random forest classifier with the addition of an embedded feature selection using the svm classification method with L1 penalty. Fit the pipeline on your training set and test it with the test set. How does the performance change?**

In [150]:
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
    ('classification', RandomForestClassifier())
])

In [151]:
clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabula...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,
                                

In [145]:
clf.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, vocabulary=None)),
  ('feature_extraction',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('feature_selection',
   SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False,
                                       fit_intercept=True, intercept_scaling=1,
                                       loss='squared_hinge', max_iter=1000,
                                       multi_class='ovr', penalty='l1',
                                       random_state=None, tol=0.0001, verbose=0),
  

In [152]:
y_pred = clf.predict(X_test)

In [154]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.92      0.93      0.93        92
entertainment       0.98      0.94      0.96        84
     politics       0.92      0.90      0.91        77
        sport       0.99      0.98      0.99       111
         tech       0.94      1.00      0.97        81

     accuracy                           0.95       445
    macro avg       0.95      0.95      0.95       445
 weighted avg       0.95      0.95      0.95       445



In [3]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])


NameError: name 'CountVectorizer' is not defined

9\. **Create another pipeline with tfidf representation, chi2 feature selection, and random forest classifier.**

In [162]:
clf3 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectKBest(chi2, k=20)),
    ('classification', RandomForestClassifier())
])

In [164]:
clf3.fit(X_train, y_train)
clf3.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, vocabulary=None)),
  ('feature_extraction',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('feature_selection',
   SelectKBest(k=20, score_func=<function chi2 at 0x000001EA0ADB0288>)),
  ('classification',
   RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                          criterion='gini', max_depth=None, max_features='auto',
                          max_leaf_nodes=None, max_samples=None,
                          min_impurity_decrease=0.0, min_impurity_split=N

In [166]:
y_pred = clf3.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.64      0.46      0.53        92
entertainment       0.80      0.56      0.66        84
     politics       0.79      0.73      0.76        77
        sport       0.63      0.98      0.76       111
         tech       0.88      0.81      0.85        81

     accuracy                           0.72       445
    macro avg       0.75      0.71      0.71       445
 weighted avg       0.73      0.72      0.71       445



In [173]:
clf4 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectKBest(chi2, k=200)),
    ('classification', RandomForestClassifier())
])

In [174]:
clf4.fit(X_train, y_train)
clf4.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, vocabulary=None)),
  ('feature_extraction',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('feature_selection',
   SelectKBest(k=200, score_func=<function chi2 at 0x000001EA0ADB0288>)),
  ('classification',
   RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                          criterion='gini', max_depth=None, max_features='auto',
                          max_leaf_nodes=None, max_samples=None,
                          min_impurity_decrease=0.0, min_impurity_split=

In [175]:
y_pred = clf4.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.87      0.91      0.89        92
entertainment       0.97      0.92      0.94        84
     politics       0.93      0.88      0.91        77
        sport       0.96      0.99      0.98       111
         tech       0.95      0.96      0.96        81

     accuracy                           0.94       445
    macro avg       0.94      0.93      0.94       445
 weighted avg       0.94      0.94      0.94       445



10\. **Train both piplines on a training subset of the BBC dataset.**

In [4]:
clf.fit(X, y)

NameError: name 'clf' is not defined

11\. **Evaluate the performance of the models using the classification_report function on the test subset. How do you analyze your results?**

12\. **We can change the learner by simply plugging a different classifier object into our pipeline. Create your third pipeline with L1 norm SVM for the feature selection method and naive Bayes for the classifier. Compare your results on the test set with the previous two pipelines.**

In [177]:
from sklearn.naive_bayes import MultinomialNB

clf5 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False))),
    ('classification', MultinomialNB(alpha=0.01))
])

In [178]:
clf5.fit(X_train, y_train)
clf5.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, vocabulary=None)),
  ('feature_extraction',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('feature_selection',
   SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False,
                                       fit_intercept=True, intercept_scaling=1,
                                       loss='squared_hinge', max_iter=1000,
                                       multi_class='ovr', penalty='l1',
                                       random_state=None, tol=0.0001, verbose=0),
  

In [179]:
y_pred = clf5.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.96      0.93      0.95        92
entertainment       1.00      0.94      0.97        84
     politics       0.95      0.99      0.97        77
        sport       1.00      1.00      1.00       111
         tech       0.93      0.98      0.95        81

     accuracy                           0.97       445
    macro avg       0.97      0.97      0.97       445
 weighted avg       0.97      0.97      0.97       445



### Dimension reduction

13\. **Dimensionality reduction methods such as PCA and SVD can be used to project the data into a lower dimensional space. If you run PCA with your text data, you might end up with the message "PCA does not support sparse input. See TruncatedSVD for a possible alternative." Therefore, we will use the Truncated SVD function from the sklearn package and we want to find out how much of the variance in the BBC data set is explained with different components. For this, first create a tfidf matrix and use that to make a co-occurrence matrix.**

In [185]:
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer()
X = tfidf_vect.fit_transform(X_train)
Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format
Xc.setdiag(0) # sometimes you want to fill same word cooccurence to 0
print("Shape of the TFIDF vectorizer:", X.shape)

[0.99244289 0.00755711]
[6.30061232 0.54980396]


In [None]:
print(Xc.todense()) # print out matrix in dense format

In [None]:
tfidf_vect.vocabulary_

14\. **Run the SVD function with different values for components: 1, 2, 4, 5, 10, 15, 20, 50, 100. Plot the explained variance ratio for each component of Truncated SVD.**

In [None]:
import matplotlib.pyplot as plt

n_comp = [1, 2, 4, 5, 10, 15, 20, 50, 100] # list containing different values of components
explained = [] # explained variance ratio for each component of Truncated SVD
for x in n_comp:
    svd = TruncatedSVD(n_components=x, random_state=321)
    svd.fit(Xc)
    explained.append(svd.explained_variance_ratio_.sum())
    print("Number of components = %r and explained variance = %r"%(x,svd.explained_variance_ratio_.sum()))

plt.plot(n_comp, explained)
plt.xlabel('Number of components')
plt.ylabel("Explained Variance")
plt.title("Plot of Number of components v/s explained variance")
plt.show()

15\. **How many components are needed to explain at least 95% of the variance?**

Based on the selected values, it seems 15 components are needed to explain 95% of the variance.

16\. **Use these components and train a SVM model on the BBC dataset. Make a pipeline for your model. Compare your results on the test set with the previous pipelines.**

In [222]:
clf5 = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('feature_extraction', TfidfTransformer()),
    ('feature_selection', TruncatedSVD(n_components=15, random_state=321)),
    ('classification', LinearSVC())
])

In [223]:
clf5.fit(X_train, y_train)

TypeError: PCA does not support sparse input. See TruncatedSVD for a possible alternative.

In [221]:
y_pred = clf5.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=labels_str))

               precision    recall  f1-score   support

     business       0.91      0.92      0.92        92
entertainment       1.00      0.94      0.97        84
     politics       0.91      0.92      0.92        77
        sport       0.99      1.00      1.00       111
         tech       0.94      0.96      0.95        81

     accuracy                           0.95       445
    macro avg       0.95      0.95      0.95       445
 weighted avg       0.95      0.95      0.95       445

