   我们将文本规范化、特征提取、建模、评估结合在一起，建立一个多分类文本分类系统。

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cross_validation import train_test_split
def get_data():
    data = fetch_20newsgroups(subset = 'all',
                             shuffle = True,
                             remove = ('headers','footers','quotes'))
    return data
def prepare_datasets(corpus,labels,test_data_proportion = 0.3):
    train_X,test_X,train_Y,test_Y = train_test_split(corpus,labels,test_size = 0.33,
                                                    random_state = 42)
    return train_X,test_X,train_Y,test_Y
def remove_empty_docs(corpus,labels):
    filtered_corpus = []
    filtered_labels = []
    for doc,label in zip(corpus,labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)
            
    return filtered_corpus,filtered_labels



现在我们已经获取了数据，查看了数据集中分类的数量，使用下面的代码将数据集分为测试数据集和训练数据集

In [2]:
dataset = get_data()
print(dataset.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [3]:
corpus,labels = dataset.data,dataset.target
corpus,labels = remove_empty_docs(corpus,labels)

In [4]:
print('Samle document:',corpus[10])
print('Class label :',labels[10])
print('Actual class label:',dataset.target_names[labels[10]])

Samle document: the blood of the lamb.

This will be a hard task, because most cultures used most animals
for blood sacrifices. It has to be something related to our current
post-modernism state. Hmm, what about used computers?

Cheers,
Kent
Class label : 19
Actual class label: talk.religion.misc


In [7]:
train_corpus,test_corpus,train_labels,test_labels = prepare_datasets(corpus,labels,test_data_proportion  = 0.3)

In [11]:
from normalization2 import normalize_corpus
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

In [19]:
from feature_extractors import bow_extractor,tfidf_extractor
from feature_extractors import averaged_word_vectorizer
from feature_extractors import tfidf_weighted_averaged_word_vectorizer
import nltk
import gensim
bow_vectorizer,bow_train_features = bow_extractor(norm_train_corpus)
bow_test_features = bow_vectorizer.transform(norm_test_corpus)

tfidf_vectorizer,tfidf_train_features = tfidf_extractor(norm_train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)
tokenized_train = [nltk.word_tokenize(text) for text in norm_train_corpus]
tokenized_test = [nltk.word_tokenize(text) for text in norm_test_corpus]
model = gensim.models.Word2Vec(tokenized_train,
                              size = 500,
                              window = 100,
                              min_count = 30,
                              sample = 1e-3)
avg_wv_train_features = averaged_word_vectorizer(corpus = tokenized_train,model = model,num_features = 500)
avg_wv_test_features = averaged_word_vectorizer(corpus = tokenized_test,model = model,num_features = 500 )
vocab = tfidf_vectorizer.vocabulary_
tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_train,
                                       tfidf_vectorizer = tfidf_train_features,
                                       tfidf_vocabulary = vocab,model = model,
                                       num_features = 500)
tfidf_wv_test_features = tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_test,
                                       tfidf_vectorizer = tfidf_test_features,
                                       tfidf_vocabulary = vocab,model = model,
                                       num_features = 500)

AttributeError: 'Word2Vec' object has no attribute 'index2word'

使用上面的特征提取器从文本文档中提取了全部必要的特征后，基于前面讨论的四个指标，我们定义一个函数用来评估分类模型，函数如下面的代码段所示：

In [None]:
from sklearn import metrics
import numpy as np
def get_metrics(trur_labels,predicted_labels):
    print('Accuracy:',np.round(metrics.accuracy_score(true_labels,predicted_labels),2))