# Multiclass Classification

虽然我们通过sklearn 也能像处理二分类那样用相同的方式对待多分类问题，但值得去理解LR处理多分类问题的原理。

多项Logistic Regression，近来也以softmax regression 闻名。

回忆下二分类的情况，模型用1个w 向量 表示，target为positive(1)的概率为：![being_1](probability.png)



K-class: 模型用K个w 向量表示， w1,w2,...,wK, target为k的概率为：![being_k](k_class_probability.png)

注意正规化，所有k的概率和为1。

二分类的代价函数为：![binary_cost_function](cost_function_T.png)

k分类的代价函数为：![k_cost_function](k_cost_function.png)

有了代价函数之后，就能迭代求解最佳w：![weights_delta](k_weights_delta.png)
有了最佳w后，就有了预测新样本的模型：![predictor](predictor_with_optimal_w.png)

### 下面以新闻主题分类为例：

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import SGDClassifier
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer

In [2]:
all_names = set(names.words())
lemmatizer = WordNetLemmatizer()

In [3]:
def letters_only(astr):
    for c in astr:
        if not c.isalpha():
            return False
    return True

def clean_text(docs):
    cleaned_docs = []
    for doc in docs:
        cleaned_docs.append(' '.join([lemmatizer.lemmatize(word.lower())
                                         for word in doc.split()
                                         if letters_only(word)
                                         and word not in all_names]))
    return cleaned_docs

In [5]:
data_train = fetch_20newsgroups(subset='train', categories=None, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories = None, random_state=42)

cleaned_train = clean_text(data_train.data)
label_train = data_train.target
cleaned_test = clean_text(data_test.data)
label_test = data_test.target

tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english',
                                   max_features=40000)
term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

In [7]:
from sklearn.model_selection import GridSearchCV
parameters = {'penalty': ['l2', None],
              'alpha': [1e-07, 1e-06, 1e-05, 1e-04],
              'eta0': [0.01, 0.1, 1, 10]}

sgd_lr = SGDClassifier(loss='log', learning_rate='constant', eta0=0.01, fit_intercept=True, n_iter=10)

grid_search = GridSearchCV(sgd_lr, parameters, n_jobs=-1, cv=3)

grid_search.fit(term_docs_train, label_train)
print(grid_search.best_params_)

sgd_lr_best = grid_search.best_estimator_
accuracy = sgd_lr_best.score(term_docs_test, label_test)
print('The accuracy on testing set is: {0:.1f}%'.format(accuracy*100))

{'alpha': 1e-06, 'eta0': 10, 'penalty': None}
The accuracy on testing set is: 79.6%
