### 20 뉴스 그룹 분류

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', random_state=2023)

- 데이터 탐색

In [3]:
print(news.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [4]:
print(news.data[0])

From: alin@nyx.cs.du.edu (ailin lin)
Subject: very cheap 386 motherboard
Organization: Nyx, Public Access Unix @ U. of Denver Math/CS dept.
Lines: 7

Novell 386dx16 motherboard with cpu, 4 megs of memory and I/O ports for
$160 + shipping / firm.             

let me know if you are interested.      

ailin
803-654-8817



In [5]:
from pprint import pprint
pprint(news.target_names)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [7]:
np.unique(news.target, return_counts=True)

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19]),
 array([799, 973, 985, 982, 963, 988, 975, 990, 996, 994, 999, 991, 984,
        990, 987, 997, 910, 940, 775, 628], dtype=int64))

- 데이터셋 추출

In [8]:
train_new = fetch_20newsgroups(
    subset='train', random_state=2023, remove=('headers','quotes','footers')
)
X_train = train_new.data
y_train = train_new.target

In [9]:
test_new = fetch_20newsgroups(
    subset='test', random_state=2023, remove=('headers','quotes','footers')
)
X_test = test_new.data
y_test = test_new.target

In [11]:
len(X_train), len(X_test)

(11314, 7532)

##### 피쳐 벡터화 + 머신러닝 모델 학습/평가

- Case 1. CountVectorizer + LogisticRegression

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(stop_words='english')
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((11314, 101322), (7532, 101322))

In [15]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=2023)
%time lr.fit(X_train_cv, y_train)
lr.score(X_test_cv, y_test)

CPU times: total: 52 s
Wall time: 44.9 s


0.6259957514604355

- Case 2. TfidfVectorizer + LogisticRegression

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvect = TfidfVectorizer(stop_words='english')
tvect.fit(X_train)
X_train_tv = tvect.transform(X_train)
X_test_tv = tvect.transform(X_test)
X_train_tv.shape, X_test_tv.shape

((11314, 101322), (7532, 101322))

In [18]:
lr = LogisticRegression(random_state=2023)
%time lr.fit(X_train_tv, y_train)
lr.score(X_test_tv, y_test)

CPU times: total: 46.9 s
Wall time: 39.7 s


0.6909187466808284

- Case 3. TfidfVectorizer(N-gram) + LogisticRegression

In [21]:
tvect2 = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
tvect2.fit(X_train)
X_train_tv2 = tvect2.transform(X_train)
X_test_tv2 = tvect2.transform(X_test)
X_train_tv2.shape, X_test_tv2.shape

((11314, 943737), (7532, 943737))

In [22]:
lr = LogisticRegression(random_state=2023)
%time lr.fit(X_train_tv2, y_train)
lr.score(X_test_tv2, y_test)

CPU times: total: 6min 8s
Wall time: 5min 20s


0.6868029739776952

- Pipeline / GridSearchCV

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipeline = Pipeline([
    ('TVECT', TfidfVectorizer(stop_words='english')),
    ('LR', LogisticRegression(random_state=2023, max_iter=500))
])
params = {
    'tvect__max_df':[300,700],
    'lr__C':[1,10]
}
grid_pipe = GridSearchCV(
    pipeline, params, scoring='accuracy', cv=3, n_jobs=-1
)

In [25]:
%time grid_pipe.fit(X_train, y_train)

ValueError: Invalid parameter 'lr' for estimator Pipeline(steps=[('TVECT', TfidfVectorizer(stop_words='english')),
                ('LR', LogisticRegression(max_iter=500, random_state=2023))]). Valid parameters are: ['memory', 'steps', 'verbose'].

In [None]:
grid_pipe.best_params_, grid_pipe.best_estimator_.score(X_test, y_test)