# 20 뉴스그룹 분류

### 1. 필요한 모듈 불러오기 및 데이터 로딩

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset='all', random_state=2022)

In [11]:
news.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [18]:
# DESCR의 경우 print로 해야 깔끔하게 나온다

print(news['DESCR'])

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [19]:
print(news.data[2])

From: astein@nysernet.org (Alan Stein)
Subject: Re: Hamza Salah, the Humanist
Organization: NYSERNet, Inc.
Lines: 16

dzk@cs.brown.edu (Danny Keren) writes:

>cl056@cleveland.Freenet.Edu (Hamaza H. Salah) writes:

># Well said Mr. Beyer :)

>He-he. The great humanist speaks. One has to read Mr. Salah's posters,
>in which he decribes Jews as "sons of pigs and monkeys", keeps
>promising the "final battle" between Muslims and Jews (in which the
>stons and the trees will "cry for the Muslims to come and kill the
>Jews hiding behind them"), makes jokes about Jews dying from heart
>attacks etc, to realize his objective stance on the matters involved.

Humanist, or sub-humanist? :-)
-- 
Alan H. Stein                     astein@israel.nysernet.org



### 2. Train/Test data 추출

In [20]:
train_news = fetch_20newsgroups(
    subset = 'train', random_state=2022, remove = ('headers', 'quotes', 'footers')
)
X_train = train_news.data
y_train = train_news.target

In [21]:
print(X_train[1])

I have the local bus card also, and don't have any such problems with it
now, but this is the second card I've gotten - the first card didn't work
in VGA mode correctly.  Maybe they still have some quality control problems.
I would suggest checking with ATI (I went through the vendor I bought the
card from since the problem showed up immediately).  I never was able to
get through to ATI's technical support number.  

I sure like the way the card performs though.  I have the 2MB ATI ultra
pro - local bus, and it is fast even in 1024x768x16bpp mode.


Cheers,
Phil




In [32]:
y_train[1], train_news.target_names[y_train[1]]

# X_train[1]에 data가 있고, y_train[1]에 2가 있을 때 2에 해당하는 
# target_name을 알고 싶으면 train_news.target_names[2]를 치면 된다
# 2 = y_train[1]이라면 저렇게 넣어도 된다

(2, 'comp.os.ms-windows.misc')

In [30]:
train_news.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [33]:
test_news = fetch_20newsgroups(
    subset = 'test', random_state=2022, remove = ('headers', 'quotes', 'footers')
)
X_test = test_news.data
y_test = test_news.target

In [34]:
len(X_train), len(X_test)

(11314, 7532)

### 3. 피처 벡터화 변환 및 머신러닝 모델 학습/평가

Case 1) CountVectorizer + LogisticRegression

In [35]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()
cvect.fit(X_train)
X_train_cv = cvect.transform(X_train)
X_test_cv = cvect.transform(X_test)
X_train_cv.shape, X_test_cv.shape

((11314, 101631), (7532, 101631))

In [37]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression(random_state=2022)
%time lrc.fit(X_train_cv, y_train)
lrc.score(X_test_cv, y_test)

Wall time: 47.7 s


0.6066117896972916

Case 2) TfidfVectorizer + SVC

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer
tvect = TfidfVectorizer()
tvect.fit(X_train)
X_train_tv = tvect.transform(X_train)
X_test_tv = tvect.transform(X_test)
X_train_tv.shape, X_test_tv.shape

((11314, 101631), (7532, 101631))

In [39]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train_tv, y_train)
svc.score(X_test_tv, y_test)

0.6575942644715879

### 4. Pipeline/GridSearchCV로 최적 하이퍼 파라미터 도출

In [40]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [47]:
pipeline = Pipeline([
    ('tvect', TfidfVectorizer(stop_words = 'english')),
    ('lrc', LogisticRegression(random_state=2022))
])
params = {
    'tvect__max_df' : [0.9, 0.95],
    'tvect__ngram_range' : [(1, 1), (1, 2)],
    'lrc__C' : [1, 10]
}

In [48]:
grid_pipe = GridSearchCV(
    pipeline, param_grid = params, scoring = 'accuracy', cv = 3, n_jobs=-1
)

In [49]:
%time grid_pipe.fit(X_train, y_train)

In [None]:
grid_pipe.best_params_

{'lrc__C': 1, 'tvect__max_df': 0.9, 'tvect__min_df': (1, 1)}

In [None]:
grid_pipe.best_estimator_.score(X_test, y_test)

## 문서 군집화 - Opinion Review Dataset

In [1]:
import pandas as pd
import glob, os

path = 'OpinosisDataset1.0/topics'
os.path.join(path, '*.data')

'OpinosisDataset1.0/topics\\*.data'

In [3]:
all_files = glob.glob(os.path.join(path, '*.data'))
len(all_files)

51

In [11]:
file = all_files[0]
file

'OpinosisDataset1.0/topics\\accuracy_garmin_nuvi_255W_gps.txt.data'

In [15]:
file.split('\\')[-1].split('.')[0]

'accuracy_garmin_nuvi_255W_gps'

In [31]:
with open(file, encoding='latin1') as f:
    text = f.read()
text

", and is very, very accurate .\n but for the most part, we find that the Garmin software provides accurate directions, whereever we intend to go .\n This function is not accurate if you don't leave it in battery mode say, when you stop at the Cracker Barrell for lunch and to play one of those trangle games with the tees .\n It provides immediate alternatives if the route from the online map program was inaccurate or blocked by an obstacle .\n I've used other GPS units, as well as GPS built into cars   and to this day NOTHING beats the accuracy of a Garmin GPS .\n It got me from point A to point B with 100% accuracy everytime .\n It has yet to disappoint, getting me everywhere with 100% accuracy .\n0 out of 5 stars Honest, accurate review, , PLEASE READ !\n Aside from that, every destination I've thrown at has been 100% accurate .\nIn closing, this is a fantastic GPS with some very nice features and is very accurate in directions .\n Plus, I've always heard that there are  quirks  with

In [33]:
filename_list = [file.split('\\')[-1].split('.')[0] for file in all_files]
opinion_list = []
for file in all_files:
    with open(file, encoding='latin1') as f:
        text = f.read()
    opinion_list.append(text)


df = pd.DataFrame({
    'filename' : filename_list, 'opinion' : opinion_list
})
df.head()

Unnamed: 0,filename,opinion
0,accuracy_garmin_nuvi_255W_gps,", and is very, very accurate .\n but for the m..."
1,bathroom_bestwestern_hotel_sfo,"The room was not overly big, but clean and ve..."
2,battery-life_amazon_kindle,After I plugged it in to my USB hub on my com...
3,battery-life_ipod_nano_8gb,short battery life I moved up from an 8gb .\...
4,battery-life_netbook_1005ha,"6GHz 533FSB cpu, glossy display, 3, Cell 23Wh ..."


* Simple tokenizer 함수를 이용해 feature 변환

In [1]:
from nltk import word_tokenize

def simple_tokenizer(text):
    word_list = word_tokenize(text)
    word_list = [word for word in word_list if len(word) > 2]
    return word_list