## 20 News Group Classification

In [30]:
import numpy as np
import pandas as pd

In [31]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset = 'all', random_state = 2021)

- 데이터 탐색

In [32]:
news.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [33]:
news.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [34]:
pd.Series(news.target).value_counts().sort_index()

0     799
1     973
2     985
3     982
4     963
5     988
6     975
7     990
8     996
9     994
10    999
11    991
12    984
13    990
14    987
15    997
16    910
17    940
18    775
19    628
dtype: int64

In [35]:
print(news.data[0])

From: dagibbs@quantum.qnx.com (David Gibbs)
Subject: Re: Countersteering sans Hands
Organization: QNX Software Systems, Ltd.
Lines: 22

In article <1993Apr20.203344.8417@cs.cornell.edu> karr@cs.cornell.edu (David Karr) writes:
>In article <Clarke.6.735328328@bdrc.bd.com> Clarke@bdrc.bd.com (Richard Clarke) writes:
>>So how do I steer when my hands aren't on the bars? (Open Budweiser in left 
>>hand, Camel cigarette in the right, no feet allowed.) 
>
>>If I lean, and the 
>>bike turns, am I countersteering?
>
>No, the bars would turn only *toward* the direction of turn in
>no-hands steering.

Just in case the original poster was looking for a serious answer,
I'll supply one.

Yes, even when steering no hands you do something quite similar
to countersteering.  Basically to turn left, you to a quick wiggle
of the bike to the right first, causing a counteracting lean to
occur to the left.  It is a lot more difficult to do on a motorcycle
than a bicycle though, because of the extra weight. 

### train / test

In [36]:
train_news = fetch_20newsgroups(subset = 'train', random_state = 2021, remove = ('headers', 'footers', 'quotes'))
X_train = train_news.data
y_train = train_news.target


In [37]:
print(train_news.data[0])


Stop! Hold it! You have a few problems here. Official history says that 
the first accusations of homosexuality in the SA came from OUTSIDE of the Nazi 
party, long BEFORE the Nazis ever came to power. So this objection is a red
herring, even if established history is wrong on this point. Moreover, none of 
the histories I've read ever made mention of Hitler or anyone else ever using 
homosexuality as a pretext for purging Roehm. A point I saw reiterated was that
Hitler and the party covered up these accusations. If you are going to accuse
official history of being a fabrication, you should at least get your facts
right. The pretext for purging Roehm was that he was planning to use the SA in
a coup against Hitler. Nowhere is there mention of using allegations of
homosexuality as a pretext for the purge, nor as a justification afterwards (it
is possible that the histories I've read have not mentioned this, but I doubt
it - would it be in Hitler's best interest to admit to the world tha

In [38]:
train_news.target[0], train_news.target_names[train_news.target[0]]

(19, 'talk.religion.misc')

In [39]:
test_news = fetch_20newsgroups(subset = 'test', random_state = 2021, remove = ('headers', 'footers', 'quotes'))
X_test = test_news.data
y_test = test_news.target

In [40]:
len(X_train), len(X_test)

(11314, 7532)

### Feature vectorzing transition and Machine learning model trainning / prediction / evaluation

- Case 1. CountVectorizer & LogisticRegression

In [41]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
cv.fit(X_train)
X_train_cv = cv.transform(X_train)
X_test_cv = cv.transform(X_test)

In [42]:
X_train_cv.shape, X_test_cv.shape

((11314, 101631), (7532, 101631))

In [43]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter = 300)
lr.fit(X_train_cv, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(max_iter=300)

In [44]:
from sklearn.metrics import accuracy_score
pred = lr.predict(X_test_cv)
accuracy_score(y_test, pred)

0.5969198088157196

In [45]:
y_test[:5], pred[:5]

(array([13, 11,  9,  6, 19]), array([13, 12,  9,  6, 13]))

- Case 2. TfidfVectorizer + LogisticRegression

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
tv.fit(X_train)
X_train_tv = tv.transform(X_train)
X_test_tv = tv.transform(X_test)

In [47]:
X_train_tv.shape, X_test_tv.shape

((11314, 101631), (7532, 101631))

In [48]:
lr = LogisticRegression(max_iter = 300)
lr.fit(X_train_tv, y_train)

LogisticRegression(max_iter=300)

In [49]:
pred = lr.predict(X_test_tv)
accuracy_score(y_test, pred)

0.6736590546999469

In [50]:
y_test[:5], pred[:5]

(array([13, 11,  9,  6, 19]), array([13, 12,  9,  6,  1]))

- Case 3. stop_words filtering, ngram_range = (1, 2), max_df = 300

In [51]:
tv = TfidfVectorizer(ngram_range = (1, 2), max_df = 300, stop_words = 'english')
tv.fit(X_train)
X_train_tv2 = tv.transform(X_train)
X_test_tv2 = tv.transform(X_test)

In [52]:
lr = LogisticRegression(max_iter = 300)
lr.fit(X_train_tv2, y_train)
pred = lr.predict(X_test_tv2)
accuracy_score(y_test, pred)

0.6922464152947424

In [53]:
y_test[:5], pred[:5]

(array([13, 11,  9,  6, 19]), array([13, 11,  9,  6, 12]))

- Case 4. LogisticRegression(C = 10) from Case 3

In [54]:
%%time
lr = LogisticRegression(max_iter = 300, C = 10)
lr.fit(X_train_tv2, y_train)
pred = lr.predict(X_test_tv2)
accuracy_score(y_test, pred)

CPU times: user 14min 2s, sys: 12min 15s, total: 26min 17s
Wall time: 11min 28s


0.7012745618693574

### Hyper Parameter tuning through Pipeline and GridSearchCV

In [55]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tv', TfidfVectorizer(stop_words = 'english')),
    ('lr', LogisticRegression())
])

In [56]:
params = {
    'tv__ngram_range' : [(1, 1), (1, 2)],
    'tv__max_df' : [300, 700],
    'lr__C' : [1, 10]
}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_pipe = GridSearchCV(pipeline, param_grid = params, cv = 3, scoring = 'accuracy', verbose = 1, n_jobs = -1)

In [58]:
%%time
grid_pipe.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


RuntimeError: cannot release un-acquired lock