# Topic modeling

## imports

In [4]:
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import joblib

  return f(*args, **kwds)


In [5]:
train_data = datasets.fetch_20newsgroups(subset='train')
test_data = datasets.fetch_20newsgroups(subset='test')

In [8]:
print(test_data.data[10])

From: Greg.Reinacker@FtCollins.NCR.COM
Subject: Windows On-Line Review uploaded
Reply-To: Greg.Reinacker@FtCollinsCO.NCR.COM
Organization: NCR Microelectronics, Ft. Collins, CO
Lines: 12

I have uploaded the Windows On-Line Review shareware edition to
ftp.cica.indiana.edu as /pub/pc/win3/uploads/wolrs7.zip.

It is an on-line magazine which contains reviews of some shareware
products...I grabbed it from the Windows On-Line BBS.

--
--------------------------------------------------------------------------
Greg Reinacker                          (303) 223-5100 x9289
NCR Microelectronic Products Division   VoicePlus 464-9289
2001 Danfield Court                     Greg.Reinacker@FtCollinsCO.NCR.COM
Fort Collins, CO  80525



In [17]:
test_data.target_names[test_data.target[10]]

'comp.os.ms-windows.misc'

In [15]:
print(test_data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features      

In [11]:
test_data.target

array([ 7,  5,  0, ...,  9,  6, 15])

In [10]:
test_data.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Model training

In [18]:
vectorizer = TfidfVectorizer(min_df=100)

vectorized_train_data = vectorizer.fit_transform(train_data.data)
vectorized_test_data = vectorizer.transform(test_data.data)

In [20]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)

In [21]:
model.fit(vectorized_train_data, train_data.target)

RandomForestClassifier(n_jobs=-1)

In [25]:
train_preds = model.predict(vectorized_train_data)
test_preds = model.predict(vectorized_test_data)

In [26]:
accuracy_score(train_preds, train_data.target)

0.9999116139296447

In [27]:
accuracy_score(test_preds, test_data.target)

0.6929102496016994

In [28]:
test_obj = 'I like to play hockey. Hockey is winter sport, a lot of people like it as well'

In [30]:
vectorized_test_obj = vectorizer.transform([test_obj])

In [31]:
print(vectorized_test_obj)

  (0, 2354)	0.19004648736253527
  (0, 2198)	0.07995502983108255
  (0, 2039)	0.39594617400909415
  (0, 1646)	0.29624377122737316
  (0, 1613)	0.1796512227862086
  (0, 1523)	0.0818410487920468
  (0, 1305)	0.25035611191746915
  (0, 1270)	0.30968642249164924
  (0, 1166)	0.09596766531354757
  (0, 1156)	0.09193929454610761
  (0, 1065)	0.6936193159452821
  (0, 268)	0.12895354534317988


In [32]:
model.predict(vectorized_test_obj)[0]

10

In [34]:
train_data.target_names[model.predict(vectorized_test_obj)[0]]

'rec.sport.hockey'

## Dump vectorizer and model

In [35]:
with open('news_vectorizer_dump_tfidf.pkl', 'wb') as output_file:
    joblib.dump(vectorizer, output_file)

In [36]:
with open('news_model_dump_rf.pkl', 'wb') as output_file:
    joblib.dump(model, output_file)