# Text Analysis Tutorial

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Getting Started

I cloned the tutorial from the [scikit-learn github repo](https://github.com/scikit-learn/scikit-learn):

`git clone git@github.com:scikit-learn/scikit-learn.git`

Then I copied the tutorial files to my own working directory. I have scikit-learn installed through conda, so I should be good to go.

## fetch_data.py

The script pulls paragraph data from wikipedia pages in 11 different languages and stores them in the paragraphs folder. Also, for each paragraph pulled in, split it into 5-word chunks and save it as synthetic short paragraph data.

### TIL

* `np.array_split(array, n_groups)` - splits an array into n subarrays. I wonder what the equivalent for ruby's `each_slice(n)` is.
* `urllib.request` module's `build_opener` constructs an object that can be used to make http requests.
* `'%s_%04d.txt' % (lang, i)` - string interpolation format is good to review.

# Getting Started II

Okay, seems like the above setup was unnecessary since we're using the 20 newsgroups dataset...




In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [3]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [4]:
len(twenty_train.data)

2257

In [5]:
len(twenty_train.filenames)

2257

In [6]:
print(twenty_train.data[0])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



In [8]:
twenty_train.target_names[twenty_train.target[0]]

'comp.graphics'

## ML Formulation

* Input - Documents
* Labels - Categories, such as `comp.graphics`
* Features - Words in the document

### Bag of words

* Assume you have a dictionary that assigns an integer to every word in your corpus.
* Each document can be represented as a frequency vector
  * The i-th entry corresponds to frequency in which the i-th word in the dictionary occurs
* `scipy.sparse` efficiently represents this vector

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

- `CountVectorizer` builds a vocabulary from the input data and builds a frequency matrix
- `(i, j)` gives the frequency of the `j`-th word in the vocabulary list in document `i`
- `#vocabulary_` attribute is a dict mapping features (words) to indices
- `#get_feature_names()` gives you the vocabulary list

## Tf-idf

* Occurrences give too much weight to frequently occurring words
* Term frequency (tf) - normalize a document's frequency vector by number of words in that document
* Inverse-document frequency (idf) - downweight words frequently occurring in corpus

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

In [24]:
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

In [25]:
# Train a Naive Bayes classifier on the labelled tfidf feature vectors
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [29]:
# Use the classifier to classify new unseen text.
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predictions = classifier.predict(X_new_tfidf)
for doc, category in zip(docs_new, predictions):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


In [30]:
# Compose all the above into a reusable pipeline
from sklearn.pipeline import Pipeline
text_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('classifier', MultinomialNB()),
])
text_classifier.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [32]:
# Calculate classifier accuracy on held-out test set
import numpy as np
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_classifier.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.8348868175765646

In [33]:
from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

             micro avg       0.83      0.83      0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



In [34]:
metrics.confusion_matrix(twenty_test.target, predicted)

array([[192,   2,   6, 119],
       [  2, 347,   4,  36],
       [  2,  11, 322,  61],
       [  2,   2,   1, 393]])

# Troubleshooting

```
$ python fetch_data.py 
Traceback (most recent call last):
  File "fetch_data.py", line 8, in <module>
    import lxml.html
  File "/home/atsui/anaconda3/lib/python3.7/site-packages/lxml/html/__init__.py", line 54, in <module>
    from .. import etree
ImportError: libicui18n.so.58: cannot open shared object file: No such file or directory
```

I guess conda is missing some necessary package.

```
$ conda search icu
Loading channels: done
# Name                       Version           Build  Channel             
icu                             54.1               0  pkgs/free           
icu                             58.2      h211956c_0  pkgs/main           
icu                             58.2      h9c2bf20_1  pkgs/main       
```

I guess my problem will be solved by installing icu 58.2. Don't I already have it?

```
$ conda list | grep icu
icu                       64.2                 he1b5a44_1    conda-forge
```

Will I get in trouble if I install an older version...?

```
$ conda install icu=58.2
Collecting package metadata: done
Solving environment: \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::conda-build==3.17.6=py37_0
  - defaults/linux-64::lxml==4.2.5=py37hefd8a0e_0
  - defaults/linux-64::libxml2==2.9.8=h26e45fe_1
  - defaults/linux-64::qt==5.9.7=h5867ecd_1
  - defaults/linux-64::qtconsole==4.4.3=py37_0
  - conda-forge/linux-64::jupyter_contrib_nbextensions==0.5.1=py37_0
  - defaults/linux-64::pango==1.42.4=h049681c_0
  - defaults/linux-64::fontconfig==2.13.0=h9420a91_0
  - defaults/linux-64::libarchive==3.3.3=h5d8350f_5
  - defaults/linux-64::harfbuzz==1.8.8=hffaf4a1_0
  - defaults/linux-64::libxslt==1.1.32=h1312cb7_0
  - defaults/linux-64::anaconda==2018.12=py37_0
  - defaults/linux-64::python-libarchive-c==2.8=py37_6
  - defaults/linux-64::anaconda-navigator==1.9.6=py37_0
  - defaults/linux-64::navigator-updater==0.2.1=py37_0
  - defaults/linux-64::jupyter==1.0.0=py37_7
  - defaults/linux-64::pyqt==5.9.2=py37h05f1152_2
  - defaults/linux-64::spyder==3.3.2=py37_0
  - defaults/linux-64::matplotlib==3.0.2=py37h5429711_0
  - defaults/linux-64::cairo==1.14.12=h8948797_3
  - defaults/linux-64::seaborn==0.9.0=py37_0
  - defaults/linux-64::scikit-image==0.14.1=py37he6710b0_0
done

## Package Plan ##

  environment location: /home/atsui/anaconda3

  added / updated specs:
    - icu=58.2


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    arrow-cpp-0.11.1           |   py37h5c3f529_1         6.7 MB
    boost-cpp-1.67.0           |       h14c3975_4          11 KB
    ca-certificates-2020.1.1   |                0         132 KB
    certifi-2019.11.28         |           py37_0         156 KB
    conda-4.8.3                |           py37_0         3.0 MB
    conda-package-handling-1.6.0|   py37h7b6447c_0         872 KB
    glog-0.3.5                 |       hf484d3e_1         158 KB
    libboost-1.67.0            |       h46d08c1_4        20.9 MB
    openssl-1.1.1e             |       h7b6447c_0         3.8 MB
    pyarrow-0.11.1             |   py37he6710b0_0         1.9 MB
    thrift-cpp-0.11.0          |       h02b749d_3         2.3 MB
    ------------------------------------------------------------
                                           Total:        40.0 MB

The following NEW packages will be INSTALLED:

  conda-package-han~ pkgs/main/linux-64::conda-package-handling-1.6.0-py37h7b6447c_0
  libboost           pkgs/main/linux-64::libboost-1.67.0-h46d08c1_4

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2019.11.~ --> pkgs/main::ca-certificates-2020.1.1-0
  conda                                       4.6.14-py37_0 --> 4.8.3-py37_0
  openssl            conda-forge::openssl-1.1.1d-h516909a_0 --> pkgs/main::openssl-1.1.1e-h7b6447c_0

The following packages will be SUPERSEDED by a higher-priority channel:

  arrow-cpp          conda-forge::arrow-cpp-0.15.1-py37h98~ --> pkgs/main::arrow-cpp-0.11.1-py37h5c3f529_1
  boost-cpp          conda-forge::boost-cpp-1.70.0-h8e57a9~ --> pkgs/main::boost-cpp-1.67.0-h14c3975_4
  certifi                                       conda-forge --> pkgs/main
  glog                   conda-forge::glog-0.4.0-he1b5a44_1 --> pkgs/main::glog-0.3.5-hf484d3e_1
  icu                      conda-forge::icu-64.2-he1b5a44_1 --> pkgs/main::icu-58.2-h9c2bf20_1
  pyarrow            conda-forge::pyarrow-0.15.1-py37h8b68~ --> pkgs/main::pyarrow-0.11.1-py37he6710b0_0
  thrift-cpp         conda-forge::thrift-cpp-0.12.0-hf3afd~ --> pkgs/main::thrift-cpp-0.11.0-h02b749d_3
  zstd                   conda-forge::zstd-1.4.4-h3b9ef0a_1 --> pkgs/main::zstd-1.3.7-h0b5b093_0
```

Let's just go for it.

After installing icu 58.2, the python script worked.