## Working With Text Data

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics.

In [1]:
import numpy as np
import pandas as pd

#### Import the data, limiting the selection to 4 classes

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [5]:
type(twenty_train)

sklearn.utils.Bunch

The returned dataset is a scikit-learn “bunch”: 
a simple holder object with fields that can be both accessed as python dict keys or object attributes for convenience, 
for instance the target_names holds the list of the requested category names:

In [6]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files themselves are loaded in memory in the data attribute. 
For reference the filenames are also available:

In [7]:
len(twenty_train.data)

2257

In [8]:
len(twenty_train.filenames)

2257

In [15]:
print("\n".join(twenty_train.data[0].split("\n")))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



For speed and space efficiency reasons scikit-learn loads the target attribute 
as an array of integers that corresponds to the index of the category name in the target_names list. 
The category integer id of each sample is stored in the target attribute:

In [17]:
print(twenty_train.target[0])

1


In [16]:
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


### Extracting features from text files
In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

#### Bags of words
The most intuitive way to do so is the bags of words representation:

1. assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).
2. for each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j where j is the index of word w in the dictionary
The bags of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

If n_samples == 10000, storing X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM which is barely manageable on today’s computers.

Fortunately, most values in X will be zeros since for a given document less than a couple thousands of distinct words will be used. For this reason we say that bags of words are typically high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the feature vectors in memory.

scipy.sparse matrices are data structures that do exactly this, and scikit-learn has built-in support for these structures.

#### Tokenizing text with scikit-learn
Text preprocessing, tokenizing and filtering of stopwords are included in a high level component that is able to build a dictionary of features and transform documents to feature vectors:

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

In [26]:
list(count_vect.vocabulary_)[50]

'email'

In [27]:
len(count_vect.vocabulary_)

35788

shape = #docs * #words in vodabulary (=features)

In [29]:
X_train_counts.shape

(2257, 35788)

For instance I look for the word Michael  in the 1st doc

In [32]:
count_vect.vocabulary_['Michael']

KeyError: 'Michael'

In [34]:
X_train_counts[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]], dtype=int64)