## Text classification using python

In this task, I want to go over a list of news stories to classify them into topics like sports, politics, finance etc. 

I will use the nltk's reuter news corpus for this exercise.

Lets start with importing the same and checking the total number of articles to be classified.

In [15]:
from nltk.corpus import reuters
len(reuters.fileids())

10788

1. Lower casing, tokenization and stop words

I will first do all the steps on 1 article of the corpus. Then convert the steps into a repeatable method that can be plugged into the over all pipeline.

In [16]:
# import numpy to facilitate data manipulation
import numpy as np

# convert the list of fileids into a numpy array. There may be a better way to do this though
np_fileids = np.array(reuters.fileids())

# observe the shape of the numpy array
print('Shape :: ', np_fileids.shape)

# Observe the raw text of the first article in the corpus (first 300 characters)
print('Raw text :: ', reuters.raw(np_fileids[0])[0:300])

Shape ::  (10788,)
Raw text ::  ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian 


Lets start with lowercasing

In [17]:
# Use the built in lower method for strings 
article_lower = reuters.raw(np_fileids[0]).lower()

# Observe the converted article text
print('Lower case :: ', article_lower[:300])

Lower case ::  asian exporters fear damage from u.s.-japan rift
  mounting trade friction between the
  u.s. and japan has raised fears among many of asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      they told reuter correspondents in asian 


Next up, I am going to use nltk's stop word list to remove the stop words.
___TODO _Use scikit learn's tf-ldf vecotrizer to generate a custom stop-words list from the reuters corpus.

In [40]:
# import the stopwords list from nltk corpus
from nltk.corpus import stopwords
# import the word_tokenizer from nltk
from nltk.tokenize import word_tokenize

# generate a set of stop words
stop_words = set(stopwords.words('english'))

stop_words.add('.')

# tokenize the article text
word_tokens = word_tokenize(article_lower)

# generate the filtered list of articles as a array
article_filtered = []

# iterate through the articles word tokens and skip the words which match with words in the stop words list
for w in word_tokens :
    if w not in stop_words :
        article_filtered.append(w)
        
article_filtered = []

# a more concise form of the above logic
article_filtered = [w for w in word_tokens if w not in stop_words]
        
print('Original length of article :: ', len(word_tokens))
print('Filtered length of article :: ', len(article_filtered))
# observe a part of the filtered article
print(article_filtered[:100])

raw_filtered = ''
# put the article back together for our next step
raw_filtered = " ".join(article_filtered)
    
raw_filtered[:100]

Original length of article ::  811
Filtered length of article ::  529
['asian', 'exporters', 'fear', 'damage', 'u.s.-japan', 'rift', 'mounting', 'trade', 'friction', 'u.s.', 'japan', 'raised', 'fears', 'among', 'many', 'asia', "'s", 'exporting', 'nations', 'row', 'could', 'inflict', 'far-reaching', 'economic', 'damage', ',', 'businessmen', 'officials', 'said', 'told', 'reuter', 'correspondents', 'asian', 'capitals', 'u.s.', 'move', 'japan', 'might', 'boost', 'protectionist', 'sentiment', 'u.s.', 'lead', 'curbs', 'american', 'imports', 'products', 'exporters', 'said', 'conflict', 'would', 'hurt', 'long-run', ',', 'short-term', 'tokyo', "'s", 'loss', 'might', 'gain', 'u.s.', 'said', 'impose', '300', 'mln', 'dlrs', 'tariffs', 'imports', 'japanese', 'electronics', 'goods', 'april', '17', ',', 'retaliation', 'japan', "'s", 'alleged', 'failure', 'stick', 'pact', 'sell', 'semiconductors', 'world', 'markets', 'cost', 'unofficial', 'japanese', 'estimates', 'put', 'impact', 'tariffs', '10', 'bil

'asian exporters fear damage u.s.-japan rift mounting trade friction u.s. japan raised fears among ma'

I am planning to skip the stemmer and only do lemmitization. I will use spacy to do the lemmitization and that's the reason for recombining the tokens back into a raw text.

__TODO _get rid of this round about way of lemmitization

In [56]:
# import spacy
import spacy

# load the built in models for English
nlp = spacy.load('en')

# ready for processing
%time doc = nlp(raw_filtered)

%time tokens_lemmatized = [token.lemma_ for token in doc]
# Observe the tokens and lemma for the article
#for token in doc 
#    tokens_lemmatized.append(token.lemma_)
    
# put the lemmatized tokens back together into a sentence
article_lemmatized = ' '.join(tokens_lemmatized)

# Observe the lemmatized article text
article_lemmatized

CPU times: user 43.6 ms, sys: 108 µs, total: 43.7 ms
Wall time: 43.7 ms
CPU times: user 240 µs, sys: 57 µs, total: 297 µs
Wall time: 301 µs


"asian exporter fear damage u.s .- japan rift mount trade friction u.s . japan raise fear among many asia 's export nation row could inflict far - reach economic damage , businessman official say tell reuter correspondent asian capital u.s . move japan may boost protectionist sentiment u.s . lead curb american import product exporter say conflict would hurt long - run , short - term tokyo 's loss may gain u.s . say impose 300 mln dlrs tariff import japanese electronic good april 17 , retaliation japan 's alleged failure stick pact sell semiconductor world market cost unofficial japanese estimate put impact tariff 10 billion dlr spokesman major electronic firm say would virtually halt export product hit new tax `` would n't able business , '' say spokesman lead japanese electronic firm matsushita electric industrial co ltd & lt ; mc.t > `` tariff remain place length time beyond month mean complete erosion export ( good subject tariff ) u.s . , '' say tom murtha , stock analyst tokyo off