# Getting started with python for text processing

# Part 0 : Installation
The easiest way to get most packages that you need is through installing the anaconda stack.
http://continuum.io/downloads

You will also need nltk library and optionally textblob for extra functionality:
Installing nltk: http://www.nltk.org/install.html
Installing nltk data: http://www.nltk.org/data.html 
[Optional] Installing TextBlob: https://textblob.readthedocs.org/en/dev/

___

# Part 1: Getting started

## Variables

Variables can be defined using a custom name and you do not need to explicitly assign types to variables:

In [43]:
var_str = 'Hello'
var_int = 2
var_float = 2.0

# Variable types:
print type(var_float)
print type(var_str)
print isinstance(var_str, str)


<type 'float'>
<type 'str'>
True


In [44]:
# Two important types in python are sets and lists
mylist = [1,2,5,7]
myset = {'a','b'}

# dictionaries provide mappings:
mydict = {'cat': 1, 'dog': 2}
print mydict['cat']
print mydict.keys()
print mydict.values()


1
['dog', 'cat']
[2, 1]


In [45]:
# Accessing members of a list by index
print mylist[0]

1


In [46]:
# Adding elements
mylist.append(8)
print mylist

myset.add('c')
myset.add('b')
print myset

[1, 2, 5, 7, 8]
set(['a', 'c', 'b'])


___
## Operations on variables
Python supports all standard operations:
+, -, *, /, //, %

In [11]:
a = 10
print a*2 + a**2 

a += 1   # i.e: a = a + 1
print a


# Note: In python 2, ``/`` operator is integer division if inputs are integers.
print a/3
# If you need float division you can explicitly divide by float
print a/3.0
print a/float(3)

# If you have lots of those operations do the following import on top of your .py file:
from __future__ import division
print a/3

# operator ``//`` explicitly does integer division
print a//float(3)

# operator ``%`` is the modulo operator
print a%3

120
11
3.66666666667
3.66666666667
3.66666666667
3.66666666667
3.0
2


___


## More list and string operations

In [19]:
# access array indices by range
a = range(10) # ie. a = [0,1,2,...,9]
print a

print a[2:5]
print a[-1]
print a[-4:-2]
print a[:4]

# Add strings
a = 'str1 '
b = 'str2 '
print a + b
print a[2:]
print a.find('t')
print a.find('G')
print a[a.find('t'):]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[2, 3, 4]
9
[6, 7]
[0, 1, 2, 3]
str1 str2 
r1 
1
-1
tr1 


___

## Comprehensions

Comprehensions provide a concise way to create lists, sets, dictionaries. Common applications are to make new lists where each element is the result of some operations applied to each member of another sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.

In [39]:
a = []
for i in range(10):
    a.append(i)

# List comprehension:
b = [i for i in range(10)] # Consie way of defining a list
print a == b

# Set comprehension:
a = {x**2 for x in {-2,-1,1,2}}
print a

# Dictionary comprehension:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
print zip(keys, values)
a = {key: value for key, value in zip(keys, values)}
print a
print a['a']



True
set([1, 4])
[('a', 1), ('b', 2), ('c', 3)]
{'a': 1, 'c': 3, 'b': 2}
1


___

## Flow control
Flow control statements in python are ``if``, ``for``, and ``while``.

Python does not provide code block statements such as ``{`` ``}``.
Instead code blocks are identified by indents. 

In [40]:
a = range(10)

for element in a:
    if element > 8:
        print element

evens = []
odds = []
for element in a:
    if element % 2 == 0:
        evens.append(element)
    elif element % 2 == 1:
        odds.append(element)

print 'evens: ' + str(evens)
print 'odds: ' + str(odds)

while len(a) < 12:
    a.append(30)

print a

9
evens: [0, 2, 4, 6, 8]
odds: [1, 3, 5, 7, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 30, 30]


___

## Functions

In python the ``def`` keyword is used to define functions, short for define.

In [47]:
def square(x):
    return x**2

def first_n_items(x, n):
    return x[:n]

print square(21)
print first_n_items(range(5, 20), 3)


441
[5, 6, 7]


In [49]:
# Functions without return statements return None

def print_line(a): # Prints a list with each element in a separate line
    for element in a:
        print element
        
mylist = range(3)
print_line(mylist)
print print_line(mylist)

# Note the last statement first runs the function and then
#    prints it's return value, which in this case is None

0
1
2
0
1
2
None


___

### Classes

Python supports Object Oriented Programming principles. The ```class``` statement creates a new class definition. We just give a very basic introduction to class definition in python. For more information please refer to python documentation: https://docs.python.org/2/tutorial/classes.html



In [34]:
class Dog:
    kind = 'canine'         # class variable shared by all instances

    def __init__(self, name, breed=None):
        '''
        __init__ is a function that is called right after class instantiation
        arguments can be mandatory or optional, optional arguments are assigned a default value
        '''
        self.name = name    # instance variable unique to each instance
        if breed is not None:
            self.breed = breed
        else:
            self.breed = 'Unknown'

d = Dog('Fido')
print d.kind
print d.name
print d.breed

d = Dog('Buddy', 'Labrador')
print d.breed

canine
Fido
Unknown
Labrador


___

### Imports
When you want to use a specific package that is not loaded by default, you need to ```import``` it. External libraries are used with the import [libname] keyword. You can also use from [libname] import [funcname] for individual functions.

In [None]:
import nltk
from nltk import word_tokenize
from nltk import word_tokenize as wrd_tok

***


# Part 1: Preprocessing text

Textual data come in different formats. The simplest for is the flat text files.

## 1-1 I/O Operations, accessing content of the files

Python provides an easy way to perform I/O operations on files:

In [36]:
file_path = '/tmp/test.txt'
text = 'This is a sample text to be written to file.'

with open(file_path, 'w') as file: # second argument is mode ``w`` is for write
    file.write(text)
    
# Now the file in the ``file_path`` contains the desired text
# Let's read it with python:

with open(file_path, 'r') as file:
    content = file.read()
    
print content

# Sometimes you need to handle international characters
# In that case you can use the ``codecs`` package

import codecs
with codecs.open(file_path, 'r', encoding='utf-8') as file:
    contents = file.read()

print contents


This is a sample text to be written to file.
This is a sample text to be written to file.


## 1-2 Preprocessing using python built functions

### Tokenization or splitting into words
Tokenization is an essential part of any text processing application.

In [69]:
sentence = """Albert Einstein (14 March 1879 - 18 April 1955) was a German-born theoretical physicist. He developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).
Einstein's work is also known for its influence on the philosophy of science. Einstein is best known in popular culture for his mass-energy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation").
He received the 1921 Nobel Prize in Physics for his "services to theoretical physics", in particular his discovery of the law of the photoelectric effect, a pivotal step in the evolution of quantum theory.'''
"""
a = sentence.split()
print a[:10]


['Albert', 'Einstein', '(14', 'March', '1879', '-', '18', 'April', '1955)', 'was']


In [56]:
# Join lists to form a sentence
print ' '.join(a[:10])

Albert Einstein (14 March 1879 - 18 April 1955) was


As you can see the pythons built in split only considers white spaces. In many cases this is not optimal (e.g. in the above sentence ``(14`` is a single token which should not be the case). We will now use nltk to do better preprocessing.

___

## 1-3 nltk
### Sentence tokenization (sentence boundary detection):
You can use Punkt Sentence Tokenizer to split a text into sentences.
This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm.


In [61]:
from nltk.tokenize import sent_tokenize
sents = sent_tokenize(sentence)
for sent in sents:
    print sent[:50] + ' ...'

Albert Einstein (14 March 1879 - 18 April 1955) wa ...
He developed the general theory of relativity, one ...
Einstein's work is also known for its influence on ...
Einstein is best known in popular culture for his  ...
He received the 1921 Nobel Prize in Physics for hi ...


### Tokenizing into words

In [62]:
# The tokenizer that uses the TreeBankTokenizer
# Notice that parantices are handledd correctly here
from nltk.tokenize import word_tokenize

words = word_tokenize(sentence)

print words[:10]

['Albert', 'Einstein', '(', '14', 'March', '1879', '-', '18', 'April', '1955']


In [64]:
# Another tokenizer
from nltk.tokenize import WordPunctTokenizer
words = WordPunctTokenizer().tokenize(sentence)
print words[:10]

['Albert', 'Einstein', '(', '14', 'March', '1879', '-', '18', 'April', '1955']


### Normalizing Text
If often happens that you need to handle terms such as 'Term', 'term' and 'Terms' in a similar way and consider them identical. In order to do that you need to lowercase all the terms and also do stemming:

In [71]:
import nltk
porter = nltk.PorterStemmer()
words = [porter.stem(w) for w in word_tokenize(sentence.lower())] # Comprehension to stem words
print words[:10]

[u'albert', u'einstein', u'(', u'14', u'march', u'1879', u'-', u'18', u'april', u'1955']


In [72]:
# Lemmatization: Changing terms to their lemma
import nltk
lemmatizer = nltk.WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in word_tokenize(sentence.lower())]
print words[10:20]

[')', u'wa', 'a', 'german-born', 'theoretical', 'physicist', '.', 'he', 'developed', 'the']


___
### Part of speech tagging

Part of speech (POS) tags are additional information that can help in various text analysis tasks. Nltk provides an easy way to extract POS tags. You will need to download `` taggers/maxent_treebank_pos_tagger/english.pickle`` by nltk.download() to be able to use it:

In [113]:
import nltk
text = nltk.word_tokenize('This is a sample sentence for which we want to extract part of speech tags.')
print nltk.pos_tag(text)[:5]

# To get help about specific tags use nltk help module
nltk.help.upenn_tagset('VBZ')

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'NN'), ('sentence', 'NN')]
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...


___
# 2- Text classification

Text is unstructured and often inorder to find interesting patterns in text, you need to convert it into structured data and extract information from it. For example, if you want to classify text into different categories, you need to change the free form text to features and later use those features for your classifier.
Here, we walk you through how to do text classification using python and scikit-learn [1].

[1] reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html


In [93]:
# First lets get some data
from sklearn.datasets import fetch_20newsgroups
train = fetch_20newsgroups(subset='train',
    shuffle=True, random_state=2)
print train.data[100]
print train.target[100]
print train.target_names[train.target[100]]

From: rsilver@world.std.com (Richard Silver)
Subject: Barbecued foods and health risk
Organization: The World Public Access UNIX, Brookline, MA
Lines: 10


Some recent postings remind me that I had read about risks 
associated with the barbecuing of foods, namely that carcinogens 
are generated. Is this a valid concern? If so, is it a function 
of the smoke or the elevated temperatures? Is it a function of 
the cooking elements, wood or charcoal vs. lava rocks? I wish 
to know more. Thanks. 


 

13
sci.med


In [94]:
# We want to assing each text to its corresponding category
# We need to extract features

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train.data)
print count_vect.vocabulary_.get(u'algorithm')

27366


In [97]:
# Occurrence count has an issue that longer documents often have more words
# So we can normalize the occurances by dividing each to total number of words in each document
# We also want terms that occur in all documents to have a lower impact, this can be normalized by idf values

# Term Frequency features:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

# Term Frequency - Inverted Document Frequency features:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [103]:
# Now we have all features, lets train the classifier and evaluate its performance
from sklearn.linear_model import SGDClassifier
import numpy as np
clf = SGDClassifier().fit(X_train_tfidf, train.target)
test = fetch_20newsgroups(subset='test',
    shuffle=True, random_state=2)
X_test_counts = count_vect.transform(test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_test_tfidf)
print np.mean(predicted == test.target)    

0.85116834838


In [104]:
# Lets get the categories of some test documents:

docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => rec.autos
