# Part 2: Create a Text Analysis sevice
## Introduction to Machine Learning

In this chapter, we won’t be doing the classic introduction to Machine Learning.
You can find a lot of good classical introductions that usually start with a drawing
which shows how a black box is a function that maps inputs to outputs, but I am
going for a different, more intuitive approach of understanding ML.
Machine Learning is a form of AI which, in general terms, implies performing some
reasoning given some previous experience.
At the other end of the spectrum we have classic AI reasoning paradigms, and here
some of the most popular:

- **Rule Based Systems**: given a set of rules, figure out which ones apply. The
process would look like this: apply the rules, get the results and by interpreting
them, take one decision or another.
- **Case Based Reasoning**: given a list of previous cases and the correct label, get
the most similar ones and figure out a way to use the labels of these cases to
come up with a label for a new given case.
- **Knowledge Based Reasoning**: use a graph-like taxonomy of the world to
perform reasoning (e.g. if all mammals give birth to living babies & dolphins
are fish & dolphins are mammals, then there are some fish that give birth to
alive babies)


Machine learning is basically a way of finding out what are the best rules for some
given data and also a way of updating the system when we get new information.
Let’s take an example to better understand this concept. Suppose we want to predict
the weather and, from a classic approach, we can build the following rules:

- if in the previous day it rained and today the sky is cloudy then today it will
rain
-  if today the humidity is above a threshold T and the sky is cloudy then today it
will rain
- otherwise it won’t rain today

## A Practical Machine Learning Example

A while ago, I have written on my blog a post about a practical Machine Learning
example. I will use the same example in this book, because it fits very well. We
will get started with Machine Learning by building a predictor that is able to tell
if a name is a girl name or a boy name. I think that this example is one of the best
starting points, and here’s why:
-  it’s works on text and is thus, representative for this book’s subject

- it forces you to do actual feature extraction, rather than getting them from a
table
- it models a problem that is familiar rather than dealing with numbers or
abstract concepts

But before diving into the ML example, I want to continue the analogy with human
reasoning from the previous example (predicting the weather). Building the rules
for the weather prediction system, we took into consideration things like:

- Is the temperature above threshold T? (Y/N)
- Is the humidity below threshold H? (Y/N)
- Did it rain yesterday? (Y/N)

It is obvious that we’re not taking into consideration all the information we have.
We don’t care that 100 days ago it rained, 101 days ago it didn’t, and 102 days ago it
poured cats and dogs. Some information might be relevant, but we don’t use it. We
try to simplify the world we live in, aiming only to extract some of its features.
Now let’s dive into the machine learning problem, starting by examining the data a
bit:

**Names Corpus**

In [1]:
import collections
from nltk.corpus import names
# Check the girl names
girl_names = names.words('female.txt')
print(girl_names[:10], '...')

['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abigael', 'Abigail', 'Abigale'] ...


In [2]:
print('#GirlNames=', len(girl_names)) # 5001

#GirlNames= 5001


In [3]:
# Check the boy names
boy_names = names.words('male.txt')
print(boy_names[:10], '...')

['Aamir', 'Aaron', 'Abbey', 'Abbie', 'Abbot', 'Abbott', 'Abby', 'Abdel', 'Abdul', 'Abdulkarim'] ...


In [4]:
print('#BoyNames=', len(boy_names)) # 2943

#BoyNames= 2943


In [5]:
# We know that there are a lot of girl names that end in `a`
# Let's see how many
girl_names_ending_in_a = [name for name in girl_names if name.endswith('a')]
print('#GirlNamesEndingInA=', len(girl_names_ending_in_a)) # 1773

#GirlNamesEndingInA= 1773


In [6]:
# Approx. a third of girl names end in `a`. That's a good insight
# Let's see what are the most common letters girl names end with
girl_ending_letters = collections.Counter([name[-1] for name in girl_names])

"MostCommonEndingLettersForGirls=", girl_ending_letters
# Our intuition was right. The most common letter is `a`
# Here are the first 3: 'a': 1773, 'e': 1432, 'y': 461
# I'm not sure what's the most common last letter for English boy names

('MostCommonEndingLettersForGirls=',
 Counter({'l': 179,
          'e': 1432,
          'y': 461,
          'i': 317,
          'a': 1773,
          'h': 105,
          's': 93,
          'd': 39,
          'n': 386,
          'g': 10,
          'x': 10,
          'o': 33,
          'r': 47,
          't': 68,
          'b': 9,
          'z': 4,
          'u': 6,
          'v': 2,
          'k': 3,
          'm': 13,
          'w': 5,
          ' ': 1,
          'p': 2,
          'j': 1,
          'f': 2}))

In [7]:
# Let's see the stats
boy_ending_letters = collections.Counter([name[-1] for name in boy_names])
"MostCommonEndingLettersForBoys=", boy_ending_letters

('MostCommonEndingLettersForBoys=',
 Counter({'r': 190,
          'n': 478,
          'y': 332,
          'e': 468,
          't': 164,
          'l': 187,
          'm': 70,
          'h': 93,
          'd': 228,
          's': 230,
          'a': 29,
          'i': 50,
          'f': 25,
          'o': 165,
          'k': 69,
          'c': 25,
          'x': 10,
          'j': 3,
          'w': 17,
          'g': 32,
          'b': 21,
          'z': 11,
          'u': 12,
          'p': 18,
          'v': 16}))

**Here’s a problem**: the 2nd and 3rd most common last letters are the same for both
genders. Let’s see how to deal with that.

First, we will build a predictive system and analyze its results. This type of system
is called a classifier. As you can guess, the role of a classifier is to classify data. The
idea is pretty basic: it takes some input and it assigns that input to a class. At first,
we are going to use the NLTK classifier because it is a little bit easier to use. After we
clarify the concepts, we will move to a more complex classifier from Scikit-Learn.

**Build the first name classifier**

In [8]:
import nltk
import random
from nltk.corpus import names

In [9]:
def extract_features(name):
    """
    Get the features used for name classification
    """
    return {'last_letter': name[-1]}

In [10]:
# Get the names
boy_names = names.words('male.txt')
girl_names = names.words('female.txt')

In [11]:
# Build the dataset
boy_names_dataset = [(extract_features(name), 'boy') for name in boy_names]
girl_names_dataset = [(extract_features(name), 'girl') for name in girl_names]
# Put all the names together
data = boy_names_dataset + girl_names_dataset
# Mix everything together
random.shuffle(data)

In [12]:
# Let's take a look at the first few entries
data[:10]

[({'last_letter': 'a'}, 'girl'),
 ({'last_letter': 'a'}, 'girl'),
 ({'last_letter': 'n'}, 'boy'),
 ({'last_letter': 'a'}, 'girl'),
 ({'last_letter': 'e'}, 'girl'),
 ({'last_letter': 'a'}, 'girl'),
 ({'last_letter': 'n'}, 'boy'),
 ({'last_letter': 'y'}, 'girl'),
 ({'last_letter': 'y'}, 'girl'),
 ({'last_letter': 'f'}, 'boy')]

In [13]:
# Split the dataset into training data and test data
cutoff = int(0.75 * len(data))
train_data, test_data = data[:cutoff], data[cutoff + 1:]
# Let's train probably the most popular classifier in the world
name_classifier = nltk.DecisionTreeClassifier.train(train_data)

In [14]:
# Take if for a spin
print(name_classifier.classify(extract_features('Bono'))) # boy
print(name_classifier.classify(extract_features('Latiffa'))) # girl

boy
girl


In [15]:
print(name_classifier.classify(extract_features('Gaga'))) # girl
print(name_classifier.classify(extract_features('Joey'))) # girl
# Sorry Joey :(

girl
girl


In [16]:
# Test how well it performs on the test data:
# Accuracy = correctly_labelled_samples / all_samples
print(nltk.classify.accuracy(name_classifier, test_data)) # 0.7420654911838791
# Look at the tree
print(name_classifier.pretty_format())

0.7531486146095718
last_letter= ? ........................................ girl
last_letter=a? ........................................ girl
last_letter=b? ........................................ boy
last_letter=c? ........................................ boy
last_letter=d? ........................................ boy
last_letter=e? ........................................ girl
last_letter=f? ........................................ boy
last_letter=g? ........................................ boy
last_letter=h? ........................................ girl
last_letter=i? ........................................ girl
last_letter=j? ........................................ boy
last_letter=k? ........................................ boy
last_letter=l? ........................................ boy
last_letter=m? ........................................ boy
last_letter=n? ........................................ boy
last_letter=o? ........................................ boy
last_letter=p? .

At this point, in our example, we only have one input (the last letter) and two classes
(boy/girl). 

Turns out that our experiment with the four names (Bono, Latiffa, Gaga
and Joey) was pretty good since one in four names got misclassified.

Using the `nltk.DecisionTreeClassifier` class we’re effectively building a tree, where
nodes hold conditions that try to optimally classify the training data.


It may not seem like much of a tree at this point (it only has one level), but don’t
worry, by the time we’ll be done with it, it will.
Let’s add another feature: the number of vowels in the name. In order to do this,
we need to change the `extract_features` function and retrain.

**New feature function**

In [17]:
def extract_features2(name):
    """
    Get the features used for name classification
    """
    return {
    'last_letter': name[-1],
    'vowel_count': len([c for c in name if c in 'AEIOUaeiou'])
    }

In [18]:
# Build the dataset
boy_names_dataset = [(extract_features2(name), 'boy') for name in boy_names]
girl_names_dataset = [(extract_features2(name), 'girl') for name in girl_names]
# Put all the names together
data = boy_names_dataset + girl_names_dataset
# Mix everything together
random.shuffle(data)

In [19]:
# Split the dataset into training data and test data
cutoff = int(0.75 * len(data))
train_data, test_data = data[:cutoff], data[cutoff + 1:]
# Let's train probably the most popular classifier in the world
name_classifier = nltk.DecisionTreeClassifier.train(train_data)

In [20]:
# Test how well it performs on the test data:
# Accuracy = correctly_labelled_samples / all_samples
print(nltk.classify.accuracy(name_classifier, test_data)) #
# Look at the tree
print(name_classifier.pretty_format())

0.7758186397984886
last_letter=a? ........................................ girl
last_letter=b? ........................................ boy
  vowel_count=0? ...................................... girl
  vowel_count=1? ...................................... boy
  vowel_count=2? ...................................... boy
last_letter=c? ........................................ boy
last_letter=d? ........................................ boy
last_letter=e? ........................................ girl
last_letter=f? ........................................ boy
last_letter=g? ........................................ boy
  vowel_count=1? ...................................... boy
  vowel_count=2? ...................................... boy
  vowel_count=3? ...................................... girl
last_letter=h? ........................................ girl
  vowel_count=1? ...................................... boy
  vowel_count=2? ...................................... girl
  vowel_count=3

At this point, we can notice that more features are somehow better, allowing us to
capture different aspects of a given sample. Let’s add a few more and see how far
it takes us:

**Final feature extraction function**

In [21]:
import string
def extract_features3(name):
    """
    Get the features used for name classification
    """
    features = {
    # Last letter
    'last_letter': name[-1],
    # First letter
    'first_letter': name[0],
    # How many vowels
    'vowel_count': len([c for c in name if c in 'AEIOUaeiou'])
    }
    # Build letter and letter count features
    for c in string.ascii_lowercase:
        features['contains_' + c] = c in name
        features['count_' + c] = name.lower().count(c)
        return features

In [22]:
# Build the dataset
boy_names_dataset = [(extract_features3(name), 'boy') for name in boy_names]
girl_names_dataset = [(extract_features3(name), 'girl') for name in girl_names]
# Put all the names together
data = boy_names_dataset + girl_names_dataset
# Mix everything together
random.shuffle(data)

In [23]:
# Split the dataset into training data and test data
cutoff = int(0.75 * len(data))
train_data, test_data = data[:cutoff], data[cutoff + 1:]
# Let's train probably the most popular classifier in the world
name_classifier = nltk.DecisionTreeClassifier.train(train_data)

In [24]:
# Test how well it performs on the test data:
# Accuracy = correctly_labelled_samples / all_samples
print(nltk.classify.accuracy(name_classifier, test_data)) #
# Look at the tree
print(name_classifier.pretty_format())

0.7783375314861462
last_letter= ? ........................................ girl
last_letter=a? ........................................ girl
last_letter=b? ........................................ girl
  first_letter=B? ..................................... boy
    count_a=0? ........................................ boy
    count_a=1? ........................................ girl
  first_letter=C? ..................................... boy
    vowel_count=0? .................................... girl
    vowel_count=1? .................................... boy
    vowel_count=2? .................................... boy
  first_letter=H? ..................................... boy
  first_letter=J? ..................................... boy
  first_letter=K? ..................................... boy
  first_letter=L? ..................................... girl
  first_letter=M? ..................................... girl
  first_letter=R? ..................................... boy
  first_letter