# Getting Started with Scikit-Learn

While NLTK models are definitely great for getting started with ML, Scikit-Learn
implementations are way better from many points of view. Before we repeat the
process from the previous section, let me emphasize some Scikit-Learn characteristics:
- provides a greater diversity of models available, whereas NLTK offers only implementations
for: MaxentClassifier, DecisionTreeClassifier and NaiveBayesClassifier
- provides consistent API for data preprocessing, vectorization and training
models.
- has extensively documented API and various tutorials around
- is faster than most alternatives due to the fact that it is built upon numpy
- provides some parallelization features

**Build dataset with scikit-learn**

In [1]:
from nltk.corpus import names
from sklearn.model_selection import train_test_split

In [2]:
def extract_features(name):
    """
    Get the features used for name classification
    """
    return {
    'last_letter': name[-1]
    }

In [3]:
# Get the names
boy_names = names.words('male.txt')
girl_names = names.words('female.txt')

In [4]:
# Build the dataset
boy_names_dataset = [(extract_features(name), 'boy') for name in boy_names]
girl_names_dataset = [(extract_features(name), 'girl') for name in girl_names]

In [5]:
# Put all the names together
data = boy_names_dataset + girl_names_dataset
# Split the data in features and classes
X, y = list(zip(*data))
# split and randomize
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, shuffle=True)

In [6]:
X_train

[{'last_letter': 'y'},
 {'last_letter': 'i'},
 {'last_letter': 'a'},
 {'last_letter': 'e'},
 {'last_letter': 'l'},
 {'last_letter': 'a'},
 {'last_letter': 'a'},
 {'last_letter': 'a'},
 {'last_letter': 'y'},
 {'last_letter': 'y'},
 {'last_letter': 'h'},
 {'last_letter': 'a'},
 {'last_letter': 'e'},
 {'last_letter': 'e'},
 {'last_letter': 'y'},
 {'last_letter': 'e'},
 {'last_letter': 'n'},
 {'last_letter': 'e'},
 {'last_letter': 'y'},
 {'last_letter': 'l'},
 {'last_letter': 'o'},
 {'last_letter': 'e'},
 {'last_letter': 'e'},
 {'last_letter': 'n'},
 {'last_letter': 'y'},
 {'last_letter': 'a'},
 {'last_letter': 'y'},
 {'last_letter': 'a'},
 {'last_letter': 'a'},
 {'last_letter': 'n'},
 {'last_letter': 'y'},
 {'last_letter': 'a'},
 {'last_letter': 'a'},
 {'last_letter': 'e'},
 {'last_letter': 'o'},
 {'last_letter': 'a'},
 {'last_letter': 'o'},
 {'last_letter': 'e'},
 {'last_letter': 'a'},
 {'last_letter': 't'},
 {'last_letter': 'l'},
 {'last_letter': 'l'},
 {'last_letter': 'a'},
 {'last_let

In [9]:
print(y_train)

['boy', 'boy', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'boy', 'boy', 'girl', 'boy', 'boy', 'girl', 'girl', 'boy', 'girl', 'girl', 'boy', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'girl', 'boy', 'boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'boy', 'boy', 'girl', 'boy', 'boy', 'boy', 'boy', 'boy', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'boy', 'boy', 'boy', 'boy', 'boy', 'boy', 'boy', 'girl', 'boy', 'girl', 'boy', 'boy', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'girl', 'girl', 'girl', 'boy', 'boy', 'girl', 'girl', 'girl', 'girl', 'boy', 'girl', 'girl', 'girl', 'boy', 'b

## Training a Scikit-Learn Model

We are going to rewrite the training process using the equivalent scikit-learn
models. Here are the main building blocks of the scikit-learn API:
There are two main types of `scikit-learn` components:

1. Transformers
2. Predictors

Here are the most used methods:

- fit - used for training a scikit-learn component.
- transform - Transformer method for converting the data format
- predict - Predictor method for classifying a data point (in our case)
- score - Evaluate a Predictor’s accuracy

**Train a Decision Tree Classifier**

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
dict_vectorizer = DictVectorizer()
name_classifier = DecisionTreeClassifier()

In [19]:
# Scikit-Learn models work with arrays not dicts
# We need to train the vectorizer so that
# it knows what's the format of the dicts
dict_vectorizer.fit(X_train)
# Vectorize the training data
X_train_vectorized = dict_vectorizer.transform(X_train)
# Train the classifier on vectorized data
name_classifier.fit(X_train_vectorized, y_train)
# Test the model
X_test_vectorized = dict_vectorizer.transform(X_test)
# Compute Accuracy (0.75)

In [20]:
print(X_test_vectorized)

  (0, 17)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 9)	1.0
  (4, 17)	1.0
  (5, 1)	1.0
  (6, 1)	1.0
  (7, 5)	1.0
  (8, 14)	1.0
  (9, 24)	1.0
  (10, 4)	1.0
  (11, 14)	1.0
  (12, 1)	1.0
  (13, 17)	1.0
  (14, 1)	1.0
  (15, 5)	1.0
  (16, 12)	1.0
  (17, 5)	1.0
  (18, 5)	1.0
  (19, 14)	1.0
  (20, 24)	1.0
  (21, 12)	1.0
  (22, 18)	1.0
  (23, 1)	1.0
  (24, 5)	1.0
  :	:
  (1961, 4)	1.0
  (1962, 9)	1.0
  (1963, 14)	1.0
  (1964, 9)	1.0
  (1965, 1)	1.0
  (1966, 5)	1.0
  (1967, 1)	1.0
  (1968, 24)	1.0
  (1969, 1)	1.0
  (1970, 14)	1.0
  (1971, 1)	1.0
  (1972, 1)	1.0
  (1973, 5)	1.0
  (1974, 5)	1.0
  (1975, 1)	1.0
  (1976, 14)	1.0
  (1977, 12)	1.0
  (1978, 6)	1.0
  (1979, 1)	1.0
  (1980, 5)	1.0
  (1981, 14)	1.0
  (1982, 15)	1.0
  (1983, 1)	1.0
  (1984, 9)	1.0
  (1985, 1)	1.0


I encourage you to retrain the classifier using the extended feature extraction
function.
At this point, some concepts might seem ambiguous. For instance, what gets passed
between the DictVectorizer and DecisionTreeClassifier? Well, Scikit-Learn models
only work with numeric matrices. The role of vectorizers is to convert data from
various formats to numeric matrices. If we take a look at output type of a vectorizer
we will get this: scipy.sparse.csr.csr_matrix. This is a sparse matrix, which is a
memory-efficient way of storing a matrix. Unfortunately, it’s a bit hard to read such
a matrix. However, we need to be able to work with it and to understand this data
format and here’s how we can do that:

**Understanding vectorized data**

In [15]:
NAMES = ['Lara', 'Carla', 'Ioana', 'George', 'Steve', 'Stephan']
transformed = dict_vectorizer.transform([extract_features(name) for name in NAMES])
# We get a scipy sparse matrix, which is a bit hard to read

In [16]:
print(transformed)

  (0, 1)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 5)	1.0
  (4, 5)	1.0
  (5, 14)	1.0


In [17]:
print(type(transformed)) # <class 'scipy.sparse.csr.csr_matrix'>

<class 'scipy.sparse.csr.csr_matrix'>


In [18]:
# We can transform the data back
dict_vectorizer.inverse_transform(transformed)

[{'last_letter=a': 1.0},
 {'last_letter=a': 1.0},
 {'last_letter=a': 1.0},
 {'last_letter=e': 1.0},
 {'last_letter=e': 1.0},
 {'last_letter=n': 1.0}]

In [22]:
# We can check the feature names
print(dict_vectorizer.feature_names_)
# ['contains_a', 'contains_b', 'contains_c', 'contains_d', 'contains_e', 'contains_f', ...

['last_letter= ', 'last_letter=a', 'last_letter=b', 'last_letter=c', 'last_letter=d', 'last_letter=e', 'last_letter=f', 'last_letter=g', 'last_letter=h', 'last_letter=i', 'last_letter=j', 'last_letter=k', 'last_letter=l', 'last_letter=m', 'last_letter=n', 'last_letter=o', 'last_letter=p', 'last_letter=r', 'last_letter=s', 'last_letter=t', 'last_letter=u', 'last_letter=v', 'last_letter=w', 'last_letter=x', 'last_letter=y', 'last_letter=z']


In [23]:
# Just as an excercise, let's build a feature dictionary in reverse:
import numpy as np
# build a vector with 0 value for each feature

vectorized = np.zeros(len(dict_vectorizer.feature_names_))

In [24]:
# Let's set the features by hand for an imagined name: "Wwbwi"
# The index in `feature_names_` represents the index of the feature
# in a row of the sparse matrix we previously discussed
# Our name has the last letter `i`
vectorized[dict_vectorizer.feature_names_.index('last_letter=i')] = 1.0

In [25]:
# Contains 3 `w`s
vectorized[dict_vectorizer.feature_names_.index('count_w')] = 3.0

ValueError: 'count_w' is not in list

In [27]:
vectorized[dict_vectorizer.feature_names_.index('count_b')] = 1.0

ValueError: 'count_b' is not in list

In [28]:
vectorized[dict_vectorizer.feature_names_.index('count_i')] = 1.0

ValueError: 'count_i' is not in list

In [29]:
# It contains the letter `b`, `i`, `w`
vectorized[dict_vectorizer.feature_names_.index('contains_b')] = 1.0

ValueError: 'contains_b' is not in list

In [30]:
vectorized[dict_vectorizer.feature_names_.index('contains_w')] = 1.0

ValueError: 'contains_w' is not in list

In [31]:
vectorized[dict_vectorizer.feature_names_.index('contains_i')] = 1.0

ValueError: 'contains_i' is not in list

In [32]:
print(vectorized)
# Let's see what we've built:
# [ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
# 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
# Let's now apply the inverse transformation, to get the feature dict
print(dict_vectorizer.inverse_transform([vectorized]))
# [{'contains_a': 1.0, 'contains_b': 1.0, 'contains_w': 1.0,
# 'count_a': 1.0, 'count_b': 1.0, 'count_w': 3.0, 'last_letter=a': 1.0}]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.]
[{'last_letter=i': 1.0}]


# ERROR FIX THIS

## Making Predictions

As we covered in the Scikit-Learn API summary, predictions can be made using the
predict method. This is an important step in our process, but keep in mind that you
can call this method only if you previously called the fit method on a trained model.

**Making predictions**

In [34]:
# Let's make some predictions
NAMES = ['Lara', 'Carla', 'Ioana', 'George', 'Steve', 'Stephan']
transformed = dict_vectorizer.transform([extract_features(name) for name in NAMES])

In [36]:
print(NAMES)

['Lara', 'Carla', 'Ioana', 'George', 'Steve', 'Stephan']


In [35]:
print(name_classifier.predict(transformed))
# We get: ['girl' 'girl' 'girl' 'girl' 'boy' 'boy']
# Oops, sorry George!

['girl' 'girl' 'girl' 'girl' 'girl' 'boy']


# ERROR: Steve label as giiirlll!