## Outlook
### Approaching a machine learning problem

Let’s say your goal is fraud
detection.

Then the following questions open up:
- How do I measure if my fraud prediction is actually working?
- Do I have the right data to evaluate an algorithm?
- If I am successful, what will be the business impact of my solution?

As we discussed in Chapter 5, it is best if you can measure the performance of your
algorithm directly using a business metric, like increased profit or decreased losses.
This is often hard to do, though. A question that can be easier to answer is “What if I
built the perfect model?” If perfectly detecting any fraud will save your company $100 a month, these possible savings will probably not be enough to warrant the effort of you even starting to develop an algorithm. On the other hand, if the model might save your company tens of thousands of dollars every month, the problem might be
worth exploring.

### Humans in the loop

Many applications are dominated by “simple cases,” for which an algo‐
rithm can make a decision, with relatively few “complicated cases,” which can be
rerouted to a human.


In many companies, the data analytics teams
work with languages like Python and R that allow the quick testing of ideas, while
production teams work with languages like Go, Scala, C++, and Java to build robust,
scalable systems. 

If you are building involved machine learn‐
ing systems, we highly recommend reading the paper “Machine Learning: The High
Interest Credit Card of Technical Debt”, published by researchers in Google’s
machine learning team. The paper highlights the trade-off in creating and maintain‐
ing machine learning software in production at a large scale.


There are more
elaborate mechanisms for online testing that go beyond A/B testing, such as bandit
algorithms. A great introduction to this subject can be found in the book Bandit Algo‐
rithms for Website Optimization by John Myles White (O’Reilly). 

### From prototype to production

### Testing production systems

In this book, we covered how to evaluate algorithmic predictions based on a test set
that we collected beforehand. This is known as oine evaluation. If your machine
learning system is user-facing, this is only the first step in evaluating an algorithm,
though. The next step is usually online testing or live testing, where the consequences
of employing the algorithm in the overall system are evaluated. Changing the recom‐
mendations or search results users are shown by a website can drastically change
their behavior and lead to unexpected consequences. To protect against these sur‐
prises, most user-facing services employ A/B testing, a form of blind user study. I

### Building your own estimator

This book has covered a variety of tools and algorithms implemented in scikitlearn that can be used on a wide range of tasks. However, often there will be some
particular processing you need to do for your data that is not implemented in
scikit-learn. It may be enough to just preprocess your data before passing it to your
scikit-learn model or pipeline. However, if your preprocessing is data dependent,
and you want to apply a grid search or cross-validation, things become trickier.


In Chapter 6 we discussed the importance of putting all data-dependent processing
inside the cross-validation loop. So how can you use your own processing together
with the scikit-learn tools? There is a simple solution: build your own estimator!
Implementing an estimator that is compatible with the scikit-learn interface, so
that it can be used with Pipeline, GridSearchCV, and cross_val_score, is quite easy.
You can find detailed instructions in the scikit-learn documentation, but here is
the gist. The simplest way to implement a transformer class is by inheriting from
BaseEstimator and TransformerMixin, and then implementing the __init__, fit,
and predict functions like this:

In [1]:
from sklearn.base import BaseEstimator, TransformerMixin

class MyTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, first_paramter=1, second_parameter=2):
        # all parameters must be specified in the __init__ function
        self.first_paramter = 1
        self.second_parameter = 2
        
    def fit(self, X, y=None):
        # fit should only take X and y as parameters
        # even if your model is unsupervised, you need to accept a y argument!
        
        # Model fitting code goes here
        print("fitting the model right here")
        # fit returns self
        return self
    
    def transform(self, X):
        # transform takes as parameter only X
        
        # apply some transformation to X:
        X_transformed = X + 1
        return X_transformed

### Where to go from here
#### Theory

We already men‐
tioned Hastie, Tibshirani, and Friedman’s book e Elements of Statistical Learning in
the Preface, but it is worth repeating this recommendation here. Another quite acces‐
sible book, with accompanying Python code, is Machine Learning: An Algorithmic
Perspective by Stephen Marsland (Chapman and Hall/CRC). Two other highly recom‐
mended classics are Pattern Recognition and Machine Learning by Christopher Bishop
(Springer), a book that emphasizes a probabilistic framework, and Machine Learning:
A Probabilistic Perspective by Kevin Murphy (MIT Press), a comprehensive (read:
1,000+ pages) dissertation on machine learning methods featuring in-depth discus‐
sions of state-of-the-art approaches, far beyond what we could cover in this book.

#### Other machine learning frameworks and packages

Another reason you might want to look beyond scikit-learn is if you are
more interested in statistical modeling and inference than prediction. In this case,
you should consider the statsmodel package for Python, which implements several
linear models with a more statistically minded interface.

Another popular machine learning package is vowpal wabbit (often called vw to
avoid possible tongue twisting), a highly optimized machine learning package written
in C++ with a command-line interface. vw is particularly useful for large datasets and
for streaming data. For running machine learning algorithms distributed on a cluster,
one of the most popular solutions at the time of writing is mllib, a Scala library built
on top of the spark distributed computing environment.

#### Ranking, recommender systems, time series, and other kinds of learning

There are two partic‐
ularly important topics that we did not cover in this book. The first is ranking, in
which we want to retrieve answers to a particular query, ordered by their relevance.
You’ve probably already used a ranking system today; this is how search engines
operate. You input a search query and obtain a sorted list of answers, ranked by how
relevant they are. A great introduction to ranking is provided in Manning, Raghavan,
and Schütze’s book Introduction to Information Retrieval.



The second topic is recom‐
mender systems, which provide suggestions to users based on their preferences.
You’ve probably encountered recommender systems under headings like “People You
May Know,” “Customers Who Bought This Item Also Bought,” or “Top Picks for
You.” There is plenty of literature on the topic, and if you want to dive right in you
might be interested in the now classic “Netflix prize challenge”, in which the Netflix
video streaming site released a large dataset of movie preferences and offered a prize
of $1 million to the team that could provide the best recommendations. 



Another
common application is prediction of time series (like stock prices), which also has a
whole body of literature devoted to it.


#### Probabilistic modeling, inference and probabilistic programming

#### Neural Networks

#### Scaling to larger datasets

In this book, we always assumed that the data we were working with could be stored
in a NumPy array or SciPy sparse matrix in memory (RAM). Even though modern
servers often have hundreds of gigabytes (GB) of RAM, this is a fundamental restric‐
tion on the size of data you can work with. Not everybody can afford to buy such a
large machine, or even to rent one from a cloud provider. In most applications, the
data that is used to build a machine learning system is relatively small, though, and
few machine learning datasets consist of hundreds of gigabites of data or more. This
makes expanding your RAM or renting a machine from a cloud provider a viable sol‐
ution in many cases. If you need to work with terabytes of data, however, or you need to process large amounts of data on a budget, there are two basic strategies: out-ofcore learning and parallelization over a cluster.

The other strategy for scaling is distributing the data over multiple machines in a
compute cluster, and letting each computer process part of the data. This can be
much faster for some models, and the size of the data that can be processed is only
limited by the size of the cluster. However, such computations often require relatively
complex infrastructure. One of the most popular distributed computing platforms at
the moment is the spark platform built on top of Hadoop. spark includes some
machine learning functionality within the MLLib package. If your data is already on a
Hadoop filesystem, or you are already using spark to preprocess your data, this might
be the easiest option. If you don’t already have such infrastructure in place, establish‐
ing and integrating a spark cluster might be too large an effort, however. The vw
package mentioned earlier provides some distributed features and might be a better
solution in this case.

#### Honing your skills

#### Conclusion