## How to Build a Machine Learning App from Scratch
This an *interactive notebook version* of an article originally published by my friend [Cameron Smith](https://camtsmith.com/about). The original tutorial can be found [here](https://camtsmith.com/articles/2017-10/naive-bayes-text-classification).

## Introduction
This project has three stages: 
1. The first is constructing a corpus of language data.
2. The second is training and testing a language classifier model to predict categories. 
3. The third step is deploying the application to the web along with an API (not covered in this notebook).

You can find the complete source code [here](https://github.com/camoverride/language-classifier) and moreover if you’d like a "sneak peek at what the application looks like in the wild", click [here](https://language-classifier-app.herokuapp.com/).

## Building a Corpus

In order to perform language classification, a data source is needed. A good source will have a large amount of text and accurate category labels. Wikipedia seems like a great place to start. Not only do they have a well-documented API, but it allows for language-specific querying.

We'll start off by creating a list of languages to be used (prefixes were taken from [here](https://en.wikipedia.org/wiki/List_of_Wikipedias))

In [None]:
LANGUAGES = ['en', 'sv', 'de', 'fr', 'nl', 'ru', 'it', 'es', 'pl', 'vi', 'pt', 'uk', 'fa', 'sco']

We're using languages with large numbers of articles (but feel free to play with it). Additionally, we are selecting [Scots](https://en.wikipedia.org/wiki/Scots_language), because it’s quite similar to English and will provide an interesting challenge for our algorithm. 

Let's not forget to import these libraries

In [None]:
import re
import json
import requests

and create one object: 
**GetArticles** which will be used to grab data from Wikipedia. It also performs basic document sanitizing, such as removing punctuation, HTML tags, and citation brackets. 

In [None]:
class GetArticles(object):
    """
    This is an object with the public method write_articles(language_id, number_of_articles,
    db_location). This writes sanitized articles to text files in a specified location.
    """
    def __init__(self):
        pass


    def _get_random_article_ids(self, language_id, number_of_articles):
        """
        Makes a request for random article ids. "rnnamespace=0" means that only articles are chosen,
        as opposed to user-talk pages or category pages. These ids are used by the _get_article_text
        function to request articles.
        """
        query = \
                        'https://' + language_id \
                        + '.wikipedia.org/w/api.php?format=json&action=query&list=random&rnlimit=' \
                        + str(number_of_articles) + '&rnnamespace=0'

        # reads the response into a json object that can be iterated over
        data = json.loads(requests.get(query).text)

        # collects the ids from the json
        ids = []
        for article in data['query']['random']:
            ids.append(article['id'])

        return ids


    def _get_article_text(self, language_id, article_id_list):
        """
        This function takes a list of articles and yields a tuple (article_title, article_text).
        """
        for idx in article_id_list:
            idx = str(idx)
            query = \
                            'https://' + language_id \
                            + '.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&pageids=' \
                            + idx + '&redirects=true'

            data = json.loads(requests.get(query).text)

            try:
                title = data['query']['pages'][idx]['title']
                text_body = data['query']['pages'][idx]['extract']
            except KeyError as error:
                # if nothing is returned for the request, skip to the next item
                # if it is important to download a precise number of files
                # then this can be repeated for every error to get a new file
                # but getting a precise number of files shouldn't matter
                print(error)
                continue

            def clean(text):
                """
                Sanitizes the document by removing HTML tags, citations, and punctuation. This
                function can also be expanded to remove headers, footers, side-bar elements, etc.
                """
                match_tag = re.compile(r'(<[^>]+>|\[\d+\]|[,.\'\"()])')
                return match_tag.sub('', text)

            yield title, clean(text_body)

Let's test it with N random articles (recommend to scrap less than 100, but feel free to break things as part of your learning). 

In [None]:
# initializing object
getdata = GetArticles()
# getting ids of N random articles
articles = getdata._get_random_article_ids('en', 10)
# fetching text
text_list = getdata._get_article_text('en', articles)
# printing titles
for title, text in text_list:
    print(title)

## Classification

Now that we have some data to play with, it's time to choose a classification algorithm. [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) seems promising. In the first half of this section, we'll explain the math behind Naive Bayes. In the second half, we'll use [Scikit](https://scikit-learn.org/stable/) to train a Multinomial Naive Bayes classifier on our data.

#### Naive Bayes

Naive Bayes classifiers are a family of classifiers that take inspiration from Bayes' Theorem. Most people like to memorize Bayes' theorem and go from there, but I find that it's more useful to derive Bayes' theorem instead, as it sheds some light on how the pieces fit together. This is especially important when we need to fiddle with the prior probability (which the nature of our corpus will force us to do). I'll add other languages later on, but I'll keep things simple for the purposes of this tutorial.

We can think of the probability of A given B as being equivalent to the probability of the intersection of A and B divided by the probability of B, where the probability of B is further equivalent to the intersection of A and B plus the intersection of B and not A (see Cameron's post [*Math to Code*](https://camtsmith.com/articles/2017-12/math-to-code) for a different perspective):

\begin{equation*}
P(A | B) = {\dfrac{P(A \cap B)}{P(B)}} = {\dfrac{P(A \cap B)}{P(A \cap B) + P(B \cap \overline{A})}}
\end{equation*}

The formula above is implemented with logic, but it's more useful to convert this to math so that we can play around with specific quantities. The intersection of two sets, A and B, is the same as the probability of B given A multiplied by the probability of A. The middle section below is the canonical form of Bayes' Theorem:

\begin{equation*}
P(A | B) = {\dfrac{P(B|A) \cdot P(A)}{P(B)}} = {\dfrac{P(B|A) \cdot P(A)}{ P(B|A) \cdot P(A) + P(B|\overline{A}) \cdot P(\overline{A})  }}
\end{equation*}

The intuition behind Naive Bayes is quite simple. Let's say we have three documents, one is English and contains \["auld", "man", "girl"\] 

In [None]:
en = ["auld", "man", "girl"]

and the other two are Scots and are \["the", "auld"\] and \["auld", "auld"\] (not the most realistic data, but it’ll suit our purposes.) 

In [None]:
sco = ["the", "auld"] + ["auld", "auld"]
sco

Let's define some probabilities

In [None]:
# P(auld)
p_auld = (en + sco).count('auld') / len(en + sco)
# P(Scots)
p_scot = len(sco) / len (en + sco)
# P(English)
p_engl = len(en) / len (en + sco)
# P(auld|Scots)
p_auld_scot = p_auld * p_scot
# P(auld|English)
p_auld_engl = p_auld * p_engl

If we are judging the input sentence \[“auld”\] (a vector of words containing only one element), then the probability of this belonging to Scots is:

\begin{equation*}
P(Scots|auld) = {\dfrac{P(auld|Scots) \cdot P(Scots)}{P(auld)}}
\end{equation*}

In [None]:
p_auld_scot * p_scot / p_auld

Intuitively, this result makes sense. Three out of four times, the word "auld" appears in Scots, and documents labeled "Scots" occur more frequently than documents in the "English" category.

\begin{equation*}
P(English|auld) = {\dfrac{P(auld|English) \cdot P(English)}{P(auld)}}
\end{equation*}

In [None]:
p_auld_engl * p_engl / p_auld

### *To be continued*