# Introduction to Machine Learning

## Overview

### Learning Goals

### Structure

- Conceptual introduction to machine learning
- Hands-on document classification as an example machine learning workflow
- Library applications of machine learning
- Be aware of existing resources to further your understanding machine learning and its applications.

## Some Definitions
### Data Science and its Subfields
**Data science** is an all encompassing process of collecting, managing, processing, analyzing, and visualizing and reporting the inferences made on data. The following graphics from UC Berkley depicts the data science cycle in five stage, along with some of the techniques associated with each stage.

<img
src="https://github.com/vrtompki/jumpstart/blob/master/Images/Data Science Process.png?raw=1" style="width: 500px;"/>

**Aritificial intelligence** (AI) describes the ability of a computer to performed learned tasks. **Machine learning** (ML) is the process of training a machine or computer to perform a task. In the previous image, ML is tool used primarily during, but not restricted to, the "Process" and "Analyze" portion of the data science cycle. **Deep learning** (DL) refers to an area of machine learning focused on using large, multi-layered neural networks. The depth of a neural network depends on how many hidden layers the network has. Deep learning is a subsection of machine learning, and machine learning is a subset of artificial intelligence, and all three areas are tools that can be used by data scientists to understand and communicate the analysis of data.

<img
src="https://github.com/vrtompki/jumpstart/blob/master/Images/AI_ML_DL.png?raw=1"/>

### Machine Learning from a Distance
Teaching or training a computer or machine to perform a task is very similar to how humans learn throughout their lifetime. Bob Ross was a famous jovial painter who was known for his television program that tried to teach viewers to paint amazing scenes from nature that included his signature "happy trees". He would often mix colors to add to his palette to get various shades of green to paint the trees. Given only the primary colors, how would one replicate Bob's technique to create a nice shade of green for their "happy tree"?

<p float="left"><img src="https://github.com/vrtompki/Linear_Regression_Tutorial/blob/master/images/bob_ross.jpg?raw=1" style="width: 200px;"/>
    <img src="https://github.com/vrtompki/Linear_Regression_Tutorial/blob/master/images/primary_colors.png?raw=1" style="width: 400px;"/>
</p>

Well you may remember that green is a combination of the colors *yellow* and *blue*, and you may notice that these two colors were already provided. You try mixing different amounts of these colors until you are finally able to create the three desired shades of *green*!

<img src="https://github.com/vrtompki/Linear_Regression_Tutorial/blob/master/images/color_mixing_example.png?raw=1" alt="Drawing" style="width: 600px;"/>

Different shades of green consists of a various amounts of two primary colors, *yellow* and *blue*.  So theoretically we could represent any shade of green using a combination of the two colors:

$$ Green = Yellow*amount_{yellow} + Blue*amount_{blue}$$

$$ \begin{align*}
Green_{1} &= Yellow*1 + Blue*2 \\
Green_{2} &= Yellow*2 + Blue*1 \\
Green_{3} &= Green_{1} + Green_{2} + Blue*1 \\
\end{align*} $$

When we train a machine, we are trying to find the best parameters of the equations to complete the task. In the previous example, albeit non-practical, we would train the model to predict the shade of green and teh amount of blue and yellow required to make it.  One of the simplest examples is the use of a straight line to predict an output similar to the equations above which represents the sum of two lines. For simplicity if we look at a single line equation, it contains a slope (the rate of change of the line, or the degree of incline) and an intercept (the base component that needs to be added to all values, usually an indicator of inherent noise of the data). By finding the slope and the intercept of the line, we can then use it to map the inputs to specific output values. This particular machine learning task is called regression. Next we will go over some fo the tasks and the types of data used for machine learning.

## Types of Machine Learning Tasks
### Supervised vs Unsupervised
Tasks are performed in one of three manners:
- **Supervised** - the input data is labeled, or samples with know classes are used to train the model.
- **Unsupervised** - class membership of the input data is not known, or is not important.
- **Semi-supervised** - uses both supervised and unsupervised techniques for analyzing data.

### Classification vs Predicition
All machine learning models are designed to do one of two tasks **classification** or **prediction**. Classifiers are models trained to identify inputs as belonging to a specific group of objects (i.e. determining if an image is a cat or a dog). 

<img src="https://github.com/vrtompki/jumpstart/blob/master/Images/Machine Learning.png?raw=1" style="width: 400px;"/>


## The What, Why, and How of machine learning - a conceptual introduction

<figure>
  <img  src="http://scikit-learn.org/stable/_static/ml_map.png" width="75%" />  <figcaption><div align="left" style="padding-top: 4px;">Source: <a href="http://scikit-learn.org/stable/tutorial/machine_learning_map/">`scikit-learn`: Choosing the right estimator</a></div></figcaption>
</figure>

## Classifying textual documents - an example ml workflow

### A typical ML classification workflow
Generally, the workflow for any classification task is usually as follows: 

1. **Collect** or create **labeled data**
2. **Transform** that data into a numeric representation
  - Each numeric value representing a characteristic of the data is called a **feature**
  - The set of all features representing a single pair of input data and labels is called the **feature vector**
  - The whole labeled data set is split into two parts (at least) to train, evaluate and refine the model: a **training set** and a **test set**
3. **Train** (learn/fit) a model on a part of the transformed labeled data (the training set)
4. **Test** the model predictions on the test set to evaluate its performance
5. **Assess** your model and revisit each of the previous steps, if necessary

#### Text classification


Given the specific task of assigning a category to a new text based on a set of labeled input texts:

1. **Collect and label.** In the case of Twitter data, for example, you'd download a bunch of tweets using the Twitter API, extract their texts from the JSON format of the API response, and then assign one or more labels to each tweet (the easiest way is just to use the tweet's hashtags as labels).
2. **Transform.** There are many strategies for turning your textual data into numbers, and `scikit-learn` has built-in libraries for most of them. Usually you'll begin by making a list of the words that apear in a document (a "bag of words"), but probably you'll also want to count the *frequencies* of the words in each document (BoW counts). In `scikit-learn`, this is done by a type of transformer called a `CountVectorizer`. We also must  choose whether to exclude uncommon words (i.e., words that only appear in a few documents) or very common words ("stopwords"). These high-level settings as a whole are called **hyperparameters** (different from **parameters**, which are the values learned by the model from the training data that enable it to make predictions).
3. **Train.** You'll need to choose which learning model/algorithm to use, either by reading the documentation or talking to your friendly neighborhood data scientist. After choosing a model, we train it on part of the labeled input data (the training set).
4. **Test.** We then use the trained model to predict/infer the labels of the remainder of the labeled input data (the test/validation set).
5. **Assess.** Apply one or more metrics (scores) to evaluate how well the predicted labels match the actual labels of the test/validation set. If the performance is unsatisfactory, we'll need to backtrack, possibly all the way to step #1, getting more labeled data and applying different transformations/hyper-parameters as needed, and/or trying a different model.

In [1]:
!which python

/Users/csbaile3/opt/anaconda3/envs/jumpstart/bin/python


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

ModuleNotFoundError: No module named 'sklearn'

In [None]:
documents = [
    "Libraries are cool places.",
    "Librarians can be cool too.",
    "Digital libraries provide lots of resources.",
    "Are digital libraries places?"
]

In [None]:
count_vectorizer = CountVectorizer()
count_vectorizer.fit(documents)
print("Vocabulary size: ", len(count_vectorizer.vocabulary_))
count_vectorizer.vocabulary_

In [None]:
counts = count_vectorizer.transform(documents)
print(counts)
print("   ^  ^         ^\n   |  |         |\n  doc word_id count")

In [None]:
# This will go through the entire vocabulary, but only show counts from the first doc
doc = 0
for word, word_id in count_vectorizer.vocabulary_.items():
    print(word, ":", counts[doc, word_id])

In [None]:
# We previously used both the fit and transform methods.
# Vectorizers typically have a single fit_transform method we can use to do both
# in one step.
counts = count_vectorizer.fit_transform(documents)
print(counts)

`CountVectorizer` also has some options to disregard stopwords, count ngrams (multiple adjacent words) instead of single words, cap the maximum number of words in each bag, normalize spelling, or count terms within a frequency range. It is worth exploring the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

### Our text corpus

Our actual corpus is made up of two pieces: tweets and texts from the the [Brown corpus](https://www.nltk.org/book/ch02.html) included in the Natural Language Toolkit (`nltk`), which contains more than a million words of English from 500 texts, where each text is categorized into one of 15 genres. We'll specifically work with the `news` genre. Our goal is to create a classifier that can distinguish between a tweet and news based solely on textual contents. 


We'll start by importing our tweets and breaking them down into a format that works with our classifier.

In [None]:
import pandas as pd

In [None]:
tweets_df = pd.read_csv("sample_tweets_do_not_share.csv")
tweets_df.head()

In [None]:
tweets_english = tweets_df[tweets_df.lang == "en"]["text"]
tweets_english.shape

Let's take a sample of roughly 4000 sentences that roughly equals our sentences from the news corpus we'll get below

In [None]:
tweets_english_sample = tweets_english.sample(4000)
tweets_english_sample.shape

In [None]:
# Build list of text, label tuples
tweet_corpus = [(text, "tweet") for text in tweets_english_sample]

In [None]:
import nltk
nltk.download("brown")
from nltk.corpus import brown

In [None]:
for category in brown.categories():
    print(category, len(brown.fileids(category)))

In [None]:
news_corpus = []
for fileid in brown.fileids("news"):
    text = " ".join(brown.words(fileids=fileid))
    news_corpus.append((text, "news"))

len(news_corpus)

In [None]:
# Typically you do want to hoist imports to the top of a file rather than have them sprinkled throughout. That said, while experimenting and prototyping, we often don't know what we need until we need it, and we can always reformat later.
from nltk import sent_tokenize
from itertools import chain

In [None]:
# We need to download punkt, a specific sentence tokenizer
nltk.download('punkt')

In [None]:
# Function that takes a list of tuples, and returns a list of tuples where the first item in a tuple is a single sentence text fragment
def sent_chunker(labelled_corpus):
    label = labelled_corpus[0][1]
    chunks = [text for text, label in labelled_corpus]
    sents = [sent_tokenize(text) for text in chunks]
    flat_sents = list(chain(*sents))
    labelled_flat_sents = [(sent, label) for sent in flat_sents]
    return labelled_flat_sents

In [None]:
import numpy as np

In [None]:
sentences = sent_chunker(news_corpus)
sentences[0]

To see if the texts in our two corpora are roughly equivalent in length, let's check the average length of the texts. 

In [None]:
np.mean([len(sent) for (sent, _) in sentences])

In [None]:
# Let's check the avg length of the tweets
np.mean([len(tweet) for (tweet, _) in tweet_corpus])

We'll combine the two lists of tuples together to create a single corpus.

In [None]:
full_corpus = tweet_corpus + sentences

For `scikit-learn`, we need to separate the texts and the labels. We'll use a couple of list comprehensions for that.

Usually, the variable containing the labels is named `y`, and the one containing the input features (in our case, the texts) is named `X`, as in you can obtain the output `y` as a function of the inputs `X`, which is the core abstraction in `scikit-learn`. But using arbitrary letters is confusing when you're trying to learn a new concept, so we'll add some explanatory info to the variable names after `X_` and `y_`.

In [None]:
texts = [text for (text, _) in full_corpus]
labels = [label for (_, label) in full_corpus]
X_texts = np.array(texts)
y_labels = np.array(labels)

Before we can move forward, we need to take a brief conceptual detour and talk about the ways that we often split corpora for machine learning tasks. 

Many machine learning approaches call for splitting the labeled data into three sets:
- **training** data (usually the largest set) for the initial model training
- **validation** data, which is then used to evaluate the initial performance of the model and subsequently fine-tune the model settings and **hyperparameters** in the hopes of getting better results
- **testing** data is "held out" until all model tuning is completed and then is used to give a final evaluation score or *benchmark* of the model's performance.

`scikit-learn` provides  functions to split a labeled dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
(X_texts_train, X_texts_test, y_labels_train, y_labels_test) = train_test_split(X_texts, y_labels, test_size=0.25, random_state=42)

print(f"{len(X_texts_train)} training documents")
print(f"{len(X_texts_test)} testing documents")

In [None]:
vectorizer = CountVectorizer()
X_features_train = vectorizer.fit_transform(X_texts_train)
X_features_test = vectorizer.transform(X_texts_test)

print(f"{X_features_train.shape[0]} training documents with {X_features_train.shape[1]} features")
print(f"{X_features_test.shape[0]} test documents with {X_features_test.shape[1]} features")

### Classification (prediction)

Let's start with one of the Naïve Bayes classifiers.

**Naïve Bayes** is a family of classifiers based on Bayes' Theorem of probability, which describes the probability of an event based on prior knowledge of possibly relevant conditions. Although its formulation can get confusing, all the math boils down to counting, multiplication and division, making Naïve Bayes (NB) classifiers very fast. On the other hand, NB makes the assumption that all of the features in the data set are equally important and independent, which is obviously not true for words. Despite this, Naïve Bayes classifiers are generally very accurate as text classifiers.

<div align="left"><b>"All models are wrong but some are useful" - George Box (1978)</b></div>

There are three Naïve Bayes algorithms in `scikit-learn`: 
- Gaussian: assumes that features follow a normal distribution.
- Multinomial: good for discrete counts, like in text classification problems using counts of words.
- Bernoulli: useful for feature vectors that are binary (i.e. zeros and ones), like classic bag of words.

Given that our feature vectors are counts of the words in each document with some additional vocabulary constraints, we will use `MultinomialNB`.

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
classifier = MultinomialNB()
classifier.fit(X_features_train, y_labels_train)

In [None]:
np.shape(classifier.feature_log_prob_)

Now we can predict the categories of previously unseen texts and assess how good our classifier is at classifying them.

In [None]:
samples = [
    "The city's hospitals celebrated a grand opening on Tuesday.",
    "This meme is lit.",
    "The forest fire burned for 12 days before firefighters could extinguish it.",
    "The disease spread quickly through Brazil.",
    "@jack needs to get back to verifying folks #bluecheck"
]

transformed_samples = vectorizer.transform(samples)
classifier.predict(transformed_samples)

How could we start to understand why the model makes the predictions that it does? 

One of the easiest things to do is to use a package that breaks down the prediction probabilities and which features led to those predictions. 

There are a number of options out there, but we'll use Lime. 

In [None]:
# If it isn't already in your environment.
!pip install Lime

In [None]:
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline

In [None]:
def explain(entry, clf, vectorizer=None, n=10):
    if vectorizer is None:
        class_names = clf.steps[1].classes_.tolist()
        pipeline = clf
    else:
        class_names = clf.classes_.tolist()
        pipeline = make_pipeline(vectorizer, clf)
    explainer = LimeTextExplainer(class_names=class_names)
    exp = explainer.explain_instance(entry, pipeline.predict_proba, num_features=n)
    exp.show_in_notebook()

explain("Reports indicated that the cave entrance collapsed, trapping them inside.", classifier, vectorizer)
#explain("They went to a beautiful restaurant, and drank wine together.", classifier, vectorizer)

We can extract some of this information from the model ourselves. If we wanted to know which words are being used to decide whether a text is either `news` or `tweet`, we need to take into account the the word vocabulary built by the vectorizer we used to transform our data, and the feature (word) probabilities calculated for each label by the model at training time.

The following is a fairly standard most informative features function. 

In [None]:
def most_informative_features(classifier, vectorizer=None, n=10):
    class_labels = classifier.classes_
    if vectorizer is None:
        feature_names = classifier.steps[0].get_feature_names()
    else:
        feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_log_prob_[0], feature_names))[-n:]
    topn_class2 = sorted(zip(classifier.feature_log_prob_[1], feature_names))[-n:]
    for prob, feat in reversed(topn_class2):
        print(class_labels[1], prob, feat)
    print()
    for prob, feat in reversed(topn_class1):
        print(class_labels[0], prob, feat)

most_informative_features(classifier, vectorizer)

You'll notice that a lot of the words here are commonly occuring words. We could decide that these are "stopwords" and remove them from the corpus in a preprocessing step. I tend to do first runs in an ml task without much preprocessing when possible, in order to get a baseline for the model and understand the corpus a bit better. 

### Model Evaluation and Selection

Changing model hyper-parameters without really having a way to assess its performance can get problematic. If for example you find better informative features but the classifier is failing 50% of the time, it doesn't really matter much how good you think those features are. In some cases, very performant classifiers make use of unexpected or counterintuitive features.

The only real way to assess its performance is by making the classifier predict labels for unseen data, and then comparing the predicted labels with the real labels. This is where the importance of separating training and testing data lies in.

In [None]:
y_pred = classifier.predict(X_features_test)
y_pred

In [None]:
print("Label     Predicted    Result\n-----     ---------    ------")
for i, real_label in enumerate(y_labels_test[:20]):
    predicted_label = y_pred[i]
    if real_label == predicted_label:
        result =  "hit"
    else:
        result = "miss"
    print(real_label, "    ", predicted_label, "       ", result)

We have almost entirely hits here because of how different the types of text within our corpus are. If our texts had more overlap in their content, it would harder for the classifier to work so effectively. 

This basic test gives us a sense of what's going on, but doesn't say too much about what types of fails we have. Were the tweets being classified as news or the other way around? This may not be too significant here, but if we think about using ML for medical diagnosis, it's a lot more important to understand if your model is giving you false positives or false negatives. 

One way of gaining this information is to build a confusion matrix. You could do this manually, but scikit-learn does contain a convenience function. 

![Confusion Matrix c/o Towards Data Science](https://miro.medium.com/max/712/1*Z54JgbS4DUwWSknhDCvNTQ.png)

In [None]:
from sklearn.metrics import confusion_matrix

# TN = true news
# FN = false news
# FT = false tweet
# TT = true tweet

print("TN FN")
print(confusion_matrix(y_labels_test, y_pred))
print("FT TT")

We can see that when our classifier misses, it thinks that pieces of news are actually tweets. 

Let's look quickly at one other common evaluation before moving on. **Accuracy** is ratio of correct predictions ("hits") to the total number of predictions.

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_labels_test, y_pred)

Two other common metrics are *precision* and *recall*. They can be calculated (averaged) for the whole model or for each category. We'll think about the latter here.

**Precision**: out of the test texts the model classified as tweets, what fraction of them were actually tweets?

**Recall**: out of the total number of texts in the test set, what fraction of them did the model correctly classify as tweets?

We aren't going to run these right now, but if you'd like to practice exploring the scikit-learn docs, you might try to find the correct functions and use them with our model. 

## Library applications of machine learning

There are a large number of potential applications of machine learning in the library world, including working with collections and metadata, improving administrative workflows, and provisioning data services. We're only going to focus on one application today, which is created metadata for digitized collections through machine learning. There are a lot of different reports on data science and machine learning in libraries out there, but if you're looking for follow up reading, we specifically recommend these:

- [Responsible Operations: Data Science, Machine Learning, and AI in Libraries](https://www.oclc.org/research/publications/2019/oclcresearch-responsible-operations-data-science-machine-learning-ai.html) by Thomas Padilla
- [Machine Learning + Libraries: A Report on the State of the Field](https://labs.loc.gov/static/labs/work/reports/Cordell-LOC-ML-report.pdf?loclr=blogsig) from LC Labs and written by Ryan Cordell. 
- [Mapping the Current Landscape of Research Library Engagement with Emerging Technologies in Research and Learning](https://www.arl.org/resources/mapping-the-current-landscape-of-research-library-engagement-with-emerging-technologies-in-research-and-learning/) from ARL, CNI, and Educause. This isn't specifically and only about ML in libraries, but it's a heavy and recurring theme. 
- [Shifting to Data Savvy: The Future of Data Science in Libraries](http://d-scholarship.pitt.edu/33891/1/Shifting%20to%20Data%20Savvy.pdf), an IMLS funded report on librarians and data science. 

None of these should be taken as exhaustive, and I think that all are at least somewhat controversial, but they are good starts to understanding potential futures of machine learning (and data science) in libraries.

In looking at metadata creation, we're going to work with a collection of digitized materials from NC State Libraries' Animal Turn collection. We're going to try to generate some meaningful metadata from both the text of the items and the page images.

The overview from special collections is [here](https://www.lib.ncsu.edu/animal-turn).

The digitized materials are [here](https://d.lib.ncsu.edu/collections/catalog?_=1543444883344&f%5Bispartof_facet%5D%5B%5D=Animal+Turn).

We've taken the liberty of writing some scraping code that uses the IIIF manifests that back the collection to retrieve the OCR text and page images of each item.  Due to time, we won't be able to walk through all of the code to produce these models or run them during this workshop. Instead we're going to load the models we've built and talk through what we might gain from using these models and methods.

Let's start just by looking at a couple of texts to see what we have. 

In [None]:
import gensim
import glob

In [None]:
fns = glob.glob("texts/*.txt")
fns[:5]

In [None]:
with open(fns[0], 'r') as f:
    print(f.read())

We'll start with using topic modeling to build a semantic model of the corpus. Topic modeling refers to a set of unsupervised machine learning algorithms and approaches to modeling the underlying thematic structure of a corpus of texts. The assumption is that the corpus is composed from a range of themes, and each text can be broken down to being some percentage about each theme. Each theme, or topic, is an aggregation of regularly co-occuring words. 

For example, in a corpus about animal rights, a single text or document might be 75% about household pets, 15% about training strategies, 5% about food, and 5% about cleaning. Another document might be 60% about farm animals, 25% about vetrinary care, and 15% about food. These breakdowns are determined by word occurence and co-occurence in relation to the topics the model constructed. 

I've built a model that assumes there are roughly 40 topics or themes in text of our digitized Animal Turn collection. I've done a bit of preprocessing with the corpus, removing stopwords, numbers, and non-letter characters. I've also removed single letter words that weren't caught by the stopword list.

In [None]:
model = gensim.models.LdaModel.load("animalturn_40_full.model")

In [None]:
model.print_topics(-1)

That's useful, but it can be hard to keep think of this as an overview. Let's switch to visualizing the model using the `pyLDAvis` package.

In [None]:
import pyLDAvis
import pyLDAvis.gensim

In [None]:
pyLDAvis.enable_notebook()

For `pyLDAvis` to work, we do need to load in data constructed during the creation of the model, namely the processed `gensim` corpus and the dictionary of words in the corpus.

In [None]:
loaded_corpus = gensim.corpora.MmCorpus("items_bow_lg_full.mm")

In [None]:
id2word = gensim.corpora.dictionary.Dictionary.load("animalturn_40_full.model.id2word")

In [None]:
pyLDAvis.gensim.prepare(model, loaded_corpus, id2word)

This is a much better way to explore the topics, and to get some sense of how the topics might related to each other. A librarian could use this sort of visualization to help explore and describe large digitized collections and how patrons could engage them. 

What this view doesn't let us explore is how the individual documents in the collection are modeled according to topics, or how they might related to each other.

We'll start by looking at a single document's significant topics. 

In [None]:
model.get_document_topics(loaded_corpus[0])

In [None]:
model.show_topic(1)

This is an effective, but slow way to understand individual documents. What if were interested in the groupings of documents based on our topic model? Based on the topic distributions for each document, I fit the corpus into 2d space using t-distributed stochastic neighbor embedding (tSNE), a dimensionality reduction algorithm for visualization.

<iframe src="animal_turn_tsne_40.html" height="800" width="800" />

In [None]:
- Make sure we have both text and image examples. 
- Examples could be aspirational, not just things we definitely already know how to do. 
- Maybe use Animal Turn for both text and images; see if you can identify when an image is a photograph vs a cartoon or illustration. That could be useful metadata to help someone access the collection. 