# Brief intro to ML

Machine learning emerged from inquiries in statistics, computer science, information theory, artificial intelligence, and pattern recognition. We can think of it as sets of tools for investigating, modeling, and understanding data. 

[Data splitting](https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf) is a fundamental preprocessing step used to divide a dataset into "training" and "test" sets. The majority portion of the data (say, 75%) is assigned to the training set, while the remaining 25% of data is assigned to the test set. Missing data should be handled before the splitting process commences.  

[k-fold cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) is preferred because it repeats this splitting process **k** number of times, where each fold  serves as the training set **k-1** times and as the test set exactly once. 

Thus, in 4-fold cross validation, the data are divided into four equal sized chunks. In the first iteration, the first 25% of the data is the test set, while the remaining 75% is the training set. In the second iteration, the second 25% is the test set and the other 75% is the training set, and so on:
![cv](img/K-fold_cross_validation_EN.jpg)

# Supervised machine learning
The syntax looks like this: **_Y ~ X_**

_*Y*_ = dependent/target/outcome/response variable  
_*X*_ = independent/predictor/input variable  

In supervised machine learning, the training dataset is used to train a model that predicts the outputs of a target function, or an estimation of the actual but unknown function that maps X to Y. Here, the model learns the characteristics of the data and its performance is evaluated using a "performance metric" - a metric that describes how well it could predict the outcomes (accuracy, MSE, RMSE, AUC, etc.). This gives us a way to measure the goodness of fit for the model on the training data. 

However, we also want to see how well it performs on the test dataset, or data it has not yet seen! The model performs well if it has similar performance on the test dataset it has never seen before, because it can be generalized to new data. If the model performs poorly on the training dataset it is said to be **underfit** because it was not able to learn about relationships between the X and Y variables. If it performs well on the training set but poorly on the test set the model is said to be **overfit** because the model performed worse than expected when given new data (although patterns might be due to noise).  

This can be extended to unsupervised learning as well, where there is no **_Y_** variable - the target function seeks to identify patterns in the data rather than predicting an outcome. 


# Classification or regression?

**Classification** is used when the Y outcome variable is categorical. "yes" or "no" is a _binary_ example: 1 is prediction of the "yes" category and 0 as the "no". This can be extended to multi-level classification as well.  

**Regression** is used when the Y outcome variable is continuous and we want the model to predict it using X variable(s). 

# The value of understanding simple linear regression - by hand! 

Doing a simple [OLS](https://en.wikipedia.org/wiki/Ordinary_least_squares) regression by hand provides a way to understand what supervised machine learning is doing under the hood. OLS regression builds a model that tries to predict Y using X (regresses Y onto X). 

First, let's generate toy predictor (X) and response (Y) variables and compute their means. This will be our "training set":

In [None]:
# Generate toy predictor (X) and response (Y) variables
import numpy
X = numpy.array([2, 4, 8, 12, 18, 20])
Y = numpy.array([1, 3, 5, 9, 19, 21])

## Calculate their means
mean_X = round(sum(X) / len(X), 2)
mean_Y = round(sum(Y) / len(Y), 2)
print("mean of X is:", mean_X)
print("mean of Y is:", mean_Y)

# Challenge 

Draw a scatter plot with your X values on the x-axis and your Y values on the y-axis

We can also perform "vectorized" operations, such as subtracting the mean of X from each X value or the mean of Y from each Y value. That is, we can do math on arrays of numbers simultaneously: 

In [None]:
print(X-mean_X)
print(Y-mean_Y)

This is important bcause it helps us calculate our [beta](https://en.wikipedia.org/wiki/Standardized_coefficient) coefficients in the regression. 

In [None]:
## Estimate the B1 coefficient (slope)
B1 = sum((X-mean_X) * (Y-mean_Y)) / sum((X-mean_X)**2)
print("slope is equal to", round(B1,2))

## Estimate B0 coefficient (intercept)
B0 = mean_Y - (B1 * mean_X)
print("intercept is equal to", round(B0, 2))

# Challenge

Plot the best fit line using the intercept and slope!

Now that we have calculated the OLS "best fit line" (the line that minimizes the sum of the squared errors) so that we can generate predicted values (our "test set") and assess the performance of the model:

In [None]:
## Generate predicted Y values by plugging in our X values to the equation: 
Y_hat = B0 + B1 * X
print(Y_hat)

Our performance metric will be [root mean square error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation), or the standard deviation of the residuals (aka prediction errors). Error is measured as the vertical distance from a data point to the best fit line. 

However, we have to do some calculations first before we get to RMSE! 

In [None]:
## First, calculate the error for each observation by subracting the predicted value from it:
Y_err = Y - Y_hat
print(Y_err)

In [None]:
## Second, calculate the square of each of these errors:
Y_err_sq = Y_err**2
print(Y_err_sq)

In [None]:
## Third, sum these values
sum_squared_err = sum(Y_err_sq)
print(sum_squared_err)

In [None]:
## Fourth, calculate the RMSE - take the square root of the summed squared error divided by the length of Y:
import math
RMSE = math.sqrt(sum_squared_err / len(Y))
print(round(RMSE, 2))

# Bag of words model

Before we can "do" machine learning, keep in mind you might want to perform preprocessing steps as outlined in notebook 1-4. Then, we need to get our text into numeric form so we can plug it into our machine learning models. 

A bag of words model classifies a text by turning it into a "bag" of words where words are normalized and counted. 

# CountVectorizer

**`CountVectorizer`** will help us quickly tokenize text, learn its vocabulary, and encode the text as a vector for use in machine learning. This is often referred to as document encoding. 

In [None]:
#!pip install -U numpy scipy scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
corpus = [
    "This is the first document.",
    "This is the second second document.",
    "And the third one.",
    "Is this the first document?"
]
# Define our vectorizer
vectorizer = CountVectorizer()

# Use the .fit method to tokenize the text and learn the vocabulary
vectorizer.fit(corpus)

# Print the vocabulary
print(vectorizer.vocabulary_)

In [None]:
# Encode the document
vector = vectorizer.transform(corpus)
print(vector) # 4 x 9 sparse matrix
print(vector.shape)
print(type(vector))

In [None]:
# View the vectors as arrays (4 x 9)
print(vector.toarray())

In [None]:
# Look at the arrays in the above cell. In which documents does "and" appear? What about "document"? What about "the"?
vectorizer.get_feature_names()

In [None]:
# What does this tell us? 
vectorizer.transform(['this is']).toarray()

# Bigrams

In addition to uni-grams, using bigrams can be useful to preserve some ordering information. 

> NOTE: **`ngram_range=(1,2)`** will get you bigrams, **`ngram_range=(1,3)`** will get you tri-grams, **`ngram_range=(1,4)`** will get you quad-grams, etc. 

> **`token_pattern=r'\b\w+\b'`** is standard regex code to separate words.

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
                                    token_pattern=r'\b\w+\b', min_df=1)

In [None]:
analyze = bigram_vectorizer.build_analyzer()
analyze('Bi-grams are cool!')

In [None]:
X = bigram_vectorizer.fit_transform(corpus).toarray()
X

In [None]:
bigram_vectorizer.get_feature_names()

In [None]:
feature_index = bigram_vectorizer.vocabulary_.get('this is')
X[:, feature_index]

# Challenge

Repeat this steps using the text below:

In [None]:
# quote from Kathryn E. Piquette: http://dhdebates.gc.cuny.edu/debates/text/40
DH = ["The digital humanities are a 'community of practice'",
      "(to borrow Etienne Wenger’s phrase)", 
      "whereby the learning, construction, and sharing of humanities knowledge",
      "is undertaken with the application of digital technologies",
      "in a reflexive, theoretically informed, and collaborative manner."]

# Challenge 

Define the following machine learning terms:

supervised =  

unsupervised =  

classification =  

regression =  

dependent variable =  

independent variable = 

performance metric =  

data split =  

training data =  

test data =  

cross-validation =  

**In the next lesson we will check out [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) - another type of document encoding that identifies unique words in text and documents. **