# Advanced Machine Learning & Signal Processing

#### Linear Algebra Terminology Review

* Scalar: numerical values ex: 1, 5, 42, pi
* Vector: a one-dimensional array (m rows x 1 col)
* Matrix: a two-dimensional array (m rows x n cols)
* Tensor: any multi-dimensional array of numbers, for example: rank 0 (scalar), rank 1 (vector), rank 2 (matrix), rank 3 (3D matrix)

#### Tensors

More broadly, tensors are a collection of vectors and covectors that are combined using the tensor product. Tensors feature heavily in the field of quantum computing. When two quantum systems are entangled together, their state vectors have been combined using the tensor product (circle with x). entanglement

#### Sparse Vectors

Sparse vectors contain predominantly zero values.

Ex: (12, [3], [1.0]) = 12 elements with a 1.0 in position 3

#### Spark ML

**StringIndexer** = a class that transforms a string class label into a numerical class index

**OneHotEncoder** = a class that transforms a column containing multiple values into a one-hot encoded vector with multiple binary elements, one for each original value

**VectorAssembler** = a class that transforms a set of columns into a single DenseVector representation.

**Pipelines** speed up ML development and enable us to express an end-to-end workflow within a single framework.

#### Pipeline Example

```python
# Retrieve data from repo
!git clone url_to_data

# Confirm data download
!ls dataset_name

from pyspark.sql.types import StructType, StructField, IntegerType
import os
from pyspark.sql.functions import lit
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline

schema = StructType([
    StructField('x', IntegerType(), True),
    StructField('x', IntegerType(), True),
    StructField('x', IntegerType(), True)])

file_list = os.listdir("dataset_name")
file_list_filtered = [f for f in file_list if "_" in f]

df = None

# Iterate through files, appending file data to end of dataframe
for category in file_list_filtered:
    data_files = os.listdir("dataset_name/", category)
    
    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option("header", "false").option("delimiter", " ").csv("dataset_name/" + category + '/' + data_file, schema=schema)
        temp_df = temp_df.withColumn("class", lit(category))
        temp_df = temp_df.withColumn("source", lit(data_file))
        
        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

# Assign numerical value to each class
indexer = StringIndexer(inputCol="class", outputCol="classIndex")
indexed = indexer.fit(df).transform(df)

# One hot encode a sparse vector representing the numerical class index
encoder = OneHotEncoder(inputCol="classIndex", outputCol="categoryVector")
encoded = encoder.transform(indexed)

# Creates an vector object representing input columns to be passed into an ml algorithm
vectorAssembler = VectorAssembler(inputCols=['x','y','z'], outputCol="features")
features_vectorized = vectorAssembler.transform(encoded)

# Normalize features
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
normalized = normalizer.transform(features_vectorized)

# Create a pipeline with the desired data processing stages
pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer])
model = pipeline.fit(df)
prediction = model.transform(df)

# Visualize the transformations
prediction.show()

# Drop unnecessary columns, leaving only the processed features column and the vectorized category column
df_train = prediction.drop('x').drop('y').drop('z').drop("class").drop("source").drop("features").drop("classIndex")

```

#### System ML

System ML enables algorithms to be reused across data-parallel frameworks such as Hadoop and Spark, streamlining the deployment process in varying environments. It provides an API called MLContext that allows the user to register RDDs and Dataframes that were previously created through Spark SQL or other libraries. 

### Machine Learning with Spark ML

#### Linear Regression

First, create a Vector Assembler and Normalizer. Then create a Linear Regression model. Finally, combine stages into a Pipeline.

```python
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

pipeline = Pipeline(stages=[vectorAssembler, normalizer, lr])
model = pipeline.fit(df)
predictions = model.transform(df)

# r2 value
print(model.stages[2].summary.r2)
```

#### Logistic Regression

Logistic regression is simply linear regression that has been passed into a sigmoid function. It is a supervised machine learning algorithm used to predict discrete categorical values.

```python
from pyspark.ml.regression import LogisticRegression

logr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
```

### Probabilities & Naive Bayes

* Marginal probability - independent of any other event
* Joint probability - probability of events occuring together
* Conditional probability - probability of an event given that another event has occurred

**Bayes Rule Derivation**

* Sum Rule: $P(x) = \sum_{y}P(x,y)$
* Product Rule: $P(x,y) = P(y|x)p(x)$

Rearranging the product rule, we can derive the Bayes rule:

$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$

This enables us to describe the probability of an event occuring based on prior knowledge of other events.

**Gaussian Distribution**

The Gaussian (or Normal) distribution is a very common continuous distribution that occurs naturally in nature. Because it is a valid probability density function, the area under the curve always sums to one. The Guassian is often used in machine learning because it is a byproduct of sampling any random distribution with finite variance. However, Bayes can also utilize different distributions, including Binomial and Multinomial.

$N(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Bayesian inference is the process of adjusting the probability of a hypothesis as new evidence becomes available. This involves:

* Obtaining a prior hypothesis (distribution) $P(H)$
* Collecting of new data $E$ with a marginal likelihood $P(E)$
* Calculating the likelihood, i.e. how compatible the new data is to our prior knowledge of existing data $P(E|H)$
* Obtaining a posterior, i.e. the probability of our hypothesis $P(H|E) = \frac{P(E|H)*P(H)}{P(E)}$

The likelihood is calculated by plugging the new data into a guassian equation, which is defined by the $\mu$ and $\sigma$ of the original data.

The goal is to maximize the posterior distribution, i.e. select the $H$ which maximizes $\frac{P(E|H)*P(H)}{P(E)}$. Notice that the denominator can be ignored, leaving only the numerator. This is called the maximum a posteriori aka MAP.

Naive Bayes is "naive" because it assumes that, when $x$ is a vector with multiple features, that all features are conditionally independent. This enables us to make a simplification in our calculations.

### Support Vector Machines

SVM is a binary linear classifier that finds the best hyperplane of separation between point clouds. Often, it is necessary to transform data into another feature space using kernels in order to identify a separation boundary. SVM can also be used with multiple classes by iterating through classes one by one ("one versus all approach").

```python
from pyspark.ml.classification import LinearSVC

lsvc = LinearSVC(maxIter=10, regParam=0.1)
```

### Classification Evaluation Measures

* Precision: $\frac{t_p}{t_p + f_p}$ - number of correct instances divided by total number of selections made
* Recall: $\frac{t_p}{t_p + f_n}$ - number of correct instances divided by total number of instances that should have been identified
* F1 Score: $\frac{2 * precision * recall}{precision + recall}$ - the harmonic mean of precision and recall

### Ensemble Learning

Decision Trees, while poor performers on their own, become quite powerful and are less prone to overfitting when ensembled. Bootstrap aggregation or bagging is the process of splitting a dataset up into smaller subsets and training a Decision Tree model on each subset to create a "Random Forest". Results are then aggregated. Note that boostrapping is a parallel process.

An alternative technique called Boosting involves sequentially training models using the residuals (error) of the previous model as input. This process, which creates "Gradient Boosted Trees", is more computationally expensive but can be improved using XGBoost.

```python
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Create vector of input features
vectorAssembler = VectorAssembler(inputCols=['X', 'Y', 'Z'], outputCol="features")

# Create model, identifying input/output cols
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)

# Feed into pipeline
pipeline = Pipeline(stages=[vectorAssembler, classifier])
model = pipeline.fit(df)
prediction = model.transform(df)
prediction.show()

# Evaluate
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("CLASS")
binEval.evaluate(prediction) 
```

### Crossvalidation, Gridsearch & Testing

Using a validation set is important to prevent overfitting. However, this means that a section of data must be reserved, which is not ideal. Instead, crossvalidation is used to iteratively split data into training and validation folds. Model results are then averaged.

Gridsearch optimizes models by iterating over the multi-dimensional hyperparameter space. Two important things to keep in mind with Gridsearch: 1) tuning parameters requires an additional test set to assess overfitting and 2) computational complexity increases exponentially with each hyperparameter.

```python
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().addGrid(normalizer.p, [1.0,2.0,10.0]).addGrid(gbt.maxBins, [2,4,8,16]).addGrid(gbt.maxDepth, [2,4,8,16]).build()

crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(), numFolds=4)
cvModel = crossval.fit(df_train)
prediction = cvModel.transform(df_test)
binEval.evaluate(prediction)
```

#### Regularization

Regularization is used to prevent overfitting by penalizing extra features that are added to a model. One type of regularized regression (L1 or L2) called lasso regression that minimizes $SSE + \lambda|\beta|$, or the sum of squared errors plus the sum of each parameter $\lambda$ multiplied by their respective coefficients of regression $\beta$. Thus an equilibrium is reached between the goodness of fit and the number of features that are used in the model.

### Unsupervised Machine Learning

Unsupervised ML requires that we understand distance between points. Commonly, this involves Euclidian Distance, which is an extension of the Pythagorean formula in vector spaces of any dimension. Another example of a distance measure is manhattan distance.

Euclidian Distance in n dimensions: $d(p,q) = d(q,p) = \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + ... + (q_n - p_n)^2}$

#### k-Means Clustering

* Initialize k cluster centroids in n-dimensional hyerspace
* Calculate distances between each point and each cluster centroid
* Assign points to nearest cluster
* Re-calculate cluster centroids based on the mean of points in each cluster space
* Repeat until distance between points and cluster centroids converges on a minimum

k-Means is a very naive algorithm that has several shortfalls:

* It requires a preset number of clusters
* It incorporates the entire dataset, thus is impacted by outliers
* It might incorrectly identify clusters due to initial centroid placement

```python
from pyspark.ml.clustering import KMeans

kmeans = KMeans().setK(5).setSeed(1)
pipeline = Pipeline(stages=[vectorAssembler, kmeans])
model = pipeline.fit(df)

# Compute within set of sum squared errors
wssse = model.stages[1].computeCost(vectorAssembler.transform(df))
```
 
#### Hierarchical Clustering

Unlike k-Means, Hierarchical Clustering algorithms do not require the number of clusters to be pre-specified. They also offer the ability to learn non-spherical boundaries. In this technique, points are grouped with nearby neighbors in discrete time-steps creating a history of clustering that can be reviewed by the end-user. This means that the number of clusters can be chosen after the fact.

#### Density-Based Clustering aka DBSCAN

DBSCAN offers improvement clustering capabilities and outlier identification. The model is initialized with two parameters: epsilon, or maximum radius, and number of points, or the minimum number of points in the epsilon neighborhood that are required to define a cluster.

* Randomly select a point and classify it as either an outlier, a border or a core point.
* If outlier, randomly select another point.
* If border or core, jump to all reachable points within the maximum radius and add them to the cluster.

DBSCAN is the preferred clustering technique due to its resilience to noise.

#### Dimensionality Reduction

Dimensionality reduction is a broad term for techniques such as feature selection, which is often driven by domain knowledge, and feature reduction, which is achieved by constructing a new, smaller feature set that retains most of the relevant information. **Principal Component Analysis (PCA)** is a widely used mechanism for feature reduction.

The idea behind PCA is that we want to find the direction in the dataset that preserves the most amount of variance. The data can then be projected onto this direction, which becomes the new coordinate system. For example, in 2D space, data points are projected onto the line that defines the direction that maximally explains variance, transforming the data from 2D into 1D. In 3D space, you would repeat this process twice: first, finding the vector that explains the most variance and then finding another vector, perpendicular to the first, that explains the most of the remaining variance. You could then project either once (to 2D) or twice (to 1D) to reduce the dimensionality of the data.

PCA Process:

* Center the data by subtracting the mean
* Compute the covariance matrix sigma
* Find the eigenvectors and eigenvalues of sigma and sort by decreasing eigenvalue
    * The eigenvectors become the principal components, or the lines that explain the most of the variance
    * The eigenvalues represent the amount of explained variance for each eigenvector
* Select desired number of new dimensions and project the data by multiplying the transpose of each observation by each of the eigenvectors