# Advanced Machine Learning & Signal Processing

#### Linear Algebra Terminology Review

* Scalar: numerical values ex: 1, 5, 42, pi
* Vector: a one-dimensional array (m rows x 1 col)
* Matrix: a two-dimensional array (m rows x n cols)
* Tensor: any multi-dimensional array of numbers, for example: rank 0 (scalar), rank 1 (vector), rank 2 (matrix), rank 3 (3D matrix)

#### Tensors

More broadly, tensors are a collection of vectors and covectors that are combined using the tensor product. Tensors feature heavily in the field of quantum computing. When two quantum systems are entangled together, their state vectors have been combined using the tensor product (circle with x). entanglement

#### Sparse Vectors

Sparse vectors contain predominantly zero values.

Ex: (12, [3], [1.0]) = 12 elements with a 1.0 in position 3

#### Spark ML

**StringIndexer** = a class that transforms a string class label into a numerical class index

**OneHotEncoder** = a class that transforms a column containing multiple values into a one-hot encoded vector with multiple binary elements, one for each original value

**VectorAssembler** = a class that transforms a set of columns into a single DenseVector representation.

**Pipelines** speed up ML development and enable us to express an end-to-end workflow within a single framework.

#### Pipeline Example

```python
# Retrieve data from repo
!git clone url_to_data

# Confirm data download
!ls dataset_name

from pyspark.sql.types import StructType, StructField, IntegerType
import os
from pyspark.sql.functions import lit
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline

schema = StructType([
    StructField('x', IntegerType(), True),
    StructField('x', IntegerType(), True),
    StructField('x', IntegerType(), True)])

file_list = os.listdir("dataset_name")
file_list_filtered = [f for f in file_list if "_" in f]

df = None

# Iterate through files, appending file data to end of dataframe
for category in file_list_filtered:
    data_files = os.listdir("dataset_name/", category)
    
    for data_file in data_files:
        print(data_file)
        temp_df = spark.read.option("header", "false").option("delimiter", " ").csv("dataset_name/" + category + '/' + data_file, schema=schema)
        temp_df = temp_df.withColumn("class", lit(category))
        temp_df = temp_df.withColumn("source", lit(data_file))
        
        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

# Assign numerical value to each class
indexer = StringIndexer(inputCol="class", outputCol="classIndex")
indexed = indexer.fit(df).transform(df)

# One hot encode a sparse vector representing the numerical class index
encoder = OneHotEncoder(inputCol="classIndex", outputCol="categoryVector")
encoded = encoder.transform(indexed)

# Creates an vector object representing input columns to be passed into an ml algorithm
vectorAssembler = VectorAssembler(inputCols=['x','y','z'], outputCol="features")
features_vectorized = vectorAssembler.transform(encoded)

# Normalize features
normalizer = Normalizer(inputCol="features", outputCol="features_norm", p=1.0)
normalized = normalizer.transform(features_vectorized)

# Create a pipeline with the desired data processing stages
pipeline = Pipeline(stages=[indexer, encoder, vectorAssembler, normalizer])
model = pipeline.fit(df)
prediction = model.transform(df)

# Visualize the transformations
prediction.show()

# Drop unnecessary columns, leaving only the processed features column and the vectorized category column
df_train = prediction.drop('x').drop('y').drop('z').drop("class").drop("source").drop("features").drop("classIndex")

```

#### System ML

System ML enables algorithms to be reused across data-parallel frameworks such as Hadoop and Spark, streamlining the deployment process in varying environments. It provides an API called MLContext that allows the user to register RDDs and Dataframes that were previously created through Spark SQL or other libraries. 

### Machine Learning with Spark ML

#### Linear Regression

First, create a Vector Assembler and Normalizer. Then create a Linear Regression model. Finally, combine stages into a Pipeline.

```python
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

pipeline = Pipeline(stages=[vectorAssembler, normalizer, lr])
model = pipeline.fit(df)
predictions = model.transform(df)

# r2 value
print(model.stages[2].summary.r2)
```

#### Logistic Regression

Logistic regression is simply linear regression that has been passed into a sigmoid function. It is a supervised machine learning algorithm used to predict discrete categorical values.

```python
from pyspark.ml.regression import LogisticRegression

logr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
```

### Probabilities & Naive Bayes

* Marginal probability - independent of any other event
* Joint probability - probability of events occuring together
* Conditional probability - probability of an event given that another event has occurred

**Bayes Rule Derivation**

* Sum Rule: $P(x) = \sum_{y}P(x,y)$
* Product Rule: $P(x,y) = P(y|x)p(x)$

Rearranging the product rule, we can derive the Bayes rule:

$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$

This enables us to describe the probability of an event occuring based on prior knowledge of other events.

**Gaussian Distribution**

The Gaussian (or Normal) distribution is a very common continuous distribution that occurs naturally in nature. Because it is a valid probability density function, the area under the curve always sums to one. The Guassian is often used in machine learning because it is a byproduct of sampling any random distribution with finite variance. However, Bayes can also utilize different distributions, including Binomial and Multinomial.

$N(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Bayesian inference is the process of adjusting the probability of a hypothesis as new evidence becomes available. This involves:

* Obtaining a prior hypothesis (distribution) $P(H)$
* Collecting of new data $E$ with a marginal likelihood $P(E)$
* Calculating the likelihood, i.e. how compatible the new data is to our prior knowledge of existing data $P(E|H)$
* Obtaining a posterior, i.e. the probability of our hypothesis $P(H|E) = \frac{P(E|H)*P(H)}{P(E)}$

The likelihood is calculated by plugging the new data into a guassian equation, which is defined by the $\mu$ and $\sigma$ of the original data.

The goal is to maximize the posterior distribution, i.e. select the $H$ which maximizes $\frac{P(E|H)*P(H)}{P(E)}$. Notice that the denominator can be ignored, leaving only the numerator. This is called the maximum a posteriori aka MAP.

Naive Bayes is "naive" because it assumes that, when $x$ is a vector with multiple features, that all features are conditionally independent. This enables us to make a simplification in our calculations.