# Machine Learning

## Definition:

**The field of study that focuses on the development of algorithms and models that allow computers to learn patterns and relationships from data and make predictions or decisions without being explicitly programmed.**

This learning process involves the use of statistical techniques, mathematical optimization, and algorithms to enable computers to:
- **Recognize Patterns**: Identify and learn patterns or structures within data.
- **Make Predictions**: Based on learned patterns, make predictions or decisions about new or unseen data.
- **Improve Performance**: Continuously refine and improve predictions or decisions as new data becomes available.

The goal of machine learning is to build models that **generalize well to new, unseen data**, allowing for accurate predictions or decisions in various domains and business settings.

## Common Terms

- **Supervised Learning**: A type of machine learning where models learn from labeled data to make predictions or decisions. It involves input-output pairs for training.
  - **Classification**: A type of supervised learning where the model predicts categories or classes for new instances based on past observations.
  - **Regression**: A type of supervised learning where the model predicts continuous values based on input features.
- **Unsupervised Learning**: Learning from data without labeled responses. Algorithms try to uncover hidden patterns or structures within the data.
  - **Clustering**: Unsupervised learning technique that groups similar data points together based on certain criteria.
- **Semi-Supervised Learning**: A combination of supervised and unsupervised learning where a model learns from both labeled and unlabeled data.
- **Reinforcement Learning**: Learning through interaction with an environment to achieve a goal. The model learns by receiving rewards or penalties for its actions.
- **Generative AI**: Autonomously generate novel and realistic content, resembling human-created data.

- **Training Data**: The data used to train a machine learning model, consisting of input features and corresponding labels/targets.
- **Validation Data**: A dataset used to evaluate a model during training to fine-tune parameters or prevent overfitting.
- **Test Data**: A separate dataset used to assess the model's performance after it has been trained and validated.
- **Feature**: An individual measurable property or characteristic of a phenomenon being observed. Features are used as input variables in machine learning models.
- **Label/Target**: The output variable that a supervised learning algorithm aims to predict. It's the value or category being predicted.

- **Feature Engineering**: The process of creating new features or transforming existing ones to improve a model's performance.
- **Dimensionality Reduction**: Techniques to reduce the number of input features, often using methods like Principal Component Analysis (PCA) or t-SNE.

- **Algorithm**: Mathematical procedures or techniques used to learn patterns and relationships from data, enabling computers to make predictions or decisions without explicit programming.
- **Model**: The result of applying an Algorithm to training data. (Approximate) representation of the patterns learned from the data.
- **Hyperparameters**: Parameters of a machine learning model that are set before the training process begins. They control the learning process and affect the model's performance.
- **Overfitting**: Occurs when a model learns too much from the training data, capturing noise and irrelevant patterns, leading to poor performance on new data.
- **Underfitting**: Occurs when a model is too simple to capture the underlying patterns in the training data, resulting in poor performance on both training and new data.
- **Regularization**: Techniques used to prevent overfitting by adding penalties or constraints to the model's parameters.
- **Cross-Validation**: A technique used to assess the performance of a model by splitting the data into multiple subsets, training on some and validating on others.

- **Metric**: Quantifiable measure used to evaluate the performance or quality of a machine learning model.
- **Accuracy**: The proportion of correctly classified instances out of the total instances in classification tasks.
- **Precision**: In binaryc flassification, the proportion of true positive predictions out of all positive predictions. It measures the accuracy of positive predictions. 
- **Recall**: The proportion of true positive predictions out of all actual positive instances. It measures the model's ability to identify all positive instances. 
- **F1 Score**: The harmonic mean of precision and recall, providing a single score that balances both metrics.
- **Mean Absolute Error (MAE)**: The average of absolute differences between predicted and actual values, measuring the average magnitude of errors. 
- **Mean Squared Error (MSE)**: The average of squared differences between predicted and actual values, giving higher weight to larger errors. 
- **Root Mean Squared Error (RMSE)**: The square root of the mean squared error, providing a measure of the standard deviation of errors.

- **Bias**: Systematic error introduced by approximations, assumptions, or simplifications in a model.
- **Variance**: The variability of model predictions for a given data point, which indicates the model's sensitivity to variations in the training data.
- **Bias-Variance Tradeoff**: The tradeoff between a model's ability to capture the complexity of the data (variance) and its ability to generalize to new data (bias).

- **Gradient Descent**: An optimization algorithm used to minimize the loss function by adjusting the model's parameters iteratively.
- **Loss Function**: A measure of how well a model performs on the training data by quantifying the difference between predicted and actual values.
- **Optimization**: The process of adjusting the model's parameters to minimize the loss function, often using optimization algorithms like gradient descent.


- **(Training) Pipeline**: Sequence of data processing steps or operations that are chained together in a specific order to automate a workflow. Typically includes data preprocessing, feature engineering, model training, and model evaluation, and possibly more.
- **Deployment**: The process of putting a trained machine learning model into operation or making it available for use in real-world applications.

- **Linear Regression**: A linear approach to modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data.
- **Coefficients (Weights)**: The values assigned to each independent variable in a linear equation representing the strength and direction of the relationship with the dependent variable.
- **Intercept**: The constant term in a linear equation that represents the predicted value when all independent variables are zero.
- **Residuals**: The differences between observed and predicted values in a regression model, indicating the model's accuracy.
- **R-squared (R²)**: A measure indicating the proportion of variance in the dependent variable explained by the independent variables in the model.
- **Adjusted R-squared**: A modified version of R-squared that adjusts for the number of predictors in the model.

- **Decision Tree**: A hierarchical tree-like structure consisting of nodes and branches used to make decisions by partitioning the data based on features.
- **Node**: Represents a decision point in a decision tree, containing a feature and a threshold used to split the data.
- **Root Node**: The top node of a decision tree, representing the initial split.
- **Leaf Node (Terminal Node)**: End nodes in a decision tree that do not split further, providing the final decision or prediction.
- **Random Forest**: An ensemble learning method that builds multiple decision trees and aggregates their predictions to improve accuracy and reduce overfitting.
- **Gradient Boosting**: An ensemble learning technique that builds multiple weak learners (usually decision trees) sequentially, each one correcting the errors of its predecessor.
- **Weak Learner**: Individual models that perform slightly better than random guessing; typically, shallow decision trees are used as weak learners in gradient boosting.

- **Neural Networks**: A set of algorithms designed to recognize patterns, inspired by the human brain's structure, consisting of interconnected nodes (neurons) arranged in layers.
- **Deep Learning**: Subset of machine learning using neural networks with multiple layers (deep neural networks) to learn from data.
- **Convolutional Neural Networks (CNNs)**: Deep learning models specifically designed for processing structured grid-like data, often used in image recognition.
- **Recurrent Neural Networks (RNNs)**: Neural networks designed for sequential data by introducing connections that loop back on themselves, used in natural language processing and time series analysis.
- **Transformers**: modern alternative to e.g. RNNs, also for processing sequential data, but better training performance (no recursion).


# sklearn

In [None]:
!/home/atreju/.conda/envs/dhbw/bin/pip install scikit-learn

In [None]:
import numpy as np
import pandas as pd
import sklearn  # usually: import of individual submodules

## A simple classification example

### Read Data

In [None]:
# read the data
df = pd.read_csv('../data/iris.csv')
df

### Train-Test-Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# separate features and target;
# encode target as numerical values
x = df.iloc[:, :4]
y = df[['variety']]

In [None]:
# stratified sampling in the train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y)

In [None]:
x_train.head()

In [None]:
x_train.shape

In [None]:
x_test.head()

In [None]:
x_test.shape

In [None]:
y_train.head()

In [None]:
y_train.shape

In [None]:
y_test.head()

In [None]:
y_test.shape

In [None]:
# verify that the stratify option did what it should
np.unique(y_test, return_counts=True)

### Train a simple Decision Tree

In [None]:
from sklearn import tree

In [None]:
classifier = tree.DecisionTreeClassifier()

In [None]:
# training the classifier
classifier.fit(x_train, y_train)

### Make predictions on the test set

In [None]:
y_pred = classifier.predict(x_test)
y_pred

In [None]:
y_pred = pd.DataFrame(y_pred, columns=['prediction'], index=y_test.index)

In [None]:
results = pd.concat([y_test, y_pred], axis=1)
results['correct'] = results.variety == results.prediction
results.head()

### Evaluating Predictions

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
ConfusionMatrixDisplay.from_estimator(classifier, x_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
f1_score(y_test, y_pred, average='macro')

In [None]:
matthews_corrcoef(y_test, y_pred)

In [None]:
# training set 'memorized', predictions on training are perfect
# some (fairly minor here) overfitting -- predictions on test not as good as on train
matthews_corrcoef(y_train, classifier.predict(x_train))

### better understanding my model

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay, permutation_importance

In [None]:
tree.plot_tree(classifier, feature_names=x_train.columns, class_names=y_train.variety.unique(), filled=True);

In [None]:
_, _, test_importance = permutation_importance(classifier, x_test, y_test, n_repeats=20).values()
_, _, train_importance = permutation_importance(classifier, x_train, y_train, n_repeats=20).values()

In [None]:
test_importance = pd.DataFrame(test_importance.T, columns=x_test.columns)
train_importance = pd.DataFrame(train_importance.T, columns=x_train.columns)

In [None]:
test_importance.head()

In [None]:
import plotnine as p9

In [None]:
molten_test_importance = test_importance.melt(var_name='feature', value_name='importance')
molten_train_importance = train_importance.melt(var_name='feature', value_name='importance')
molten_test_importance['set'] = 'test'
molten_train_importance['set'] = 'train'

In [None]:
combined_importance = pd.concat([molten_test_importance, molten_train_importance])
combined_importance

In [None]:
p9.ggplot(combined_importance, p9.aes(x='feature', y='importance', color='set')) + p9.geom_boxplot()

## A simple regression example

In [None]:
from sklearn.datasets import fetch_california_housing

In [None]:
housing_data_features, housing_data_price = fetch_california_housing(return_X_y=True, as_frame=True)

In [None]:
housing_data_features

In [None]:
housing_data_price

### splitting the data

In [None]:
x_train, x_test, y_train, y_test = train_test_split(housing_data_features, housing_data_price,test_size=0.1)

### A **very** simple example

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# for the very first example we'll fit to just one feature
feature = 'MedInc'

In [None]:
# Create linear regression object and fit on one feature
regressor = linear_model.LinearRegression()
regressor.fit(x_train[[feature]], y_train)

In [None]:
# Make predictions using the testing set
y_pred = regressor.predict(x_test[[feature]])

In [None]:
# The coefficients
print("Coefficients & Intercept: \n", regressor.coef_, regressor.intercept_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))

In [None]:
results_train = pd.DataFrame({
    feature: x_train[feature],
    'true': y_train,
    'predicted': regressor.predict(x_train[[feature]]),
    'type': 'train'
})
results_test = pd.DataFrame({
    feature: x_test[feature],
    'true': y_test,
    'predicted': regressor.predict(x_test[[feature]]),
    'type': 'test'
})
results = pd.concat([results_train, results_test])

In [None]:
import plotnine as p9

In [None]:
p9.ggplot(results, p9.aes(x=feature, y='true', color='type')) + p9.geom_point() + p9.geom_line(p9.aes(y='predicted'), color='black')

In [None]:
p9.ggplot(results, p9.aes(x='true', y='predicted', color='type')) + p9.geom_point() + p9.scale_x_continuous(limits=(0, 6)) + p9.scale_y_continuous(limits=(0, 6))

### Using all the features

In [None]:
from sklearn import linear_model
from sklearn import ensemble
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
housing_data_features, housing_data_price = fetch_california_housing(return_X_y=True, as_frame=True)
x_train, x_test, y_train, y_test = train_test_split(housing_data_features, housing_data_price,test_size=0.1)

In [None]:
x_train

In [None]:
# regressor = tree.DecisionTreeRegressor(max_depth=10)
regressor = linear_model.LinearRegression()

# Train the model using the training sets
regressor.fit(x_train, y_train)

In [None]:
y_pred = regressor.predict(x_test)

In [None]:
# The coefficients
print("Coefficients & Intercept: \n", regressor.coef_, regressor.intercept_)
# The mean squared error
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred))

In [None]:
results_train = pd.DataFrame({
    'true': y_train,
    'predicted': regressor.predict(x_train),
    'type': 'train'
})
results_test = pd.DataFrame({
    'true': y_test,
    'predicted': regressor.predict(x_test),
    'type': 'test'
})
results = pd.concat([results_train, results_test])

In [None]:
p9.ggplot(results, p9.aes(x='true', y='predicted', color='type')) + p9.geom_point() + p9.scale_x_continuous(limits=(0, 6)) + p9.scale_y_continuous(limits=(0, 6))

## Clustering
- unsupervised, we're not using the labels now (except for visualisation)
- that also makes it hard to evaluate the quality for real settings

In [None]:
from sklearn import datasets, cluster

In [None]:
coords, labels = datasets.make_blobs(n_samples=500, n_features=2, centers=3, cluster_std=3)
# coords, labels = datasets.make_moons(n_samples=500)
data = pd.DataFrame({
    'x': coords[:, 0],
    'y': coords[:, 1],
    'cluster': labels
})

In [None]:
p9.ggplot(data, p9.aes(x='x', y='y', color='factor(cluster)')) + p9.geom_point()

In [None]:
clusterer = cluster.KMeans(n_clusters=3, n_init='auto')
# clusterer = cluster.DBSCAN()

In [None]:
# not using the labels here!
clusterer.fit(coords)

In [None]:
data['prediction'] = clusterer.fit_predict(coords)

In [None]:
p9.ggplot(data, p9.aes(x='x', y='y', color='factor(prediction)')) + p9.geom_point()

## Dimensionality Reduction
- not predictive as such
- useful for visualizations and preprocessing
- can be used for anomaly detection

### Principal Component Analysis (PCA)
- concentrates variance in first components

In [None]:
from sklearn import decomposition
from matplotlib import pyplot as plt

In [None]:
x_normal = np.random.normal(20, 5, 10000)
y_normal = 1.2*x_normal + np.random.normal(0, 1, 10000)
data_normal = pd.DataFrame({
    'x': x_normal,
    'y': y_normal,
    'type': 'normal'
})

x_abnormal = np.random.normal(25, 1, 1000)
y_abnormal = np.random.normal(20, 1, 1000)
data_abnormal = pd.DataFrame({
    'x': x_abnormal,
    'y': y_abnormal,
    'type': 'anomaly'
})

data = pd.concat([data_normal, data_abnormal])


In [None]:
p9.ggplot(data, p9.aes(x='x', y='y', fill='type')) + p9.geom_point(alpha=0.1)

In [None]:
pca = decomposition.PCA()
new_coords = pca.fit_transform(data.loc[:, ['x', 'y']])
transformed_data = pd.DataFrame(new_coords, columns=['v0', 'v1'])
transformed_data['type'] = data.type.values

In [None]:
p9.ggplot(transformed_data, p9.aes(x='v0', y='v1', fill='type')) + p9.geom_point(alpha=0.1)

### t-SNE
- converts to a fixed number of dimensions
- useful for visualisation

In [None]:
from sklearn import manifold

In [None]:
data = pd.read_csv('../data/iris.csv')

In [None]:
sne = manifold.TSNE()
transformed_coordinates = sne.fit_transform(data.iloc[:, :4])

In [None]:
transformed_data = pd.DataFrame(transformed_coordinates, columns=['v0', 'v1'])
transformed_data['variety'] = data.variety

In [None]:
p9.ggplot(transformed_data, p9.aes(x='v0', y='v1', color='variety')) + p9.geom_point()

In [None]:
images, labels = datasets.load_digits(return_X_y=True, as_frame=True)

In [None]:
from matplotlib import pyplot as plt
plt.imshow(images.loc[15].values.reshape(8, 8))

In [None]:
sne = manifold.TSNE()
transformed = sne.fit_transform(images)

In [None]:
data = pd.DataFrame(transformed, columns=['v0', 'v1'])
data['label'] = labels

In [None]:
p9.ggplot(data, p9.aes(x='v0', y='v1', color='factor(label)')) + p9.geom_point(alpha=0.3)