Implementations of commonly-used machine learning algorithms from scratch using only numpy. Each algorithm is standalone with no other dependencies of other algorithms.
All models are intended for a transparent look into their implementation. They are not intended to be efficient or used to practical applications, but simply offer aid to anyone studying machine learning.
Clone the repository.
git clone https://github.com/f4str/ml-algorithms
Change directories into the cloned repository.
cd ml-algorithms
Install Python and create a virtual environment.
python3 -m venv venv
source venv/bin/activate
Install the dev dependencies using pip.
pip install -e .[dev]
All implementations are based on scikit-learn and keras by following a similar class style and structure.
All models are created by initializing a class with their hyperparameters. Each model will always have default hyperparameters so models can also be created without any parameters.
classifier = LogisticRegression(penalty='l1', C=0.001) # specify hyperparameters
regressor = RidgeRegression() # use default hyperparameters
Since various other parameters are setup when creating a model. It is recommended to completely reinitialize the model rather than changing the hyperparameter.
tree_clf = DecisionTreeClassifier(criterion='gini')
tree_clf.criterion = 'entropy' # will not work, do not use
tree_clf = DecisionTreeClassifier(criterion='entropy') # use this instead
All models are trained using the fit(X, y)
method. This will always take parameters X
, a matrix of training features, and y
, the training labels. If the algorithm uses gradient descent, it may also take optional parameters for the epochs
and lr
to override the defaults. If the model uses gradient descent, the fit(X, y)
method will return two lists for the training loss and evaluation metric (accuracy or R2 score) per training epoch. Otherwise, the model will return the final loss and evaluation metric from the trained model equivalent to calling evaluate(X, y)
.
training_loss, training_acc = classifier.fit(X, y) # returns Tuple[list, list]
loss, r2 = regressor.fit(X, y) # returns Tuple[float, float]
All models have a predict(X)
method which can be called after training. This will return the predicted values based on the weights learned from training.
y_pred = classifier.predict(X) # returns class labels
y_pred = regressor.predict(X) # returns real value predictions
In addition, some classifiers have a predict_proba(X)
and predict_log_proba(X)
to get the class probabilities and log probabilities.
y_pred_prob = classifier.predict_proba(X)
y_pred_log_prob = classifier.predict_log_proba(X)
To evaluate a model, it is recommended to run predict(X)
and use your evaluation metrics of choice (accuracy, R2 score, F1 score, cross entropy, MSE, etc.). However, to get a quick and rough estimate of the model performance, all models have an evaluate(X, y)
method which will return the default loss and evaluation metric. These metrics are model specific.
ce, acc = classifier.evaluate(X, y) # cross entropy and binary accuracy
mse, r2 = regressor.evaluate(X, y) # mean square error and R2 score
Various algorithms are implemented for both supervised and unsupervised learning tasks. All models are separated into their own category located in their respective subdirectory. Aside from the utils
submodule with helper functions, all implementations are completely standalone so there are no other dependencies and the class can be used immediately out of the box.
- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet Regression
- Logistic Regression
- L1 Penalty
- L2 Penalty
- ElasticNet Penalty
- Decision Tree Classifier
- Gini Split
- Entropy Split
- Misclassification Split
- Decision Tree Regressor
- Mean Squared Error Split
- Mean Absolute Error Split
- Poisson Deviance Split
- K-Nearest Neighbors Classifier (in-progress)
- K-Nearest Neighbors Regressor (in-progress)
- Support Vector Classifier (in-progress)
- Support Vector Regressor (in-progress)
- Multilayer Perceptron Regressor
- Multilayer Perceptron Classifier
- Principal Component Analysis (in-progress)
The tox
library is used to run all tests and code formatting. This is automatically installed with the dev requirements. The available options are as follows.
-
Run linting checks using
flake8
.tox -e lint
-
Run type checks using
mypy
.tox -e type
-
Run unit tests
pytest
.tox -e test
-
Run all three of the tests above.
tox
-
Format the code using
black
andisort
to comply with linting conventions.tox -e format
Upon pull request, merge, or push to the master
branch, the three tests with tox
will be run using GitHub Actions. The workflow will fail if any of the tests fail. See .github/workflows/python-package.yml
for more information on how the CI works.