<div align="center">
    <h1>Classes, unit tests and other good coding practices 👌 </h1>
    <h3> Weizmann AI Hub for Scientific Discovery </h3>
    <h4>Nathan LEVY</h4>
    nathan.levy@weizmann.ac.il
    <p>with inputs from M.Kim</p>
    <p>Summer 2024</p>
</div>

The goal of this tutorial is to build a Python package for k-nn classification. We will implement the k-nn algorithm, write unit tests for it, and package it in a clean and organized way.

Specifically, we will cover the following topics:
- Code linting and formatting (using `ruff`)
- Code documentation (using docstrings)
- Unit tests (using the `unittest` module)
- Object-oriented programming (developing the knn classifier as a class inspired by scikit-learn architecture)
- Package management (using `pyproject.toml`)


Overall, the goal of this tutorial is to provide a hands-on experience with good coding practices and to give you the tools to develop your own Python packages - $escaping~from~Jupyter~ notebooks!$ 🎉

💡 This tutorial is built for everyone in the hub and does not assume specific knowledge apart from Python programming. We assume that you already solved the  `ex-home-knn` prior to start this tutorial.

💡 We will not cover version control tools (Git) but we highly encourage you to start a git repository for this project. 

💡 We recommend using VSCode GUI for this project, and getting familiar with the [debugger](https://code.visualstudio.com/docs/editor/debugging). You may check the [VSCode on WEXAC](https://hpcwiki.weizmann.ac.il/en/home/ai_hub) wiki section.

Over the course of its development, Python experienced many enhancements and new features, detailed in PEPs (Python Enhancement Proposals). For instance:

- PEP 8 is a style guide which introduced naming styles, indentation, and other conventions, cf [PEP 8](https://realpython.com/python-pep8/#toc)
- PEP 257 is a docstring convention, cf [PEP 257](https://peps.python.org/pep-0257/)

You may briefly look at these PEPs. In this tutorial we will see how to use dedicated tools named _linters_ to make your code compliant with these conventions!

## 2. Linting your code with `ruff`

Linting is the process of checking the source code for programmatic and stylistic errors. Linters are tools that perform static analysis on your code to find potential errors, bugs, and stylistic issues. Linters can help you catch bugs early in the development process, and ensure that your code is clean, readable, and maintainable. In our case, we will use `ruff`.

First of all, we need to install `ruff`: if you are working with VSCODE, you can install the extension `ruff` directly from the marketplace. If not simply add it to your environment with `pip install ruff`.

Ruff can check for more than 800 rules, that are listed in the [ruff documentation](https://docs.astral.sh/ruff/rules/). 
You can specify them in the ruff config file `ruff.toml`. To begin with you can use the default rules, which correspond to the F and E rules in the documentation and are sufficient to cover most common errors. 

- We reproduced the [tutorial](https://docs.astral.sh/ruff/tutorial/#getting-started) example file in `example_ruff.py`. You can run the linting with the following command:

    `ruff check example_ruff.py`

    You can also run the linting on a whole directory with:

    `ruff check .`

- You can then fix the auto-fixable errors: 

     `ruff check --fix .`

- And finally format your files so that all lines are less than the line length you provided: 

    `ruff format .`

<div class="alert alert-block alert-info">
<b>Task 1</b>
<p>Run the linting on the example_ruff.py file. What are the errors? Are they fixable? How does the file look like after formatting?</p>
</div>

## 3. Type hints and documentation 

As you can see in the `example_ruff.py` file, we have added type hints to the arguments and output of the function `sum_even_numbers`. Type hints are a way to specify the type of a variable in Python, and they can be used to make your code more readable and maintainable. Let's explain the annotations in `sum_even_numbers(numbers: Iterable[int]) -> int`:

- `numbers: Iterable[int]` specifies that the argument `numbers` is an iterable of integers. The `Iterable` type hint is a generic type hint that specifies that the argument is an iterable, and the `[int]` part specifies that the elements of the iterable are integers.

- `-> int` specifies that the return value of the function is an integer.

Built-in types in python are:
- `int`: integer
- `float`: floating point number
- `str`: string
- `bool`: boolean
- `list`: list
- `tuple`: tuple
- `dict`: dictionary
- `set`: set

You can also use the `typing` module to specify more complex types. For instance `Iterable` to specify that a variable is an iterable. You can read more about type hints in the [cheat sheet](https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html).

<div class="alert alert-block alert-info">
<b>Task 2</b>
<p>I have a function that takes a list of strings and returns the maximum length. How would you annotate it? What if the function returns both min and max lenghts? </p>
</div>

We have also added a one-line docstring to the function `sum_even_numbers`. A docstring is a string that appears at the beginning of a module, class, or function definition, and it is used to document the purpose and usage of the code. They should be enclosed in triple quotes, and they can span multiple lines. We'll use the numpy convention. To enforce it, we simply added a `[lint.pydocstyle]` section in the `ruff.toml` file:

```
[lint.pydocstyle]
convention = "numpy"
```

For a function, the docstring should contain the following information:
- A brief description of what the function does
- A description of the arguments and their types
- A description of the return value and its type

You can read how to write docstrings in the numpy convention [here](https://numpydoc.readthedocs.io/en/latest/format.html).


<div class="alert alert-block alert-info">
<b>Task 3</b>
<p>Complete the docstrings in test_ruff.py. Run ruff again. </p>
</div>

## 4. Classes

Python is an object-oriented programming language, which means that it supports classes and objects. You should first review the basic concepts of object-oriented programming by reading this presentation: [Object-oriented programming](https://realpython.com/python-classes/#getting-started-with-python-classes) until "Using Inheritance and Building Class Hierarchies - Simple Inheritance".


Let's play a bit with the example given in the tutorial: in the `car_class.py` file, we build a `Vehicle` class and a `Car` class that inherits from the `Vehicle` class. Make sure to understand the `super()` function by reading this [explanation](https://realpython.com/python-super/). 


If you completed the _Type hints and documentation_ section, please add the docstrings and type hints to the classes and methods 😌. 

In [2]:
from car_class import Car
# auto = Car(...)

<div class="alert alert-block alert-info">
<b>Task 4</b>
<p>Add two arguments to the Car: fuel_capacity (how many liters in the tank) and efficiency (oil consumption per 100km). Then add a method called range, taking as argument fuel_level (how many liters left) and computing the range of the car based on fuel level and efficiency.</p>
</div>

## 5. Unit testing

💡 The part about _Classes_ needs to be completed first.

We're diving into unit testing. Imagine you're building a LEGO masterpiece 🏗️: unit testing is like checking each brick before you snap it into place. In the coding universe, a "unit" is like the atom of your program - the smallest bit that can stand on its own. In object-oriented programming, think of a unit as a single method in a class.

Tests are the quality control making sure your code is pitch-perfect before it hits the big stage. Let's get testing! 💪🧪

We will use the `unittest` module, which is part of the Python standard library. Full documentation can be found [here](https://docs.python.org/3/library/unittest.html), but we'll cover the basics. 

- we group a set of tests in a _testcase_, a class that inherits from `unittest.TestCase`. 
- each test is a method of this class that starts with `test_`.
- we use the `assert` statement to check if the output of a function is what we expect. It raises an `AssertionError` if the condition is not met. We write it as `assert CONDTION, MESSAGE if condtion not met`.

In the `test_car.py` file, we have written a testcase for the `Car` class. We have tested the `range` method. 

The `unittest` module can be run using command line:

`python -m unittest test_car_class.py` #to run all tests in the file

`python -m unittest test_car_class.TestCar` #to run a specific testcase

`python -m unittest test_car_class.TestCar.test_range` #to run a specific test

<div class="alert alert-block alert-info">
<b>Task 5</b>
<p>Run the testcase for the Car. Add a new test making sure that the number of seats is greater than zero. </p>
</div>

## 6. Application: build a k-nn classifier

We now put all these concepts together to build a k-nn classifier, inspired by the scikit-learn package - which is a good example of object-oriented programming in Python.


We will build a class `KNeighborsClassifier` inspired by the [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) architecture. 


The class is built in the `knn.py` file and needs the following methods:

- `__init__(self, n_neighbors: int = 5)`: the constructor of the class, taking as argument the number of neighbors to consider. The default value is 5.

- `fit(self, X: np.ndarray, y: np.ndarray) -> None`: a method to fit the model on the data. `X` is a 2D numpy array of shape `(n_samples, n_features)` and `y` is a 1D numpy array of shape `(n_samples,)`.

- `predict(self, X: np.ndarray) -> np.ndarray`: a method to predict the labels of the data. `X` is a 2D numpy array of shape `(n_samples, n_features)` and the output is a 1D numpy array of shape `(n_samples,)`.

- `predict_proba(self, X: np.ndarray) -> np.ndarray`: a method to predict the probabilities of the labels. `X` is a 2D numpy array of shape `(n_samples, n_features)` and the output is a 2D numpy array of shape `(n_samples, n_classes)`.

- `score(self, X: np.ndarray, y: np.ndarray) -> float`: a method to compute the accuracy of the model on the data. `X` is a 2D numpy array of shape `(n_samples, n_features)` and `y` is a 1D numpy array of shape `(n_samples,)`.

<div class="alert alert-block alert-info">
<b>Task 6</b>
<p></p>
</div>


A Python package is a directory that contains a special file called `__init__.py`. This file can be empty, but it is necessary to tell Python that the directory should be considered a package. The package directory can also contain other Python files, modules, and subpackages.

We'll use again the IRIS dataset, that you can visualize [here](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).

## BONUSES

## Package structure in Python

A Python package is a way of organizing related Python code into a directory hierarchy. Most of the tools that you use in Python like NumPy, Pandas are organized as Python packages. Packages can be installed using package managers like pip, and are often distributed through the Python Package Index [PyPI](https://pypi.org/). This system allows Python developers to easily share and reuse code across projects and with the wider Python community.

The `pyproject.toml` file was introduced in [PEP 518](https://pip.pypa.io/en/stable/reference/build-system/pyproject-toml/).  It is the configuration file for a package. It is used to specify the project's dependencies, build system, and other configuration options. 


It needs to be located at the root of the project directory and has 3 main sections: `[build-system]`, `[project]`, and `[tool]`.

##### 1. `[build-system]` section

This section specifies specifies the build backend to be used. A build backend for a Python package is a tool or system that handles the process of building, packaging, and preparing your Python project for distribution. Here we will use default Hatchling as a build backend  - so you can let this section as is.


##### 2. `[project]` section

This section specifies the project's metadata. You can fill the following fields:
- `name`: the name of the project (as it will appear when you want to install it with pip)
- `version`: the version of the project
- `description`: a one-line description of the project
- `readme`: the path to the README file
- `requires-python`: the minimum Python version supported by the project
- `authors`: the authors of the project
- `license`: the license of the project
- `dependencies`: the dependencies of the project
- `urls`: the URLs of the project, i.e. a public git repository


The non-trivial field is the `dependencies` field. It specifies the dependencies of the project, i.e. the packages that need to be installed in order to use the project. For instance, if you want to specify that you need `numpy` and `scipy` to run your project, you can write:

`dependencies = ["numpy", "scipy"]`

You can also specify the version of the dependencies that you need. For instance, if you need `numpy` version greater or equal to 1.20.0 and `scipy` version 1.6.0, you can write:

`dependencies = ["numpy>=1.20.0", "scipy==1.6.0"]`


We'll fill this field as we progress on our project.

We also specify optional dependencies, which are needed only for specific features of our package. In our case, the package required to run the tests. In our case we will need `pytest` later in the tutorial, so we write: 

```
[project.optional-dependencies]
test = ["pytest"]
```

##### 3. `[tool]` section

In this last part, we have to specify the details of external tools that we'll need. For our project, we want to do **linting**. Linting is the process of checking the source code for programmatic and stylistic errors. Linters are tools that perform static analysis on your code to find potential errors, bugs, and stylistic issues. Linters can help you catch bugs early in the development process, and ensure that your code is clean, readable, and maintainable. In our case, we will use `ruff`. Let's have a short overview of this tool before filling the `[tool]` section.

### B1. Build the package

https://docs.google.com/presentation/d/1AKVx6vlzv6sAVBoyT7gLJnJtRaNXMFf1/edit#slide=id.p22

### B2. More guidelines

https://www.datacamp.com/tutorial/coding-best-practices-and-guidelines