# Style guide & practical tips on Machine Learning Python code

### 1. Sctucture your project
- Use *Jupyter* cells in one-per-action fashion: one - for imports, one - for reading data, one - for displaying a plot etc.  
- Use *Markdown* syntax to add headings and comments (switch cell type to *Markdown*).
- Don't encapsulate your code into separate *.py* files: our projects are pretty small and may be stored into one *Jupyter* notebook.
- Keep cells in logical order: rerunning *Jupyter* document (Kernel > Restart & Run All) shouldn't break your program.

### 2. Organize imports
- Import everything you need in the first cell.
- Use conventional aliases for packages: they are generally accepted, so you may even google for something like 'pd remove column'

In [1]:
import os
import numpy as np
import scipy
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans

### 3. Obey naming convention
- Obey uppercase/lowercase rule: vectors and scalars are expected to be stored into lowercase variables, matrices - into uppercase ones.
- Use generally accepted naming:  
store feature matrix into capital `X`,  
target variable - into lowercase `y`,  
lowercase `x` is expected to store single observation,  
pandas.**DataFrame** should be stored into just `df` if you have single DataFrame in your program, or name it more specifically if needed, e.g. `test_df`  
etc.

In [2]:
DATA_DIR = 'data'
FILE_NAME = 'iris.data.csv'

file_path = os.path.join(DATA_DIR, FILE_NAME)

In [3]:
df = pd.read_csv(file_path)

In [4]:
X = df.values[:, :-1]
y = df.values[:, -1]

- Uppercase/lowercase rule is also related to functions: uppercase parameter means that matrix is expected

In [5]:
def fit(X, y):
    for x in X:
        w = None
        b = None
    return None

### 4. Respect levels of abstraction
- *pandas* is a high-level API for *numpy* arrays. So use high level when it is logical (i.e. mostly always):

In [6]:
# low level: what are features 0 and 2?
X = df.values[:, [0, 2]]
X.shape

(150, 2)

In [7]:
# high level: we want to selecl just ['sepal length', 'petal length']
X = df[['sepal length', 'petal length']].values
X.shape

(150, 2)

- The common rule is to use *pandas* for data preprocessing and *numpy* - for training and further:

In [8]:
# preprocessing, use df
# ...

X = df.values[:, :-1]
y = df.values[:, -1]

# training, use X and y
# ...
# performance evaluation
# etc.

- Respecting levels of abstraction is not only related to *pandas/numpy* but even more to every single line of code.
- Define functions even if they won't be reused, just to keep your code at the same level of abstraction:

In [9]:
def read_data():
    return pd.read_csv(file_path)


def preprocess(df):
    return df

In [10]:
# high level
df = read_data()
preprocessed_df = preprocess(df)

X = preprocessed_df.values[:, :-1]
y = preprocessed_df.values[:, -1]

# low level model training
# to be replaced with 'fit(X, y)'
for x in X:
    w = None
    b = None

### 5. Avoid explicit loops
- *numpy* gives you encapsulated implementation of matrix operations called *vectorization*. It's much faster and often makes code more clear.

$$d(x, x^{'}) = \sqrt{\sum_{j = 1}^{m} (x_j - x_j^{'}) ^ {2}}$$

In [11]:
def euclidean_distance(x1, x2):
    return np.sqrt(((x1 - x2) ** 2).sum())

In [12]:
X = np.array([
    [1, 4, 3, 5],
    [2, 3, 6, 10]
])

euclidean_distance(X[0], X[1])

6.0

- Note that it is not always possible to avoid explicit loops. For example, explicit loops cannot be avoided in the following cases:  
\- iterating over epochs (passes of the entire training set when training your model),  
\- iterating over mini-batches when using mini-batch gradient descent.  
However, analyze each case carefully. 

### 6. Don't use 1-D *numpy* arrays

- Avoid using 1-D *numpy* arrays wherever it is possible: matrix operations may return unexpected result. Use  
numpy.**reshape**  
to convert arrays to 2-D

https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html

In [13]:
x = np.array([1, 2, 3, 4, 5])
x_row = x.reshape(1, -1)
x_column = x.reshape(-1, 1)

print('1-D:', x.shape)
print('2-D, row:', x_row.shape)
print('2-D, column:', x_column.shape)

1-D: (5,)
2-D, row: (1, 5)
2-D, column: (5, 1)


- Some library functions require 1-D arrays as input. Use numpy.**ravel** to make your arrays 1-D:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html

In [14]:
x_row.ravel()

array([1, 2, 3, 4, 5])

In [15]:
x_column.ravel()

array([1, 2, 3, 4, 5])

### 7. Follow *scikit-learn* interface when implementing your own models
- For example, see sklearn.cluster.**KMeans** docs when developing your own *k-means* implementation

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

This is a well-thought-out style, best practice. However, you also will be able to easily switch between your own and out-of-the-box implementations of Machine Learning models without changing surrounding code.

In [16]:
class MyKMeans:
    def __init__(self, n_clusters, init='random', n_init=1, max_iter=100):
        self.n_clusters = n_clusters
        self.init = init
        self.n_init = n_init
        self.max_iter = max_iter

    def fit(self, X):
        pass
        self.cluster_centers_ = None

    def predict(self, X):
        return None

In [17]:
kmeans = MyKMeans(n_clusters=3, init='k-means++')
kmeans.fit(X)
print(kmeans.cluster_centers_)

None


- Implement every *scikit-learn* offered option for model training if you are familiar with the theory behind it or just think that it's easy and useful thing. We didn't learn `max_iter` option for *k-means* method. But that's easy. So why not to implement it on your own?

### 8. Use library code wherever possible
- Implementation Machine Learining models on your own for study purposes, however maximize library usage when developing your own implementation and everything around it.
- Use *pandas* for high-level data preprocessing.
- Use *numpy* and *scipy* for fast and readable code with matrix operations and other calculations.
- Use *scikit-learn* for out-of-the-box implementation of general Machine Learning operations: sample split, performance evaluation etc.
- Learn best practices by examples. Google for every high-level concept you are going to implement. Need to randomly choose some rows of *numpy* matrix? Google for examples!

By the way, did you notice that we used `os.path.join` to concatenate path to our data source? This cross-platform implementation saves our time, handling many cases and making our code more readable. Once again: google for examples of every concept you are going to implement!

### 9. Do your own research
- Proof your decisions with examples and statistics.

**Example 1.** You need to choose learning rate $\alpha$.  
Try some options. Measure number of iterations needed for convergence of learning. Track dynamics of model performance while learning (measure cost), you may even plot it. Use metrics (*accuracy* e.g.) to evaluate performance of the trained model on training and validation sets.

**Example 2.** You have implemented an improvement: *k-means++* initialization.  
Why is it better than random initialization? Make experiments. Train your model for many times. Say, 100. Measure average epoch (passes over the training set) number needed for convergence for both types of initialization. Measure classification performance (*accuracy*, at least). Compare results and voila! Your decision is reasoned!

**Example 3.** You have implemented an improvement: *mini-batch gradient descent* for model training.  
Why is it better than batch gradient descent? Track learning performance. How many epochs were needed for learning convergence in both cases? What \[physical\] time did calculations take? Plot the results!