# ML Engineering Basics

## Introduction

In this notebook you will go through the process of environment setup and try packages mentioned in the theory part of the module.

> We assume that you created fresh environment and installed notebook package on top of it or use Google Colab.

## Conda environment

At this point you already have a working environment since you've opened this notebook. The only packaged installed is **notebook** which is a part of jupyter ecosystem. Let's proceed with installation of other packages.

## Using shell commands from the notebook

First let's install numpy via pip. We can do it as easy as:

```bash
pip install numpy
```

To run this and any other shell commands use ! symbol before the command itself.

In [None]:
!pip install numpy

To install pandas we are going to use conda instead. Look at command below:
- You must use `-y` parameter when running `conda install` inside the notebook. Conda will prompt you to agree to installation. If you provide `-y` parameter, this will be omitted.
- We also use `-c conda-forge` to install from specific conda channel. The channel is just a place where conda packages are stored. Many packages are not available in the official channels, however they are most likely available in the conda-forge. Note, that pandas is available in official channel, but we use conda-forge just to get in touch.
- The numbers after the package name is the version of the package which is going to be installed. Here we have pandas version 1.0.5.

In [None]:
!conda install -y -c conda-forge pandas==1.0.5

## Numpy, vectorization

Numpy is a rich and powerful library. The ideas behind it are optimized operations with matrices (or tensors when there are more than 2 dimentions). It is written in C, thus providing much better peformance than pure Python. Same (or similar) interface is used in the most popular deep learning libraries like TensorFlow and PyTorch.

The core principle of Numpy is vectorization. Instead of calculating matrix operations in for cycle one number by one, Numpy parallelize this operations obtaining much better performance.

Let's see it for ourselves. First we create a 200x200 matrix with integers ranging from 0 to 100. Then measure the time of matrix multiplication in two cases:
- Matrix as list of list in pure Python
- Matrix as optimized numpy array.

In [None]:
import numpy as np

mtx = np.random.randint(0, 100, size=(200, 200))

In [None]:
print(f'Shape: {mtx.shape}')
mtx

In [None]:
mtx_as_list = mtx.tolist()

In [None]:
%%time
result = np.zeros_like(mtx).tolist()
for i in range(len(mtx_as_list)):
    for j in range(len(mtx_as_list)):
        for k in range(len(mtx_as_list)):
             result[i][j] += mtx_as_list[i][k] * mtx_as_list[k][j]

In [None]:
%%time
# @ - matrix multiplication operation in Numpy
numpy_result = mtx @ mtx

As you can see matrix operation in Numpy are very optimized. Many other libraries, like Pandas, build their operations on top of Numpy.

## Pandas

Now let's look at Pandas. It's like a swiss army knife for data wrangling and analysis. We'll go through indexing, column creation, data saving and loading, and custom functions. There is much more inside the library, but this you  should learn for yourself.

In [None]:
!pip install bds_courseware

In [1]:
import pandas as pd

from bds_courseware import read_drive_dataset
from bds_courseware import WORKSHOP_DATASETS, HOMEWORK_DATASETS

name = "msft_store"
read_drive_dataset(*WORKSHOP_DATASETS[name])

Unnamed: 0,Name,Rating,No of people Rated,Category,Date,Price
0,Dynamic Reader,3.5,268,Books,07-01-2014,Free
1,"Chemistry, Organic Chemistry and Biochemistry-...",3.0,627,Books,08-01-2014,Free
2,BookViewer,3.5,593,Books,29-02-2016,Free
3,Brick Instructions,3.5,684,Books,30-01-2018,Free
4,Introduction to Python Programming by GoLearni...,2.0,634,Books,30-01-2018,Free
...,...,...,...,...,...,...
5317,JS King,1.0,720,Developer Tools,19-07-2018,₹ 269.00
5318,MQTTSniffer,2.5,500,Developer Tools,10-04-2017,₹ 64.00
5319,"Dev Utils - JSON, CSV and XML",4.0,862,Developer Tools,18-11-2019,₹ 269.00
5320,Simply Text,4.0,386,Developer Tools,23-01-2014,₹ 219.00


Frequently data is stored as a plain text in a comma separated format. It is plain, human-readable format.

Pandas parser support many options to read such files. When reading csv file with known structure it is best to parse every column on reading step (e.g. timestamps). See documentation of `pandas.read_csv` for all options.

In [None]:
df = read_drive_dataset(*WORKSHOP_DATASETS[name])

In [None]:
df.head(10)

In [None]:
df.to_csv('data.csv', index=False)

In [None]:
pd.read_csv('data.csv')

You can reffer to a columns by name like a dictionary or an attribute. The returned object is a **pandas.Series**. You can treat it like a data vector.

In [None]:
df["Name"]
# or
df.Name

You can see that some names are truncated. When working with text data it is useful to remove truncation with the next line of code

In [None]:
pd.options.display.max_colwidth = None

**Series** support common math and bool opperations.

In [None]:
df["Rating"] < 2

In [None]:
df["Rating"] * 10

Another useful function is `value_counts()`. Use it to calculate frequencies of each unique values in a **Series**.

In [None]:
df["Rating"].value_counts()

In [None]:
# Calculate percentage of each rating number instead of absolute value
###
### YOUR CODE HERE
###

You can create new or update existing column using simple assignment. Another way is to use `.assign()` function. Assign is very useful it pair with long chained expressions.

In [None]:
df['rating_power']  = df["Rating"] * df["No of people Rated"]
# or
df  = df.assign(rating_power=lambda this_df: this_df["Rating"] * this_df["No of people Rated"])
df.head()

There are common chart implemented in Pandas. You can access them using `.plot.`. You can use it with a Dataframe or a Series. To draw we need to install matlplotlib library first.

In [None]:
!pip install matplotlib

In [None]:
df.rating_power.plot.hist(bins=50)

If you need to know what unique labels are in the column, you can use `.unique()`

In [None]:
df['Category'].unique()

One of the most frequent operations in Pandas is filtering. You can do it using boolean indexing or `.query()` function.

In [None]:
df.loc[df['Category'] == 'Music']

In [None]:
df.loc[df['Category'] == 'Music'].sort_values('rating_power', ascending=False)

In [None]:
df.query('Category == "Music"')

In [None]:
# Find Lifestyle apps with the best rating
###
### Your code here
###

In [None]:
# Find all apps from `Health and Fitness` category which more than 600 people rated
###
### Your code here
###

Another common use case is applying cusom function via `.apply()`. You can apply any function to dataframe row- or column-wise.

In [None]:
# create a boolean column with non-free apps indicator
df.apply(lambda this_df: this_df.Price != 'Free', axis=1)

In [None]:
def price_to_num(p):
    """
    Function to convert price columns to float values.
    In case of free app set price to zero.
    Leave NaNs as is.
    """
    ###
    ### Your code here
    ###

df.Price.apply(price_to_num)

That's it for now. Pandas has user guide with best practices for all functionality. Find it [here](https://pandas.pydata.org/docs/user_guide/index.html). Pandas is one of the vitals for data scientist, spending more time to learn it is highly recommended.