# Basic Engineering

## Introduction

In this notbook you will go through the process of environment setup and try packages mentioned in the theory part of the module.

> We assume that you created fresh environment and installed notebook package on top of it.

## Conda environment

At this point you already have a working environment since you opened this notebook. The only packaged installed is **notebook** which is a part of jupyter ecosystem. Lets proceed with installation of other packages.

## Using shell commands from the notebook

First lets install numpy via pip. We can do it as easy as:

```bash
pip install numpy
```

To run this and any other shell commands use ! symbol before the command itself.

In [1]:
!pip install numpy

Collecting numpy
  Using cached numpy-1.20.1-cp38-cp38-win_amd64.whl (13.7 MB)
Installing collected packages: numpy
Successfully installed numpy-1.20.1


To install pandas we are going to use conda instead. Look at command below:
- You must use `-y` parameter when running `conda install` inside the notebook. Conda will prompt you to agree to installation. If you provide `-y` parameter, this will be omitted.
- We also use `-c conda-forge` to install from specific conda channel. The channel is just a place where conda packages are stored. Many packages are not available in the official channels, however they are most likely available if conda-forge. Note, that pandas is available in official channel, but we use conda-forge just to get in touch.
- The numbers after the package name is the version of the package which is going to be installed. Here we have pandas version 1.0.5.

In [3]:
!conda install -y -c conda-forge pandas==1.0.5

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\agama\anaconda3\envs\course

  added / updated specs:
    - pandas==1.0.5


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.12.5  |       h5b45459_0         173 KB  conda-forge
    certifi-2020.12.5          |   py38haa244fe_1         144 KB  conda-forge
    intel-openmp-2020.3        |     h57928b3_311         2.0 MB  conda-forge
    libblas-3.9.0              |            8_mkl         3.9 MB  conda-forge
    libcblas-3.9.0             |            8_mkl         3.9 MB  conda-forge
    liblapack-3.9.0            |            8_mkl         3.9

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\agama\anaconda3\envs\course

  added / updated specs:
    - pandas==1.0.5


The following NEW packages will be INSTALLED:

  intel-openmp       conda-forge/win-64::intel-openmp-2020.3-h57928b3_311
  libblas            conda-forge/win-64::libblas-3.9.0-8_mkl
  libcblas           conda-forge/win-64::libcblas-3.9.0-8_mkl
  liblapack          conda-forge/win-64::liblapack-3.9.0-8_mkl
  mkl                conda-forge/win-64::mkl-2020.4-hb70f87d_311
  numpy              conda-forge/win-64::numpy-1.20.0-py38h0cc643e_0
  pandas             conda-forge/win-64::pandas-1.0.5-py38he6e81aa_0
  python_abi         conda-forge/win-64::python_abi-3.8-1_cp38
  pytz               c

## Numpy, vectorization

Numpy is rich and powerfull library. The idea behind it are optimized operations with matrices (or tensors when there are more than 2 dimentions). It is written in C, thus providing much better peformance than pure Python. Same (or similar) interface is used in the most popular deep learning libraries like TensorFlow and PyTorch.

The core principle of Numpy is vectorization. Instead of calculating matrix operations in for cycle one number by one, Numpy parallelize this operations obtaining much better performance.

Lets see it for ourselves. First we create a 100x100 matrix with integers ranging from 0 to 100. Then measure the time of matrix multiplication in two cases:
- Matrix as list of list in pure Python
- Matrix as optimized numpy array.

In [4]:
import numpy as np

mtx = np.random.randint(0, 100, size=(200, 200))

In [5]:
print(f'Shape: {mtx.shape}')
mtx

Shape: (200, 200)


array([[43, 26, 71, ..., 18, 39, 96],
       [58, 64, 99, ..., 12,  9, 69],
       [86, 46, 98, ..., 55, 32, 25],
       ...,
       [81, 60, 97, ...,  6, 32, 44],
       [30, 86, 30, ..., 24, 97, 24],
       [33, 95, 26, ..., 37, 44, 47]])

In [6]:
mtx_as_list = mtx.tolist()

In [7]:
%%time
result = np.zeros_like(mtx).tolist()
for i in range(len(mtx_as_list)):
    for j in range(len(mtx_as_list)):
        for k in range(len(mtx_as_list)):
             result[i][j] += mtx_as_list[i][k] * mtx_as_list[k][j]

Wall time: 3.12 s


In [8]:
%%time
# @ - matrix multiplication operation in Numpy
numpy_result = mtx @ mtx

Wall time: 12 ms


As you can see matrix operation in Numpy are very optimized. Many other libraries, like Pandas, build their operations on top of numpy.

## Pandas

Now lets look at Pandas. It's like a swiss army knife for data wrangling and analysis. We'll go through indexing, column creation, data saving and loading, and custom functions.

In [9]:
!pip install bds_courseware

Collecting bds_courseware
  Using cached bds_courseware-1.0.3-py3-none-any.whl (3.1 kB)
Collecting gdown
  Using cached gdown-3.12.2-cp38-none-any.whl
Collecting filelock
  Using cached filelock-3.0.12-py3-none-any.whl (7.6 kB)
Collecting requests[socks]
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting chardet<5,>=3.0.2
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting PySocks!=1.5.7,>=1.5.6
  Using cached PySocks-1.7.1-py3-none-any.whl (16 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.3-py2.py3-none-any.whl (137 kB)
Collecting tqdm
  Downloading tqdm-4.56.1-py2.py3-none-any.whl (72 kB)
Installing collected packages: urllib3, idna, chardet, requests, PySocks, tqdm, filelock, gdown, bds-courseware
Successfully installed PySocks-1.7.1 bds-courseware-1.0.3 chardet-4.0.0 filelock-3.0.12 gdown-3.12.2 idna-2.10 requests-2.25.1 tqdm-4.56.1 urllib3-1.26.3


In [10]:
from bds_courseware import get_msft_store_dataset
import pandas as pd

df = get_msft_store_dataset()
df.head()

ImportError: cannot import name 'get_msft_store_dataset' from 'bds_courseware' (C:\Users\agama\anaconda3\envs\course\lib\site-packages\bds_courseware\__init__.py)

Frequently, data is stored as a plain text in a comma separated format. It is plain, human-readable format.

Pandas parser support many options to read such files. When reading csv file with known structure it is best to parse every column on reading step (e.g. timestamps). See documentation of `pandas.read_csv` for all options.

In [None]:
df.to_csv('data.csv', index=False)

In [None]:
pd.read_csv('data.csv')

You can reffer to a columns by name like a dictionary or an attribute. The returned object is a **pandas.Series**. You can treat it like a data vector.

In [None]:
df["Name"]
# or
df.Name

You can see that some name are truncated. When working with text data it is useful to remove truncation with the next command

In [None]:
pd.options.display.max_colwidth = None

**Series** support common math and boolean opperations.

In [None]:
df["Rating"] < 2

In [None]:
df["Rating"] * 10

Another useful function is `value_counts()`. Use it to calculate frequencies of each unique values in a **Series**.

In [None]:
df["Rating"].value_counts()

In [None]:
# Calculate percentage of each rating number instead of absolute value
###
### YOUR CODE HERE
###

You can create new or update existing column using simple assignment. Another way is to use `.assign()` function. Assign is very useful it pair with long chained expressions.

In [None]:
df['rating_power']  = df["Rating"] * df["No of people Rated"]
# or
df  = df.assign(rating_power=lambda this_df: this_df["Rating"] * this_df["No of people Rated"])
df.head()

There are common chart implemented in Pandas. You can access them using `.plot.`. You can use it with a Dataframe or a Series.

In [None]:
df.rating_power.plot.hist(bins=50)

If you need to know what unique labels are in the column, you can use `.unique()`

In [None]:
df['Category'].unique()

One of the most frequent operations in Pandas is filtering. You can do it using boolean indexing or `.query()` function.

In [None]:
df.loc[df['Category'] == 'Music']

In [None]:
df.loc[df['Category'] == 'Music'].sort_values('rating_power', ascending=False)

In [None]:
df.query('Category == "Music"')

In [None]:
# Find Lifestyle apps with best rating
###
### Your code here
###

In [None]:
# Find all apps from `Health and Fitness` category with more than 600 people rated
###
### Your code here
###

Another common use case is applying cusom function via `.apply()`. You can apply any function to dataframe row- or column-wise.

In [None]:
df.apply(lambda this_df: this_df.Price != 'Free', axis=1)

In [None]:
def price_to_num(p):
    """
    Function to convert price columns to float values.
    In case of free app set price to zero.
    Leave NaNs as is.
    """
    ###
    ### Your code here
    ###

df.Price.apply(price_to_num)

That's it for now. Pandas has user guide with best practices for all functionality. Find it [here](https://pandas.pydata.org/docs/user_guide/index.html). Pandas is one of the vitals for data scientist, spending more time to learn it is highly recommended.