# Machine Learning for Software Engineers

> Topics covered include data analysis/visualization, feature engineering, supervised learning, unsupervised learning, and deep learning. All topics are are ofindustry standard frameworks: NumPy, pandas, scikit-learn, XGBoost, TensorFlow, and Keras.

- author: Victor Omondi
- toc: true
- comments: true
- categories: [software-engineer, machine-learning]
- image: images/mlse-shield.png

# Libraries

In [1]:
import warnings

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")

## Libraries setup

In [2]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)

# Overview

## A. What is Machine Learning?

Machine learning is the branch of science that deals with algorithms and systems performing specific tasks using patterns and inference, rather than explicitly programmed instructions. There are a variety of different use cases for machine learning, from image recognition to text generation. Most machine learning tasks generalize to one of the following two learning types:

- **Supervised learning**: Using labeled data to train a model. The labels for the training dataset represent the class/category that each data observation belongs to. After training, the model should be able to predict labels for new data observations (from the same population distribution as the training data).
  - ***Example***: Let’s say you’re training a machine learning model to predict whether a picture contains a lake or not. With supervised learning, you would train a model on a dataset of pictures where the label for each picture is “Yes” if it contains a lake or “No” if it doesn’t. After training, the model will be able to take in a picture and determine whether or not it contains a lake.

- **Unsupervised Learning**: Using unlabeled data to allow a model to learn relationships between data observations and pick up on underlying patterns. Most data in the world is unlabeled, which makes unsupervised learning a very useful method of machine learning.
  - ***Example***: Going back to the same picture dataset from above, but now assume the training dataset is unlabeled. Using unsupervised learning, a model will be able to pick up on the inherent differences between pictures with a lake and pictures without a lake, e.g. differences in pixel color or orientation. This allows the model to cluster the pictures into two separate groups.

If it is possible to get large enough labeled training datasets, supervised learning is the way to go. However, it is often difficult to get fully labeled datasets, which is why many tasks require unsupervised learning or semi-supervised learning (a mix of supervised and unsupervised learning). Deciding which type of learning method to use is only the first step towards creating a machine learning model. You also need to choose the proper model architecture for your task and, most importantly, be able to process data into a training pipeline and interpret/analyze model results.

## B. ML vs. AI vs. Data Science

People often throw around the terms “machine learning”, “artificial intelligence”, and “data science” interchangeably. In reality, machine learning is a subset of artificial intelligence and overlaps heavily with data science. Artificial intelligence deals with any technique that allows machines to display “intelligence”, similar to humans. Machine learning is one of the main techniques used to create artificial intelligence, but other non-ML techniques (e.g. alpha-beta pruning, rule-based systems) are also widely used in AI.

On the other hand, data science deals with gathering insights from datasets. Traditionally, data scientists have used statistical methods for gathering these insights. However, as machine learning continues to grow, it has also penetrated into the field of data science.

In industry, any data scientist or AI researcher needs to have a good understanding of machine learning. Machine learning in industry has allowed us to create wonderful autonomous systems. These systems have matched, or sometimes even exceeded, the best human performance in their respective fields. A good example is AlphaGo, a machine-learning based system that has beaten the best human Go players in the world.

## C. 7 Steps of the Machine Learning Process

1. **Data Collection**: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
2. **Data Processing and Preparation**: Once you’ve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
3. **Feature Engineering**: Once you’ve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
4. **Model Selection**: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
5. **Model Training and Data Pipeline**: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
6. **Model Validation**: After training the model for a sufficient amount of time, you will need to validate the model’s performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
7. **Model Persistence**: Finally, after training and validating the model’s performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.

## D. What this course will provide

We’ll be able to take process and clean a raw dataset, train a machine learning model on the data, and validate the model’s performance. Specifically, we will be able to:
- Take a raw dataset and process it for a given task. This means dealing with missing data and outliers, normalizing and transforming features, figuring out which features are the most relevant to the task, and picking out the best combination of features to use.
- Picking the correct model architecture to use based on the data. Many people will always default to using a large neural network for any machine learning task, but many times this is unnecessary and can even hurt the model’s final performance if the dataset is not large enough.
- Code a machine learning model and train it on processed data. Validate the model’s performance on held-out data and understand techniques to improve a model’s performance.

# Data Manipulation with NumPy

## Introduction

An overview of data processing and the NumPy library.

In the **Data Manipulation** section, we will explore how to perform data manipulation using NumPy.

### A. Data Processing

When asked about Google's model for success, Peter Norvig, the director of research at Google, famously stated,

> "We don't have better algorithms than anyone else; we just have more data."

Though probably an understatement (given the amount of talent employed at Google), the quote does provide a sense of just how vital data is to having successful outcomes.

People normally discuss the importance of data in the context of machine learning. No matter how sophisticated a machine learning model is, it will not perform well unless it has a reasonable amount of data to train on. On the other hand, given a large and diverse set of training data, a good deep learning model will significantly outperform non-deep learning algorithms.

However, data is not just limited to machine learning. Companies use data to identify customer trends, political parties use data to determine which demographics they should target, sports teams use data to analyze players, etc.

![Example.jpg](datasets/images/example.jpg "Example baseball data used in sabermetrics. The concept was popularized by the 2011 film, Moneyball.")

The universal usage of data makes **data processing**, the act of converting raw data into a meaningful form, an essential skill to have.

### B. NumPy

Many scenarios involve mostly numeric datasets. For example, medical data contains many numeric metrics, such as height, weight, and blood pressure. Furthermore, the majority of neural networks use input data that is either numeric or has been converted to a numeric form.

When we deal with numeric data, the best Python library to use is [NumPy](http://www.numpy.org/). The NumPy library allows us to perform many operations on numeric data, and convert the data to more usable forms.

In [4]:
# Initializing a NumPy array
arr = np.array([-1, 2, 5], dtype=np.float32)

# Print the representation of the array
arr

array([-1.,  2.,  5.], dtype=float32)

In the following chapters, we’ll explore all the necessary NumPy operations for data manipulation.

## NumPy Arrays

Exploring NumPy arrays and how they're used.

### Goals:

- Explore NumPy arrays and how to initialize them
- Write code to create several NumPy arrays

### A. Arrays

NumPy arrays are basically just Python lists with added features. In fact, we can easily convert a Python list to a Numpy array using the  `np.array`  function, which takes in a Python list as its required argument. The function also has quite a few keyword arguments, but the main one to know is  `dtype` . The  `dtype`  keyword argument takes in a [NumPy type](https://docs.scipy.org/doc/numpy/user/basics.types.html) and manually casts the array to the specified type.

The code below is an example usage of  `np.array`  to create a 2-D matrix. 

> Note: the array is manually cast to  `np.float32` .

In [5]:
arr = np.array([[0, 1, 2], [3, 4, 5]],
               dtype=np.float32)
arr

array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)

When the elements of a NumPy array are mixed types, then the array's type will be  *upcast*  to the highest level type. This means that if an array input has mixed  `int`  and  `float`  elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with  `int` ,  `float` , and  `string`  elements, everything is cast to strings.

The code below is an example of  `np.array`  upcasting. Both integers are cast to their floating-point equivalents.

In [6]:
arr = np.array([0, 0.1, 2])
arr

array([0. , 0.1, 2. ])

### B. Copying

Similar to Python lists, when we make a reference to a NumPy array it doesn't create a different array. Therefore, if we change a value using the reference variable, it changes the original array as well. We get around this by using an array's inherent  `copy`  function. The function has no required arguments, and it returns the copied array.

In the code example below,  `c`  is a reference to  `a`  while  `d`  is a copy. Therefore, changing  `c`  leads to the same change in  `a` , while changing  `d`  does not change the value of  `b` .

In [7]:
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print(f'Array a: {a}')

Array a: [0 1]


In [8]:
c[0] = 5
print(f'Array a: {a}')

Array a: [5 1]


In [9]:
d = b.copy()
d[0] = 6
print(f'Array b: {b}')

Array b: [9 8]


### C. Casting

We cast NumPy arrays through their inherent  `astype`  function. The function's required argument is the new type for the array. It returns the array cast to the new type.

The code below shows an example of casting using the  `astype`  function. The  `dtype`  property returns the type of an array.

In [10]:
arr = np.array([0, 1, 2])
arr.dtype

dtype('int32')

In [11]:
arr = arr.astype(np.float32)
arr.dtype

dtype('float32')

### D. NaN

When we don't want a NumPy array to contain a value at a particular index, we can use  `np.nan`  to act as a placeholder. A common usage for  `np.nan`  is as a filler value for incomplete data.

The code below shows an example usage of  `np.nan` . Note that  `np.nan`  cannot take on an integer type.

In [12]:
arr = np.array([np.nan, 1, 2])
arr

array([nan,  1.,  2.])

In [13]:
arr = np.array([np.nan, 'abc'])
arr

array(['nan', 'abc'], dtype='<U32')

In [14]:
# Will result in a ValueError
np.array([np.nan, 1, 2], dtype=np.int32)

ValueError: cannot convert float NaN to integer

### E. Infinity

To represent infinity in NumPy, we use the  `np.inf`  special value. We can also represent negative infinity with  `-np.inf` .

The code below shows an example usage of  `np.inf` . Note that  `np.inf`  cannot take on an integer type.

In [15]:
print(np.inf > 1000000)

True


In [17]:
arr = np.array([np.inf, 5])
arr

array([inf,  5.])

In [18]:
arr = np.array([-np.inf, 1])
arr

array([-inf,   1.])

In [19]:
# Will result in an OverflowError
np.array([np.inf, 3], dtype=np.int32)

OverflowError: cannot convert float infinity to integer

## NumPy Basics

Perform basic operations to create and modify NumPy arrays.

### Goals:

- Explore some basic NumPy operations
- Write code using the basic NumPy functions

### A. Ranged data

While  `np.array`  can be used to create any array, it is equivalent to hardcoding an array. This won't work when the array has hundreds of values. Instead, NumPy provides an option to create ranged data arrays using  `np.arange` . The function acts very similar to the  `range`  function in Python, and will always return a 1-D array.

The code below contains example usages of  `np.arange`.

In [20]:
arr = np.arange(5)
arr

array([0, 1, 2, 3, 4])

In [21]:
arr = np.arange(5.1)
arr

array([0., 1., 2., 3., 4., 5.])

In [22]:
arr = np.arange(-1, 4)
arr

array([-1,  0,  1,  2,  3])

In [23]:
arr = np.arange(-1.5, 4, 2)
arr

array([-1.5,  0.5,  2.5])

The output of  `np.arange`  is specified as follows:

* If only a single number,  *n* , is passed in as an argument,  `np.arange`  will return an array with all the integers in the range [0,  *n* ).  
* > Note:  the lower end is inclusive while the upper end is exclusive.
* For two arguments,  *m*  and  *n* ,  `np.arange`  will return an array with all the integers in the range [ *m* ,  *n* ).
* For three arguments,  *m* ,  *n* , and  *s* ,  `np.arange`  will return an array with the integers in the range [ *m* ,  *n* ) using a step size of  *s* .
* Like  `np.array` ,  `np.arange`  performs upcasting. It also has the  `dtype`  keyword argument to manually cast the array.

To specify the number of elements in the returned array, rather than the step size, we can use the  `np.linspace`  function.

This function takes in a required first two arguments, for the start and end of the range, respectively. The end of the range is inclusive for  `np.linspace` , unless the keyword argument  `endpoint`  is set to  `False` . To specify the number of elements, we set the  `num`  keyword argument (its default value is  `50` ).

The code below shows example usages of  `np.linspace` . It also takes in the  `dtype`  keyword argument for manual casting.

In [24]:
arr = np.linspace(5, 11, num=4)
arr

array([ 5.,  7.,  9., 11.])

In [25]:
arr = np.linspace(5, 11, num=4, endpoint=False)
arr

array([5. , 6.5, 8. , 9.5])

In [26]:
arr = np.linspace(5, 11, num=4, dtype=np.int32)
arr

array([ 5,  7,  9, 11])

### B. Reshaping data

The function we use to reshape data in NumPy is  `np.reshape` . It takes in an array and a new shape as required arguments. The new shape must exactly contain all the elements from the input array. For example, we could reshape an array with 12 elements to  `(4, 3)` , but we can't reshape it to  `(4, 4)` .

We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value necessary to allow the new shape to contain all the elements of the array.

The code below shows example usages of  `np.reshape` .

In [27]:
arr = np.arange(8)

reshaped_arr = np.reshape(arr, (2, 4))
reshaped_arr

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [28]:
print(f'New shape: {reshaped_arr.shape}')

New shape: (2, 4)


In [29]:
reshaped_arr = np.reshape(arr, (-1, 2, 2))
reshaped_arr

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [30]:
print(f'New shape: {reshaped_arr.shape}')

New shape: (2, 2, 2)


While the  `np.reshape`  function can perform any reshaping utilities we need, NumPy provides an inherent function for flattening an array. Flattening an array reshapes it into a 1D array. Since we need to flatten data quite often, it is a useful function.

The code below flattens an array using the inherent  `flatten`  function.

In [32]:
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
flattened = arr.flatten()
arr

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [33]:
print(f'arr shape: {arr.shape}')

arr shape: (2, 4)


In [34]:
flattened

array([0, 1, 2, 3, 4, 5, 6, 7])

In [35]:
print(f'flattened shape: {flattened.shape}')

flattened shape: (8,)


### C. Transposing

Similar to how it is common to reshape data, it is also common to transpose data. Perhaps we have data that's supposed to be in a particular format, but some new data we get is rearranged. We can just transpose the data, using the  `np.transpose`  function, to convert it to the proper format.

The code below shows an example usage of the  `np.transpose`  function. The matrix rows become columns after the transpose.

In [36]:
arr = np.arange(8)
arr = np.reshape(arr, (4, 2))
transposed = np.transpose(arr)
display(arr)
print('arr shape: {}'.format(arr.shape))
display(transposed)
print('transposed shape: {}'.format(transposed.shape))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

arr shape: (4, 2)


array([[0, 2, 4, 6],
       [1, 3, 5, 7]])

transposed shape: (2, 4)


The function takes in a required first argument, which will be the array we want to transpose. It also has a single keyword argument called  `axes` , which represents the new  *permutation*  of the dimensions.

The permutation is a tuple/list of integers, with the same length as the number of dimensions in the array. It tells us where to switch up the dimensions. For example, if the permutation had 3 at index 1, it means the old third dimension of the data becomes the new second dimension (since index 1 represents the second dimension).

The code below shows an example usage of the  `np.transpose`  function with the  `axes`  keyword argument. The  `shape`  property gives us the shape of an array.

In [37]:
arr = np.arange(24).reshape((3, 4, 2))
print('arr shape: {}'.format(arr.shape))
transposed = np.transpose(arr, axes=(1, 2, 0))
print('transposed shape: {}'.format(transposed.shape))

arr shape: (3, 4, 2)
transposed shape: (4, 2, 3)


In this example, the old first dimension became the new third dimension, the old second dimension became the new first dimension, and the old third dimension became the new second dimension. The default value for  `axes`  is a dimension reversal (e.g. for 3-D data the default  `axes`  value is  `[2, 1, 0]` ).

### D. Zeros and ones

Sometimes, we need to create arrays filled solely with 0 or 1. For example, since binary data is labeled with 0 and 1, we may need to create dummy datasets of strictly one label. For creating these arrays, NumPy provides the functions  `np.zeros`  and  `np.ones` . They both take in the same arguments, which includes just one required argument, the array shape. The functions also allow for manual casting using the  `dtype`  keyword argument.

The code below shows example usages of  `np.zeros`  and  `np.ones` .

In [38]:
arr = np.zeros(4)
arr

array([0., 0., 0., 0.])

In [39]:
arr = np.ones((2, 3))
arr

array([[1., 1., 1.],
       [1., 1., 1.]])

In [40]:
arr = np.ones((2, 3), dtype=np.int32)
arr

array([[1, 1, 1],
       [1, 1, 1]])

If we want to create an array of 0's or 1's with the same shape as another array, we can use  `np.zeros_like`  and  `np.ones_like` .

The code below shows example usages of  `np.zeros_like`  and  `np.ones_like` .

In [41]:
arr = np.array([[1, 2], [3, 4]])
np.zeros_like(arr)

array([[0, 0],
       [0, 0]])

In [42]:
arr = np.array([[0., 1.], [1.2, 4.]])
np.ones_like(arr)

array([[1., 1.],
       [1., 1.]])

In [43]:
np.ones_like(arr, dtype=np.int32)

array([[1, 1],
       [1, 1]])

## Math

Understand how arithmetic and linear algebra work in NumPy.

### Goals:

- How to perform math operations in NumPy
- Write code using NumPy math functions

### A. Arithmetic

One of the main purposes of NumPy is to perform multi-dimensional arithmetic. Using NumPy arrays, we can apply arithmetic to each element with a single operation.

The code below shows multi-dimensional arithmetic with NumPy.

In [44]:
arr = np.array([[1, 2], [3, 4]])
# Add 1 to element values
arr + 1

array([[2, 3],
       [4, 5]])

In [45]:
# Subtract element values by 1.2
arr - 1.2

array([[-0.2,  0.8],
       [ 1.8,  2.8]])

In [46]:
# Double element values
arr * 2

array([[2, 4],
       [6, 8]])

In [47]:
# Halve element values
arr / 2

array([[0.5, 1. ],
       [1.5, 2. ]])

In [48]:
# Integer division (half)
arr // 2

array([[0, 1],
       [1, 2]], dtype=int32)

In [49]:
# Square element values
arr**2

array([[ 1,  4],
       [ 9, 16]], dtype=int32)

In [50]:
# Square root element values
print(repr(arr**0.5))

array([[1.        , 1.41421356],
       [1.73205081, 2.        ]])


Using NumPy arithmetic, we can easily modify large amounts of numeric data with only a few operations. For example, we could convert a dataset of Fahrenheit temperatures to their equivalent Celsius form.

The code below converts Fahrenheit to Celsius in NumPy.

In [51]:
def f2c(temps):
    return (5/9)*(temps-32)

fahrenheits = np.array([32, -4, 14, -40])
celsius = f2c(fahrenheits)
print(f'Celsius: {repr(celsius)}')

Celsius: array([  0., -20., -10., -40.])


It is important to note that performing arithmetic on NumPy arrays  **does not change the original array** , and instead produces a new array that is the result of the arithmetic operation.

### B. Non-linear functions

Apart from basic arithmetic operations, NumPy also allows you to use non-linear functions such as exponentials and logarithms.

The function  `np.exp`  performs a base  *e*  exponential on an array, while the function  `np.exp2`  performs a base 2 exponential. Likewise,  `np.log` ,  `np.log2` , and  `np.log10`  all perform logarithms on an input array, using base  *e* , base 2, and base 10, respectively.

The code below shows various exponentials and logarithms with NumPy. Note that  `np.e`  and  `np.pi`  represent the mathematical constants  $e$  and $\pi$, respectively.

In [52]:
arr = np.array([[1, 2], [3, 4]])
# Raised to power of e
np.exp(arr)

array([[ 2.71828183,  7.3890561 ],
       [20.08553692, 54.59815003]])

In [53]:
# Raised to power of 2
np.exp2(arr)

array([[ 2.,  4.],
       [ 8., 16.]])

In [54]:
arr2 = np.array([[1, 10], [np.e, np.pi]])
# Natural logarithm
np.log(arr2)

array([[0.        , 2.30258509],
       [1.        , 1.14472989]])

In [55]:
# Base 10 logarithm
np.log10(arr2)

array([[0.        , 1.        ],
       [0.43429448, 0.49714987]])

To do a regular power operation with any base, we use  `np.power` . The first argument to the function is the base, while the second is the power. If the base or power is an array rather than a single number, the operation is applied to every element in the array.

The code below shows examples of using  `np.power` .

In [56]:
arr = np.array([[1, 2], [3, 4]])
# Raise 3 to power of each number in arr
np.power(3, arr)

array([[ 3,  9],
       [27, 81]], dtype=int32)

In [57]:
arr2 = np.array([[10.2, 4], [3, 5]])
# Raise arr2 to power of each number in arr
np.power(arr2, arr)

array([[ 10.2,  16. ],
       [ 27. , 625. ]])

In addition to exponentials and logarithms, NumPy has various other mathematical functions, which are listed [here](https://docs.scipy.org/doc/numpy/reference/routines.math.html).

### C. Matrix multiplication

Since NumPy arrays are basically vectors and matrices, it makes sense that there are functions for dot products and matrix multiplication. Specifically, the main function to use is  `np.matmul` , which takes two vector/matrix arrays as input and produces a dot product or matrix multiplication.

The code below shows various examples of matrix multiplication. When both inputs are 1-D, the output is the dot product.

Note that the dimensions of the two input matrices must be valid for a matrix multiplication. Specifically, the second dimension of the first matrix must equal the first dimension of the second matrix, otherwise  `np.matmul`  will result in a  `ValueError` .

In [58]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([-3, 0, 10])
np.matmul(arr1, arr2)

27

In [59]:
arr3 = np.array([[1, 2], [3, 4], [5, 6]])
arr4 = np.array([[-1, 0, 1], [3, 2, -4]])
np.matmul(arr3, arr4)

array([[  5,   4,  -7],
       [  9,   8, -13],
       [ 13,  12, -19]])

In [60]:
np.matmul(arr4, arr3)

array([[  4,   4],
       [-11, -10]])

In [61]:
# This will result in ValueError
np.matmul(arr3, arr3)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 2)