# Machine Learning for Software Engineers

> Topics covered include data analysis/visualization, feature engineering, supervised learning, unsupervised learning, and deep learning. All topics are are ofindustry standard frameworks: NumPy, pandas, scikit-learn, XGBoost, TensorFlow, and Keras.

- author: Victor Omondi
- toc: true
- comments: true
- categories: [software-engineer, machine-learning]
- image: images/mlse-shield.png

# Libraries

In [1]:
import warnings

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")

## Libraries setup

In [148]:
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
warnings.filterwarnings("ignore", message="The default dtype for empty Series will be 'object' instead of 'float64' in a future version")

# Overview

## A. What is Machine Learning?

Machine learning is the branch of science that deals with algorithms and systems performing specific tasks using patterns and inference, rather than explicitly programmed instructions. There are a variety of different use cases for machine learning, from image recognition to text generation. Most machine learning tasks generalize to one of the following two learning types:

- **Supervised learning**: Using labeled data to train a model. The labels for the training dataset represent the class/category that each data observation belongs to. After training, the model should be able to predict labels for new data observations (from the same population distribution as the training data).
  - ***Example***: Let’s say you’re training a machine learning model to predict whether a picture contains a lake or not. With supervised learning, you would train a model on a dataset of pictures where the label for each picture is “Yes” if it contains a lake or “No” if it doesn’t. After training, the model will be able to take in a picture and determine whether or not it contains a lake.

- **Unsupervised Learning**: Using unlabeled data to allow a model to learn relationships between data observations and pick up on underlying patterns. Most data in the world is unlabeled, which makes unsupervised learning a very useful method of machine learning.
  - ***Example***: Going back to the same picture dataset from above, but now assume the training dataset is unlabeled. Using unsupervised learning, a model will be able to pick up on the inherent differences between pictures with a lake and pictures without a lake, e.g. differences in pixel color or orientation. This allows the model to cluster the pictures into two separate groups.

If it is possible to get large enough labeled training datasets, supervised learning is the way to go. However, it is often difficult to get fully labeled datasets, which is why many tasks require unsupervised learning or semi-supervised learning (a mix of supervised and unsupervised learning). Deciding which type of learning method to use is only the first step towards creating a machine learning model. You also need to choose the proper model architecture for your task and, most importantly, be able to process data into a training pipeline and interpret/analyze model results.

## B. ML vs. AI vs. Data Science

People often throw around the terms “machine learning”, “artificial intelligence”, and “data science” interchangeably. In reality, machine learning is a subset of artificial intelligence and overlaps heavily with data science. Artificial intelligence deals with any technique that allows machines to display “intelligence”, similar to humans. Machine learning is one of the main techniques used to create artificial intelligence, but other non-ML techniques (e.g. alpha-beta pruning, rule-based systems) are also widely used in AI.

On the other hand, data science deals with gathering insights from datasets. Traditionally, data scientists have used statistical methods for gathering these insights. However, as machine learning continues to grow, it has also penetrated into the field of data science.

In industry, any data scientist or AI researcher needs to have a good understanding of machine learning. Machine learning in industry has allowed us to create wonderful autonomous systems. These systems have matched, or sometimes even exceeded, the best human performance in their respective fields. A good example is AlphaGo, a machine-learning based system that has beaten the best human Go players in the world.

## C. 7 Steps of the Machine Learning Process

1. **Data Collection**: The process of extracting raw datasets for the machine learning task. This data can come from a variety of places, ranging from open-source online resources to paid crowdsourcing. The first step of the machine learning process is arguably the most important. If the data you collect is poor quality or irrelevant, then the model you train will be poor quality as well.
2. **Data Processing and Preparation**: Once you’ve gathered the relevant data, you need to process it and make sure that it is in a usable format for training a machine learning model. This includes handling missing data, dealing with outliers, etc.
3. **Feature Engineering**: Once you’ve collected and processed your dataset, you will likely need to transform some of the features (and sometimes even drop some features) in order to optimize how well a model can be trained on the data.
4. **Model Selection**: Based on the dataset, you will choose which model architecture to use. This is one of the main tasks of industry engineers. Rather than attempting to come up with a completely novel model architecture, most tasks can be thoroughly performed with an existing architecture (or combination of model architectures).
5. **Model Training and Data Pipeline**: After selecting the model architecture, you will create a data pipeline for training the model. This means creating a continuous stream of batched data observations to efficiently train the model. Since training can take a long time, you want your data pipeline to be as efficient as possible.
6. **Model Validation**: After training the model for a sufficient amount of time, you will need to validate the model’s performance on a held-out portion of the overall dataset. This data needs to come from the same underlying distribution as the training dataset, but needs to be different data that the model has not seen before.
7. **Model Persistence**: Finally, after training and validating the model’s performance, you need to be able to properly save the model weights and possibly push the model to production. This means setting up a process with which new users can easily use your pre-trained model to make predictions.

## D. What this course will provide

We’ll be able to take process and clean a raw dataset, train a machine learning model on the data, and validate the model’s performance. Specifically, we will be able to:
- Take a raw dataset and process it for a given task. This means dealing with missing data and outliers, normalizing and transforming features, figuring out which features are the most relevant to the task, and picking out the best combination of features to use.
- Picking the correct model architecture to use based on the data. Many people will always default to using a large neural network for any machine learning task, but many times this is unnecessary and can even hurt the model’s final performance if the dataset is not large enough.
- Code a machine learning model and train it on processed data. Validate the model’s performance on held-out data and understand techniques to improve a model’s performance.

# Data Manipulation with NumPy

## Introduction

An overview of data processing and the NumPy library.

In the **Data Manipulation** section, we will explore how to perform data manipulation using NumPy.

### A. Data Processing

When asked about Google's model for success, Peter Norvig, the director of research at Google, famously stated,

> "We don't have better algorithms than anyone else; we just have more data."

Though probably an understatement (given the amount of talent employed at Google), the quote does provide a sense of just how vital data is to having successful outcomes.

People normally discuss the importance of data in the context of machine learning. No matter how sophisticated a machine learning model is, it will not perform well unless it has a reasonable amount of data to train on. On the other hand, given a large and diverse set of training data, a good deep learning model will significantly outperform non-deep learning algorithms.

However, data is not just limited to machine learning. Companies use data to identify customer trends, political parties use data to determine which demographics they should target, sports teams use data to analyze players, etc.

![Example.jpg](datasets/images/example.jpg "Example baseball data used in sabermetrics. The concept was popularized by the 2011 film, Moneyball.")

The universal usage of data makes **data processing**, the act of converting raw data into a meaningful form, an essential skill to have.

### B. NumPy

Many scenarios involve mostly numeric datasets. For example, medical data contains many numeric metrics, such as height, weight, and blood pressure. Furthermore, the majority of neural networks use input data that is either numeric or has been converted to a numeric form.

When we deal with numeric data, the best Python library to use is [NumPy](http://www.numpy.org/). The NumPy library allows us to perform many operations on numeric data, and convert the data to more usable forms.

In [3]:
# Initializing a NumPy array
arr = np.array([-1, 2, 5], dtype=np.float32)

# Print the representation of the array
arr

array([-1.,  2.,  5.], dtype=float32)

In the following chapters, we’ll explore all the necessary NumPy operations for data manipulation.

## NumPy Arrays

Exploring NumPy arrays and how they're used.

### Goals:

- Explore NumPy arrays and how to initialize them
- Write code to create several NumPy arrays

### A. Arrays

NumPy arrays are basically just Python lists with added features. In fact, we can easily convert a Python list to a Numpy array using the  `np.array`  function, which takes in a Python list as its required argument. The function also has quite a few keyword arguments, but the main one to know is  `dtype` . The  `dtype`  keyword argument takes in a [NumPy type](https://docs.scipy.org/doc/numpy/user/basics.types.html) and manually casts the array to the specified type.

The code below is an example usage of  `np.array`  to create a 2-D matrix. 

> Note: the array is manually cast to  `np.float32` .

In [4]:
arr = np.array([[0, 1, 2], [3, 4, 5]],
               dtype=np.float32)
arr

array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)

When the elements of a NumPy array are mixed types, then the array's type will be  *upcast*  to the highest level type. This means that if an array input has mixed  `int`  and  `float`  elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with  `int` ,  `float` , and  `string`  elements, everything is cast to strings.

The code below is an example of  `np.array`  upcasting. Both integers are cast to their floating-point equivalents.

In [5]:
arr = np.array([0, 0.1, 2])
arr

array([0. , 0.1, 2. ])

### B. Copying

Similar to Python lists, when we make a reference to a NumPy array it doesn't create a different array. Therefore, if we change a value using the reference variable, it changes the original array as well. We get around this by using an array's inherent  `copy`  function. The function has no required arguments, and it returns the copied array.

In the code example below,  `c`  is a reference to  `a`  while  `d`  is a copy. Therefore, changing  `c`  leads to the same change in  `a` , while changing  `d`  does not change the value of  `b` .

In [6]:
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print(f'Array a: {a}')

Array a: [0 1]


In [7]:
c[0] = 5
print(f'Array a: {a}')

Array a: [5 1]


In [8]:
d = b.copy()
d[0] = 6
print(f'Array b: {b}')

Array b: [9 8]


### C. Casting

We cast NumPy arrays through their inherent  `astype`  function. The function's required argument is the new type for the array. It returns the array cast to the new type.

The code below shows an example of casting using the  `astype`  function. The  `dtype`  property returns the type of an array.

In [9]:
arr = np.array([0, 1, 2])
arr.dtype

dtype('int32')

In [10]:
arr = arr.astype(np.float32)
arr.dtype

dtype('float32')

### D. NaN

When we don't want a NumPy array to contain a value at a particular index, we can use  `np.nan`  to act as a placeholder. A common usage for  `np.nan`  is as a filler value for incomplete data.

The code below shows an example usage of  `np.nan` . Note that  `np.nan`  cannot take on an integer type.

In [11]:
arr = np.array([np.nan, 1, 2])
arr

array([nan,  1.,  2.])

In [12]:
arr = np.array([np.nan, 'abc'])
arr

array(['nan', 'abc'], dtype='<U32')

In [13]:
# Will result in a ValueError
np.array([np.nan, 1, 2], dtype=np.int32)

ValueError: cannot convert float NaN to integer

### E. Infinity

To represent infinity in NumPy, we use the  `np.inf`  special value. We can also represent negative infinity with  `-np.inf` .

The code below shows an example usage of  `np.inf` . Note that  `np.inf`  cannot take on an integer type.

In [14]:
print(np.inf > 1000000)

True


In [15]:
arr = np.array([np.inf, 5])
arr

array([inf,  5.])

In [16]:
arr = np.array([-np.inf, 1])
arr

array([-inf,   1.])

In [17]:
# Will result in an OverflowError
np.array([np.inf, 3], dtype=np.int32)

OverflowError: cannot convert float infinity to integer

## NumPy Basics

Perform basic operations to create and modify NumPy arrays.

### Goals:

- Explore some basic NumPy operations
- Write code using the basic NumPy functions

### A. Ranged data

While  `np.array`  can be used to create any array, it is equivalent to hardcoding an array. This won't work when the array has hundreds of values. Instead, NumPy provides an option to create ranged data arrays using  `np.arange` . The function acts very similar to the  `range`  function in Python, and will always return a 1-D array.

The code below contains example usages of  `np.arange`.

In [18]:
arr = np.arange(5)
arr

array([0, 1, 2, 3, 4])

In [19]:
arr = np.arange(5.1)
arr

array([0., 1., 2., 3., 4., 5.])

In [20]:
arr = np.arange(-1, 4)
arr

array([-1,  0,  1,  2,  3])

In [21]:
arr = np.arange(-1.5, 4, 2)
arr

array([-1.5,  0.5,  2.5])

The output of  `np.arange`  is specified as follows:

* If only a single number,  *n* , is passed in as an argument,  `np.arange`  will return an array with all the integers in the range [0,  *n* ).  
* > Note:  the lower end is inclusive while the upper end is exclusive.
* For two arguments,  *m*  and  *n* ,  `np.arange`  will return an array with all the integers in the range [ *m* ,  *n* ).
* For three arguments,  *m* ,  *n* , and  *s* ,  `np.arange`  will return an array with the integers in the range [ *m* ,  *n* ) using a step size of  *s* .
* Like  `np.array` ,  `np.arange`  performs upcasting. It also has the  `dtype`  keyword argument to manually cast the array.

To specify the number of elements in the returned array, rather than the step size, we can use the  `np.linspace`  function.

This function takes in a required first two arguments, for the start and end of the range, respectively. The end of the range is inclusive for  `np.linspace` , unless the keyword argument  `endpoint`  is set to  `False` . To specify the number of elements, we set the  `num`  keyword argument (its default value is  `50` ).

The code below shows example usages of  `np.linspace` . It also takes in the  `dtype`  keyword argument for manual casting.

In [22]:
arr = np.linspace(5, 11, num=4)
arr

array([ 5.,  7.,  9., 11.])

In [23]:
arr = np.linspace(5, 11, num=4, endpoint=False)
arr

array([5. , 6.5, 8. , 9.5])

In [24]:
arr = np.linspace(5, 11, num=4, dtype=np.int32)
arr

array([ 5,  7,  9, 11])

### B. Reshaping data

The function we use to reshape data in NumPy is  `np.reshape` . It takes in an array and a new shape as required arguments. The new shape must exactly contain all the elements from the input array. For example, we could reshape an array with 12 elements to  `(4, 3)` , but we can't reshape it to  `(4, 4)` .

We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value necessary to allow the new shape to contain all the elements of the array.

The code below shows example usages of  `np.reshape` .

In [25]:
arr = np.arange(8)

reshaped_arr = np.reshape(arr, (2, 4))
reshaped_arr

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [26]:
print(f'New shape: {reshaped_arr.shape}')

New shape: (2, 4)


In [27]:
reshaped_arr = np.reshape(arr, (-1, 2, 2))
reshaped_arr

array([[[0, 1],
        [2, 3]],

       [[4, 5],
        [6, 7]]])

In [28]:
print(f'New shape: {reshaped_arr.shape}')

New shape: (2, 2, 2)


While the  `np.reshape`  function can perform any reshaping utilities we need, NumPy provides an inherent function for flattening an array. Flattening an array reshapes it into a 1D array. Since we need to flatten data quite often, it is a useful function.

The code below flattens an array using the inherent  `flatten`  function.

In [29]:
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
flattened = arr.flatten()
arr

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [30]:
print(f'arr shape: {arr.shape}')

arr shape: (2, 4)


In [31]:
flattened

array([0, 1, 2, 3, 4, 5, 6, 7])

In [32]:
print(f'flattened shape: {flattened.shape}')

flattened shape: (8,)


### C. Transposing

Similar to how it is common to reshape data, it is also common to transpose data. Perhaps we have data that's supposed to be in a particular format, but some new data we get is rearranged. We can just transpose the data, using the  `np.transpose`  function, to convert it to the proper format.

The code below shows an example usage of the  `np.transpose`  function. The matrix rows become columns after the transpose.

In [33]:
arr = np.arange(8)
arr = np.reshape(arr, (4, 2))
transposed = np.transpose(arr)
display(arr)
print('arr shape: {}'.format(arr.shape))
display(transposed)
print('transposed shape: {}'.format(transposed.shape))

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])

arr shape: (4, 2)


array([[0, 2, 4, 6],
       [1, 3, 5, 7]])

transposed shape: (2, 4)


The function takes in a required first argument, which will be the array we want to transpose. It also has a single keyword argument called  `axes` , which represents the new  *permutation*  of the dimensions.

The permutation is a tuple/list of integers, with the same length as the number of dimensions in the array. It tells us where to switch up the dimensions. For example, if the permutation had 3 at index 1, it means the old third dimension of the data becomes the new second dimension (since index 1 represents the second dimension).

The code below shows an example usage of the  `np.transpose`  function with the  `axes`  keyword argument. The  `shape`  property gives us the shape of an array.

In [34]:
arr = np.arange(24).reshape((3, 4, 2))
print('arr shape: {}'.format(arr.shape))
transposed = np.transpose(arr, axes=(1, 2, 0))
print('transposed shape: {}'.format(transposed.shape))

arr shape: (3, 4, 2)
transposed shape: (4, 2, 3)


In this example, the old first dimension became the new third dimension, the old second dimension became the new first dimension, and the old third dimension became the new second dimension. The default value for  `axes`  is a dimension reversal (e.g. for 3-D data the default  `axes`  value is  `[2, 1, 0]` ).

### D. Zeros and ones

Sometimes, we need to create arrays filled solely with 0 or 1. For example, since binary data is labeled with 0 and 1, we may need to create dummy datasets of strictly one label. For creating these arrays, NumPy provides the functions  `np.zeros`  and  `np.ones` . They both take in the same arguments, which includes just one required argument, the array shape. The functions also allow for manual casting using the  `dtype`  keyword argument.

The code below shows example usages of  `np.zeros`  and  `np.ones` .

In [35]:
arr = np.zeros(4)
arr

array([0., 0., 0., 0.])

In [36]:
arr = np.ones((2, 3))
arr

array([[1., 1., 1.],
       [1., 1., 1.]])

In [37]:
arr = np.ones((2, 3), dtype=np.int32)
arr

array([[1, 1, 1],
       [1, 1, 1]])

If we want to create an array of 0's or 1's with the same shape as another array, we can use  `np.zeros_like`  and  `np.ones_like` .

The code below shows example usages of  `np.zeros_like`  and  `np.ones_like` .

In [38]:
arr = np.array([[1, 2], [3, 4]])
np.zeros_like(arr)

array([[0, 0],
       [0, 0]])

In [39]:
arr = np.array([[0., 1.], [1.2, 4.]])
np.ones_like(arr)

array([[1., 1.],
       [1., 1.]])

In [40]:
np.ones_like(arr, dtype=np.int32)

array([[1, 1],
       [1, 1]])

## Math

Understand how arithmetic and linear algebra work in NumPy.

### Goals:

- How to perform math operations in NumPy
- Write code using NumPy math functions

### A. Arithmetic

One of the main purposes of NumPy is to perform multi-dimensional arithmetic. Using NumPy arrays, we can apply arithmetic to each element with a single operation.

The code below shows multi-dimensional arithmetic with NumPy.

In [41]:
arr = np.array([[1, 2], [3, 4]])
# Add 1 to element values
arr + 1

array([[2, 3],
       [4, 5]])

In [42]:
# Subtract element values by 1.2
arr - 1.2

array([[-0.2,  0.8],
       [ 1.8,  2.8]])

In [43]:
# Double element values
arr * 2

array([[2, 4],
       [6, 8]])

In [44]:
# Halve element values
arr / 2

array([[0.5, 1. ],
       [1.5, 2. ]])

In [45]:
# Integer division (half)
arr // 2

array([[0, 1],
       [1, 2]], dtype=int32)

In [46]:
# Square element values
arr**2

array([[ 1,  4],
       [ 9, 16]], dtype=int32)

In [47]:
# Square root element values
print(repr(arr**0.5))

array([[1.        , 1.41421356],
       [1.73205081, 2.        ]])


Using NumPy arithmetic, we can easily modify large amounts of numeric data with only a few operations. For example, we could convert a dataset of Fahrenheit temperatures to their equivalent Celsius form.

The code below converts Fahrenheit to Celsius in NumPy.

In [48]:
def f2c(temps):
    return (5/9)*(temps-32)

fahrenheits = np.array([32, -4, 14, -40])
celsius = f2c(fahrenheits)
print(f'Celsius: {repr(celsius)}')

Celsius: array([  0., -20., -10., -40.])


It is important to note that performing arithmetic on NumPy arrays  **does not change the original array** , and instead produces a new array that is the result of the arithmetic operation.

### B. Non-linear functions

Apart from basic arithmetic operations, NumPy also allows you to use non-linear functions such as exponentials and logarithms.

The function  `np.exp`  performs a base  *e*  exponential on an array, while the function  `np.exp2`  performs a base 2 exponential. Likewise,  `np.log` ,  `np.log2` , and  `np.log10`  all perform logarithms on an input array, using base  *e* , base 2, and base 10, respectively.

The code below shows various exponentials and logarithms with NumPy. Note that  `np.e`  and  `np.pi`  represent the mathematical constants  $e$  and $\pi$, respectively.

In [49]:
arr = np.array([[1, 2], [3, 4]])
# Raised to power of e
np.exp(arr)

array([[ 2.71828183,  7.3890561 ],
       [20.08553692, 54.59815003]])

In [50]:
# Raised to power of 2
np.exp2(arr)

array([[ 2.,  4.],
       [ 8., 16.]])

In [51]:
arr2 = np.array([[1, 10], [np.e, np.pi]])
# Natural logarithm
np.log(arr2)

array([[0.        , 2.30258509],
       [1.        , 1.14472989]])

In [52]:
# Base 10 logarithm
np.log10(arr2)

array([[0.        , 1.        ],
       [0.43429448, 0.49714987]])

To do a regular power operation with any base, we use  `np.power` . The first argument to the function is the base, while the second is the power. If the base or power is an array rather than a single number, the operation is applied to every element in the array.

The code below shows examples of using  `np.power` .

In [53]:
arr = np.array([[1, 2], [3, 4]])
# Raise 3 to power of each number in arr
np.power(3, arr)

array([[ 3,  9],
       [27, 81]], dtype=int32)

In [54]:
arr2 = np.array([[10.2, 4], [3, 5]])
# Raise arr2 to power of each number in arr
np.power(arr2, arr)

array([[ 10.2,  16. ],
       [ 27. , 625. ]])

In addition to exponentials and logarithms, NumPy has various other mathematical functions, which are listed [here](https://docs.scipy.org/doc/numpy/reference/routines.math.html).

### C. Matrix multiplication

Since NumPy arrays are basically vectors and matrices, it makes sense that there are functions for dot products and matrix multiplication. Specifically, the main function to use is  `np.matmul` , which takes two vector/matrix arrays as input and produces a dot product or matrix multiplication.

The code below shows various examples of matrix multiplication. When both inputs are 1-D, the output is the dot product.

Note that the dimensions of the two input matrices must be valid for a matrix multiplication. Specifically, the second dimension of the first matrix must equal the first dimension of the second matrix, otherwise  `np.matmul`  will result in a  `ValueError` .

In [55]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([-3, 0, 10])
np.matmul(arr1, arr2)

27

In [56]:
arr3 = np.array([[1, 2], [3, 4], [5, 6]])
arr4 = np.array([[-1, 0, 1], [3, 2, -4]])
np.matmul(arr3, arr4)

array([[  5,   4,  -7],
       [  9,   8, -13],
       [ 13,  12, -19]])

In [57]:
np.matmul(arr4, arr3)

array([[  4,   4],
       [-11, -10]])

In [58]:
# This will result in ValueError
np.matmul(arr3, arr3)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 2)

## Random

Generate numbers and arrays from different random distributions.

### Goals:

- Explore random operations in NumPy
- Write code using the np.random submodule

### A. Random integers

Similar to the Python  `random`  module, NumPy has its own submodule for pseudo-random number generation called  `np.random` . It provides all the necessary randomized operations and extends it to multi-dimensional arrays. To generate pseudo-random integers, we use the  `np.random.randint`  function.

The code below shows example usages of  `np.random.randint` .

In [59]:
np.random.randint(5)

3

In [60]:
np.random.randint(5)

0

In [61]:
np.random.randint(5, high=6)

5

In [62]:
random_arr = np.random.randint(-3, high=14,
                               size=(2, 2))
random_arr

array([[ 7, 13],
       [ 8,  3]])

The  `np.random.randint`  function takes in a single required argument, which actually depends on the  `high`  keyword argument. If  `high=None`  (which is the default value), then the required argument represents the upper (exclusive) end of the range, with the lower end being 0. Specifically, if the required argument is  *n* , then the random integer is chosen uniformly from the range [0,  *n* ).

If  `high`  is not  `None` , then the required argument will represent the lower (inclusive) end of the range, while  `high`  represents the upper (exclusive) end.

The  `size`  keyword argument specifies the size of the output array, where each integer in the array is randomly drawn from the specified range. As a default,  `np.random.randint`  returns a single integer.

### B. Utility functions

Some fundamental utility functions from the  `np.random`  module are  `np.random.seed`  and  `np.random.shuffle` . We use the  `np.random.seed`  function to set the [random seed](https://en.wikipedia.org/wiki/Random_seed), which allows us to control the outputs of the pseudo-random functions. The function takes in a single integer as an argument, representing the random seed.

The code below uses  `np.random.seed`  with the same random seed. Note how the outputs of the random functions in each subsequent run are identical when we set the same random seed.

In [63]:
np.random.seed(1)
np.random.randint(10)

5

In [64]:
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
random_arr

array([[15, 75],
       [12, 78]])

In [65]:
# New seed
np.random.seed(2)
np.random.randint(10)

8

In [66]:
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
random_arr

array([[18, 75],
       [25, 46]])

In [67]:
# Original seed
np.random.seed(1)
np.random.randint(10)

5

In [68]:
random_arr = np.random.randint(3, high=100,
                               size=(2, 2))
random_arr

array([[15, 75],
       [12, 78]])

The  `np.random.shuffle`  function allows us to randomly shuffle an array. Note that the shuffling happens in place (i.e. no return value), and shuffling multi-dimensional arrays only shuffles the first dimension.

The code below shows example usages of  `np.random.shuffle` . Note that only the rows of  `matrix`  are shuffled (i.e. shuffling along first dimension only).

In [69]:
vec = np.array([1, 2, 3, 4, 5])
np.random.shuffle(vec)
vec

array([3, 4, 2, 5, 1])

In [70]:
np.random.shuffle(vec)
vec

array([5, 3, 4, 2, 1])

In [71]:
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
np.random.shuffle(matrix)
matrix

array([[4, 5, 6],
       [7, 8, 9],
       [1, 2, 3]])

### C. Distributions

Using  `np.random`  we can also draw samples from probability distributions. For example, we can use  `np.random.uniform`  to draw pseudo-random real numbers from a [uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)).

The code below shows usages of  `np.random.uniform` .

In [72]:
np.random.uniform()

0.3132735169322751

In [73]:
np.random.uniform(low=-1.5, high=2.2)

0.4408281904196243

In [74]:
np.random.uniform(size=3)

array([0.44345289, 0.22957721, 0.53441391])

In [75]:
np.random.uniform(low=-3.4, high=5.9,
                             size=(2, 2))

array([[5.09984683, 0.85200471],
       [0.60549667, 5.33388844]])

The function  `np.random.uniform`  actually has no required arguments. The keyword arguments,  `low`  and  `high` , represent the inclusive lower end and exclusive upper end from which to draw random samples. Since they have default values of 0.0 and 1.0, respectively, the default outputs of  `np.random.uniform`  come from the range [0.0, 1.0).

The  `size`  keyword argument is the same as the one for  `np.random.randint` , i.e. it represents the output size of the array.

Another popular distribution we can sample from is the [normal (Gaussian) distribution](https://en.wikipedia.org/wiki/Normal_distribution). The function we use is  `np.random.normal` .

The code below shows usages of  `np.random.normal` .

In [76]:
np.random.normal()

0.7252740646272712

In [77]:
np.random.normal(loc=1.5, scale=3.5)

4.772112039383628

In [78]:
np.random.normal(loc=-2.4, scale=4.0,
                            size=(2, 2))

array([[ 2.07318791, -2.17754724],
       [-0.89337346, -0.89545991]])

Like  `np.random.uniform` ,  `np.random.normal`  has no required arguments. The  `loc`  and  `scale`  keyword arguments represent the mean and standard deviation, respectively, of the normal distribution we sample from.

NumPy provides quite a few more built-in distributions, which are listed [here](https://docs.scipy.org/doc/numpy-1.14.1/reference/routines.random.html).

### D. Custom sampling

While NumPy provides built-in distributions to sample from, we can also sample from a custom distribution with the  `np.random.choice`  function.

The code below shows example usages of  `np.random.choice` .

In [79]:
colors = ['red', 'blue', 'green']
np.random.choice(colors)

'green'

In [80]:
np.random.choice(colors, size=2)

array(['blue', 'red'], dtype='<U5')

In [81]:
np.random.choice(colors, size=(2, 2),
                            p=[0.8, 0.19, 0.01])

array([['red', 'red'],
       ['blue', 'red']], dtype='<U5')

The required argument for  `np.random.choice`  is the custom distribution we sample from. The  `p`  keyword argument denotes the probabilities given to each element in the input distribution. Note that the list of probabilities for  `p`  must sum to 1.

In the example, we set  `p`  such that  `'red'`  has a probability of 0.8 of being chosen,  `'blue'`  has a probability of 0.19, and  `'green'`  has a probability of 0.01. When  `p`  is not set, the probabilities are equal for each element in the distribution (and sum to 1).

## Indexing

Index into NumPy arrays to extract data and array slices.

### Goals:

* Explore indexing arrays in NumPy
* Write code for indexing and slicing arrays

### A. Array accessing

Accessing NumPy arrays is identical to accessing Python lists. For multi-dimensional arrays, it is equivalent to accessing Python lists of lists.

The code below shows example accesses of NumPy arrays.

In [82]:
arr = np.array([1, 2, 3, 4, 5])
arr[0]

1

In [83]:
arr[4]

5

In [84]:
arr = np.array([[6, 3], [0, 2]])
# Subarray
arr[0]

array([6, 3])

### B. Slicing

NumPy arrays also support slicing. Similar to Python, we use the colon operator (i.e.  `arr[:]` ) for slicing. We can also use negative indexing to slice in the backwards direction.

The code below shows example slices of a 1-D NumPy array.

In [85]:
arr = np.array([1, 2, 3, 4, 5])
arr[:]

array([1, 2, 3, 4, 5])

In [86]:
arr[1:]

array([2, 3, 4, 5])

In [87]:
arr[2:4]

array([3, 4])

In [88]:
arr[:-1]

array([1, 2, 3, 4])

In [89]:
arr[-2:]

array([4, 5])

For multi-dimensional arrays, we can use a comma to separate slices across each dimension.

The code below shows example slices of a 2-D NumPy array.

In [90]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])
arr[:]

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [91]:
arr[1:]

array([[4, 5, 6],
       [7, 8, 9]])

In [92]:
arr[:, -1]

array([3, 6, 9])

In [93]:
arr[:, 1:]

array([[2, 3],
       [5, 6],
       [8, 9]])

In [94]:
arr[0:1, 1:]

array([[2, 3]])

In [95]:
arr[0, 1:]

array([2, 3])

### C. Argmin and argmax

In addition to accessing and slicing arrays, it is useful to figure out the actual indexes of the minimum and maximum elements. To do this, we use the  `np.argmin`  and  `np.argmax`  functions.

The code below shows example usages of  `np.argmin`  and  `np.argmax` . Note that the index of element  `-6`  is index  `5`  in the flattened version of  `arr` .

In [96]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
np.argmin(arr[0])

2

In [97]:
np.argmax(arr[2])

1

In [98]:
np.argmin(arr)

5

The  `np.argmin`  and  `np.argmax`  functions take the same arguments. The required argument is the input array and the  `axis`  keyword argument specifies which dimension to apply the operation on.

The code below shows how the  `axis`  keyword argument is used for these functions.

In [99]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [-3, 9, 1]])
np.argmin(arr, axis=0)

array([2, 0, 1], dtype=int64)

In [100]:
np.argmin(arr, axis=1)

array([2, 2, 0], dtype=int64)

In [101]:
np.argmax(arr, axis=-1)

array([1, 1, 1], dtype=int64)

In our example, using  `axis=0`  meant the function found the index of the minimum  *row*  element for each column. When we used  `axis=1` , the function found the index of the minimum  *column*  element for each row.

Setting  `axis`  to -1 just means we apply the function across the last dimension. In this case,  `axis=-1`  is equivalent to  `axis=1` .

## Filtering

Filter NumPy data for specific values.

### Goals:

* Explore how to filter data in NumPy
* Write code for filtering NumPy arrays

### A. Filtering data

Sometimes we have data that contains values we don't want to use. For example, when tracking the best hitters in baseball, we may want to only use the batting average data above .300. In this case, we should  *filter*  the overall data for only the values that we want.

The key to filtering data is through basic relation operations, e.g.  `==` ,  `>` , etc. In NumPy, we can apply basic relation operations element-wise on arrays.

The code below shows relation operations on NumPy arrays. The  `~`  operation represents a boolean negation, i.e. it flips each truth value in the array.

In [102]:
arr = np.array([[0, 2, 3],
                [1, 3, -6],
                [-3, -2, 1]])
arr == 3

array([[False, False,  True],
       [False,  True, False],
       [False, False, False]])

In [103]:
arr > 0

array([[False,  True,  True],
       [ True,  True, False],
       [False, False,  True]])

In [104]:
arr != 1

array([[ True,  True,  True],
       [False,  True,  True],
       [ True,  True, False]])

In [105]:
# Negated from the previous step
~(arr != 1)

array([[False, False, False],
       [ True, False, False],
       [False, False,  True]])

Something to note is that  `np.nan`  can't be used with any relation operation. Instead, we use  `np.isnan`  to filter for the location of  `np.nan` .

The code below uses  `np.isnan`  to determine which locations of the array contain  `np.nan`  values.

In [106]:
arr = np.array([[0, 2, np.nan],
                [1, np.nan, -6],
                [np.nan, -2, 1]])
np.isnan(arr)

array([[False, False,  True],
       [False,  True, False],
       [ True, False, False]])

Each boolean array in our examples represents the location of elements we want to filter for. The way we perform the filtering itself is through the  `np.where`  function.

### B. Filtering in NumPy

The  `np.where`  function takes in a required first argument, which is a boolean array where  `True`  represents the locations of the elements we want to filter for. When the function is applied with only the first argument, it returns a tuple of 1-D arrays.

The tuple will have size equal to the number of dimensions in the data, and each array represents the  `True`  indices for the corresponding dimension. Note that the arrays in the tuple will all have the same length, equal to the number of  `True`  elements in the input argument.

The code below shows how to use  `np.where`  with a single argument.

In [107]:
np.where([True, False, True])

(array([0, 2], dtype=int64),)

In [108]:
arr = np.array([0, 3, 5, 3, 1])
np.where(arr == 3)

(array([1, 3], dtype=int64),)

In [109]:
arr = np.array([[0, 2, 3],
                [1, 0, 0],
                [-3, 0, 0]])
x_ind, y_ind = np.where(arr != 0)
x_ind # x indices of non-zero elements

array([0, 0, 1, 2], dtype=int64)

In [110]:
y_ind # y indices of non-zero elements

array([1, 2, 0, 0], dtype=int64)

In [111]:
arr[x_ind, y_ind]

array([ 2,  3,  1, -3])

In [112]:
np.where(arr != 0)

(array([0, 0, 1, 2], dtype=int64), array([1, 2, 0, 0], dtype=int64))

The interesting thing about  `np.where`  is that it must be applied with exactly 1 or 3 arguments. When we use 3 arguments, the first argument is still the boolean array. However, the next two arguments represent the  `True`  replacement values and the  `False`  replacement values, respectively. The output of the function now becomes an array with the same shape as the first argument.

The code below shows how to use  `np.where`  with 3 arguments.

In [113]:
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
negatives = np.array([[-2, -5], [-1, -8]])
np.where(np_filter, positives, negatives)

array([[ 1, -5],
       [-1,  4]])

In [114]:
np_filter = positives > 2
np.where(np_filter, positives, negatives)

array([[-2, -5],
       [ 3,  4]])

In [115]:
np_filter = negatives > 0
np.where(np_filter, positives, negatives)

array([[-2, -5],
       [-1, -8]])

Note that our second and third arguments necessarily had the same shape as the first argument. However, if we wanted to use a constant replacement value, e.g.  `-1` , we could incorporate [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html). Rather than using an entire array of the same value, we can just use the value itself as an argument.

The code below showcases broadcasting with  `np.where` .

In [116]:
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
np.where(np_filter, positives, -1)

array([[ 1, -1],
       [-1,  4]])

### C. Axis-wise filtering

If we wanted to filter based on rows or columns of data, we could use the  `np.any`  and  `np.all`  functions. Both functions take in the same arguments, and return a single boolean or a boolean array. The required argument for both functions is a boolean array.

The code below shows usage of  `np.any`  and  `np.all`  with a single argument.

In [117]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
arr > 0

array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True]])

In [118]:
np.any(arr > 0)

True

In [119]:
np.all(arr > 0)

False

The  `np.any`  function is equivalent to performing a logical OR ( `||` ), while the  `np.all`  function is equivalent to a logical AND ( `&&` ) on the first argument. np.any returns true if even one of the elements in the array meets the condition and np.all returns true only if all the elements meet the condition. When only a single argument is passed in, the function is applied across the entire input array, so the returned value is a single boolean.

However, if we use a multi-dimensional input and specify the  `axis`  keyword argument, the returned value will be an array. The  `axis`  argument has the same meaning as it did for  `np.argmin`  and  `np.argmax`  from the previous chapter. Using  `axis=0`  means the function finds the index of the minimum  *row*  element for each column. When we used  `axis=1` , the function finds the index of the minimum  *column*  element for each row.

Setting  `axis`  to -1 just means we apply the function across the last dimension.

The code below shows examples of using  `np.any`  and  `np.all`  with the  `axis`  keyword argument.

In [120]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
arr > 0

array([[False, False, False],
       [ True,  True, False],
       [ True,  True,  True]])

In [121]:
np.any(arr > 0, axis=0)

array([ True,  True,  True])

In [122]:
np.any(arr > 0, axis=1)

array([False,  True,  True])

In [123]:
np.all(arr > 0, axis=1)

array([False, False,  True])

We can use  `np.any`  and  `np.all`  in tandem with  `np.where`  to filter for entire rows or columns of data.

In the code example below, we use  `np.any`  to obtain a boolean array representing the rows that have at least one positive number. We then use the boolean array as the input to  `np.where` , which gives us the actual indices of the rows with at least one positive number.

![widget](https://www.educative.io/api/collection/6083138522447872/5629499534213120/page/5728116278296576/image/6652560851075072.png)

In [124]:
arr = np.array([[-2, -1, -3],
                [4, 5, -6],
                [3, 9, 1]])
has_positive = np.any(arr > 0, axis=1)
has_positive

array([False,  True,  True])

In [125]:
arr[np.where(has_positive)]

array([[ 4,  5, -6],
       [ 3,  9,  1]])

## Statistics

Apply statistical metrics to NumPy data.

### Goals:

* Explore basic statistical analysis in NumPy
* Write code to obtain statistics for NumPy arrays

### A. Analysis

It is often useful to analyze data for its main characteristics and interesting trends.

For example, we can obtain minimum and maximum values of a NumPy array using its inherent  `min`  and  `max`  functions. This gives us an initial sense of the data's range, and can alert us to extreme outliers in the data.

The code below shows example usages of the  `min`  and  `max`  functions.

In [126]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
arr.min()

-60

In [127]:
arr.max()

72

In [128]:
arr.min(axis=0)

array([ -3,  -2, -60])

In [129]:
arr.max(axis=-1)

array([72,  3,  4])

The  `axis`  keyword argument is identical to how it was used in  `np.argmin`  and  `np.argmax`  from the chapter on Indexing. In our example, we use  `axis=0`  to find an array of the minimum values in each column of  `arr`  and  `axis=1`  to find an array of the maximum values in each row of  `arr` .

### B. Statistical metrics

NumPy also provides basic statistical functions such as  `np.mean` ,  `np.var` , and  `np.median` , to calculate the mean, variance, and median of the data, respectively.

The code below shows how to obtain basic statistics with NumPy. Note that  `np.median`  applied without  `axis`  takes the median of the flattened array.

In [130]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
np.mean(arr)

2.0

In [131]:
np.var(arr)

977.3333333333334

In [132]:
np.median(arr)

1.0

In [133]:
np.median(arr, axis=-1)

array([ 3.,  1., -2.])

Each of these functions takes in the data array as a required argument and  `axis`  as a keyword argument. For a more comprehensive list of statistical functions (e.g. calculating percentiles, creating histograms, etc.), check out the NumPy [statistics page](https://docs.scipy.org/doc/numpy/reference/routines.statistics.html).

## Aggregation

Use aggregation techniques to combine NumPy data and arrays.

### Goals:

* Explore how to aggregate data in NumPy
* Write code to obtain sums and concatenations of NumPy arrays

### A. Summation

we calculated the sum of individual values between multiple arrays. To sum the values within a single array, we use the  `np.sum`  function.

The function takes in a NumPy array as its required argument, and uses the  `axis`  keyword argument. 

If the  `axis`  keyword argument is not specified,  `np.sum`  returns the overall sum of the array.

The code below shows how to use  `np.sum` .

In [134]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
np.sum(arr)

18

In [135]:
np.sum(arr, axis=0)

array([ -2,  73, -53])

In [136]:
np.sum(arr, axis=1)

array([ 75, -56,  -1])

In addition to regular sums, NumPy can perform cumulative sums using  `np.cumsum` . Like  `np.sum` ,  `np.cumsum`  also takes in a NumPy array as a required argument and uses the  `axis`  argument. If the  `axis`  keyword argument is not specified,  `np.cumsum`  will return the cumulative sums for the flattened array.

The code below shows how to use  `np.cumsum` . For a 2-D NumPy array, setting  `axis=0`  returns an array with cumulative sums across each column, while  `axis=1`  returns the array with cumulative sums across each row. Not setting  `axis`  returns a cumulative sum across all the values of the flattened array.

In [137]:
arr = np.array([[0, 72, 3],
                [1, 3, -60],
                [-3, -2, 4]])
np.cumsum(arr)

array([ 0, 72, 75, 76, 79, 19, 16, 14, 18], dtype=int32)

In [138]:
np.cumsum(arr, axis=0)

array([[  0,  72,   3],
       [  1,  75, -57],
       [ -2,  73, -53]], dtype=int32)

In [139]:
np.cumsum(arr, axis=1)

array([[  0,  72,  75],
       [  1,   4, -56],
       [ -3,  -5,  -1]], dtype=int32)

### B. Concatenation

An important part of aggregation is combining multiple datasets. In NumPy, this equates to combining multiple arrays into one. The function we use to do this is  `np.concatenate` .

Like the summation functions,  `np.concatenate`  uses the  `axis`  keyword argument. However, the default value for  `axis`  is  `0`  (i.e. dimension 0). Furthermore, the required argument for  `np.concatenate`  is a list of arrays, which the function combines into a single array.

The code below shows how to use  `np.concatenate` , which aggregates arrays by joining them along a specific dimension. For 2-D arrays, not setting the  `axis`  argument (defaults to  `axis=0` ) concatenates the arrays vertically. When we set  `axis=1` , the arrays are concatenated horizontally.

In [140]:
arr1 = np.array([[0, 72, 3],
                 [1, 3, -60],
                 [-3, -2, 4]])
arr2 = np.array([[-15, 6, 1],
                 [8, 9, -4],
                 [5, -21, 18]])
np.concatenate([arr1, arr2])

array([[  0,  72,   3],
       [  1,   3, -60],
       [ -3,  -2,   4],
       [-15,   6,   1],
       [  8,   9,  -4],
       [  5, -21,  18]])

In [141]:
np.concatenate([arr1, arr2], axis=1)

array([[  0,  72,   3, -15,   6,   1],
       [  1,   3, -60,   8,   9,  -4],
       [ -3,  -2,   4,   5, -21,  18]])

In [142]:
np.concatenate([arr2, arr1], axis=1)

array([[-15,   6,   1,   0,  72,   3],
       [  8,   9,  -4,   1,   3, -60],
       [  5, -21,  18,  -3,  -2,   4]])

## Saving Data

 how to save and load NumPy data.

### Goals:

* Exploring how to save and load data in NumPy
* Write code to save NumPy data to a file

### A. Saving

After performing data manipulation with NumPy, it's a good idea to save the data in a file for future use. To do this, we use the  `np.save`  function.

The first argument for the function is the name/path of the file we want to save our data to. The file name/path should have a ".npy" extension. If it does not, then  `np.save`  will append the ".npy" extension to it.

The second argument for  `np.save`  is the NumPy data we want to save. The function has no return value. Also, the format of the ".npy" files when viewed with a text editor is largely gibberish when viewed with a text editor.

If  `np.save`  is called with the name of a file that already exists, it will overwrite the previous file.

The code below shows examples of saving NumPy data.

In [144]:
arr = np.array([1, 2, 3])
# Saves to 'arr.npy'
np.save('datasets/numpy_files/arr.npy', arr)
# Also saves to 'arr.npy'
np.save('datasets/numpy_files/arr', arr)

### B. Loading

After saving our data, we can load it again using  `np.load` . The function's required argument is the file name/path that contains the saved data. It returns the NumPy data exactly as it was saved.

Note that  `np.load`  will not append the ".npy" extension to the file name/path if it is not there.

The code below shows how to use  `np.load`  to load NumPy data.

In [145]:
load_arr = np.load('datasets/numpy_files/arr.npy')
load_arr

array([1, 2, 3])

In [146]:
# Will result in FileNotFoundError
load_arr = np.load('datasets/numpy_files/arr')

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/numpy_files/arr'

# Data Analysis with pandas

## Introduction

An overview of data analysis with pandas.

We will be using [pandas](https://en.wikipedia.org/wiki/Pandas_(software)) to analyze Major League Baseball (MLB) data. The data comes courtesy of [Sean Lahman](https://en.wikipedia.org/wiki/Sean_Lahman), and contains statistics for every player, manager, and team in MLB history. The full database can be found and downloaded [here](http://www.seanlahman.com/baseball-archive/statistics/).

### A. Data analysis

Before doing any task with a dataset, it is a good idea to perform preliminary [data analysis](https://en.wikipedia.org/wiki/Data_analysis). Data analysis allows us to understand the dataset, find potential outlier values, and figure out which features of the dataset are most important to our application.

### B. pandas

Since most machine learning frameworks (e.g. TensorFlow) are built on Python, it is beneficial to use a Python-based data analysis toolkit like pandas. pandas (all lowercase) is an excellent tool for processing and analyzing real world data, with utilities ranging from parsing multiple file formats to converting an entire data table into a NumPy matrix array.

We'll dive into the main data analysis functionalities of pandas. For a complete overview of the pandas toolkit, you can visit the official [pandas website](https://pandas.pydata.org/pandas-docs/stable/).

### C. Matplotlib and pyplot

An essential part of data analysis is creating charts and plots to visualize the data. Similar to the saying, "a picture is worth a thousand words", data visualization can convey key data trends and correlations through a single figure.

The library we will use for data visualization in Python is [Matplotlib](https://en.wikipedia.org/wiki/Matplotlib). Specifically, we'll be using the [pyplot](https://matplotlib.org/api/pyplot_api.html) API of Matplotlib, which provides a variety of plotting tools from simple line plots to advanced visuals like heatmaps and 3-D plots. While we will only touch on the basic necessities for our data analysis (e.g. line plots, boxplots, etc.), a full overview of Matplotlib can be found at the official [website](https://matplotlib.org/index.html).

## Series

The pandas Series object for 1-D data.

### Goals

* Explore the pandas Series object and its basic utilities
* Write code to create several Series objects

### A. 1-D data

Similar to NumPy, pandas frequently deals with 1-D and 2-D data. However, we use two separate objects to deal with 1-D and 2-D data in pandas. For 1-D data, we use the  `pandas.Series`  objects, which we'll refer to simply as a Series.

A Series is created through the  `pd.Series`  constructor, which takes in no required arguments but does have a variety of keyword arguments.

The first keyword argument is  `data` , which specifies the elements of the Series. If  `data`  is not set,  `pd.Series`  returns an empty Series. Since the  `data`  keyword argument is almost always used, we treat it like a regular first argument (i.e. skip the  `data=`  prefix).

Similar to the  `np.array`  constructor,  `pd.Series`  also takes in the  `dtype`  keyword argument for manual casting.

The code below shows how to create pandas Series objects using  `pd.Series` .

In [149]:
series = pd.Series()
# Newline to separate series print statements
print('{}\n'.format(series))

Series([], dtype: float64)



In [150]:
series = pd.Series(5)
print('{}\n'.format(series))

0    5
dtype: int64



In [151]:
series = pd.Series([1, 2, 3])
print('{}\n'.format(series))

0    1
1    2
2    3
dtype: int64



In [152]:
series = pd.Series([1, 2.2]) # upcasting
series

0    1.0
1    2.2
dtype: float64

In [153]:
arr = np.array([1, 2])
series = pd.Series(arr, dtype=np.float32)
series

0    1.0
1    2.0
dtype: float32

In [155]:
series = pd.Series([[1, 2], [3, 4]])
series

0    [1, 2]
1    [3, 4]
dtype: object

In our examples, we initialized each Series with its values by setting the first argument using a scalar, list, or NumPy array. Note that  `pd.Series`  upcasts values in the same way as  `np.array` . Furthermore, since Series objects are 1-D, the  `ser`  variable represents a Series with lists as elements, rather than a 2-D matrix.

### B. Index

In the previous examples, you may have noticed the zero-indexed integers to the left of the elements in each Series. These integers are collectively referred to as the  *index*  of a Series, and each individual index element is referred to as a  *label* .

The default index is integers from 0 to  *n*  - 1, where  *n*  is the number of elements in the Series. However, we can specify a custom index via the  `index`  keyword argument of  `pd.Series` .

The code below shows how to use the  `index`  keyword argument with  `pd.Series` .

In [156]:
series = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series

a    1
b    2
c    3
dtype: int64

In [157]:
series = pd.Series([1, 2, 3], index=['a', 8, 0.3])
series

a      1
8      2
0.3    3
dtype: int64

The  `index`  keyword argument needs to be a list or array with the same length as the  `data`  argument for  `pd.Series` . The values in the  `index`  list can be any hashable type (e.g. integer, float, string).

### C. Dictionary input

Another way to set the index of a Series is by using a Python dictionary for the  `data`  argument. The keys of the dictionary represent the index of the Series, while each individual key is the label for its corresponding value.

The code below shows how to use  `pd.Series`  with a Python dictionary as the first argument. In our example, we set  `'a'` ,  `'b'` , and  `'c'`  as the Series index, with corresponding values  `1` ,  `2` , and  `3` .

In [158]:
series = pd.Series({'a':1, 'b':2, 'c':3})
series

a    1
b    2
c    3
dtype: int64

In [159]:
series = pd.Series({'b':2, 'a':1, 'c':3})
series

b    2
a    1
c    3
dtype: int64

## DataFrame

the pandas DataFrame object for 2-D data.

### Goals:

* Explore the pandas DataFrame object and its basic utilities
* Write code to create and manipulate a pandas DataFrame

### A. 2-D data

One of the main purposes of pandas is to deal with tabular data, i.e. data that comes from tables or spreadsheets. Since tabular data contains rows and columns, it is 2-D. For working with 2-D data, we use the  `pandas.DataFrame`  object, which we'll refer to simply as a DataFrame.

A DataFrame is created through the  `pd.DataFrame`  constructor, which takes in essentially the same arguments as  `pd.Series` . However, while a Series could be constructed from a scalar (representing a single value Series), a DataFrame cannot.

Furthermore,  `pd.DataFrame`  takes in an additional  `columns`  keyword argument, which represents the labels for the columns (similar to how  `index`  represents the row labels).

The code below shows how to use the  `pd.DataFrame`  constructor.

In [160]:
df = pd.DataFrame()
df

In [161]:
df = pd.DataFrame([5, 6])
df

Unnamed: 0,0
0,5
1,6


In [162]:
df = pd.DataFrame([[5,6]])
df

Unnamed: 0,0,1
0,5,6


In [163]:
df = pd.DataFrame([[5, 6], [1, 3]],
                  index=['r1', 'r2'],
                  columns=['c1', 'c2'])
df

Unnamed: 0,c1,c2
r1,5,6
r2,1,3


In [164]:
df = pd.DataFrame({'c1': [1, 2], 'c2': [3, 4]},
                  index=['r1', 'r2'])
df

Unnamed: 0,c1,c2
r1,1,3
r2,2,4


> Note: Note that when we use a Python dictionary for initialization, the DataFrame takes the dictionary's keys as its column labels.

### B. Upcasting

When we initialize a DataFrame of mixed types, upcasting occurs on a per-column basis. The  `dtypes`  property returns the types in each column as a Series of types.

The code below shows how upcasting works in DataFrames. You'll notice that upcasting only occurs in the first column for the DataFrame below, because the second column's values are both integers.

In [165]:
upcast = pd.DataFrame([[5, 6], [1.2, 3]])
upcast

Unnamed: 0,0,1
0,5.0,6
1,1.2,3


In [166]:
# Datatypes of each column
upcast.dtypes

0    float64
1      int64
dtype: object

### C. Appending rows

We can append additional rows to a given DataFrame through the  `append`  function. The required argument for the function is either a Series or DataFrame, representing the row(s) we append.

Note that the  `append`  function returns the modified DataFrame but doesn't actually change the original. Furthermore, when we append a Series to the DataFrame, we either need to specify the  `name`  for the series or use the  `ignore_index`  keyword argument. Setting  `ignore_index=True`  will change the row labels to integer indexes.

The code below shows example usages of the  `append`  function.

In [167]:
df = pd.DataFrame([[5, 6], [1.2, 3]])
ser = pd.Series([0, 0], name='r3')

df_app = df.append(ser)
df_app

Unnamed: 0,0,1
0,5.0,6
1,1.2,3
r3,0.0,0


In [168]:
df_app = df.append(ser, ignore_index=True)
df_app

Unnamed: 0,0,1
0,5.0,6
1,1.2,3
2,0.0,0


In [169]:
df2 = pd.DataFrame([[0,0],[9,9]])
df_app = df.append(df2)
df_app

Unnamed: 0,0,1
0,5.0,6
1,1.2,3
0,0.0,0
1,9.0,9


### D. Dropping data

We can drop rows or columns from a given DataFrame through the  `drop`  function. There is no required argument, but the keyword arguments of the function gives us two ways to drop rows/columns from a DataFrame.

The first way is using the  `labels`  keyword argument to specify the labels of the rows/columns we want to drop. We use this alongside the  `axis`  keyword argument (which has default value of  `0` ) to drop from the rows or columns axis.

The second method is to directly use the  `index`  or  `columns`  keyword arguments to specify the labels of the rows or columns directly, without needing to use  `axis` .

The code below shows examples on how to use the  `drop`  function.

In [175]:
df = pd.DataFrame({'c1': [1, 2], 'c2': [3, 4],
                   'c3': [5, 6]},
                  index=['r1', 'r2'])
df

Unnamed: 0,c1,c2,c3
r1,1,3,5
r2,2,4,6


In [176]:
# Drop row r1
df_drop = df.drop(labels='r1')
df_drop

Unnamed: 0,c1,c2,c3
r2,2,4,6


In [177]:
# Drop columns c1, c3
df_drop = df.drop(labels=['c1', 'c3'], axis=1)
df_drop

Unnamed: 0,c2
r1,3
r2,4


In [178]:
df_drop = df.drop(index='r2')
df_drop

Unnamed: 0,c1,c2,c3
r1,1,3,5


In [179]:
df_drop = df.drop(columns='c2')
df_drop

Unnamed: 0,c1,c3
r1,1,5
r2,2,6


In [180]:
df.drop(index='r2', columns='c2')
df_drop

Unnamed: 0,c1,c3
r1,1,5
r2,2,6


Similar to  `append` , the  `drop`  function returns the modified DataFrame but doesn't actually change the original.


> Note: When using  `labels`  and  `axis` , we can't drop both rows and columns from the DataFrame.

## Combining

Combine multiple DataFrames through concatenation and merging.