# CS-6600 Lecture 2 - The Tools of the Trade

**Instructor: Dylan Zwick**

*Weber State University*

References:
* [Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) by Aurélien Géron - [Chapter 2: End-to-End Machine Learning Project](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb)

* [Python for Data Analysis](https://wesmckinney.com/book/) by Wes McKinney

<center>
  <img src="https://imgs.xkcd.com/comics/python_environment.png" alt="Python Environment">
</center>

All the programming work we do in this class will be in Python, and much of the work will be done in Jupyter notebooks. A Jupyter notebook is a powerful tool for interactively developing and presenting data science projects.

There are many options for running Jupyter notebooks. There are cloud-based options like:

*   [Google Collab](https://colab.google/)
*   [Anaconda Cloud](https://anaconda.cloud/)

You can also download and run [Anaconda](https://www.anaconda.com/) locally on your computer. Please note that Anaconda is big (on the order of gigabytes), and can take a little while (but not too long) to install. If you're running locally, once you have Anaconda installed, you should open Anaconda Navigator, and launch Jupyter Notebook.

For this class I'll be using Google Collab, and that's what I'd encourage you to use as well, although it's not strictly required. If you're going to set up a Google Collab or other cloud account, please do so with your Weber State email. There may be a fee involved you'll need to pay. It's cheaper than textbooks - which are available online for free.

I must admit I do feel a bit silly writing this, as the very fact that you're reading it means you've probably got this part figured out.


_What is a Jupyter notebook_?

A notebook integrates code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media (like comics!). It's a single document where you can run code, display the output, and add explanations.

Note that it is possible to use many different programming languages in a Jupyter Notebook. However, Jupyter was built with Python in mind (it's the "Py" in "JuPYter", and Python is the most common language used in Jupyter notebooks. It's what we'll be using in this class.

## The Notebook Interface

Now that you have an open notebook in front of you, its interface will hopefully not look entirely alien. After all, Jupyter is kind of like an advanced word processor. Take a look around.

Had a chance to look around? Good. Now, let's talk a bit about some important terms for Jupyter Notebooks.

First, there are two fairly prominent terms which you'll need to know: _cells_ and _kernels_.

* A _cell_ is a container for text to be displayed in the notebook or code to be executed by the notebook’s kernel.

* A _kernel_ is a “computational engine” that executes the code contained in a notebook document.

**Cells**

We’ll return to kernels a little later, but first let's talk about cells. Cells form the body of a notebook. There are two main cell types:

* A _code cell_ contains code to be executed in the kernel. When the code is run, the notebook displays the output below the code cell that generated it.

* A _Markdown cell_ contains text formatted using Markdown and displays its output in-place when the Markdown cell is run.

The default first cell in a new notebook is always a code cell.

Let’s test out a classic hello world example: Type _print('Hello World!')_ into the cell below and then press _Ctrl + Enter_.

In [None]:
print('Hello World!')

When we run the cell, its output is displayed below it and the label to its left will have changed from In [ ] to In [1].

The output of a code cell also forms part of the document, which is why you can see it after you run the cell. You can always tell the difference between code and Markdown cells because code cells have that label on the left and Markdown cells do not.

The label number indicates when the cell was executed on the kernel — in this case the cell was executed first.

Run the cell again and the label will change to a higher number. It will become clearer why this is useful later on when we take a closer look at kernels.

Now, try running the cell below.

In [None]:
time.sleep(3)

Whoops! ERROR.

What happened? Well, we're calling a function from the *time* library, but in order to do this, we need to import the library. Let's do that and try again.

In [None]:
import time
time.sleep(3)

That's better. Note that for this lecture we'll be importing libraries inline as we need them, but that won't always be the case. Usually, instead of importing libraries "in line", I'll import all the libraries we need right at the top. Perhaps this is just because I'm old and was trained writing C programs where you need to specify all this up front, but I prefer doing this because I find it makes it easier to debug. You don't need to go hunting through your notebook trying to find if and where you've imported a given library. However, this is just my style, and yours may differ.

Did you notice anything different? (Besides that it ran.)

This cell doesn’t produce any output, but it does take three seconds to execute. Notice how Jupyter signifies when the cell is currently running by changing its label to In [*].

In general, the output of a cell comes from any text data specifically printed during the cell’s execution, as well as the value of the last line in the cell, be it a lone variable, a function call, or something else.

**Keyboard Shortcuts**

One thing you may have already observed is that some cells have a white background, while others have a gray background without these options. A white background means it's for text, while a gray background means it's for code. There is always one “active” cell highlighted within a notebook. Active cells can also be in either "command" mode or "edit" mode.

How can we create or modify cells? Well, keyboard shortcuts are a very popular feature of the Jupyter environment because they facilitate a speedy cell-based workflow. Many of these are actions you can carry out on the active cell when it’s in command mode.

Below, you’ll find a list of some of Jupyter’s keyboard shortcuts. You don’t need to memorize them all immediately (or ever), but this list should give you a good idea of what’s possible, and you'll find you quickly memorize the ones you frequently need.

* Toggle between edit and command mode with Esc and Enter, respectively.

* Once in command mode:
    * Scroll up and down your cells with your Up and Down keys.
    * Press A or B to insert a new cell above or below the active cell.
    * Ctrl-M + M will transform an active code cell to a Markdown cell.
    * Ctrl-M + Y will transform an active Markdown cell to a code cell.
    * Ctrl-M + D will delete the active cell.
    * Ctrl-M + Z will undo cell deletion.

Go ahead and give some of these a try.


If you want to practice deleting and undoing deletion, you can use this cell.

**Markdown**

[Markdown](https://www.markdownguide.org/) is a lightweight, easy to learn markup language for formatting plain text. Its syntax has a one-to-one correspondence with HTML tags, so some prior knowledge of HTML would be helpful but is definitely not a prerequisite.

All the narrative text above was written in Markdown. Let’s cover the basics with a quick example:
````
# This is a level 1 heading

## This is a level 2 heading

This is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_.

Paragraphs must be separated by an empty line.

* Sometimes we want to include lists.
* Which can be bulleted using asterisks.

1. Lists can also be numbered.
2. If we want an ordered list.

[It is possible to include hyperlinks](https://www.example.com)

Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
```
bar()
```
Or can be indented by 4 spaces:

    foo()

And finally, adding images is easy: ![This is alt text](https://imgs.xkcd.com/comics/selection_bias.png)
````

Below is the text above rendered with Markdown:

# This is a level 1 heading

## This is a level 2 heading

This is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_.

Paragraphs must be separated by an empty line.

* Sometimes we want to include lists.
* Which can be bulleted using asterisks.

1. Lists can also be numbered.
2. If we want an ordered list.

[It is possible to include hyperlinks](https://www.example.com)

Inline code uses single backticks: `foo()`, and code blocks use triple backticks:
```
bar()
```
Or can be indented by 4 spaces:

    foo()

And finally, adding images is easy: ![This is alt text](https://imgs.xkcd.com/comics/selection_bias.png)

Please note the image above comes from the ["xkcd" comic](https://xkcd.com/). I'll be using images from xkcd frequently in this class.

**Kernels**

Behind every notebook runs a kernel. When you run a code cell, that code is executed within the kernel. Any output is returned back to the cell to be displayed. The kernel’s state persists over time and between cells — it pertains to the document as a whole and not individual cells.

For example, if you import libraries or declare variables in one cell, they will be available in another. Let’s try this out to get a feel for it. First, we’ll import a Python package and define a function:

In [None]:
import numpy as np
def square(x):
    return x * x

Once we’ve executed the cell above, we can reference np and square in any other cell.

In [None]:
x = np.random.randint(1, 10)
y = square(x)
print('%d squared is %d' % (x, y))

This will work regardless of the order of the cells in your notebook. As long as a cell has been run, any variables you declared or libraries you imported will be available in other cells.

For example, remember that *time* command that didn't work the first time we ran it? Scroll up and try running the cell again. It should run now, because you've imported the required library.

Most of the time when you create a notebook, the flow will be top-to-bottom. But it’s common to go back to make changes. When we do need to make changes to an earlier cell, the order of execution we can see on the left of each cell can help us diagnose problems by seeing the order in which the cells have run. You see why they're numbered? I told you it would be explained later. :)

But what happens if we change the value of y specified in the code cell right above?

In [None]:
y = 10
print('Is %d squared %d?' % (x, y))

No. No it's not. We get this output because once we’ve run the y = 10 code cell, y is no longer equal to the square of x in the kernel.

And if we ever wish to reset things, there are several incredibly useful options from the Runtime menu, including:

* Restart session: restarts the kernel, thus clearing all the variables etc that were defined.
* Restart session and run all: same as above but will also run all your cells in order from first to last.

If your kernel is ever stuck on a computation and you wish to stop it, you can choose the *Interrupt execution* option.

**Choosing a Runtime**

When you're running a Jupyter notebook in Collab, you've got a few options that determine your *runtime*. One, is you need to pick which language you're using - for us it will almost always be Python 3. We also need to pick which hardware we're using. This won't be important at the start, but later on we'll definitely want to be using GPUs.

There are kernels for different versions of Python, and also for over 100 languages including Java, C, and even Fortran. Data scientists may be particularly interested in the kernels for R and Julia, as well as both imatlab and the Calysto MATLAB Kernel for Matlab.

## The Numpy Library

Now let's go over the basics of NumPy, which is short for "Numerical Python". It's one of the great data science workhorse libraries of Python, and we'll be using it all the time. We'll just be scratching the surface of its features and capabilities today.

The basic object of interest in NumPy is the array, or the *ndarray*, which is an efficient, multidimensional array optimized for storage and computation. NumPy also comes with tools for reading/writing array data to disk and some nice built-in math functions, plus tools for integrating with C libraries.

The basic idea behind NumPy is that, while Python is great for doing many things, it's not an inherently optimized language for handling large-scale numeric computations or storing large data files. Instead of requiring that data analysts needing to work with such things do so using a different language, NumPy provides tools that, essentially, translate basic numeric analysis needs into much more efficient, lower-level (C-style) implementations.

To give you an idea of the performance difference, check out the following:

In [None]:
import numpy as np # Numpy is almost always abbreviated as np

In [None]:
my_arr = np.arange(1000000)
my_list = list(range(1000000))

In [None]:
%timeit my_arr2 = my_arr*2

In [None]:
%timeit my_list2 = [x * 2 for x in my_list]

That's quite the difference! Generally, NumPy-based algorithms are 10 to 100 times faster (or more!) than their pure Python counterparts and use significantly less memory. So, if you're writing for loops to go through lists for large numeric computations - you're probably doing it wrong.

### The NumPy ndarray: A Multidimensional Array Object

The basic data object in NumPy is the *ndarray*, and applying standard arithmetic operations to these arrays uses syntax very similar to standard arithmetic.

For example, let's create an array:

In [None]:
data = np.array([[1, 4.2, 7], [5, 8, 2.71]])
data

We can multiply every element in the array by 2 with the following:

In [None]:
data * 2

Or, we can get the same result by adding the array to itself:


In [None]:
data + data

We could even square every element in the array:

In [None]:
data ** 2

Again, these operations are generally quite fast relative to trying to implement them with a for loop.

#### Data Types for ndarrays

In general, an array is  for homogeneous data; in other words, all the elements must be the same type. Every array also has a *shape*, which is a tuple indicating the size of each dimension, and a *dtype*, which describes the data type of the array.

In [None]:
data.shape

In [None]:
data.dtype

How do we create ndarrays? The easiest way is with the *array* function used above, where we just specify the elements in a list or nested lists. Unless explicitly told otherwise, the array function will try to infer a good data type for the array based upon its inputs.

In addition to the *array* function, there are a number of other functions for creating new arrays. For example:

In [None]:
np.zeros(10)

In [None]:
np.zeros((2,3))

In [None]:
np.ones((4,3))

The function *arange* is like the *range* function is Python, except it creates a NumPy array instead of a list.

In [None]:
np.arange(12)

In [None]:
np.arange(3,11)

We can, if desired, explicitly convert or *cast* an array from one data type to another using the *astype* method:

In [None]:
arr = np.array([1,2,3,4,5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr

If we go the other way, from floats to ints, that will not generate an error, but the decimal parts will be truncated.

In [None]:
arr = np.array([2.3, 5.0, 4.999, 2.71828])
arr.dtype

In [None]:
int_arr = arr.astype(np.int64)
int_arr

We can convert strings that make sense as numbers to numbers:

In [None]:
arr = np.array(["2.3", "1", "42"])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr

But, this will cause an error if the strings don't make sense as numbers:

In [None]:
arr = np.array(["2.3", "1", "42", "Shrubbery"])
arr

In [None]:
float_arr = arr.astype(np.float64)

The datatype for an array can also be specified when you create it:


In [None]:
arr = np.array([1, 2, 3])
arr

In [None]:
arr = np.array([1, 2, 3], dtype=np.float64)
arr

#### Arithmetic with NumPy Arrays

Arrays enable you to express batch operations on data without writing any for loops. In NumPy this is called "vectorization". Any arithmetic operations between equal-sized arrays apply the operation element-wise:

In [None]:
arr = np.array([[1,2,3],[3,2,1]])
arr

In [None]:
arr * arr

In [None]:
arr + 2 * arr

#### Basic indexing and slicing

Selecting a subset of an array is a somewhat deep topic that we'll only touch on here. One-dimensional arrays on the surface act similarly to Python lists:

In [None]:
arr = np.arange(10)
arr[5]

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12
arr

As you can see, assigning a scalar value to a slice propagates (or *broadcasts*) the value to the entire selection.

An important first  distinction here is that array slices are views **on the original array**, which means any modification to them is reflected in the source array.

For example:

In [None]:
arr_slice = arr[5:8]
arr_slice

In [None]:
arr_slice[1] = 42

In [None]:
arr

This might seem surprising if you are used to Python lists. The idea behind this is that NumPy has been designed to work with very large arrays, and you can image that lots of copying of big arrays could lead to performance and memory problems.

If you do want to create a copy and not just a view, you can do so with the *copy* function:

In [None]:
arr = np.arange(10)
arr

In [None]:
arr_slice = arr[3:6]
arr_slice

In [None]:
arr_copy = arr[3:6].copy()
arr_copy

In [None]:
arr_slice[0] = 23
arr_copy[1] = 42
arr

For higher dimensional arrays there are more options.

In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [None]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2d[2]

If we wanted to access the third element of the first array, we could do so either recursively:

In [None]:
arr2d[0][2]

Or, using a comma-separated list:

In [None]:
arr2d[0,2]

If we assign a scalar value to an entire array, it broadcasts that value to every entry:

In [None]:
arr2d[0]

In [None]:
arr2d[0] = 42
arr2d

Multiple slices can be passed to multi-dimensional arrays just like we can pass multiple indices.

In [None]:
arr2d[:2]

In [None]:
arr2d[:2,1:]

By mixing indexes and slices, you can get lower dimensional slices:

In [None]:
arr2d[:2,2]

Keep in mind, these slices are *views*

In [None]:
arr2d[:2,2] = 801
arr2d

We can index the elements in an array with **boolean indexing**. For example, suppose we have the following array of names:

In [None]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4,7], [0,2], [-5,6], [0,0], [1,2], [-12,-4], [3,4]])

In [None]:
names

In [None]:
data

Suppose each name corresponds with a row in the *data* array and we wanted to select all the rows with the corresponding name "Bob". Like arithmetic operations, comparisons (like ==) with arrays are also vectorized. So, this command produces a Boolean array:

In [None]:
names == "Bob"

If we pass this as an index to our array we get:

In [None]:
data[names=="Bob"]

Note of course the Boolean array must be of the same length as the array axis it's indexing.

You can even mix and match Boolean arrays with slices:

In [None]:
data[names == "Bob", 1:]

In [None]:
data[names == "Bob", 1]

We can also use the standard *and* (&), *or* (|), and *not* (~) operations.

In [None]:
data[names != "Bob"]

In [None]:
data[~(names == "Bob")]

In [None]:
data[(names == "Bob") | (names == "Will")]

Note selecting data from an array by Boolean indexing and assigning the results to a new variable creates a *copy* of the data. I know it's a bit confusing when a copy is created and when it's not. Sorry, I didn't make the rules.

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Sorry, I didn't make this terminology either. For example:

In [None]:
arr = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
arr

In [None]:
arr[[1,0,2]]

Using negative indices selects rows from the end:

In [None]:
arr[[-1,-2]]

Passing multiple index arrays selects an array of elements corresponding to each tuple of indices:

In [None]:
arr[[1,0,2],[2,0,1]]

Fancy indexing, unlike slicing, always copies the data into a new array when assigning the results to a new variable. Again, not my rules.

Finally, we can transpose an array with the *T* method:

In [None]:
arr

In [None]:
arr.T

## The Pandas Library

The other great workhorse library for data science with Python is *pandas*. We will not cover, or even come close to covering, the entire library today. Also, there are many aspects and facets of Pandas that you'll learn and internalize only by using it. However, today should - ideally - give you a starting point.

First, some notational conventions. As with NumPy and "np", almost always the name of the library "pandas" is abbreviated as "pd" when it's imported, and pd is used to reference it afterwards.

Joke - How do you ruin a data scientist's day?

```
import numpy as pd
import pandas as np
```

Let's do it right:

In [None]:
import pandas as pd

### Pandas Basics

The dataset with which we'll play around is the "royal line" dataset, which was created from public sources and contains family history information about Elizabeth II, the Queen of England at the time the dataset was compiled. Note a few things about the command below:

* It uses the read_csv command from pandas, which is used to read in "comma separated value" files. This is a very common format for storing tabular data, as it's not tied to a particular program like, for example, Excel files are. However, Pandas also has functionality for reading in pretty much any type of data format commonly found in practice.

* The basic data object in Pandas is a "dataframe", which is created by the read_csv command. Frequently, a dataframe is denoted with the abbreviation "df", which we do here.

In [None]:
url = 'https://drive.google.com/uc?export=download&id=1k7k-ObAKWhW0iPbyCplogGMGsy29Mif5' #This URL points to the royal_line.csv file stored on my (Dylan's) Google Drive. You should all be able to access it.

In [None]:
df = pd.read_csv(url)

We can then take a look at this dataframe using the "print" command, which will by default print the first five and last five rows of the dataframe.

In [None]:
print(df)

If we just want to check out the first $n$ values, we can use the "head" function:

In [None]:
print(df.head())
print(df.head(12))

And similarly the "tail" function returns the last $n$ values:

In [None]:
print(df.tail(15))

You may have noticed in the printed dataframe an additional column of numbers located to the far left of the data. For instance, if we just call df.head() with the default option (which is 5), we get:

In [None]:
df.head()

That first column is an index column created by Pandas for the dataframe. It starts at $0$ and enumerates from there. Note this column is *not* in our original csv - it's created.

In this example, our original csv already has an index column, called ID, that starts at 1, so this additional data column is a bit redundant. We can specify an index column when we read in the data using the "index_col" parameter in read_csv.

In [None]:
df = pd.read_csv(url, index_col='ID')

Now, the index column is the "ID" column from the original dataset.

In [None]:
df.head()

If we only wanted to see the "title" column, we could do so as follows:

In [None]:
df['title'].head()

We can specify more than one column in this way as well. Note the *double brackets*. Think of this as the outer brackets accessing the dataframe, while the inner brackets specify a list.

In [None]:
df[['title', 'first_name']].tail()

In [None]:
df.columns

Note this returns an index object that behaves as an iterable list, so you could, for example, go through the colums with a for loop.

You can also find out more information about a dataframe using the "info" function:

In [None]:
df.info()

### Dropping or Removing Data

There are many reasons why you might want to drop or remove data from a dataframe. For example, it could be that only some data is relevant to your analysis. Or, it could be that some data is insufficient or corrupt, and leaving it in would lead to incorrect conclusions. Also, sometimes certain columns aren't of interest to the analysis in question.

If we want to drop entire columns, we can use the "drop" function and specify the columns as a list in the argument:

In [None]:
df.drop(columns = ['birth_place', 'death_place'])

However, while the dataframe above only has six columns, if we call it again we get:

In [None]:
df

What?!? I thought we dropped two!

What's going on here is that the drop command creates a new dataframe as its output. It *does not* modify the original dataframe. So, for example, we could say:

In [None]:
df2 = df.drop(columns = ['birth_place', 'death_place'])
df2

Here, df2 is the dataframe with those two columns dropped, while df is the original, unchanged dataframe.

If we want to actually make the change to the original dataframe, we can do this with the "inplace" argument.

In [None]:
df.drop(columns = ['birth_place', 'death_place'], inplace = True)
df

This is the same as:

In [None]:
df = pd.read_csv(url, index_col='ID')
df = df.drop(columns = ['birth_place', 'death_place'])
df

You can also drop rows by indicating specific indices with the *index* argument:

In [None]:
df.drop(index=[4,5,6], inplace=True)

Or, by using df.index, which avoids potential variations in index numbering and always references the first row starting with $0$.

In [None]:
df.drop(df.index[0], inplace=True)
df.drop(df.index[0], inplace=True)

Each line drops the first row on the dataframe, whatever that first row might be. So, these two lines together drop the first two rows. We can also use standard Python indexing and slicing notation to specify indices here.

In [None]:
df.drop(df.index[2:5], inplace=True)

The "drop_duplicates" function can be used to drop duplicate rows, while the "dropna" function drops every row that includes at least one NA entry. Be careful with this one, as it could potentially drop a lot of rows!

In [None]:
df.dropna(inplace=True)

### Adding, Modifying Data, and Mapping

Suppose we have a dataframe with NA values, and instead of dropping them we want to fill them with some values we determine. Here are a few ways to do that:

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace ALL NA entries with a fixed value:
df.fillna(0, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace the first 2 NA entries in each column with a fixed value:
df.fillna(0, limit=2, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace ALL NA first names with a fixed value:
df['first_name'].fillna('no first name', inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace specific columns with specific values provided by a dictionary:
values = {'first_name': 'no_first_name', 'last_name': 'no_last_name', 'sex': 'no_sex', 'title': 'no_title', 'birth_date': 'no_birth_date', 'birth_place': 'no_birth_place', 'death_date': 'no_death_date', 'death_place': 'no_death_place'}
df.fillna(value=values, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# ffill and pad: from first row to last row, propagate the most recent row that is not an NA forward until next valid row
df.ffill(inplace = True)

In [None]:
df = pd.read_csv(url, index_col='ID')
# bfill and backfill: like ffill, except from last row to first row
df.bfill(inplace=True)
df

We can also create new columns from existing ones. For example:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['full_name'] = df['first_name'] + ' ' + df['last_name']
df

This illustrates a problem. Anytime we have an NaN value, the string concatenation is also NaN. How could we get around this? Well, we could create our own specific function that handles this, and then apply that to our dataframe:

In [None]:
def create_full_name(row):
    if isinstance(row['first_name'], str) and isinstance(row['last_name'], str):  # both first_name and last_name are strings
        result = row['first_name'] + ' ' + row['last_name']
    elif isinstance(row['first_name'], str):  # only first_name is a string
        result = row['first_name']
    elif isinstance(row['last_name'], str):  # only last_name is a string
        result = row['last_name']
    else:  # neither first_name nor last_name are strings, they are both NaN
        result = np.nan
    return result

df = pd.read_csv(url, index_col='ID')

df['full_name'] = df.apply(create_full_name, axis=1)
df

This "apply" operation applies the specified function. You could also use Python lambda functions to create a function inline if needed. Note the option "axis = 1" means to process the data row by row. The option "axis = 0" would process the data column by column. For example:

In [None]:
# Create a dataframe that is a 6 x 2 array formed from a list of 12 numbers ordered from 0 to 11.
df = pd.DataFrame(np.arange(12).reshape(6,2), columns = ['column 1', 'column 2'])
print(df)

In [None]:
# Create a new dataframe that takes the maximum value of each column in the dataframe we just created.
new_df = df.apply(lambda column: column.max())
print(new_df)

There are three main functions used to create or change data in dataframes: apply, map, and applymap.

In [None]:
df = pd.DataFrame(np.arange(8).reshape(4,2), columns = ['column 1', 'column 2'])
print(df)

The *apply* function can be used to apply a function along either axis of a dataframe.

In [None]:
print(df.apply(np.max))

In [None]:
print(df.apply(np.max, axis = 1))

The *map* function is a bit more limited and is used to apply a function element-wise to a series. It's more efficient than *apply* when used in this way.

In [None]:
print(df['column 1'].map(lambda x: x*2))

The *applymap* function is used for element-wise operations across all elements of a dataframe, not just those within a series.

In [None]:
print(df.applymap(lambda x: x*2))

### Changing Datatypes of Series or Columns

The datatypes for our "royal_line" examples have all been 'objects' because every column has had data that's been interpreted as a string. This is a general, default datatype that is quite encompassing in what it can handle. However, there are some functions, like maximum or average, that make sense for certain types of numeric data, but not for general data, and if we try to apply these functions to objects we'll have a bad time.

In [None]:
# Let's create a simple dataframe with three columns containing different types of data:
df = pd.DataFrame({'ints': [1,2,3,4], 'strings': ['a','b','c','d'], 'floats': [1.1, '2.2', '3.3', 4]})
print(df)
print(df.dtypes)

Here, the second and third column are interpreted as objects because both contained strings (the values 2.2 and 3.3 in the floats column were entered as strings).

To convert these to a different datatype, we can convert a single column, or multiple columns using a dictionary.

In [None]:
df['floats'] = df['floats'].astype(float)
print(df.dtypes)

In [None]:
convert_dict = {'ints': int, 'strings': str, 'floats': float}
df = df.astype(convert_dict)
print(df.dtypes)

In [None]:
# The following command would also work:
df['ints'] = df['ints'].astype(float)
print(df)
print(df.dtypes)

In [None]:
# But this one won't:
df['strings'] = df['strings'].astype(int)

Pandas has many built in conversion functions (to_datetime, to_timedelta, to_numeric, etc...) but you'll sometimes encounter data that's formatted in such a way that it's not possible to immediately convert it to the format you want using one of the built in functions. To deal with this, sometimes you need to write your own conversion function.

For example, if we check out the 'birth_date' column in our royal_line dataset, we see:

In [None]:
df = pd.read_csv(url, index_col='ID')
print(df['birth_date'])

A lot of NaN. OK, let's remove these and see what we get:

In [None]:
df.dropna(subset = ['birth_date'], inplace=True)
print(df['birth_date'])

If we then try to convert these values to datetimes we get:

In [None]:
df['birth_date'] = pd.to_datetime(df['birth_date']) # This fails

This generates errors due to several issues.

First, there are entries in the dataset formatted like the following: ABT 751. This notation means that the family history experts believe the person was born about (ABT) 751.

The second is an out of bounds nanosecond timestamp error related to Pandas only supporting approximately 580 years in the range from around 1677 to 2262.

To get around these issue, we'll write and then apply our own function. Note we're not dropping the NaN values here.

In [None]:
def get_year(x):
    if pd.isna(x):
        year_result = np.nan  # if the birth_year is nan then return nan
    else:  # checking a number of edge cases in the data and stripping it out:
        if "ABT" in x:  # for example: ABT  1775
            x = x[3:]
            x = x.strip()
        if "/" in x:  #  For example: 1775/1776
            x = x[:x.find('/')]
        num_spaces = x.count(' ')
        if num_spaces == 0:  # only has the year
            year_result = int(x)
        elif num_spaces == 1:  # example: FEB 1337
            x = x[x.rfind(' ') + 1:]  # 'rfind' finds the last space. The 'r' stands for 'reverse.'
            if x.isnumeric():
                year_result = int(x)
            else:  # This could happen if there is only a day and month, like '10 JAN'
                year_result = np.nan
        elif num_spaces == 2:  # example: 16 FEB 1337
            x = x[x.rfind(' ') + 1:]
            year_result = int(x)
        else:
            year_result = np.nan  # There are a few other strange dates that aren't worth our time to fix, so just return nan for those.
    return year_result

df['birth_year'] = df['birth_date'].map(get_year)

print(df['birth_year'])

### Conditionals in Dataframes and Series

Conditionals are a very useful feature of Pandas which typically produce a Numpy array of Booleans or a Pandas Boolean series.

For example, consider the following code that produces True if the birth_year column (calculated above) is greater than or equal to 1990, and False otherwise.

In [None]:
boolean_mask = (df.birth_year >= 1990)
print(boolean_mask)

We can then use this to, for example, only print the entries for which the boolean is True.

In [None]:
print(df[boolean_mask][['first_name', 'last_name', 'birth_year']])

We can combine Boolean expressions using the logical operators & ("and"), | ("or"), and ~ ("not"). For example:

In [None]:
print(df[(df.birth_year >= 1500) & (df.title.str.contains('Queen'))][['first_name', 'title', 'birth_year']])

### loc and iloc Functions

One of the most common tasks for data scientists is filtering information to more efficiently derive actionable insights. Marketers also like saying things like that.

We've seen the "head" and "tail" functions, which provide a quick, truncated view of the beginning or end, respectively, of the dataframe or series. But what if you're interested in examining results that are not necessarily at the very beginning or end of the dataset.

For this purpose, the loc function is designed to access rows and columns by label. In contrast, the iloc function is used to access rows and columns by integer value - the "i" stands for "integer".

A quick example of the difference is illustrated below, where both commands do the same thing:

In [None]:
print(df.loc[1])

In [None]:
print(df.iloc[0])

However, the following will produce an error:

In [None]:
print(df.loc[0]) #This fails

Because there is no row with label 0.

Now, row indices (labels) don't need to be unique. For example:

In [None]:
df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=['A', 'B'], index=['cat', 42, 'stone', 42, 12345])# Five rows each with an associated index
print(df)

The index "42" appears twice, and some indices are numbers, while some are strings. Let's look at some examples:

In [None]:
print(df.loc[12345])

In [None]:
print(df.loc['stone'])

In [None]:
print(df.loc[42])

In [None]:
print(df.loc['A']) #This will fail

In [None]:
print(df.loc['cat':'stone'])

In [None]:
print(df.loc[['cat','stone']])

In [None]:
print(df.loc['stone', 'B'])

In [None]:
print(df.loc[df['A'] > 3])

Now let's take a look at some iloc examples:

In [None]:
print(df.iloc[0])

In [None]:
print(df.iloc[0:3])

In [None]:
print(df.iloc[[0,2,4]])

In [None]:
print(df.iloc[0,1])

In [None]:
print(df.iloc[0:3,1])

Returning to the royal family history data as an example, let's create a new column named "era". The "era" column signifies if a person was born in one of three distinct time periods: 'ancient', 'middle_years', or 'modern'. The following creates a new column and initially assigns the value 'unknown' to every entry within it:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)
df['era'] = 'unknown'

In [None]:
print(df)

The next question is how to divide the birth years. If we check out their maximum and minimum values, we get:

In [None]:
print(f"The earliest year = {df['birth_year'].min()} and the latest year = {df['birth_year'].max()}.")

So, 686 is the earliest year, and 1991 is the latest. This is a difference of 1991 - 656 = 1305 years, which if we divide by 3 this gives us 435 years per era. So, the "ancient" royals are those born between 686 and 1121, the "middle_years" royals are those born between 1121 and 1555, and the "modern" royals are those born after 1555. (Not all that modern!) We can assign these three eras with the following code:

In [None]:
df.loc[df['birth_year'] < 1122, 'era'] = 'ancient'  # 686 – 1121
df.loc[(df['birth_year'] >= 1122) & (df['birth_year'] <= 1555), 'era'] = 'middle_years'  # 1122 – 1555
df.loc[df['birth_year'] > 1555, 'era'] = 'modern'  # after 1555
print(df)

We could have also done this using a custom function and the "map" utility.

### Reshaping with Pivot, Pivot_Table, Groupby, and Transpose

Frequently it is convenient or informative to restructure data contained in a dataframe, effectively organizing the data into a different shape or format. This section will cover the most common reshaping functions provided by Pandas.

The pivot function addresses the situation in which separate categories of a dataset feature are enumerated and highlighted using a cross tabular format. For example:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2],
                   'Type': ['new', 'used', 'new', 'used'],
                   'Price': [10, 5, 12, 7]})
print(df)

Calling the pivot function on this dataframe will reword the data into a more compact and usable format. In the example above, we want to reshape the data such that each car brand is represented on a single row. A use case for this particular reorganization would be a car salesperson who needs to quickly view all the prices of a given car brand for the different 'Type' categories.

In [None]:
p = df.pivot(index='Car', columns='Type', values='Price')
print(p)

The pivot function only works if there is either zero or one entries per cell in the result. Suppose we have the following dataframe:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})
print(df)

Invoking the pivot function on this dataframe will generate an error:

In [None]:
p = df.pivot(index='Car', columns='Type', values='Price') #This will fail

The reason for this is there are two price entries for the used version of car 2. In this case what we could do is use the "pivot_table" function with an aggregator, which specifies how to combine values when more than 1 occurs.

In [None]:
p = df.pivot_table(index='Car', columns='Type', values='Price', aggfunc=np.mean)
print(p)

Pivot tables can result in immensely complex tabular formats with multiple indexes, multiple columns, and various aggregation functions specified. Today, we demonstrate only the basic single-index, single-column case.

Here's another example of a pivot_table using our royal family history dataset:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)

df.dropna(inplace=True, subset=['title', 'sex', 'birth_year'])
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc='mean')
print(p)

Here are two more examples. The first fills blank entries in the resulting pivot table after aggregation with $0$ instead of NaN, and uses two aggregate functions, mean and count.

In [None]:
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

The second is similar to the first, but instead declares two indexes and two columns producing a much more complicated, nested output result.

In [None]:
p = df.pivot_table(index=['title', 'first_name'], columns=['sex', 'last_name'], values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

The groupby function's recasting of information is very similar to that of the pivot_table function. In general, the main difference is how the resulting output is shaped. Note it's a common mistake to create a group object without specifying an aggregating function like mean, sum, or std.

Consider the following:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})

g = df.groupby(by='Car')
print(g)

That's not particularly helpful. However, if we group by 'Car' and invoke the 'mean' function, we obtain a more useful result.

In [None]:
g = df.groupby(by='Car').mean() #Thil will fail
print(g)

Whoa! What happened there? Well, it's trying to apply the aggregate function to both the price, which is fine, and the type, which is not. This used to result in a warning and dropping the type column, but in more recent versions of Pandas it gives an error.

How can we get around this? Well, we can drop the *Type* column first.

In [None]:
g = df[['Car','Price']].groupby(by='Car').mean() #Need to toss out the "Type" column, or it will try to take its mean.
print(g)

If instead of mean we wanted to use the more robust count we don't need to toss out the type:

In [None]:
g = df.groupby(by='Car').count()
print(g)

If we wanted to group by both 'Car' and 'Type', and use two different aggregation function, we could do that:

In [None]:
g = df.groupby(by=['Car','Type']).agg(['mean','count'])
print(g)

Finally, and with NumPy arrays, the transpose function (or simply T) transposes a dataframe. For example:

In [None]:
df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])
print(f'Original\n{df}')

df = df.transpose()  # or df.T

print(f'\nTransposed:\n{df}')

## References

At the end of most of the lecture notes I like to provide references for further reading. Please note these are generally here in case you're interested, but you're not required to check them out.

Sometimes they may explore a topic in more depth, sometimes they might provide additional learning resources, and sometimes they might just be fun (I'll try to provide a link to a song each lecture).

* [Introduction to NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html)

* [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)

* Song Of The Day (SOTD) - [Royals](https://youtu.be/nlcIKh6sBtcsi=duDiNtjAzRyh5OUJ) by Lorde