# CS-6600 Lecture 3 - More Tools of the Trade

**Instructor: Dylan Zwick**

*Weber State University*

References:
* [Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow](https://www.oreilly.com/library/view/hands-on-machine-learning/9781098125967/) by Aurélien Géron - [Chapter 2: End-to-End Machine Learning Project](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb)

* [Python for Data Analysis](https://wesmckinney.com/book/) by Wes McKinney

* [Data Visualization in Python](https://www.amazon.com/Data-Visualization-Python-Pandas-Matplotlib/dp/B0972TFYN8) by David Landup (Full course [here](https://stackabuse.com/courses/data-visualization-in-python-with-matplotlib-and-pandas/))

<center>
  <img src="https://imgs.xkcd.com/comics/real_estate_analysis.png" alt="xkcd planetary analysis">
</center>

In our last class I tried to cover an intro to Jupyter notebooks, NumPy, and Pandas all in one 75 minute class. I was not successful. Specifically, we made it through Jupyter notebooks and NumPy, but didn't even start with Pandas.

So, today we'll do a bit of catching up and finish the material on Pandas. We'll also try to cover some basic data visualization in Python.

Next week, we'll get going on data science and modeling.


I mentioned last time my style is to usually import the libraries I need right at the start of the notebook. Let's do that here. The first three libraries we import are ones we'll be using *all the time*, and so you'll see them at the top of pretty much every notebook from now on.

Joke - How do you ruin a data scientist's day?

```
import numpy as pd
import pandas as np
```

Let's do it right and import them with the standard abbreviations. As with NumPy and "np", almost always the name of the library "pandas" is abbreviated as "pd" when it's imported, and pd is used to reference it afterwards.

In [None]:
import numpy as np
import pandas as pd #This is the way.
import matplotlib.pyplot as plt

from matplotlib.patches import Circle
from matplotlib.patheffects import withStroke
from matplotlib.ticker import AutoMinorLocator, MultipleLocator

## The Pandas Library

Along with NumPy, the other great workhorse library for data science with Python is *pandas*. As with NumPy, we will not cover, or even come close to covering, the entire library today. Also, there are many aspects and facets of Pandas that you'll learn and internalize only by using it. However, today should - ideally - give you a starting point.

A note on nomenclature. The name "Pandas" did not originate as a reference to the deceptively cute type of bear, but as an abbrevation and amalgamation of "Panel Data".

### Pandas Basics

The dataset with which we'll play around is the "royal line" dataset, which was created from public sources and contains family history information about Elizabeth II, the Queen of England at the time the dataset was compiled. Note a few things about the command below:

* It uses the read_csv command from pandas, which is used to read in "comma separated value" files. This is a very common format for storing tabular data, as it's not tied to a particular program like, for example, Excel files are. However, Pandas also has functionality for reading in pretty much any type of data format commonly found in practice.

* The basic data object in Pandas is a "dataframe", which is created by the read_csv command. Frequently, a dataframe is denoted with the abbreviation "df", which we do here.

In [None]:
url = 'https://drive.google.com/uc?export=download&id=1k7k-ObAKWhW0iPbyCplogGMGsy29Mif5' #This URL points to the royal_line.csv file stored on my (Dylan's) Google Drive. You should all be able to access it.

In [None]:
df = pd.read_csv(url)

We can then take a look at this dataframe using the "print" command, which will by default print the first five and last five rows of the dataframe.

In [None]:
print(df)

If we just want to check out the first $n$ values, we can use the "head" function:

In [None]:
print(df.head())
print(df.head(12))

And similarly the "tail" function returns the last $n$ values:

In [None]:
print(df.tail(15))

You may have noticed in the printed dataframe an additional column of numbers located to the far left of the data. For instance, if we just call df.head() with the default option (which is 5), we get:

In [None]:
df.head()

That first column is an index column created by Pandas for the dataframe. It starts at $0$ and enumerates from there. Note this column is *not* in our original csv - it's created.

In this example, our original csv already has an index column, called ID, that starts at 1, so this additional data column is a bit redundant. We can specify an index column when we read in the data using the "index_col" parameter in read_csv.

In [None]:
df = pd.read_csv(url, index_col='ID')

Now, the index column is the "ID" column from the original dataset.

In [None]:
df.head()

If we only wanted to see the "title" column, we could do so as follows:

In [None]:
df['title'].head()

We can specify more than one column in this way as well. Note the *double brackets*. Think of this as the outer brackets accessing the dataframe, while the inner brackets specify a list.

In [None]:
df[['title', 'first_name']].tail()

In [None]:
df.columns

Note this returns an index object that behaves as an iterable list, so you could, for example, go through the colums with a for loop.

You can also find out more information about a dataframe using the "info" function:

In [None]:
df.info()

### Dropping or Removing Data

There are many reasons why you might want to drop or remove data from a dataframe. For example, it could be that only some data is relevant to your analysis. Or, it could be that some data is insufficient or corrupt, and leaving it in would lead to incorrect conclusions. Also, sometimes certain columns aren't of interest to the analysis in question.

If we want to drop entire columns, we can use the "drop" function and specify the columns as a list in the argument:

In [None]:
df.drop(columns = ['birth_place', 'death_place'])

However, while the dataframe above only has six columns, if we call it again we get:

In [None]:
df

What?!? I thought we dropped two!

What's going on here is that the drop command creates a new dataframe as its output. It *does not* modify the original dataframe. So, for example, we could say:

In [None]:
df2 = df.drop(columns = ['birth_place', 'death_place'])
df2

Here, df2 is the dataframe with those two columns dropped, while df is the original, unchanged dataframe.

If we want to actually make the change to the original dataframe, we can do this with the "inplace" argument.

In [None]:
df.drop(columns = ['birth_place', 'death_place'], inplace = True)
df

This is the same as:

In [None]:
df = pd.read_csv(url, index_col='ID')
df = df.drop(columns = ['birth_place', 'death_place'])
df

You can also drop rows by indicating specific indices with the *index* argument:

In [None]:
df.drop(index=[4,5,6], inplace=True)

Or, by using df.index, which avoids potential variations in index numbering and always references the first row starting with $0$.

In [None]:
df.drop(df.index[0], inplace=True)
df.drop(df.index[0], inplace=True)

Each line drops the first row on the dataframe, whatever that first row might be. So, these two lines together drop the first two rows. We can also use standard Python indexing and slicing notation to specify indices here.

In [None]:
df.drop(df.index[2:5], inplace=True)

The "drop_duplicates" function can be used to drop duplicate rows, while the "dropna" function drops every row that includes at least one NA entry. Be careful with this one, as it could potentially drop a lot of rows!

In [None]:
df.dropna(inplace=True)

### Adding, Modifying Data, and Mapping

Suppose we have a dataframe with NA values, and instead of dropping them we want to fill them with some values we determine. Here are a few ways to do that:

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace ALL NA entries with a fixed value:
df.fillna(0, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace the first 2 NA entries in each column with a fixed value:
df.fillna(0, limit=2, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace ALL NA first names with a fixed value:
df['first_name'].fillna('no first name', inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# replace specific columns with specific values provided by a dictionary:
values = {'first_name': 'no_first_name', 'last_name': 'no_last_name', 'sex': 'no_sex', 'title': 'no_title', 'birth_date': 'no_birth_date', 'birth_place': 'no_birth_place', 'death_date': 'no_death_date', 'death_place': 'no_death_place'}
df.fillna(value=values, inplace=True)
df

In [None]:
df = pd.read_csv(url, index_col='ID')
# ffill and pad: from first row to last row, propagate the most recent row that is not an NA forward until next valid row
df.ffill(inplace = True)

In [None]:
df = pd.read_csv(url, index_col='ID')
# bfill and backfill: like ffill, except from last row to first row
df.bfill(inplace=True)
df

We can also create new columns from existing ones. For example:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['full_name'] = df['first_name'] + ' ' + df['last_name']
df

This illustrates a problem. Anytime we have an NaN value, the string concatenation is also NaN. How could we get around this? Well, we could create our own specific function that handles this, and then apply that to our dataframe:

In [None]:
def create_full_name(row):
    if isinstance(row['first_name'], str) and isinstance(row['last_name'], str):  # both first_name and last_name are strings
        result = row['first_name'] + ' ' + row['last_name']
    elif isinstance(row['first_name'], str):  # only first_name is a string
        result = row['first_name']
    elif isinstance(row['last_name'], str):  # only last_name is a string
        result = row['last_name']
    else:  # neither first_name nor last_name are strings, they are both NaN
        result = np.nan
    return result

df = pd.read_csv(url, index_col='ID')

df['full_name'] = df.apply(create_full_name, axis=1)
df

This "apply" operation applies the specified function. You could also use Python lambda functions to create a function inline if needed. Note the option "axis = 1" means to process the data row by row. The option "axis = 0" would process the data column by column. For example:

In [None]:
# Create a dataframe that is a 6 x 2 array formed from a list of 12 numbers ordered from 0 to 11.
df = pd.DataFrame(np.arange(12).reshape(6,2), columns = ['column 1', 'column 2'])
print(df)

In [None]:
# Create a new dataframe that takes the maximum value of each column in the dataframe we just created.
new_df = df.apply(lambda column: column.max())
print(new_df)

There are three main functions used to create or change data in dataframes: apply, map, and applymap.

In [None]:
df = pd.DataFrame(np.arange(8).reshape(4,2), columns = ['column 1', 'column 2'])
print(df)

The *apply* function can be used to apply a function along either axis of a dataframe.

In [None]:
print(df.apply(np.max))

In [None]:
print(df.apply(np.max, axis = 1))

The *map* function is a bit more limited and is used to apply a function element-wise to a series. It's more efficient than *apply* when used in this way.

In [None]:
print(df['column 1'].map(lambda x: x*2))

The *applymap* function is used for element-wise operations across all elements of a dataframe, not just those within a series.

In [None]:
print(df.applymap(lambda x: x*2))

### Changing Datatypes of Series or Columns

The datatypes for our "royal_line" examples have all been 'objects' because every column has had data that's been interpreted as a string. This is a general, default datatype that is quite encompassing in what it can handle. However, there are some functions, like maximum or average, that make sense for certain types of numeric data, but not for general data, and if we try to apply these functions to objects we'll have a bad time.

In [None]:
# Let's create a simple dataframe with three columns containing different types of data:
df = pd.DataFrame({'ints': [1,2,3,4], 'strings': ['a','b','c','d'], 'floats': [1.1, '2.2', '3.3', 4]})
print(df)
print(df.dtypes)

Here, the second and third column are interpreted as objects because both contained strings (the values 2.2 and 3.3 in the floats column were entered as strings).

To convert these to a different datatype, we can convert a single column, or multiple columns using a dictionary.

In [None]:
df['floats'] = df['floats'].astype(float)
print(df.dtypes)

In [None]:
convert_dict = {'ints': int, 'strings': str, 'floats': float}
df = df.astype(convert_dict)
print(df.dtypes)

In [None]:
# The following command would also work:
df['ints'] = df['ints'].astype(float)
print(df)
print(df.dtypes)

In [None]:
# But this one won't:
df['strings'] = df['strings'].astype(int)

Pandas has many built in conversion functions (to_datetime, to_timedelta, to_numeric, etc...) but you'll sometimes encounter data that's formatted in such a way that it's not possible to immediately convert it to the format you want using one of the built in functions. To deal with this, sometimes you need to write your own conversion function.

For example, if we check out the 'birth_date' column in our royal_line dataset, we see:

In [None]:
df = pd.read_csv(url, index_col='ID')
print(df['birth_date'])

A lot of NaN. OK, let's remove these and see what we get:

In [None]:
df.dropna(subset = ['birth_date'], inplace=True)
print(df['birth_date'])

If we then try to convert these values to datetimes we get:

In [None]:
df['birth_date'] = pd.to_datetime(df['birth_date']) # This fails

This generates errors due to several issues.

First, there are entries in the dataset formatted like the following: ABT 751. This notation means that the family history experts believe the person was born about (ABT) 751.

The second is an out of bounds nanosecond timestamp error related to Pandas only supporting approximately 580 years in the range from around 1677 to 2262.

To get around these issue, we'll write and then apply our own function. Note we're not dropping the NaN values here.

In [None]:
def get_year(x):
    if pd.isna(x):
        year_result = np.nan  # if the birth_year is nan then return nan
    else:  # checking a number of edge cases in the data and stripping it out:
        if "ABT" in x:  # for example: ABT  1775
            x = x[3:]
            x = x.strip()
        if "/" in x:  #  For example: 1775/1776
            x = x[:x.find('/')]
        num_spaces = x.count(' ')
        if num_spaces == 0:  # only has the year
            year_result = int(x)
        elif num_spaces == 1:  # example: FEB 1337
            x = x[x.rfind(' ') + 1:]  # 'rfind' finds the last space. The 'r' stands for 'reverse.'
            if x.isnumeric():
                year_result = int(x)
            else:  # This could happen if there is only a day and month, like '10 JAN'
                year_result = np.nan
        elif num_spaces == 2:  # example: 16 FEB 1337
            x = x[x.rfind(' ') + 1:]
            year_result = int(x)
        else:
            year_result = np.nan  # There are a few other strange dates that aren't worth our time to fix, so just return nan for those.
    return year_result

df['birth_year'] = df['birth_date'].map(get_year)

print(df['birth_year'])

### Conditionals in Dataframes and Series

Conditionals are a very useful feature of Pandas which typically produce a Numpy array of Booleans or a Pandas Boolean series.

For example, consider the following code that produces True if the birth_year column (calculated above) is greater than or equal to 1990, and False otherwise.

In [None]:
boolean_mask = (df.birth_year >= 1990)
print(boolean_mask)

We can then use this to, for example, only print the entries for which the boolean is True.

In [None]:
print(df[boolean_mask][['first_name', 'last_name', 'birth_year']])

We can combine Boolean expressions using the logical operators & ("and"), | ("or"), and ~ ("not"). For example:

In [None]:
print(df[(df.birth_year >= 1500) & (df.title.str.contains('Queen'))][['first_name', 'title', 'birth_year']])

### loc and iloc Functions

One of the most common tasks for data scientists is filtering information to more efficiently derive actionable insights. Marketers also like saying things like that.

We've seen the "head" and "tail" functions, which provide a quick, truncated view of the beginning or end, respectively, of the dataframe or series. But what if you're interested in examining results that are not necessarily at the very beginning or end of the dataset.

For this purpose, the loc function is designed to access rows and columns by label. In contrast, the iloc function is used to access rows and columns by integer value - the "i" stands for "integer".

A quick example of the difference is illustrated below, where both commands do the same thing:

In [None]:
print(df.loc[1])

In [None]:
print(df.iloc[0])

However, the following will produce an error:

In [None]:
print(df.loc[0]) #This fails

Because there is no row with label 0.

Now, row indices (labels) don't need to be unique. For example:

In [None]:
df = pd.DataFrame(np.arange(10).reshape(5, 2), columns=['A', 'B'], index=['cat', 42, 'stone', 42, 12345])# Five rows each with an associated index
print(df)

The index "42" appears twice, and some indices are numbers, while some are strings. Let's look at some examples:

In [None]:
print(df.loc[12345])

In [None]:
print(df.loc['stone'])

In [None]:
print(df.loc[42])

In [None]:
print(df.loc['A']) #This will fail

In [None]:
print(df.loc['cat':'stone'])

In [None]:
print(df.loc[['cat','stone']])

In [None]:
print(df.loc['stone', 'B'])

In [None]:
print(df.loc[df['A'] > 3])

Now let's take a look at some iloc examples:

In [None]:
print(df.iloc[0])

In [None]:
print(df.iloc[0:3])

In [None]:
print(df.iloc[[0,2,4]])

In [None]:
print(df.iloc[0,1])

In [None]:
print(df.iloc[0:3,1])

Returning to the royal family history data as an example, let's create a new column named "era". The "era" column signifies if a person was born in one of three distinct time periods: 'ancient', 'middle_years', or 'modern'. The following creates a new column and initially assigns the value 'unknown' to every entry within it:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)
df['era'] = 'unknown'

In [None]:
print(df)

The next question is how to divide the birth years. If we check out their maximum and minimum values, we get:

In [None]:
print(f"The earliest year = {df['birth_year'].min()} and the latest year = {df['birth_year'].max()}.")

So, 686 is the earliest year, and 1991 is the latest. This is a difference of 1991 - 656 = 1305 years, which if we divide by 3 this gives us 435 years per era. So, the "ancient" royals are those born between 686 and 1121, the "middle_years" royals are those born between 1121 and 1555, and the "modern" royals are those born after 1555. (Not all that modern!) We can assign these three eras with the following code:

In [None]:
df.loc[df['birth_year'] < 1122, 'era'] = 'ancient'  # 686 – 1121
df.loc[(df['birth_year'] >= 1122) & (df['birth_year'] <= 1555), 'era'] = 'middle_years'  # 1122 – 1555
df.loc[df['birth_year'] > 1555, 'era'] = 'modern'  # after 1555
print(df)

We could have also done this using a custom function and the "map" utility.

### Reshaping with Pivot, Pivot_Table, Groupby, and Transpose

Frequently it is convenient or informative to restructure data contained in a dataframe, effectively organizing the data into a different shape or format. This section will cover the most common reshaping functions provided by Pandas.

The pivot function addresses the situation in which separate categories of a dataset feature are enumerated and highlighted using a cross tabular format. For example:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2],
                   'Type': ['new', 'used', 'new', 'used'],
                   'Price': [10, 5, 12, 7]})
print(df)

Calling the pivot function on this dataframe will reword the data into a more compact and usable format. In the example above, we want to reshape the data such that each car brand is represented on a single row. A use case for this particular reorganization would be a car salesperson who needs to quickly view all the prices of a given car brand for the different 'Type' categories.

In [None]:
p = df.pivot(index='Car', columns='Type', values='Price')
print(p)

The pivot function only works if there is either zero or one entries per cell in the result. Suppose we have the following dataframe:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})
print(df)

Invoking the pivot function on this dataframe will generate an error:

In [None]:
p = df.pivot(index='Car', columns='Type', values='Price') #This will fail

The reason for this is there are two price entries for the used version of car 2. In this case what we could do is use the "pivot_table" function with an aggregator, which specifies how to combine values when more than 1 occurs.

In [None]:
p = df.pivot_table(index='Car', columns='Type', values='Price', aggfunc=np.mean)
print(p)

Pivot tables can result in immensely complex tabular formats with multiple indexes, multiple columns, and various aggregation functions specified. Today, we demonstrate only the basic single-index, single-column case.

Here's another example of a pivot_table using our royal family history dataset:

In [None]:
df = pd.read_csv(url, index_col='ID')
df['birth_year'] = df['birth_date'].map(get_year)

df.dropna(inplace=True, subset=['title', 'sex', 'birth_year'])
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc='mean')
print(p)

Here are two more examples. The first fills blank entries in the resulting pivot table after aggregation with $0$ instead of NaN, and uses two aggregate functions, mean and count.

In [None]:
p = df.pivot_table(index='title', columns='sex', values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

The second is similar to the first, but instead declares two indexes and two columns producing a much more complicated, nested output result.

In [None]:
p = df.pivot_table(index=['title', 'first_name'], columns=['sex', 'last_name'], values='birth_year', aggfunc=['mean', 'count'], fill_value=0)
print(p)

The groupby function's recasting of information is very similar to that of the pivot_table function. In general, the main difference is how the resulting output is shaped. Note it's a common mistake to create a group object without specifying an aggregating function like mean, sum, or std.

Consider the following:

In [None]:
df = pd.DataFrame({'Car': [1, 1, 2, 2, 2],
                   'Type': ['new', 'used', 'new', 'used', 'used'],
                   'Price': [10, 5, 12, 7, 6]})

g = df.groupby(by='Car')
print(g)

That's not particularly helpful. However, if we group by 'Car' and invoke the 'mean' function, we obtain a more useful result.

In [None]:
g = df.groupby(by='Car').mean() #Thil will fail
print(g)

Whoa! What happened there? Well, it's trying to apply the aggregate function to both the price, which is fine, and the type, which is not. This used to result in a warning and dropping the type column, but in more recent versions of Pandas it gives an error.

How can we get around this? Well, we can drop the *Type* column first.

In [None]:
g = df[['Car','Price']].groupby(by='Car').mean() #Need to toss out the "Type" column, or it will try to take its mean.
print(g)

If instead of mean we wanted to use the more robust count we don't need to toss out the type:

In [None]:
g = df.groupby(by='Car').count()
print(g)

If we wanted to group by both 'Car' and 'Type', and use two different aggregation function, we could do that:

In [None]:
g = df.groupby(by=['Car','Type']).agg(['mean','count'])
print(g)

Finally, and with NumPy arrays, the transpose function (or simply T) transposes a dataframe. For example:

In [None]:
df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])
print(f'Original\n{df}')

df = df.transpose()  # or df.T

print(f'\nTransposed:\n{df}')

## Matplotlib and Pyplot

Matplotlib is the JavaScript of Python.

I'll explain what I mean. It's a tool that was built with a fairly limited purpose for a relatively small audience, that has then grown far beyond the vision of its creators, and its current version is a amazing amalgam built upon a flimsy foundation.

Most notably, matplotlib was originally built to try to mirror the functionality and structure of the data visualization tools in MATLAB. This made a lot of sense 20 years ago, but is certainly not something that would be a goal if you were trying to build a data visualization tool in Python from scratch today. A fun and interesting article about its history and issues can be found [here](https://ryxcommar.com/2020/04/11/why-you-hate-matplotlib/).

So, if you find yourself struggling with matplotlib, don't beat yourself up. However, the most important thing to know about matplotlib is that it's widely used, and if you're doing data visualization in Python, for better or worse it's the foundation for everything, and you'll need to understand it.

"There are only two kinds of languages: the ones people complain about and the ones nobody uses". - [Bjarne Stroustrup](https://www.youtube.com/watch?v=JBjjnqG0BP8) (Inventor of C++)

OK, now that's out of the way, let's look at some of the basic visualization capabilities in matplotlib. First, we'll typically not import matplotlib per se, but rather pyplot, which is a collection of functions that make matplotlib work like MATLAB. We'll be using pyplot so frequently that from now on it will be joining the pantheon of libraries that we import at the start of everything we do, and which we always abbreviate according to convention.

This is done at the start of the notebook.

### The Plotting Paradigm

The paradigm we're working with in pyplot is essentially that all our functions are altering a figure, and this changes the figure's *state*. The state is saved and carried across function calls so calling multiple functions will essentially build on top of the state left from the previous functions.

For example, calling *plt.plot()* multiple times will plot multiple plots on top of each other, after which you can *plt.show()* them. Let's construct a simple line plot.

In [None]:
x = [1,2,3,4,5]
y = [1,4,5,7,2]
plt.plot(x,y)
plt.show()

Nice.

OK, let's do this again, but this time with two plots containing the same values on the horizontal ($x$) axis, but different values on the vertical ($y$) axis.

In [None]:
x = [1,2,3,4,5]
y = [1,4,5,7,2]
z = [1,6,2,5,1]
plt.plot(x,y)
plt.plot(x,z)
plt.show()

We've now got two plots displayed within the same figure, and along the same x-axis.

OK, what if we have not just different y-axis values, but different x-axis values?

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = [6,7,8,9,15]
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2)
plt.show()

Nice.

The x-axis adjusts its size to accommodate both x-value inputs.

But, what happens if no such sensible accommodation can be made. For example, we're not just restricted to having numeric values on our x-axis. We can have categorical ones as well.

In [None]:
x = ['a','b','c','d','e']
y = [1,6,2,5,1]
plt.plot(x,y)
plt.show()

What happens if we combine this and a plot with numeric x-values?

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = ['a','b','c','d','e']
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2)
plt.show()

It does its best. It displays on the x-axis the most recent set of x-values that make sense, and tries to interpret earlier x-values appropriately. In this case, it associates the value x = 1 with the index 1 categorical term.

The important thing to note is that matplotlib really *tries* here. It doesn't just throw an error and say what you gave it doesn't make sense. Whether this behavior is good or bad is a matter of debate, but that's what it does.

The plots above defaulted to line graphs using default colors. These defaults are not in any way mandatory, and can be modified. Like, a lot.

For example, if instead of the above we coded this:

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = ['a','b','c','d','e']
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2, 'ro')
plt.show()

What's going on here? Well, when we specified 'ro' in the plot command, this specified:

* 'r' - Color (r is red, g is green, b is blue, ...)
* 'o' - Shape (o is circle,, - is line, ...)

If instead we wanted green "x" marks, we could do that too!

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = ['a','b','c','d','e']
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2, 'gx')
plt.show()

A list of the matplotlib markers and their corresponding terms can be found [here](https://matplotlib.org/stable/api/markers_api.html). A list of matplotlib colors and their corresponding terms can be found [here](https://matplotlib.org/stable/gallery/color/named_colors.html). Have fun.

The marks and colors are not by any means the only things we can customize. We could, for example, add labels to our axes:

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = ['a','b','c','d','e']
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2, 'gx')
plt.ylabel('Y-Axis Label')
plt.xlabel('X-Axis Label')
plt.show()

This is all very nice for quick, simple plots. We can write just a few lines of code and have a working visualization, letting matplotlib auto-configure the elements of the plot for us.

Now, this functional style approach we've used so far comes from matplotlib's MATLAB antecedents. It's what we'll use for most of the class, but it should be noted that there are also more object-oriented approaches to using matplotlib - particularly regarding "getter" and "setter" functions. We'll demonstrate a few of these here, although we'll mostly stick with the functional paradigm. But keep in mind that essentially everything we'll do under the functional paradigm has an object-oriented parallel and vice-versa. These two approaches do exactly the same thing, just with different commands.

If we wanted to recreate our first plot using a more object-oriented approach, we could do so with the code below:

In [None]:
figure = plt.figure()
ax = figure.add_axes([0,0,1,1])

x = [1,2,3,4,5]
y = [1,7,3,9,3]

ax.plot(x,y)
plt.show()

We'll get into the details of this in a bit, but first, let's talk about the objects. What are the objects in a plot? Well, the good folks at matplotlib generated a wonderful figure that points out many of the objects, and even provided the code for creating it! Check it out below (don't worry about understanding the code right now).

In [None]:
royal_blue = [0, 20/256, 82/256]


# make the figure

np.random.seed(24601) #Anybody get the reference?

X = np.linspace(0.5, 3.5, 100)
Y1 = 3+np.cos(X)
Y2 = 1+np.cos(1+X/0.75)/2
Y3 = np.random.uniform(Y1, Y2, len(X))

fig = plt.figure(figsize=(7.5, 7.5))
ax = fig.add_axes([0.2, 0.17, 0.68, 0.7], aspect=1)

ax.xaxis.set_major_locator(MultipleLocator(1.000))
ax.xaxis.set_minor_locator(AutoMinorLocator(4))
ax.yaxis.set_major_locator(MultipleLocator(1.000))
ax.yaxis.set_minor_locator(AutoMinorLocator(4))
ax.xaxis.set_minor_formatter("{x:.2f}")

ax.set_xlim(0, 4)
ax.set_ylim(0, 4)

ax.tick_params(which='major', width=1.0, length=10, labelsize=14)
ax.tick_params(which='minor', width=1.0, length=5, labelsize=10,
               labelcolor='0.25')

ax.grid(linestyle="--", linewidth=0.5, color='.25', zorder=-10)

ax.plot(X, Y1, c='C0', lw=2.5, label="Blue signal", zorder=10)
ax.plot(X, Y2, c='C1', lw=2.5, label="Orange signal")
ax.plot(X[::3], Y3[::3], linewidth=0, markersize=9,
        marker='s', markerfacecolor='none', markeredgecolor='C4',
        markeredgewidth=2.5)

ax.set_title("Anatomy of a figure", fontsize=20, verticalalignment='bottom')
ax.set_xlabel("x Axis label", fontsize=14)
ax.set_ylabel("y Axis label", fontsize=14)
ax.legend(loc="upper right", fontsize=14)


# Annotate the figure

def annotate(x, y, text, code):
    # Circle marker
    c = Circle((x, y), radius=0.15, clip_on=False, zorder=10, linewidth=2.5,
               edgecolor=royal_blue + [0.6], facecolor='none',
               path_effects=[withStroke(linewidth=7, foreground='white')])
    ax.add_artist(c)

    # use path_effects as a background for the texts
    # draw the path_effects and the colored text separately so that the
    # path_effects cannot clip other texts
    for path_effects in [[withStroke(linewidth=7, foreground='white')], []]:
        color = 'white' if path_effects else royal_blue
        ax.text(x, y-0.2, text, zorder=100,
                ha='center', va='top', weight='bold', color=color,
                style='italic', fontfamily='monospace',
                path_effects=path_effects)

        color = 'white' if path_effects else 'black'
        ax.text(x, y-0.33, code, zorder=100,
                ha='center', va='top', weight='normal', color=color,
                fontfamily='monospace', fontsize='medium',
                path_effects=path_effects)


annotate(3.5, -0.13, "Minor tick label", "ax.xaxis.set_minor_formatter")
annotate(-0.03, 1.0, "Major tick", "ax.yaxis.set_major_locator")
annotate(0.00, 3.75, "Minor tick", "ax.yaxis.set_minor_locator")
annotate(-0.15, 3.00, "Major tick label", "ax.yaxis.set_major_formatter")
annotate(1.68, -0.39, "xlabel", "ax.set_xlabel")
annotate(-0.38, 1.67, "ylabel", "ax.set_ylabel")
annotate(1.52, 4.15, "Title", "ax.set_title")
annotate(1.75, 2.80, "Line", "ax.plot")
annotate(2.25, 1.54, "Markers", "ax.scatter")
annotate(3.00, 3.00, "Grid", "ax.grid")
annotate(3.60, 3.58, "Legend", "ax.legend")
annotate(2.5, 0.55, "Axes", "fig.subplots")
annotate(4, 4.5, "Figure", "plt.figure")
annotate(0.65, 0.01, "x Axis", "ax.xaxis")
annotate(0, 0.36, "y Axis", "ax.yaxis")
annotate(4.0, 0.7, "Spine", "ax.spines")

# frame around figure
fig.patch.set(linewidth=4, edgecolor='0.5')
plt.show()

* **Figure** - The figure contains *everything* that we'll be seeing within it. Each *Figure* object can have one or more *Axes* objects.
* **Axes** - Although the name *Axes* implies the actual axes of the plot, the *Axes* object can practically be seen as *the plot itself*. An *Axes* sits snug in the *Figure* and contains elements such as *Titles, Legends, Grids, etc.* Since a *Figure* can have multiple *Axes* objects, each would actually be a plot for itself. Keep in mind that in the previous example, where we've used *plot()* two times, we haven't created multiple *Axes* objects. Both of these lines were plotted on the same *Axes* object, as *plot* doesn't create a plot, it, well, plots.
* **Title** - The title of the *Axes* object.
* **Legend** - The legend of the *Axes* object.
* **Ticks** - Sub-divided into *major ticks* and *minor ticks*. These are the ticks on the X-axis and Y-axis we've seen in the examples above.
* **Labels** - Labels can be set for the X and Y-axis, or for ticks. They're used to, well, label certain elements of the plot for a finer user experience.
* **Grids** - Optional lines in the background of the plot, that help the viewer to distinguish between similar X and Y values, based on the frequency of grid lines.
* **Lines / Markers** - The actual lines / markers that are used to express records / data of a plot. Most of the time, you'll use lines to plot continuous data, while you'll use markers for discrete data.

Alright, as mentioned in the plots we've constructed so far we've only had one *Axes* object, even though we've placed multiple plots upon those axes. Suppose that instead of wanting to plot everything on one axes, we want to use multiple axes, but stay within the same figure. How might we do this?

Well, one way to do this is by using the add_subplot function. This creates an 'axes' object on a *grid* with the specified row, column, and index.

In [None]:
figure = plt.figure()
ax = figure.add_subplot(1,1,1)

x = [1,2,3,4,5]
y = [1,7,3,9,3]

ax.plot(x,y)
plt.show()

If we set this to the (1,2,2) position instead:

In [None]:
figure = plt.figure()
ax1 = figure.add_subplot(1,2,1)
ax2 = figure.add_subplot(1,2,2)

x = [1,2,3,4,5]
y = [1,7,3,9,3]
z = [1,6,2,4,5]

ax1.plot(x,y)
ax2.plot(x,z)
plt.show()

In [None]:
fig, ax = plt.subplots()

x = [1,2,3,4,5]
y = [1,7,3,9,3]

ax.plot(x,y)
plt.show()

If we'd like to work with more than one subplot, we simply specify the number of them in *subplots()*. Let's create 4.

In [None]:
fig, ax = plt.subplots(4)

x = [5,4,2,6,2]
y = [1,7,3,9,3]
z = [1,6,2,4,5]
n = [7,3,2,5,2]

ax[0].plot(x)
ax[1].plot(y)
ax[2].plot(z)
ax[3].plot(n)

plt.show()

This creates four different *Axes* and plots them on different rows in the same column. We can also create them as separate objects with unique references, instead of as four objects in a Numpy array.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4)

ax1.plot(x)
ax2.plot(y)
ax3.plot(z)
ax4.plot(n)

plt.show()

Creating these 4 plots is a bit of a squeeze for the default figure size. However, all of this is customizable, which we'll now go into in a bit more detail.

### Basic Matplotlib Customization ###

A good portion of matplotlib's popularity comes from its customizability.

There's a ton that can be customized, and a ton of options for customization. We won't and can't go over all of them. Instead, we'll explore a few common operations, such as changing the figure and the font size.

Going back to our plot above, we note it's squeezed and it doesn't look good. Let's change the figure size to allow all four of these *Axes* objects to fit nicely.

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, figsize=(6,8))

ax1.plot(x)
ax2.plot(y)
ax3.plot(z)
ax4.plot(n)

plt.show()

This created a figure object that's 6 inches by 8 inches.

Instead of using the *figsize* argument, we can also set the height and width of a  figure. These can be done either via the *set()* function with the *figheight* and *figwidth* argument, or via the *set_figheight()* and *set_figwidth()* functions. Many ways to do the same thing. Let's see some examples:

In [None]:
#Example using the set() function

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

fig = plt.figure()

fig.set(figheight = 5, figwidth = 10)

# Adds subplot on index 1
ax1 = fig.add_subplot(121)
# Add subplot on index 2
ax2 = fig.add_subplot(122)

ax1.plot(x,y)
ax2.plot(x,z)

plt.show()

In [None]:
#Example using the set_figheight() and set_figwidth() functions.

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

fig = plt.figure()

fig.set_figheight(5)
fig.set_figwidth(10)

# Adds subplot on index 1
ax1 = fig.add_subplot(121)
# Add subplot on index 2
ax2 = fig.add_subplot(122)

ax1.plot(x,y)
ax2.plot(x,z)

plt.show()

We can also use the *set_size_inches()* of the Figure object:

In [None]:
#Example using the set_size_inches() function.

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

fig = plt.figure()

fig.set_size_inches(10,5)

# Adds subplot on index 1
ax1 = fig.add_subplot(121)
# Add subplot on index 2
ax2 = fig.add_subplot(122)

ax1.plot(x,y)
ax2.plot(x,z)

plt.show()

Please note there's no right or wrong approach here - matplotlib allows you to customize the figures in many ways, because it's anticipated you might want to change the parameters in many ways. Also, because it wasn't planned out very well, and the community has just kept adding different ways to do the same thing aligned with different structural goals.

Now let's talk a bit about text. Adding text to plots is a very common task. This text could be labels for axes, titles for plots, or even values of certain markers in the form of tooltips. We use text to give further context to the numerical and visual data on the plots.

One of the key classes here, unsurprisingly, in the *Text* class, which takes care of the parsing, storing, and drawing of textual data on plots, given certain coordinates. All the methods we'll use to add labels, titles, etc. rely on the functionality of this class.

For example, we we called:

In [None]:
x_1 = [1,2,3,4,5]
y_1 = [1,4,5,7,2]
x_2 = ['a','b','c','d','e']
y_2 = [1,6,2,5,1]
plt.plot(x_1,y_1)
plt.plot(x_2,y_2, 'gx')
plt.ylabel('Y-Axis Label')
plt.xlabel('X-Axis Label')
plt.show()

The "xlabel" and "ylabel" functions constructed *Text* instances with default parameters. We'll rarely work manually with *Text* instances. Most of the time we'll be using helper functions that construct instances and assign them to appropriate positions.

Let's create a plot with some text:

In [None]:
fig, ax = plt.subplots()

fig.suptitle('This is the Figure-level Subtitle')
ax.set_title('This is the Axes-level Title')
ax.set_xlabel('X-label')
ax.set_ylabel('Y-label')
ax.text(0.5,0.5, 'This is generic text')
ax.annotate('This is an annotation, with an arrow between \n itself and generic text',
            xy = (0.625, 0.5),
            xytext = (0.25,0.25),
            arrowprops=dict(arrowstyle='<->',
                            connectionstyle='arc3, rad=0.15'))
plt.show()

The *suptitle()* is added at the Figure-level, and is above all of its subplots. The *title* and labels can be set on the Axes-level, where each *Axes* can have separate titles and labels.

We did use the generic *text()* here. The *x* and *y* values refer to *actual* values on the plot - not percentages. By default, matplotlib creates a 1x1 plot, so in this case our text starts at the very center.

Finally, regarding the *annotate()* method, the *xy* tuple is the endpoint of the annotation - to where it's pointing, the *xytext* is the position of the text. The *arrowprops* accepts a dictionary with various properties you can use to customize the arrows. You can find details about the various options [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.annotate.html).

Annotations are not the only way to provide observers with a way to differentiate and interpret the data on the plot. We commonly color-code certain variables, so they can easily be differentiated with a simple glance. You can add annotations for these - the green line is the age variable, the red line is the population variable, etc., but this can become unwieldy. Also, it's a bit weird to annotate features. Annotations are typically used to point out certain observations.

To point out features, we typically use a legend, which could have, for example, a list of colors and a list of labels for these colors.

Let's first create a simple plot with two variables, each with a different color.

In [None]:
fig, ax = plt.subplots()

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue')
ax.plot(z, color = 'black')

plt.show()

Next, let's add a legend to this plot. First, we'll want to label our variables so we can refer to them in the legend. Once this is done, all we need to do is call the legend() function on the *Axes* instance.

In [None]:
fig, ax = plt.subplots()

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend()

plt.show()

Matplotlib does its best to fit the legend in a place where it'll obstruct the least of the plot.

But, suppose we don't like where the legend has been placed. Let's place it somewhere else. Specifically, let's place it in the top-right corner, and let's remove the border.

In [None]:
fig, ax = plt.subplots()

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(loc='upper right', frameon=False)

plt.show()

Meh, that's not great. Let's maybe stretch out our figure a bit.

In [None]:
fig, ax = plt.subplots()

fig.set_figwidth(12)

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(loc='upper right', frameon=False)

plt.show()

Nice.

Sometimes, it's tricky to place the legend within the border box of a plot. There can be a lot going on within the plot! In these cases, you can place the legend *outside* the axes. This is done via the *bbox_to_anchor* argument, which specifies where we want to anchor the legend.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(bbox_to_anchor=(0.5, -0.10))

plt.show()

There are a few other parameters we can specify in the legend call as well. For example, the number of columns (*ncol*). If we set this to $2$ we would get:

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(bbox_to_anchor=(0.5, -0.10), ncol=2)

plt.show()

This doesn't look great. We can center it by setting "loc = center", which will center the bounding box around its location.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(loc='center', bbox_to_anchor=(0.5, -0.10), ncol=2, shadow = True) #Note the shadow

plt.show()

Nice.

Let's add a title to our legend, and make the font size larger.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(title='Functions', fontsize=12, title_fontsize=14, loc='center', bbox_to_anchor=(0.5, -0.15), ncol=2)

plt.show()

Here, we changed the fort size for the legend and for the legend title. We can change all the font sizel by modifying the "runtime configuration parameters", or rcParams, specifically "font.size".

In [None]:
fig, ax = plt.subplots(figsize=(12,6))

x = np.arange(0,10,.1)
y = np.sin(x)
z = np.cos(x)

plt.rcParams['font.size'] = '16'

ax.plot(y, color = 'blue', label ='Sine wave')
ax.plot(z, color = 'black', label='Cosine wave')
leg = ax.legend(title='Functions', fontsize=12, title_fontsize=14, loc='center', bbox_to_anchor=(0.5, -0.15), ncol=2)

plt.show()

Alright, so this was a quick look at some of the basic plotting capabilities of matplotlib and pyplot. There's much, much more but just armed with these basics you can still create cool things - which is what you'll be doing in Assignment 2.

## References

At the end of most of the lecture notes I like to provide references for further reading. Please note these are generally here in case you're interested, but you're not required to check them out.

Sometimes they may explore a topic in more depth, sometimes they might provide additional learning resources, and sometimes they might just be fun (I'll try to provide a link to a song each lecture).

* [Introduction to NumPy](https://numpy.org/doc/stable/user/absolute_beginners.html)

* [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html)

* [Pyplot tutorial](https://matplotlib.org/stable/tutorials/pyplot.html)

* Song Of The Day (SOTD) - [Royals](https://youtu.be/nlcIKh6sBtcsi=duDiNtjAzRyh5OUJ) by Lorde

* Why did I use $24601$ as a random seed? It's a reference to [this](https://youtu.be/rEi9wgbob-0?si=yHM-24r4yH8gR3S-).