# Lab 2.0<br>Introducing Numpy & Pandas

## BUS152 - Spring 2024 <br> Brian Brady

### __Objectives__

This lesson will introduce us to two extremely useful and popular libraries that extend what base Python can do.  We can only scratch the suface of everything these two utilities can do, but the plan is to introduce you to the basics now, and you'll pick up more advanced functionality as you begin working on real projects.

Below are the objectives for this lesson.

- Learn why we need to use external libraries and functions
- Learn how to load and call external libraries and functions
- Learn the basics of Numpy and Pandas

### __Loading Libraries__

Recall that everything we've been doing so far is with base Python, and Python is a open source, general purpose coding language.  What this means is, the source code can be adapted and extended to do additional things not included in the base distribution.  This is exactly what makes the langauage so powerful.  

So in our case for data analysis and mathematical modeling, we'll need to get comfortable working with a few extremely common add-on libraries:  Numpy and Pandas.  There will be others that are pretty common, but these first two are the basis for which everything else is built.

First things first, you'll need to _load_ whichever libraries you want to use every time you boot up your favorite IDE or start your Jupyter kernel.  You will almost always need to use Numpy, Pandas, and a few others we'll get to later, for every single project, so you might as well get in the habit of loading them straight away.  The general convention is to load all of your libraries at the top of your script and use common aliases (i.e. `np` for numpy, `pd` for pandas).

To load a library, we use the `import` command, followed by the library name, `numpy`, and then `as alias` if applicable.  This only has to be done once per file, notebook, or session.

In [None]:
# Numpy is imported with the alias "np", and Pandas as "pd", by convention
import numpy as np
import pandas as pd

### __Library Functionality__

We now have access to all of the functions that come with the `numpy` and `pandas` installations.  If we want to know what's available in a library, we can see that simply by using the `dir()` function.

In [None]:
# Truncated output to only the last 10 functions in the directory list
dir(np)[-10:]

Now, if you want to see what one of these functions does, try typing the library alias/name followed by a period, then the function, and finally a question mark.

In [None]:
np.zeros?

We can see from the docstring output that `np.zeros()` is designed to "Return a new array of given shape and type, filled with zeros".  Google will also be your friend here when you want to know how to use some function, e.g. google "numpy zeros function examples" and you'll have plenty of tutorials to learn from in no time.

So to use `np.zeros()`, we can see under the "Parameters" section that all we need to do is to supply it with at least one number for the shape of the array we want.  All of the other parameters are optional.  Let's try it.

In [None]:
np.zeros(3)

So anytime we want to use something from numpy, we have to prepend the function with `np.`.  You'll get the hang of this in no time.

Let's move on to what Numpy and Pandas actually do for us.

### __Numpy__

Numpy, short for Numerical Python, is one of the foundational libraries for numerical computing in Python.  Nearly all of the mathematical and scientific packages and capabilities will be built on top of this library (McKinney, 2022).  It is extremely fast and efficient, and will outperform native Python equivalents so we should try to use it whenever possible.  Numpy is an expansive topic, but below are the main points and a taste of what we're going to be using it for.

- ndarray, which are efficient multidimensional arrays providing fast array-oriented arithmetic operations and flexible broadcasting capabilities (operations between different sized arrays)
- Mathematical functions for fast operations on entire arrays of data without having to write loops
- Linear algebra and random number generation

Numpy does not provide any mathematical or scientific functions; however, it is an indespensible foundation for those libraries when we get to them.  

First thing we always need to do is before using any library utilities is to load our library.

In [None]:
# Import numpy with the standard "np" alias
import numpy as np

##### __ndarray__

Let's start with the basic feature of Numpy: N-dimensional arrays

We can create them in any number of ways.  Below are several examples of a 1-dimensional array.

In [None]:
# Using the numpy a-range() function to generate a range of numbers from 1-10
np.arange(1,11)

In [None]:
# Recall working with Python's native range() function earlier?
# Notice the difference.  range() returns a list, while np.arange() returns an array.
list(range(1,11))

How about explicitly calling the `array()` function on a list?

In [None]:
np.array([1,2,3,4,5,6,7,8,9,10])

How about a multidimensional array?  Let's try a 3x3 (rows x columns) array of 3 separate lists and see what we get.

In [None]:
arr = np.array([ [1,2,3],
                 [4,5,6],
                 [7,8,9]
               ])
arr

In [None]:
# shape will tell us how many rows and columns we have
arr.shape

<h5>Arithmetic with Arrays</h5>

Let's create an array and perform some basic arithmetic operations on it and see what happens.

In [None]:
arr = np.array([[1,2,3], [4,5,6]])
arr

Any arithmetic operations between _equal_ sized arrays applies the operation element-wise like below.

In [None]:
# Multiply the array by itself
arr * arr

In [None]:
# Subtract the array from itself
arr - arr

Arithmetic operations with scalars (single values) propagate the scalar argument to each element in the array, like below.

In [None]:
2 * arr

##### __Indexing and Slicing__

1-dimensional arrays behave just like lists so no big surprises here.

In [None]:
arr1 = np.arange(5)
arr1

In [None]:
# Extract the 3rd element (remember, it's still 0 indexed based)
arr1[2]

In [None]:
# Extract the 2nd through 4th elements
arr1[1:4]

Multidemensional arrays take a bit of getting used to but will become second nature soon.  The format is `[rows,columns]` using the slicing and indexing notations you've already learned.

In [None]:
arr2 = np.array([[1,2,3], [4,5,6]])
arr2

In [None]:
# Extract the value in the 1st row and 2nd column
arr2[0,1]

In [None]:
# Extract the 3rd column
arr2[:,2]

In [None]:
# Extract the 2nd row
arr2[1:,]

Numpy is extremely vast and we could spend all day on it, however that's probably not the best use of our time.  You will mostly use a handful of Numpy's generation functions that we will get to, and also because it's the foundation for the topic coming up next: Pandas.  

### __Pandas__

Pandas will become a huge part of your Python programming life.  It will make things so much easier and enjoyable for you so definitely take your time to learn it.

Pandas provides data structures and functions designed to make working with structured or tabular (think Excel rows and columns) data super easy!  Under the hood, pandas is leveraging Numpy array computing with flexible data manipulation capabilities of spreadsheets and relational databases.  It will allow us to easily reshape, slice and dice, perform aggregations, and apply functions (McKinney, 2022).  You'll be glad someone thought of it, trust me.

The two main structures and functionality of the Pandas library are what we need to get to know next.

- Series
- DataFrame

Let's dive it and see what's what.

In [None]:
# Import the library (pd by convention)
import pandas as pd

Check out a few of the methods and functions that come with the Pandas library.

In [None]:
# Truncated output
dir(pd)[:10]

An important point to make here is that many of these are objects and constructors have methods and functions underneath them.  Check out the <a href="https://pandas.pydata.org/docs/reference/frame.html">Pandas documentation</a> for more details.  To illustrate, let's look under the "pd.DataFrame" constructor shown under the `dir(pd)` call.

In [None]:
# Truncated output to only the last 15 objects in the directory list
dir(pd.DataFrame)[-15:]

Now we know that when we have a DataFrame object, we have have many methods and functions we can use such as `pd.DataFrame().transform()`, or `pd.DataFrame().unstack()`, etc.  Let's walk before we crawl though by starting with Series.

##### __Series__

A series is a 1-dimensional array, with an associated array of data labels called its "index".  Let's build one using the `pd.Series()` function.

In [None]:
srs1 = pd.Series([10,5,-3,6])
srs1

Notice the list of numbers on the left of our values?  That's the index.  We can access the values and index by dot notation like we're already familiar with.

In [None]:
srs1.values

In [None]:
srs1.index

We can index and slice these in much the same way as your familiar with already.  For Series, we'll retrieve records by their index.

In [None]:
# Retrive the 3rd value from series 1
srs1[2]

We could have also built our Series with a custom index if we had labels we wanted to apply instead of the range of numbers.

In [None]:
srs2 = pd.Series([2,3,4,5], index = ['blue','red','green','yellow'])
srs2

In [None]:
# And then retrieve the 3rd value from series 2
srs2['yellow']

##### __Dataframe__

Let's move on to the much more exciting DataFrame.  DataFrames are rectangular tables of data that contain an ordered collection of homogenous, or heterogenous, columns.  This is problably the data format that most people are familiar with because it looks much like Excel with rows and columns.

DataFrames have row _AND_ column indexes, and can be thought of as a Dictionary of Series, all sharing the same index.  Instead of just more words, let's just dive into some examples.  We'll start with the `pd.DataFrame()` constructor by passing it a dictionary.

In [None]:
# Manually create a DataFrame
dat = pd.DataFrame({'student': ['Violet','Jade','Bill','Ben','Pat','Mateo'],
                    'age': 16,
                    'gender': ['F','F','M','M','M','M']                    
                    }, index = ['a','b','c','d','e','f'])
dat

BAM!  Easy enough.  Now you have your first dataframe!  Nicely done.

##### __Indexing and Slicing__

Want to access a column from your dataframe?  We can do that in a few different ways.  

- Index operator
- Attribute access (dot notation)
- .loc & .iloc

_Index Operator:_ If you want the data returned as a DataFrame, then you need to pass the column names in a list like below.

In [None]:
# Extract "student" column as a dataframe
dat[['student']]

In [None]:
# Extract two columns by passing a list of column names
dat[['student','gender']]

If you just want one column returned as a Series, then you do not need to pass the column name as a list.  See below.

In [None]:
# Return the values a Series with single brackets
print(type(dat['student']))

# Return the values as a DataFrame with double square brackets
print(type(dat[['student']]))

_Dot Notation:_  We could have just as easily pulled the "student" column by accessing the object attributes by using the `dat.` syntax.

In [None]:
# Attribute access (dot notation)
dat.student

_.loc and .iloc:_  These are a little confusing at first, but with a little practice will make total sense.  "loc" stands for "label based location", and "iloc" stands for "integer based location".  Both refer to the arguments you can pass to make the retrieve data.  For example, if you want to pull row two by the index name, you would use ".iloc".  If you want to retrieve a column by it's name, then you would use "loc" instead.

See the table below for examples of how to use them.

| Type                                                                          | Comments                                                       |
| :---                                                                          | :---                                                           |
| dat[ 'col_name1' ]                                                            | Select a single column; Returns a Series                       |
| dat[[ 'col_name1', 'col_name2' ]]                                             | Select multiple columns; Returns a DataFrame                   |     
| dat.loc[ 'row_index_name1' ]                                                  | Select row by index LABEL; Returns a Series                    |  
| dat.loc[[ 'row_index_name1' , 'row_index_name2' ]]                            | Select row(s) by LABEL; Returns a DataFrame                    |
| dat.loc[ : , 'col_name1' ]                                                    | Select row(s) and/or column by LABEL; Returns a Series         |
| dat.loc[ : , ['col_name1' , 'col_name2' ]]                                    | Select row(s) and/or column(s) by LABEL; Returns a DataFrame   |
| dat.iloc[ index ]                                                             | Select row(s) by integer index position                        |
| dat.iloc[ : , index ]                                                         | Select row(s) and/or column(s) by integer index position       |

In [None]:
# Extract index label "c"
dat.loc['c']

In [None]:
# Extract index labels "a" & "f"
dat.loc[['a','f']]

In [None]:
# Extract the "student" column by name
dat.loc[:, 'student']

In [None]:
# Extract the row index names "b" and "d", and the "student" and "age" columns by name
dat.loc[['b','d'], ['student','age']]

Ok, that's enough indexing.  How about adding a new column?  

##### __Adding a Column__
That's pretty eash too.  We'll just use the `dat['column_name']` syntax with assignment for now.

In [None]:
# Add new column named "score" in you dataframe
dat['score'] = np.random.choice(range(90, 101), 6)
dat

In [None]:
# Create new column with no values
dat['comment'] = ''

# For any row with a score of 95 or higher, enter "Well done" in the new "comment" column
dat.loc[dat['score'] >= 95, 'comment'] = 'Well done'
dat

##### _Deleting a Column_

There a couple of ways we can delete columns, 

- del method
- pd.drop()

_del method:_  Just like it says.  We just call `del df['column_name']` and presto, it's gone.

In [None]:
# Delete the "comment" column
del dat['comment']
dat

In [None]:
# Delete the "age" column using the ".drop()" method - notice that we have to tell it which axis to use, 0 = rows, 1 = columns
dat = dat.drop('age', axis = 1)
dat

##### __Masking & Filtering__

We will get much more into this in the next lesson, but for now see how we are going to start using our boolean logicals from previous lessons.  Here we can to see which scores are greater than some value, which returns a boolean True or False for each row.  Then we will pass that "mask" into our dataframe and filter to just the rows that evaluate to True.

In [None]:
# Boolen mask
msk = dat.score > 98
msk

In [None]:
# Filter by our mask
dat[msk]

We could have also used the "loc" method because our mask is returing a Series by index name

In [None]:
dat.loc[msk]

And we could also just extract the Student names for anyone with a score over 98, like so.

In [None]:
dat.loc[msk, 'student']

Whew!  Another long one I know.  And believe it or not, we're just barely scratching the surface with the minimum you can learn just to get going.  There's still so much more you could learn.

We of course were not able to cover everything in these short few labs though, but you'll pick up new Pandas and Numpy functions and operations and you get more experience under your belt working through projects to come.  This was an excellent start.  