## (Supplementary) Introduction to Numpy and Pandas

### Learning Objectives:
* Understand the relationship between Numpy and Pandas
* Identify the constituent parts of a Pandas dataframe

You've already been introduced to Python variables, data types, and functions. These are all *objects* within Python. Depending on how they are defined, objects may have attributes and methods. Attributes are called without a parentheses, because they require no arguments. Methods are functions that are specific to certain types of objects. When you call a method on an object, you are applying that function to the object, with some additional argument(s). Thus, they are called with parentheses:

In [5]:
X = [3,1,2]
# Apply the 'sort' method to X
X.sort()
X

[1, 2, 3]

### Numpy

Most of the data manipulation you'll do with Python will use the Numpy and Pandas libraries.

Numpy stands for 'numerical python,' and it was one of the first tools to turn Python into a data science programming language.

In [6]:
# Import the library:
import numpy as np

Recall that objects in Python have types. Numpy provides many more types than basic python:

In [7]:
num = 1
type(num)

int

In [8]:
num = np.int32(1)

type(num)

numpy.int32

Numpy's main contribution to the Python world are *arrays*. Arrays are multi-dimensional collections of elements of the same type. Almost always the objects in an array are numbers. You can create an array from a list of Python lists:

In [9]:
A = np.array([[1,2,3],[4,5,6]])
A

array([[1, 2, 3],
       [4, 5, 6]])

Or you can create an array using a Numpy function, such as np.full:

In [10]:
# The first argument specifies the dimensions, the second tells Numpy what to fill the array with:
# This array is two-dimensional. Most arrays you see will be two dimensional, but remember that an
# array can contain data in any number of dimensions.
np.full((3,3), 8)

array([[8, 8, 8],
       [8, 8, 8],
       [8, 8, 8]])

In [11]:
# A three-dimensional array. The first array is 'stacked' below the other one.
np.full((2,2,2), 7)

array([[[7, 7],
        [7, 7]],

       [[7, 7],
        [7, 7]]])

In [59]:
# Note that A is an array:
type(A)
# But the dtype of A is numeric:
A.dtype

# This is because the 'dtype' attribute of an array tells us the data type of the elements *inside* that array.
# Moving on, it will be helpful to remember the difference between the type of an object and the data type
# of the elements it contains.

dtype('int32')

Arrays have their own attributes and methods. You've already seen the dtype attribute.

In [13]:
# The mean method returns the average of the numbers inside the array:
A.mean()

3.5

In [14]:
# The sum method returns the sum:
A.sum()

21

In [15]:
B = np.array([[0,1,0],[2,2,2]])
B

array([[0, 1, 0],
       [2, 2, 2]])

In [16]:
# Arrays can be added:
A + B

array([[1, 3, 3],
       [6, 7, 8]])

In [17]:
# And multiplied:
A * B

array([[ 0,  2,  0],
       [ 8, 10, 12]])

These operations are *element-wise*. That means that when we multiply two arrays, we are just multiplying all the corresponding elements from each array. If you are familiar with matrix multiplication in math, you know it is possible to multiple the arrays themselves together:

In [18]:
A.dot(B.T)
# If you don't understand this right now, don't worry about it. We will discuss matrix multiplication later in the course.

array([[ 2, 12],
       [ 5, 30]])

Arrays have a certain number of dimensions, or *axes*:

In [19]:
A.ndim # A is two-dimensional. It has rows and columns.

2

In [20]:
A.shape # A has 2 rows and 3 columns.

(2, 3)

Just like Python lists, arrays can be indexed:

In [21]:
X[0] # 1 is the first element of the list X

1

In [22]:
# The first elemtent of A is just a smaller array representing the first row of A:
A[0] 

array([1, 2, 3])

In [23]:
# We index the array twice (if it is two-dimensional) to get to a particular element
A[0][0]

1

In [24]:
# Or you can use tuple indexing:
A[(0,0)]

1

#### Miscellaneous Numpy Functions:

In [25]:
# range is a built-in Python function that returns a sequence of numbers
range(0,10)

range(0, 10)

In [26]:
# Numpy has an equivalent, which is great whenever you need a series of numbers
# for the X-axis of a chart, for example.
# Note that arange returns an array, whereas range returns a Python iterable object
np.arange(0,10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [27]:
# You can create arrays full of zeros, ones, or with random numbers:
np.zeros((2,2))

array([[ 0.,  0.],
       [ 0.,  0.]])

In [28]:
np.ones((2,2))

array([[ 1.,  1.],
       [ 1.,  1.]])

In [29]:
np.random.random((2,2))

array([[ 0.78108897,  0.26729225],
       [ 0.16706886,  0.78819154]])

In [30]:
# An identity matrix is a two dimensional array with ones on the diagonal and zeros elsewhere:
np.eye(2,2)

array([[ 1.,  0.],
       [ 0.,  1.]])

In addition to arrays, Numpy has almost all statistical functions you may have heard of:

In [31]:
np.mean([1,2,3])

2.0

In [32]:
np.std([1,3,7])

2.4944382578492941

You will learn more about statistics and Numpy in an upcoming lecture.

### Pandas

Pandas is a data manipulation library built on top of Numpy. Almost all of the time, you will use Pandas to store and interact with your data.

While Numpy gives us arrays to work with, Pandas provides Dataframes. A Dataframe is the basic tabular data structure you'll use in this course. Since they have rows and columns, they are always two-dimensional. Typically you will read in data from some source, like a CSV. But we can also construct them from scratch:

In [33]:
import pandas as pd

df = pd.DataFrame([[1,2,'A'],[3,4,'B'],[5,6,'C']])
df

Unnamed: 0,0,1,2
0,1,2,A
1,3,4,B
2,5,6,C


In [34]:
# Compare the dataframe above with an array constructed in the same way:
np.array([[1,2,'A'],[3,4,'B'],[5,6,'C']])

array([['1', '2', 'A'],
       ['3', '4', 'B'],
       ['5', '6', 'C']],
      dtype='<U11')

There are a few differences you should notice right away. Jupyter Notebooks displays Dataframes really nicely, whereas arrays are just shown as a list of lists.

Secondly, dataframes can contain elements of more than one type. Notice that Numpy treats all of the elements of the array as characters. Dataframes can hold data of different types in each column.

Lastly, the Dataframe is *labelled*. Each row and column has a label, or index. Dataframes can be thought of as labeled two-dimensional arrays.

Lets play with some of these features of dataframes.

In [35]:
# Because Pandas is based on Numpy, many of the methods are the same or similar.
# In Numpy .dtype returns the type of the elements in the array.
# In Pandas, .dtypes returns the data type of each column:
df.dtypes
# This will be one of the Pandas commands you use most often

0     int64
1     int64
2    object
dtype: object

In [60]:
# You can examine the shape a dataframe, just like you would with an array:
df.shape

(3, 3)

In [36]:
# Since dataframes are just labeled arrays, we can return the array
# that a dataframe is built on top of:
df.values

array([[1, 2, 'A'],
       [3, 4, 'B'],
       [5, 6, 'C']], dtype=object)

In [37]:
# The columns labels of a dataframe can be accessed:
df.columns

RangeIndex(start=0, stop=3, step=1)

In [38]:
# And changed:
df.columns = ['Column One','Column Two','Column Three']
df

Unnamed: 0,Column One,Column Two,Column Three
0,1,2,A
1,3,4,B
2,5,6,C


In [39]:
# The row labels of a dataframe are the index:
df.index
# Typically the index just counts up from 0, unless you've rearranged your data.

RangeIndex(start=0, stop=3, step=1)

In [40]:
# You can assign a new index to the data:
df.index = [3,4,5]

In [41]:
# Or you can choose a pre-existing column to be the index:
df.set_index('Column One')

# Side note about operations occuring in place:
# Notice that this operation hasn't occured 'in place.' Some Pandas
# operations occur in place, and others don't. You'll just have to remember
# which is which. 

Unnamed: 0_level_0,Column Two,Column Three
Column One,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,A
3,4,B
5,6,C


In [42]:
# For example, lets change the index back to [0,1,2]. This operation is not in place:
df = df.reset_index(drop=True)
df

Unnamed: 0,Column One,Column Two,Column Three
0,1,2,A
1,3,4,B
2,5,6,C


In [43]:
# If an operation is 'in place,' then it affects the object it is called on, without
# returning anything. Lets say I want to drop duplicate values, which by default does 
# not occur in place:
df = df.drop_duplicates()

In [44]:
# But I can make it occur in place:
df2 = df.drop_duplicates(inplace=True)

In [45]:
# If I do that, what have I assigned to df2?
df2

# When you run this cell, nothing is returned. Thats because an in place operation returns no output.

In [46]:
# Objects with nothing assigned to them are of type NoneType:
type(df2)
# If you ever observe a NoneType error, it means you've probably made this mistake.

NoneType

In [114]:
# To select a certain column, just pass the name of the column to the square brackets which we 
# usually use for subsetting and indexing. If you want more than one column, pass a list of column names.
df['Column One']

0    1
1    3
2    5
Name: Column One, dtype: int64

A Pandas series is just a one-dimensional collection of data. When you're working with dataframes, always remember that your columns are Pandas series:

In [115]:
type(df['Column One'])

pandas.core.series.Series

As you work with Pandas, remember that certain functions apply to series, and some to dataframes. For example, the functions dealing with setting the index operate on the entire dataframe. But if you want to calculate the mean of a column, you're working with each column as a series.

If you want to apply a function to every column of the dataframe, you can use the .apply method. Make sure the function you pass to the apply method is a function that works on series:

In [54]:
df.apply(np.max)

# Here we've applied the max function from numpy to every column. Note that in Python, C > B > A.

Column One      5
Column Two      6
Column Three    C
dtype: object

Pandas knows there are a lot of functions we might want to apply to all the columns, though, so it lets us just apply them directly to the dataframe:

In [55]:
df.max()

Column One      5
Column Two      6
Column Three    C
dtype: object

Other times we might want to create new columns from our existing columns. We can work with the columns directly as series:

In [57]:
df['Column One'] * df['Column Two']

0     2
1    12
2    30
dtype: int64

There's a lot more you can do with Pandas which we'll be teaching in other lessons.

For now, you should understand the basic relationship between the Numpy array and the Pandas dataframe: that dataframes are labeled two-dimensional arrays that can contain different data types.

You should also understand what makes up a dataframe. Every dataframe is made up of columns, which are Pandas series, and a certain number of rows. The dataframe has row labels, which are the index, and column names. When you're working with dataframes, you should remember if you're dealing with series methods or dataframe methods.