# Numpy and Pandas Demo

This notebook provides a very basic introduction to the Numpy and Pandas Python packages. This notebook must be run with a Jupyter kernel that has Numpy, Pandas, and Matplotlib installed. You can execute the single cell in the following notebook to install such a kernel: https://github.com/frenchwr/data_viz_wg/blob/master/create_numpy_pandas_kernel.ipynb

# Numpy

Numpy is a popular Python library that is useful for rapid processing of numerical data.

In [None]:
import numpy as np  # np is now shorthand for "numpy" throughout this notebook

In [None]:
x = [] # this is a traditional Python list
for xi in range(10000):
    x.append(xi)

In [None]:
y = np.asarray(x) # now we are converting this list to a numpy array

One of the primary advantages of Numpy is its speed. Let's compare the processing time of a Numpy array versus a traditional Python list.

In [None]:
%%time
xnew = [xi*8 for xi in x] # here we use a list comprehension to scale each element of x by 8

In [None]:
%%time
ynew = y*8  # this scales each element of our numpy array by 8

This is a somewhat simple example, the performance differences are often even more dramatic. The reason for these differences is that Numpy was designed from the ground up with performance in mind. The module is actually loaded as a pre-compiled and highly optimized C code. Numpy arrays are also layed out in memory in a more intelligent fashion compared to traditional Python lists.

## Creating Numpy Arrays

Note, much of this tutorial is taken from: http://cs231n.github.io/python-numpy-tutorial/

In [None]:
a = np.array([1, 2, 3]) # rank 1 array
print(a)
print(a.shape)

In [None]:
b = np.array([[1,2,3],[4,5,6]]) # rank 2 array
print(b)
print(b.shape)

In [None]:
c = np.zeros((2,2))
print(c)

In [None]:
d = np.ones((1,2))
print(d)

In [None]:
e = np.full((2,2), 7)
print(e)

In [None]:
f = np.random.random((2,2))
print(f)

## Slicing and Dicing Data with Numpy

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

In [None]:
print(a[0,0]) # Numpy is zero-indexed, just like normal Python lists!

In [None]:
print(a[2,3])

In [None]:
b = a[:2, 1:3]  # The first index is requesting the first two rows of array a
print(b)

In [None]:
print(a[0, 1])

In [None]:
b[0, 0] = 77
print(a[0, 1]) # b is effectively a alias for a subset of array a!

How would you slice array a to yield the following?
* third column ([3 7 11]) of data?
* [[7 8] [11 12]]

In [None]:
print(a[a>8])  # boolean-based indexing is also possible in Numpy

## Mathematical Operations on Numpy Arrays

In [None]:
print(np.sin(a))

In general, you want to avoid doing mathematical operations within a for loop within Python (true in Matlab and R, as well). Instead, rely on the "vectorized" math functions available through Numpy.

In [None]:
print(y.size)

In [None]:
%%time
for yi in y:
    np.sin(yi)

In [None]:
%%time
np.sin(y)  # notice we are now passing the entire array to this function

# Pandas

Pandas is a Python package designed for working with tabular data. It provides a dataframe-based interface for storing, accessing, and processing data. Pandas makes use of Numpy and Matplotlib internally. A more complete demo of the types of operations that can be performed on a Pandas dataframe, see the following demo: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min

First, let's download some fitbit data from GitHub that we can wrangle into submission.

In [None]:
%%bash
git clone https://github.com/willie8338/Fitbit.git ~/Fitbit # download some fitbit data
ls -l --color ~/Fitbit

In [None]:
%%bash
ls -l --color ~/Fitbit/data

In [None]:
import pandas as pd

In [None]:
# The CSV has two sections - Activity and Sleep. Let's just focus on the Activity data.
# Excel spreadsheets can also be loaded directly into a Pandas dataframe
df = pd.read_csv("~/Fitbit/data/fitbit_export_20140710.csv",skiprows=1,nrows=50,index_col=0)

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.index

In [None]:
# Let's convert our index to a datatime type, which is more fully featured
df.index = pd.to_datetime(df.index)
df.index

In [None]:
df.dtypes

In [None]:
df.columns

Notice that the CSV uses commas when values are > 999. This can cause issues, so let's remove those first and convert the type float.

In [None]:
for c in df.columns:
    if (df[c].dtype == object):
        df[c] = df[c].str.replace(',','')
        df[c] = df[c].astype(float)

In [None]:
df.head()

Pandas dataframes provide a rich set of SQL-like functionality. Try typing "df." and then the tab key to see a drop down menu of the various operations that can be applied to your dataframe. The describe() method is great for generating a quick overview of your data:

In [None]:
df.describe()

In [None]:
df.sort_values(by='Steps',ascending=False)

## Selecting Data

In [None]:
df['Minutes Very Active'].head(10)

In [None]:
df[13:21]  # show rows 13-20

While the bracket-based Numpy-like indexing scheme above works, it is generally recommended to make use of the .at, .iat, .loc and .iloc and methods as they are optimized with performance and reliability in mind. The .loc method requires a label-based index for selecting data, while .iloc expects an integer-based index.

In [None]:
df.loc['2014-05-22']

In [None]:
df.iloc[0]

In [None]:
df.loc[:, ['Steps', 'Floors']].head()  # select all indices (dates), with two columns

Excercises: 
* Use df.loc to select data between May 24 - 29 with Columns "Distance" and "Minutes Sedentary"
* Use df.loc to select data from June 17 onwards with Columns "Floors" and "Activity Calories"
* Use df.iloc for selecting rows 2-5 and Columns 3-6

If you just want to retrive a single scalar value from your dataframe, use the df.at and df.iat methods for speed:

In [None]:
%%time
df.iat[1,2]

In [None]:
%%time
df.iloc[1,2]

You can also select data based on boolean operations:

In [None]:
df[df.Steps > 15000]

In [None]:
df[df.Steps > 15000]['Activity Calories']

## Plotting

In [None]:
%matplotlib inline
df.plot()

In [None]:
ax = df.plot(y="Distance",kind="bar",stacked="stacked",rot=45)
n = 5
ticks = ax.xaxis.get_ticklocs()
ticklabels = [l.get_text() for l in ax.xaxis.get_ticklabels()]
ax.xaxis.set_ticks(ticks[::n])
ax.xaxis.set_ticklabels(ticklabels[::n])

In [None]:
ax = df.plot(x="Steps",y="Distance",kind="scatter",style=['o','rx'])