# **NumPy & Pandas**: The tools of the trade

---

# Overview

1. About NumPy
    1. Importing
    * Basic data structure overview
    * Brief object creation tutorial
    * Indexing/Exploring/Manipulating
2. About Pandas
    1. Importing
    * Basic data structure overview
    * Brief object creation tutorial
    * Indexing/Exploring/Manipulating
    * I/O in Pandas
3. Data visualization
4. Closing

# NumPy: Numeric Python ('numb-pie')

---

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/NumPy_logo.svg/1200px-NumPy_logo.svg.png" alt="NumPy logo" style="width:300">

NumPy (np) is the premier Python package for scientific computing. The reason np is so powerful is its N-dimensional array object.

np is a *lower*-level numerical computing library. This means that, while you can use it directly, most of its power comes from the packages built on top of np:
* Pandas (*Pan*els *Da*tas): the R Killer
* Scikit-learn (machine learning)
* Scikit-image (image processing)
* OpenCV (computer vision)
* more...

___

## Importing `np`

### The convention: aliasing numpy to np
```python
import numpy as np
```

The cell below will import numpy, but as many of you may not have the complete build, it is set up to install `np` and `pandas` if `np` is not present.

In [None]:
try:
    import numpy as np
excep ImportError:
    import conda.cli as cc
    cc.main(*'conda install -y numpy pandas matplotlib seaborn'.split())
    import numpy as np

## `np` Data Structures

<img src="https://www.oreilly.com/library/view/elegant-scipy/9781491922927/assets/elsp_0105.png" alt="data structures" style="width:200">

* 1D array := vector
* 2D array := array or matrix
* ND array 

### What you know now:

In [None]:
# lists
some_list = 

# Remember, id() just gives us the 
# memory location of the object
for item in some_list:
    print(id(item))

### How `np` does it:

![](https://image.slidesharecdn.com/numpytalksiam-110305000848-phpapp01/95/numpy-talk-at-siam-17-728.jpg?cb=1299283822)

---

### The basics

In [None]:
# From Python list
arr = np.array([1,2,3])

Every array has these basic attributes:
* `shape`: array dimension
* `size`: Number of elements in array
* `ndim`: Number of array dimension (`len(arr.size)`)
* `dtype`: Data-type of the array
* `T`: The transpose of the array

In [None]:
# Playing with attributes


In [None]:
# 2D array


In [None]:
# Reshaping


In [None]:
# 3D array


#### Indexing

In [None]:
# Position-based


In [None]:
# Slicing


In [None]:
# Boolean


#### Time Check

In [None]:
a = list(range(int(1e6)))
b = np.array(a)

In [None]:
%%timeit
sum(a)

In [None]:
%%timeit
np.sum(b)

---

## Creating data

Since `list(range(n))` is a common prototyping setup, `np` makes it a little easier with:

In [None]:
# arange


Remember `makeTable`?

In [None]:
numRows =
numCols = 

In [None]:
make_table = 

Look [here](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.random.html) for more random sampling methods.

### Pre-allocating

Unlike Python `list`s, `np` does better if the array size is pre-allocated. All this means is that you tell `np` how big of an array you want *before* you do anything to it.

#### Standard Python

In [None]:
%%timeit
a = []
for i in range(int(1e6)):
    a.append(i)

#### `np.append`

In [None]:
%%timeit
a = np.array([])
for i in range(int(1e6)):
    np.append(a, i)

#### Pre-allocated `np`

In [None]:
%%timeit
a = np.empty(int(1e6), dtype=np.int64)
for i in range(int(1e6)):
    a[i] = i

---

## Broadcasting

![](https://image.slidesharecdn.com/numpytalksiam-110305000848-phpapp01/95/numpy-talk-at-siam-27-728.jpg?cb=1299283822)

---

![](https://pandas.pydata.org/_static/pandas_logo.png)

# About Pandas

## What is pandas?

[Pandas](https://pandas.pydata.org/) is a high-performance library that makes familiar data structures, like `data.frame` from R, and appropriate data analysis tools available to Python users...Python master race.

## How does pandas work?

Pandas is built off of [Numpy](http://www.numpy.org/), and therefore leverages Numpy's C-level speed for its data analysis.

However, there is one caveat...whereas Numpy can only make data structures of a single type, Pandas can use many types. Think of a SQL table, where each column can be whatever type you want it to be, so long as every item in the column is that same type.

## Why use pandas?

1. Data munging/wrangling: the cleaning and preprocessing of data
2. Loading data into memory from disparate data formats (SQL, CSV, TSV, JSON)
2. Tired of R, but still like `data.frame`
3. Don't know R
4. Want to forget R

# Importing

Probably the easiest step. Because pandas is built off of numpy, it is always usefull to import numpy at the same time. However, this isn't necessary.

In [None]:
import numpy as np
import numpy.random as nr
import pandas as pd

Additionally, we will be doing some very basic data visualization. This means we need a plotting library. The following cell shows the standard convention for importing [matplotlib](https://matplotlib.org/#) and how to use notebook `magic` to allow the plots to show in the notebook when they are generated.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

`poss_labels` below is just a quick way for me to name things.

In [None]:
poss_labels = nr.choice([word.strip() for word in open('./datasets/words_alpha.txt')], size=(1000,), replace=False)

Alternately, once can use the `gen_lab` function to create *m* number for *n*-length labels

In [None]:
import string

def gen_lab(n, m=1):
    '''generate a m number of n-length random labels from string.ascii_lowercase'''
    labels = []
    for i in range(m):
        out = nr.choice(list(string.ascii_lowercase), 2)
        labels.append(''.join(out))
    return labels

# Basic data structure overview

For a more thorough dive into the different data structure, feel free to read [this](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro) documentation

The data structures of interest are:
1. `pd.Series`
2. `pd.DataFrame`

## Series

**One-dimensional** labeled array (or vector) 

```python
# Initialization Syntax
series = pd.Series(data, index, dtype) 
```

* **`data`** : what is going inside the Series (array-like, dict, or scalar value)
* **`index`**: row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`dytpe`**: numpy/python based data types

## DataFrame

**Multi-dimensional** labeled data structure with columns of *potentially* different types

```python
# Initialization Syntax
df = pd.DataFrame(data, index, columns, dtype)
```

* **`data`** : what is going inside the DataFrame (numpy ndarray (structured or homogeneous), dict, or DataFrame)
* **`index`** : row identifiers (doesn't have to be unique--think foreign key. Defaults to row number)
* **`columns`** : column identifiers
* **`dytpe`** : numpy/python based data types

# Brief object creation tutorial

## Series

### From a Python list

In [None]:
data_list = list(range(5))
data_list

In [None]:
a = pd.Series(data_list)
a

------
Below is going to look like some crazy, but bear with me. Due to pandas having a very rich API, even a simple `Series` has a large ammount of attributes and methods. That means I want to clean it up and just look at attributes *only*, so I use the thing below to do that. </br> </br>
Let's look at the attributes of **`a`** to see what it can do...

In [None]:
dir(a)

In [None]:
import types

def get_attr_list(obj):
    attributes = []
    for i in dir(obj):
        if not i.startswith('_'):
            if not isinstance(getattr(obj,i), types.MethodType):
                attributes.append(i)
    return attributes

In [None]:
get_attr_list(a)

We can quickly generate additional data structures by using the attributes of existing data structures (so long as they are appropriate)

----

### Assigning a more helpful index

In [None]:
pd.Series(a, index=nr.choice(poss_labels, size=(a.size)))

In [None]:
b = pd.Series(a.values, index=nr.choice(poss_labels, size=(a.size)))
b

### From dictionary

In [None]:
data_dict = dict(zip(b.index, a.values))
data_dict

In [None]:
c = pd.Series(data_dict)
c

### Setting the dtype

In [None]:
d = pd.Series(b, dtype=np.float16)
d

### From numpy array

In [None]:
d = pd.Series(nr.randn(5), index=c.index, dtype=np.float64)
d

### Naming the series

In [None]:
e = pd.Series(d, name='Foo')
e

## DataFrame

DataFrames work much in the same way as `pd.Series`. Just like `np.ndarray`, it is just an extension (with some caveats) into a different dimensionality.

### From list

In [None]:
in_dat = [data_list, data_list[::-1]]
in_dat

In [None]:
dfa = pd.DataFrame(in_dat)
dfa

----
Again, let's look at the attributes of a dataframe

In [None]:
get_attr_list(dfa)

### Add an index

In [None]:
dfb = pd.DataFrame(dfa.values, index=list(string.ascii_lowercase[:len(in_dat)]))
dfb

### Add column names

Because this is a dataframe, we can add both and index ***and*** column names

In [None]:
dfc = pd.DataFrame(dfb.values, index=dfb.index, columns=gen_lab(2, len(in_dat[0])))
dfc

### Add a dtype

In [None]:
dfd = pd.DataFrame(dfc, dtype=np.float16)
dfd

In [None]:
dfd.dtypes

As seen, if you set a `dtype` for the `DataFrame`, you set it for ***all*** of the elements. However, you can always set the column dtypes individually. We will look at this later.

### From a dictionary

In [None]:
two_d_dict = dict(zip(dfd.index,in_dat))
two_d_dict

In [None]:
dfe = pd.DataFrame(two_d_dict, index = dfd.columns).T
dfe

### From numpy array

In [None]:
arr_1 = nr.randint(0, 100, (3,4))
arr_1

In [None]:
dff = pd.DataFrame(arr_1, index = list(string.ascii_lowercase[:arr_1.shape[0]]), columns = gen_lab(2, arr_1.shape[1]))
dff

### From `pd.Series`

#### Row-wise (`append`)

In [None]:
# Remember pd.Series a?
a = a.rename('original')
print(a)
a_rev = pd.Series(a.values[::-1], name='reveresed', dtype=np.float16)
a_rev

In [None]:
# Create the original DataFrame
df_rows = pd.DataFrame(a).T
df_rows

In [None]:
# Now add on a row
df_rows.append(a_rev)

#### Column-wise (`join`/`concat`)

In [None]:
df_columns = pd.DataFrame(nr.randn(3,4))
df_columns.columns = [gen_lab(2, 4)]
df_columns

#### `join`

In [None]:
p = df_columns.join(a_rev)

In [None]:
p

In [None]:
p.dtypes

#### `concat`

In [None]:
# Same size
pd.concat([df_columns, a_rev[:3]], axis=1)

In [None]:
# Unequal size
pd.concat([df_columns, a_rev], axis=1)

### 3D DataFrame

Special instructions on multi-dimensional dataframe: It is necessary to use `pd.MultiIndex` for three dimensional data. Previously, there was a `pd.Panel` data structure...but has since been deprecated.

In [None]:
three_d_idx = pd.MultiIndex.from_arrays([['x', 'x'], list('yz')])
three_d_idx

In [None]:
dfg = pd.DataFrame(nr.randn(2,4), index=three_d_idx)
dfg

A better way to think about this is like a treatment time series.

In [None]:
t = pd.date_range('20180409', periods=6).to_series()
t1 = t.append(t.copy()).sort_values().T
pos = ['beg_fluorescence', 'stop_fluorescence'] * 5
idx = pd.MultiIndex.from_tuples(list(zip(t1, pos)))

In [None]:
time_series = pd.DataFrame(nr.randn(10,10), columns=idx)
time_series.T

# I/O in Pandas

One of the biggest reasons people use pandas is to bring data in without having to mess around with file I/O, delimiters, and type conversion. Pandas, is a one-stop shop for a lot of this.

### CSV Files

#### Input

In [None]:
pd.read_csv('./datasets/real_estate.csv')

#### Output

You can also, just as easily, save your `DataFrames`

In [None]:
time_series.to_csv()

**NOTE: **You can always put a filename in the function above to have it save to file...I just didn't feel like it.

### Excel Files

#### Input

In [None]:
pd.read_excel('./datasets/sample_excel.xlsx').head()

#### Output

In [None]:
excel_writer = pd.ExcelWriter('./datasets/excel_output.xlsx')
time_series.to_excel(excel_writer, 'Sheet1')

### TSV Files

#### Input

In [None]:
pd.read_table('./datasets/sample_tsv.tsv').tail()

### Clipboard

#### Copy

In [None]:
time_series.to_clipboard()

In [None]:
	2018-04-09 00:00:00	2018-04-09 00:00:00	2018-04-10 00:00:00	2018-04-10 00:00:00	2018-04-11 00:00:00	2018-04-11 00:00:00	2018-04-12 00:00:00	2018-04-12 00:00:00	2018-04-13 00:00:00	2018-04-13 00:00:00
	beg_fluorescence	stop_fluorescence	beg_fluorescence	stop_fluorescence	beg_fluorescence	stop_fluorescence	beg_fluorescence	stop_fluorescence	beg_fluorescence	stop_fluorescence
0	0.8437640628490046	0.29470436534984484	0.7955544709922411	-0.8481852473362237	0.9362531125924183	-0.06259146866549041	-1.2711725680382564	-2.709801295191622	0.6253899927979863	1.2483013014592819
1	1.2741200876778285	-1.4821571323062797	-1.3156081497359964	-1.1302016849085206	-0.018692129700282886	-1.3815550131512133	1.0809604003179176	-0.76629820632411	0.09772086261654515	1.5592813174378892
2	-1.5818311237633553	1.680214108420773	-0.2311027498272529	0.0035169311421782805	-0.7645153184729468	-1.2012564154453196	-0.10572029914008506	-0.3930911189232981	-0.07848255960577383	-0.2395697269927247
3	-0.2120692136483637	0.9292579293332327	-0.15942040184851514	-0.029682434111224214	0.8449170575134685	-0.5011827621774062	-0.37769458185574795	-0.043070972857119465	-0.7526235915567282	1.4278508995339716
4	0.5160571438348978	-0.5165802478905551	-0.18178499795112277	0.3152533197417652	-1.1496335731102205	-0.019384429879345955	-1.290757699317373	-0.7797856669149529	-0.17579641826784378	-0.05495629845113751
5	-0.05227120808556489	-0.2033537329597349	-0.13186110987444938	-0.37919145631996337	-0.7631958265345212	0.945914095824421	-0.04569495451190494	-0.5090454572888908	-0.12638447548209528	-0.9704243545819314
6	-0.3549529120949826	-0.9706710781093872	-0.399620527283899	0.40154564804264486	-1.170702745645506	1.8870107463622263	0.832710001248042	-0.6222044491855712	-1.2689159169302215	-1.2033226459213264
7	-0.9500457831291824	1.2447967753246547	0.2133064719381987	0.23173692995770692	-0.750332511558901	0.6580175977475702	-0.04455166589781693	-0.24526777439764524	0.6090609347644231	-0.15706186633158617
8	-1.2023915437972283	0.6513669480062141	-0.7409869235184222	0.3532704029344822	0.928200904505217	0.09323701308648707	-0.6815038634688395	-0.25830952369203947	-0.2791460704273421	0.566432083131961
9	1.7128882054368193	1.665135646740481	0.09495309350907842	2.3851193698190287	-1.3219689367683465	1.140675919055515	1.8274775589993553	1.7230222721130941	-0.8689105466060649	-0.06613561495067284


#### Paste

In [None]:
pd.read_clipboard()

### SQLite

In [None]:
import sqlite3

In [None]:
conn = sqlite3.connect('./datasets/flights.db')
sql_df = pd.read_sql('SELECT * FROM airlines LIMIT 10;', conn)
sql_df

# Indexing/Exploring/Manipulating in Pandas

While standard `'[]'` bracket slicing can be used, as well as `'.'` methods,

there are also 2 pandas-specific methods for indexing:
1. `.loc` -> primarily label-based
2. `.iloc` -> primarily integer-based

Additionally, Pandas allows you to do random sampling from the dataframe

In [None]:
# Load up some data to work with
index_example = pd.read_csv('./datasets/real_estate.csv')
small_idx = index_example.head(10)
small_idx

## `'[]'` slicing on a `pd.DataFrame` gives us a slice of **rows**

In [None]:
small_idx[:3]

## `'.'` operators and a column name can select a **specific named** column

In [None]:
small_idx.street

`'.'` operator selected columns are now just a `pd.Series` and can be `'[]'` sliced on further

In [None]:
small_idx.street[:3]

However, if it is a named column that doesn't fit well as a `'.'` name, you can use `'[]'` selection as well

In [None]:
small_idx['street'][:3]

Furthermore, you can name your rows, and select them too

In [None]:
# Give our table index some string values
small_idx.index = gen_lab(2, len(small_idx))
small_idx[:10]

In [None]:
small_idx['dx':'sp']

By using this, we can now select specific items/columns/rows of a dataframe and make changes to them

In [None]:
# Silence SettingWithCopy warning
pd.set_option('mode.chained_assignment', None)
small_idx.sq__ft.dtype

In [None]:
small_idx.sq__ft = small_idx.sq__ft.astype(np.float16)

In [None]:
small_idx.sq__ft.dtype

## Selection by label: the `.loc` method

```python
# .loc syntax
small_idx.loc[row indexer, column indexer]
```

#### a slice of specific items (based on label)

In [None]:
small_idx.loc['dx':'sp', 'street']

#### A list of desired items (based on label)

In [None]:
small_idx.loc[['dx', 'sp'], 'street']

#### Boolean indexing

In [None]:
small_idx.loc[small_idx.sq__ft > 1000]

## Selection by position: the `.iloc` method

#### a slice of specific items (based on position)

In [None]:
small_idx.iloc[:3,2]

#### a slice of specific items (based on position)

In [None]:
small_idx.iloc[:5,[0,1,6,9]]

In [None]:
small_idx.iloc[:5, small_idx.columns == 'sq__ft']

## Quick Exploration of the data

In [None]:
index_example.sq__ft.describe()

In [None]:
print('MAD: {}'.format(index_example.price.mad()))
print('SUM: {}'.format(index_example.price.aggregate(sum)))
print('Any missing values: {}'.format(index_example.price.hasnans))

## Object Manipulation

In [None]:
small_idx.loc[small_idx.sq__ft > 1000, 'price'] = 0 
small_idx.loc[small_idx.sq__ft > 1000]

# Data Visualization

Pandas works off of a plotting library called `matplotlib` by default. You can easily start visualizing dataframs and series just by a simple command.

In [None]:
index_example.price.plot(kind='density')

In [None]:
index_example.plot.box()

In [None]:
index_example.city.hist().plot(rot=45)

In [None]:
import seaborn as sns

fig, axes = plt.subplots()
sns.violinplot(data=index_example.loc[:,['baths', 'beds']], ax=axes)
axes.set_ylabel('number')
plt.show()