# <p style="text-align: Center;">NumPy and Pandas</p>
## <p style="text-align: Center;">University of Wyoming COSC 1010</p>
### <p style="text-align: Center;">Adapted from: *Data Visualization with Python and JavaScript* By Kyran Dale </p>

## <p style="text-align: Center;">Introduction to NumPy</p>

## Introduction to NumPy
---
* NumPy stands for the Numeric Python library 
* It is a key building block of Pandas, the primary data analysis library 
* NumPy is a module that allows access to fast multi-dimensional array manipulation, implemented by low-level libraries 
    * These libraries are written in C and Fortran 
* Natively, Python is pretty slow

## Introduction to NumPy
---
* NumPy allows you to perform parallel operations on large arrays all at once 
    * This makes it quite quick 
* NumPy is the primary building block of many Python data processing libraries
* This ecosystem extends beyond just Pandas
* Understanding NumPy gives you an edge when working with a magnitude of libraries 
* The key to understanding NumPy is its arrays 

## The NumPy Array
--- 
* Everything in NumPy is built around its `ndarray` object
* Operations on these arrays are performed quickly using compiled libraries 
    * This allows NumPy to out perform native Python 
* Among other things you can perform standard arithmetic on these arrays
    * like a Python int or float 

In [34]:
import numpy as np 

a = np.array([1,2,3])

print(a+a)

[2 4 6]


## The NumPy Array
--- 
* There we were able to add an array to itself as easily as adding numbers 
* `import numpy as np` is the standard way to import the library 
* NumPy can leverage the massively parallel computation available to modern CPUs
    * This allows large computations to happen quickly 
* The key properties of a NumPy `ndarray` are:
    * The number of dimensions (`ndim`) 
    * shape (`shape`)
    * numeric type (`dtype`)

## The NumPy Array
--- 
* The same array of numbers can be reshaped in place
* This sometimes involves changing the array's number of dimensions 


In [2]:
def print_array_details(a):
    print(f"Dimensions: {a.ndim}\nshape: {a.shape}\ndtype: {a.dtype}")

a = np.array([1,2,3,4,5,6,7,8])
print(a)
print_array_details(a)

[1 2 3 4 5 6 7 8]
Dimensions: 1
shape: (8,)
dtype: int64


## The NumPy Array
--- 
* Using the `reshape` method the array shape and dimensions can be changed
    * An 8 element array can be changed to a 2d array of `2*4`
    * Or a 3d array of `2*2*2`
* The shape and numeric type of an array can be specified on creation of an array, or later 
* The easiest way to change an array's numeric type is by using the `astype` method
    * This makes a resized copy of the original with a new type 

In [3]:
#first a 2d array from our 1d array 
a2 = a.reshape([2,4])
print(a2)

[[1 2 3 4]
 [5 6 7 8]]


In [4]:
#Then a 3d array from our 1d array 
a3 = a.reshape([2,2,2])
print(a3)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


## Creating Arrays 
--- 
* As demonstrated NumPy can create arrays with lists of numbers
* It also provides some utility functionality to create arrays with a specific shape
    * `zeros` and `ones` are the most common functions to create prefilled arrays 
* Alternatively `empty` can be used, which is faster but leaves the initialization to you 

In [8]:
print(np.zeros([2,3]))
print()
print(np.ones([3,3]))
print()
print(np.empty([2,2]))

[[0. 0. 0.]
 [0. 0. 0.]]

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]

[[4.68677518e-310 0.00000000e+000]
 [4.68616830e-310 4.68616830e-310]]


## Creating Arrays 
--- 
* You can also create arrays with random elements using np's `random`
* The `linspace` method creates a specified number of evenly spaced samples over an interval 
* `arrange` is similar, but uses a step-size argument 
    * unlike `arange` `linspace` is inclusive on the upper bound

In [11]:
print(np.random.random((2,2)))
print(np.linspace(2,10,5)) # five numbers range 2-10
print(np.arange(2,10,2)) # 2 being the step size

[[0.36800437 0.12884575]
 [0.40857825 0.43752508]]
[ 2.  4.  6.  8. 10.]
[2 4 6 8]


## Array Indexing and Slicing 
--- 
* One-dimensional arrays are indexed and sliced much like Python lists 
* Indexing multi-dimension arrays is also similar
    * Each dimension has its own indexing/slicing operation 
    * These are specified by comma separated tuples 
* If the number of objects in a selected tuple is less than the dimensions, the remaining dimensions are assumed to be fully selected 
* Ellipses (`...`) can be used to denote a full selection 

In [14]:
print(a[2])
print(a[3:5]) # a slice 
print(a[:4:2]) # lice of 0:4 every second item 
print(a[::-1]) #reversed 

3
[4 5]
[1 3]
[8 7 6 5 4 3 2 1]


In [17]:
ax = np.arange(16,dtype='int32')
ax = ax.reshape([2,2,4])
print(ax)
print(ax[1,1,2])

[[[ 0  1  2  3]
  [ 4  5  6  7]]

 [[ 8  9 10 11]
  [12 13 14 15]]]
14


## A Few Basic Operations 
---
* With NumPy arrays you can perform basic (and more advanced) operations like you would with normal numeric variables 
* Being able to manipulate arrays as easily as single numbers is a huge strength of NumPy
* Boolean operators work ina  similar way to arithmetic ones


In [20]:
print(a)
print(a*2)#multiply all elements by 2
print(a-2)#subtract 2 from all elements
print(a/2)#divide all numbers
print(a < 4)#check if all numbers are <4

[1 2 3 4 5 6 7 8]
[ 2  4  6  8 10 12 14 16]
[-1  0  1  2  3  4  5  6]
[0.5 1.  1.5 2.  2.5 3.  3.5 4. ]
[ True  True  True False False False False False]


## A Few Basic Operations 
---
* Arrays also have a useful number of methods
    * `min` find the minimum 
    * `sum` sum the numbers 
    * `mean` take the average 
    * `std` find the standard deviation 

## A Few Basic Operations 
---
* In addition there are a large number of built-in array functions 
    * `np.pi` not an array function, but gives you pi 
    * `np.degrees` translates radians to degrees 
    * `np.sin` take the sin of each element in the array 
    * `np.round` round numbers to a provided number of decimal places 

## <p style="text-align: Center;">Introduction to Pandas</p>

## Introduction to Pandas
---
* Pandas isa  key element in a data visualization toolchain 
* Used for both cleaning and exploring data
* We will introduce the key concepts and show how Pandas works with existing data 

## Why Pandas is Tailor-Made for Dataviz 
---
* Pandas is tailor-made to manipulate row-column data tables
* This is done through its core datatype, the `DataFrame` 
    * The `DataFrame` can be thought of as a fast programmatic spreadsheet 

## Heterogeneous Data and Categorizing Measurements
---
* Pandas can be used to get data out of common data stores, like CSV files
* First though, we will talk about the heterogeneous datasets that Pandas was designed to work with
* A visualization likely presents the results of some sort of measurement 
* Measurements largely fall under two categories:
    * numerical
    * categorical 

## Heterogeneous Data and Categorizing Measurements
---
* Numerical values can be divided into interval and ratio scales
* Categorical values can in turn be divided into normal and ordinary measurements 
* These give four broad categories of observation 

## Heterogeneous Data and Categorizing Measurements
---
* Suppose you ahd information about a tweet as JSON data:
```
    {
       "text": "#Python and #JavaScript sitting in a tree...",
        "id": 2103303030333004303,
        "favorited": true,
        "filter_level":"medium",
        "created_at": "Wed Mar 23 14:07:43 +0000 2015",
        "retweet_count":23,
        "coordinates":[-97.5, 45.3]
    }
```

## Heterogeneous Data and Categorizing Measurements
--- 
* The `text` and `id` fields are unique identifiers 
    * text may be categorical information 
* `favorited` is Boolean categorical information 
    * This would be nominal, it can be counted but not ordered
* `filter_level` is also categorical, but also ordinal
    * There is on order low->medium->high 
* `created_at` is a timestamp, a numerical value on an interval scale 
    * Tweets could be ordered based on this value, something Pandas can do automatically
*  `retweet_count` is also a numerical scale, but a ratio scale
    * A ratio scale has a meaningful concept of 0 

## Heterogeneous Data and Categorizing Measurements
--- 
* Some data about the tweet contains heterogeneous information 
* This information covers all the generally accepted divisions of measurements 
* NumPy arrays are great for homogeneous numerical number crunching
* pandas however is designed for categorical data, time series, and items that reflect the heterogeneous nature of the world 

## The `DataFrame`
---
* The first step in a Pandas session is to load some data
* There are various ways to do so, and data can be loaded from multiple sources 
* For now we wills tart with some JSON data
* You can utilize Pandas' `read_json()` method to get a data frame based on the JSON 
* By convention `DataFrame` objects' name starts with `df`

In [5]:
import pandas as pd 
df = pd.read_json('151pokemon.json')

## The `DataFrame`
---
* With the `df` in hand we cans ee and work with the contents
* A quick way to get the row-column structure of a `DF` is to use the `head` method 
* The head method shows the first five elements 

In [6]:
df.head

<bound method NDFrame.head of      number        name                                       types  \
0         1   Bulbasaur   [{'type1': 'Grass'}, {'type2': 'Poison'}]   
1         2     Ivysaur   [{'type1': 'Grass'}, {'type2': 'Poison'}]   
2         3    Venusaur   [{'type1': 'Grass'}, {'type2': 'Poison'}]   
3         4  Charmander      [{'type1': 'Fire'}, {'type2': 'none'}]   
4         5  Charmeleon      [{'type1': 'Fire'}, {'type2': 'none'}]   
..      ...         ...                                         ...   
146     147     Dratini    [{'type1': 'Dragon'}, {'type2': 'none'}]   
147     148   Dragonair    [{'type1': 'Dragon'}, {'type2': 'none'}]   
148     149   Dragonite  [{'type1': 'Dragon'}, {'type2': 'Flying'}]   
149     150      Mewtwo   [{'type1': 'Psychic'}, {'type2': 'none'}]   
150     151         Mew   [{'type1': 'Psychic'}, {'type2': 'none'}]   

                                                 stats  
0    [{'total': '318'}, {'hp': '45'}, {'attack': '4...  
1  

## Indices
---
* The `df`'s columns are indexed by the `column` property
    * It is a `Pandas` `index` instance 
    * The columns can be accessed and selected 
* Initially Pandas `rows` have a single numeric index 
    * Pandas can handle multiple if needed 


In [9]:
df.columns

Index(['number', 'name', 'types', 'stats'], dtype='object')

In [10]:
df.index

RangeIndex(start=0, stop=151, step=1)

## Indices
---
* As well as integers, row indices can be:
    * Strings
    * DatetimeIndices
    * PeriodIndices for time based data 
* To aid selections, a column of `DataFrame` can be set to the index via `set_index`
* The `loc` method can be used to select a row by a index label

In [20]:
dfx = df.set_index('name')
dfx.loc["Charizard"]

number                                                    6
types              [{'type1': 'Fire'}, {'type2': 'Flying'}]
stats     [{'total': '534'}, {'hp': '78'}, {'attack': '8...
Name: Charizard, dtype: object

## Rows and  Columns 
---
* The rows and columns of a `DatFrame` are stored as `Pandas Series`
    * A heterogeneous counterpart of NumPy arrays
* These are essentially one-dimensional arrays that can contain any datatype 
* There are two main ways to select a row from the `DataFrame`
    * The `loc` method as demonstrated 
    * The `iloc` method which selects by position

In [22]:
df.iloc[5]

number                                                    6
name                                              Charizard
types              [{'type1': 'Fire'}, {'type2': 'Flying'}]
stats     [{'total': '534'}, {'hp': '78'}, {'attack': '8...
Name: 5, dtype: object

## Rows and  Columns 
---
* You can grab a column using `DataFrame` dot notation, or the conventional array access via keyword
* It will return a Pandas `series` 
    * With all the column fields and their `DataFrame` indices preserved

In [24]:
name_col = df.name 
type(name_col)

pandas.core.series.Series

In [25]:
name_col

0       Bulbasaur
1         Ivysaur
2        Venusaur
3      Charmander
4      Charmeleon
          ...    
146       Dratini
147     Dragonair
148     Dragonite
149        Mewtwo
150           Mew
Name: name, Length: 151, dtype: object

## Selecting Groups 
---
* There are various ways to select groups (subset of rows)
* This operation will occur on a `DataFrame` and return a filtered `DataFrame`
* Often it will be a selection of all columns with a specified column value
* One way to accomplish this is utilizing the `groupby` method
    * This groups a column (or list  of) 
* `get_group` method allows you to select the desired group

## Creating and Saving DataFrames
---
* The easiest way to create a `DataFrame` is to use a Python dictionary 
    * Though it won't often be used 
    * Typically the data will be read in from a file 
* The `from_dict` method can be used

In [26]:
dfn = pd.DataFrame.from_dict([
{'name': 'Albert Einstein', 'category':'Physics'},
{'name': 'Marie Curie', 'category':'Chemistry'},
{'name': 'William Faulkner', 'category':'Literature'}
])

dfn

Unnamed: 0,name,category
0,Albert Einstein,Physics
1,Marie Curie,Chemistry
2,William Faulkner,Literature


## Creating and Saving DataFrames
---
* Pandas as a large amount of `read_[format]` or `to_[Format]` methods
* It covers most data-loading use cases, from CSV to databases 
* By default Pandas will try to convert the loaded data sensibly

## CSV
---
* Pandas is able to handle CSV files in a sophisticated fashion 
* conventional CSV files load without issue
* There are a lot of different possibilities for the CSV files though that aren't comma separated 
* Any non-standard can be specified with the `sep` parameter of `pd.read_csv()`

## Excel Files
---
* Pandas uses pythons `xlrd` module to read Excel files 
* Excel documents have multiple named sheets, each of which can be passed to a `DataFrame`
* The `read_excel` method can be used to easily read in a workbook, or sheets
    * Sheets can be specified via index or name with the `sheetname` parameter 
    * By default the first sheet is returned 

## Series into DataFrames
---
* The key idea with Pandas `Series` is the index
* These indices function as labels for the heterogeneous data
* When Pandas operates on more than one data object, these indices are used to align fields 
* Series can be created in one fo three ways
    * From a Python list or NumPy array
    * With a Python Dict
    * A single scalar value 

In [27]:
s = pd.Series([1,2,3,4]) #automatically assigns int indices 
s

0    1
1    2
2    3
3    4
dtype: int64

In [28]:
# Alternatively can specify indices 
s = pd.Series([1,2,3,4],index=['a','b','c','d'])
s

a    1
b    2
c    3
d    4
dtype: int64

In [29]:
# A dict can be used to specify data and indices 

s = pd.Series({'a':1,'b':2,'c':3})
s

a    1
b    2
c    3
dtype: int64

In [30]:
# Finally a scalar can be passed with indices 
pd.Series(9,{'a','b','c'})

b    9
c    9
a    9
dtype: int64

## Series into DataFrames
---
* Slicing operations work as they would in Python `lists` or `ndarrays`
    * The labels will be preserved 
* Creating and manipulating individual series is useful when interacting with NumPy 
    * Or creating visualizations 
* Series are the building blocks of `DataFrames` so they can be easily joined together to create a `DataFrame`