<h1><center>Introduction to Programming in Python III</center></h1>
<h2><center>Pandas for Data Analysis and Manipulation</center></h2>
<center><h3>Notebook 1 - Pandas Data Structures</h3></center>

Helpful Links:

- [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Pandas Data Structures](https://pandas.pydata.org/docs/getting_started/dsintro.html#dsintro)
- [Getting Started](https://pandas.pydata.org/docs/getting_started/index.html)
- [Pandas Series and Numpy Arrays](https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays-15dfe05919d7)

#### Import Numpy and Pandas

In [1]:
import numpy as np
import pandas as pd

Show the version numbers of the libraries we just imported:

In [2]:
print('numpy version', np.__version__)
print('pandas version', pd.__version__)

numpy version 1.18.1
pandas version 1.0.1


It is important to understand which version of a Python library we are using. See the different documentation versions hosted at: https://pandas.pydata.org/

Q: "Why do we import numpy in addition to pandas?"

A: Pandas includes data structures that build on top of data structures and programming logic already provided by Numpy.

#### Pandas Data Structures

###### Pandas "Series" type
"Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index." [Source](https://pandas.pydata.org/docs/getting_started/dsintro.html#series)

The basic syntax for creating a series is as follows:

`s = pd.Series(data, index=index)`

Show the Python Docstring for the Series data type using the following syntax.

In [3]:
#pd.Series?

##### Creating Series and understanding Index

Create a series from a single scalar value, 0

In [4]:
s = pd.Series(0)
s

0    0
dtype: int64

The index is automatically assigned since no index is provided.

Q: What is an index in Pandas?

A: "Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects."

In [5]:
s.index.dtype

dtype('int64')

Create a series from a single scalar value of type str: `'a'`

In [6]:
s = pd.Series('a')
s

0    a
dtype: object

The dtype shown in the output above is *object*, not *str*. [Explanation](https://stackoverflow.com/questions/34881079/pandas-distinction-between-str-and-object-types)

Create a series from a Python list: `[0.0, 0.5, 1.0]`

In [7]:
s = pd.Series([0.0, 0.5, 1.0])
s

0    0.0
1    0.5
2    1.0
dtype: float64

So far, we have been allowing Pandas to automatically assign the index of our series since we have not specified one.

Create a Series by specifying both a **list for data** and a **list for the index**.

In [8]:
s = pd.Series(data=[0.0, 0.5, 1.0], index=['a','b','c'])
s

a    0.0
b    0.5
c    1.0
dtype: float64

In [9]:
s.index.dtype

dtype('O')

We can specify the same Series using a Python *dictionary* as the *data* argument.

In [10]:
s = pd.Series(data={
    'a': 0.0,
    'b': 0.5,
    'c': 1.0
})

s

a    0.0
b    0.5
c    1.0
dtype: float64

Generate a-range returned as a numpy.ndarray

In [11]:
#np.arange?

In [12]:
a0 = np.arange(5.0)
a0

array([0., 1., 2., 3., 4.])

In [13]:
type(a0)

numpy.ndarray

In [14]:
s = pd.Series(data=a0)
s

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

Explore other arguments to pd.Series

In [15]:
#pd.Series?

Create a series from an ndarray created via np.arange and assign it a dtype of float64 and name it tempC.

In [16]:
s = pd.Series(data=np.arange(5), dtype='float64', name='tempC')
s

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
Name: tempC, dtype: float64

Describe the series and provide summary statistics using the describe method

In [17]:
s.describe()

count    5.000000
mean     2.000000
std      1.581139
min      0.000000
25%      1.000000
50%      2.000000
75%      3.000000
max      4.000000
Name: tempC, dtype: float64

Rename the series to `'tempF'` using the *rename* method.

In [18]:
s = s.rename('tempF')
s

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
Name: tempF, dtype: float64

Series are very similar to ndarrays and are valid arguments to *most* numpy functions. Note that slicing Series also slices the index.

Take a slice of the Series, s, containing only the 2nd and 3rd elements.
(Hint: the indexes we need are 1 and 2 and that range is expressed as: `1:3`.)

In [19]:
s[1:3]

1    1.0
2    2.0
Name: tempF, dtype: float64

Another example Series to show how slicing and indexing work with Series

In [20]:
s = pd.Series({
    4: 'a',
    3: 'b',
    2: 'c',
    1: 'd'
})
s

4    a
3    b
2    c
1    d
dtype: object

Contrast the above with the following which will attempt to find the value **at the specified index value** rather than the integer position within the Series.

In [21]:
s[1]

'd'

So, what happens if we specify `s[0]` here?

In [22]:
#s[0]

Like NumPy ndarray, Series has a dtype

In [23]:
s = pd.Series({'a': 1.0, 'b': 2.0, 'c': 3.0})
s

a    1.0
b    2.0
c    3.0
dtype: float64

In [24]:
s.dtype

dtype('float64')

Except for special cases, the dtype is normally a Numpy dtype.

While series is like a NumPy ndarray, if you need to produce an actual ndarray, you may call the *to_numpy* method on the Series.

In [25]:
s.to_numpy()

array([1., 2., 3.])

Series is dict-like:

In [26]:
s

a    1.0
b    2.0
c    3.0
dtype: float64

In [27]:
s['a']

1.0

In [28]:
s['c'] = 4
s

a    1.0
b    2.0
c    4.0
dtype: float64

In [29]:
'a' in s

True

In [30]:
'd' in s

False

If the key is not contained in the index, a KeyError is raised.

In [31]:
#s['d']

We can use the *get* method to return `None` in the case of missing values or some specified default such as *np.NaN*.

In [32]:
val = s.get('d')
print(val)

None


In [33]:
val = s.get('d', np.NaN)
print(val)

nan


In [34]:
s['d'] = 8.0
print(s.get('d'))

8.0


In [35]:
s

a    1.0
b    2.0
c    4.0
d    8.0
dtype: float64

In [36]:
1.0 in s

False

In [37]:
1.0 in s.values

True

Vector operations and label alignment with Series

In [38]:
s

a    1.0
b    2.0
c    4.0
d    8.0
dtype: float64

In [39]:
t = s / 2
t

a    0.5
b    1.0
c    2.0
d    4.0
dtype: float64

In [40]:
s + t

a     1.5
b     3.0
c     6.0
d    12.0
dtype: float64

In [41]:
s - t

a    0.5
b    1.0
c    2.0
d    4.0
dtype: float64

In [42]:
s * t

a     0.5
b     2.0
c     8.0
d    32.0
dtype: float64

In [43]:
s ** 2

a     1.0
b     4.0
c    16.0
d    64.0
dtype: float64

In [44]:
t - s

a   -0.5
b   -1.0
c   -2.0
d   -4.0
dtype: float64

In [45]:
#np.exp?

In [46]:
np.exp(t - s)

a    0.606531
b    0.367879
c    0.135335
d    0.018316
dtype: float64

"A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels." [Source](https://pandas.pydata.org/docs/getting_started/dsintro.html#vectorized-operations-and-label-alignment-with-series)

Notice the NaN values for result of the operation on two series with labels that don't exactly match.

In [47]:
s = pd.Series({'a': 1., 'b': 2.0, 'c': 3.0})
t = pd.Series({'b': 4.0, 'c': 5.0, 'd': 6.0})
u = s + t
u

a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

The NaN values can be dropped if desired using the dropna method.

In [48]:
#u.dropna?

In [49]:
u.dropna()

b    6.0
c    8.0
dtype: float64

In [50]:
u

a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

In [51]:
u = u.dropna()
u

b    6.0
c    8.0
dtype: float64

In [52]:
u = s + t
u

a    NaN
b    6.0
c    8.0
d    NaN
dtype: float64

In [53]:
u.dropna(inplace=True)
u

b    6.0
c    8.0
dtype: float64

###### Pandas "Dataframe" type
"DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays, lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules." [Source](https://pandas.pydata.org/docs/getting_started/dsintro.html#dataframe)

Create a dataframe from a dict-of-lists or ndarrays

In [54]:
#pd.DataFrame?

In [55]:
# from a dict-of-lists
df = pd.DataFrame({'one': [1., 2., 3., 4.],
                   'two': [4., 3., 2., 1.]})
df

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


As with Series, if an index is not specified, one is automatically assigned from 0 through n-1, where n is the length of each list.

Create a DataFrame from a dict of lists and also specify the index via a list of strings.

In [56]:
df = pd.DataFrame({'one': [1., 2., 3., 4.],
                   'two': [4., 3., 2., 1.]},
                  index=['a', 'b', 'c', 'd'])
df

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


Create a DataFrame from a list of dictionaries.

In [57]:
df = pd.DataFrame([{'a': 1.0, 'b': 2.0, 'c': 3.0, 'd': 4.0},
                   {'a': 4.0, 'b': 3.0, 'c': 2.0, 'd': 1.0}],
                  index=['one', 'two'])
df

Unnamed: 0,a,b,c,d
one,1.0,2.0,3.0,4.0
two,4.0,3.0,2.0,1.0


Create a DataFrame from a list of series.

In [58]:
s1 = pd.Series({'a': 1., 'b': 2., 'c': 3., 'd': 4.})
print(s1)

s2 = pd.Series({'a': 4., 'b': 3., 'c': 2., 'd': 1.})
print(s2)

a    1.0
b    2.0
c    3.0
d    4.0
dtype: float64
a    4.0
b    3.0
c    2.0
d    1.0
dtype: float64


In [59]:
df = pd.DataFrame({'one': s1, 'two': s2})
df

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


In [60]:
df = pd.DataFrame([s1, s2], index=['one', 'two'])
df

Unnamed: 0,a,b,c,d
one,1.0,2.0,3.0,4.0
two,4.0,3.0,2.0,1.0


*Transposing* a DataFrame (swapping rows with columns)

In [61]:
df.T

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


In [62]:
df.transpose()

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


#### Column Selection, Addition, Deletion

In [63]:
df = pd.DataFrame({'one': [0., 1., 2.], 'two': [2., 1., 0.]}, index=['a', 'b', 'c'])
df

Unnamed: 0,one,two
a,0.0,2.0
b,1.0,1.0
c,2.0,0.0


##### Column Selection

In [64]:
df['one']

a    0.0
b    1.0
c    2.0
Name: one, dtype: float64

##### Column Addition

In [65]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
a,0.0,2.0,0.0
b,1.0,1.0,1.0
c,2.0,0.0,0.0


In [66]:
df['four'] = df['two'] - df['three']
df

Unnamed: 0,one,two,three,four
a,0.0,2.0,0.0,2.0
b,1.0,1.0,1.0,0.0
c,2.0,0.0,0.0,0.0


In [67]:
df['five'] = df['two'] / df['four']
df

Unnamed: 0,one,two,three,four,five
a,0.0,2.0,0.0,2.0,1.0
b,1.0,1.0,1.0,0.0,inf
c,2.0,0.0,0.0,0.0,


##### Column Deletion

In [68]:
del df['one']
df

Unnamed: 0,two,three,four,five
a,2.0,0.0,2.0,1.0
b,1.0,1.0,0.0,inf
c,0.0,0.0,0.0,


In [69]:
df.pop('two')
df

Unnamed: 0,three,four,five
a,0.0,2.0,1.0
b,1.0,0.0,inf
c,0.0,0.0,


When assigning a new column to a sclar value the value will be set at each index location

In [70]:
df['one'] = 0.0
df

Unnamed: 0,three,four,five,one
a,0.0,2.0,1.0,0.0
b,1.0,0.0,inf,0.0
c,0.0,0.0,,0.0


When inserting a Series that doesn't have the same index length as the dataframe, it will be conformed to the DataFrame's index.

In [71]:
df['two'] = df['one'][0:1]
df

Unnamed: 0,three,four,five,one,two
a,0.0,2.0,1.0,0.0,0.0
b,1.0,0.0,inf,0.0,
c,0.0,0.0,,0.0,


##### Column Insertion

Notice that newly created columns are added at the end of the DataFrame. If you want to insert a column at a specific position, you may use the *insert* method.

In [72]:
df.insert(0, 'zero', df['one'])
df

Unnamed: 0,zero,three,four,five,one,two
a,0.0,0.0,2.0,1.0,0.0,0.0
b,0.0,1.0,0.0,inf,0.0,
c,0.0,0.0,0.0,,0.0,


##### Column Reordering

Columns can also be reordered by creating a new DataFrame from the existing one.

In [73]:
df = df[['zero', 'one', 'two']]
df

Unnamed: 0,zero,one,two
a,0.0,0.0,0.0
b,0.0,0.0,
c,0.0,0.0,


#### Indexing / Selection of DataFrames

|Operation | Syntax | Result|
|:---------|:-------|:------|
|Select column|df[col]|Series|
|Select row by label|df.loc[label]|Series|
|Select row by integer location|df.iloc[loc]|Series|
|Slice rows|df[5:10]|DataFrame|
|Select rows by boolean vector|df[bool_vec]|DataFrame|

In [74]:
df = pd.DataFrame({
    'one': [0, 1, 2],
    'two': [3, 4, 5],
    'three': [6, 7, 8]}, index=['a', 'b', 'c'])
df

Unnamed: 0,one,two,three
a,0,3,6
b,1,4,7
c,2,5,8


Select the second column by column name.

In [75]:
df['two']

a    3
b    4
c    5
Name: two, dtype: int64

Select the 2nd and 3rd column by column name.

In [76]:
df[['one', 'two']]

Unnamed: 0,one,two
a,0,3
b,1,4
c,2,5


Select 3rd row by row (index) label

In [77]:
df.loc['c']

one      2
two      5
three    8
Name: c, dtype: int64

Select 3rd row by row integer location

In [78]:
df.iloc[2]

one      2
two      5
three    8
Name: c, dtype: int64

Select the 2nd and 3rd rows by slicing

In [79]:
df[1:3]

Unnamed: 0,one,two,three
b,1,4,7
c,2,5,8


Select the 2nd and 3rd rows using a boolean vector

In [80]:
df[[False, True, True]]

Unnamed: 0,one,two,three
b,1,4,7
c,2,5,8
