# M.A.D. Python Libraries - `pandas`

<span style="color:red;">**M.A.D.** => **M**achine **L**earning and **D**ata Science<span>

**Purpose:** The purpose of this workbook is to help you get comfortable with the topics outlined below.

**Prereqs**
* None

**Recomended Usage**
* Run each of the cells (Shift+Enter) and edit them as necessary to solidify your understanding
* Do any of the exercises that are relevant to helping you understand the material

**Topics Covered**
* Pandas

# Workbook Setup

## Troubleshooting Tips

If you run into issues running any of the code in this notebook, check your version of Jupyter, Python, extensions, libraries, etc.

```bash
!jupyter --version

jupyter core     : 4.6.1
jupyter-notebook : 6.0.2
qtconsole        : not installed
ipython          : 7.9.0
ipykernel        : 5.1.3
jupyter client   : 5.3.4
jupyter lab      : 1.2.3
nbconvert        : 5.6.1
ipywidgets       : not installed
nbformat         : 4.4.0
traitlets        : 4.3.3
```

```bash
!jupyter-labextension list

JupyterLab v1.2.3
Known labextensions:
   app dir: /usr/local/share/jupyter/lab
        @aquirdturtle/collapsible_headings v0.5.0  enabled  OK
        @jupyter-widgets/jupyterlab-manager v1.1.0  enabled  OK
        @jupyterlab/git v0.8.2  enabled  OK
        @jupyterlab/github v1.0.1  enabled  OK
        jupyterlab-flake8 v0.4.0  enabled  OK

Uninstalled core extensions:
    @jupyterlab/github
    jupyterlab-flake8
```

In [6]:
# # Run this cell to check the version of Jupyter you are running
# !jupyter --version

In [2]:
# # Run one of these cells to check what extensions you are using
# !jupyter-labextension list
# !jupyter-nbextension list

In [1]:
# # Check ipython version
# import sys
# print(sys.version)

## Notebook Configs

In [1]:
# AUTO GENERATED CELL FOR NOTEBOOK SETUP

# NOTEBOOK WIDE MAGICS

# Reload all modules before executing a new line
%load_ext autoreload
%autoreload 2

# Abide by PEP8 code style
%load_ext pycodestyle_magic
%pycodestyle_on

# LIBRARY SPECIFIC MAGICS - UNCOMMENT AS NEEDED

# Plot all matplotlib plots in output cell and save on close
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd

Save an array as a .npy file

In [67]:
np.save('my_array', a)

In [25]:
np.load('my_array.npy')

array([1, 2, 3])

Save several arrays in an uncompressed .npz file

In [68]:
np.savez('array.npz', array1=a, array2=b)

In [33]:
my_arrays = np.load('array.npz')

In [34]:
my_arrays['array1']

array([1, 2, 3])

In [94]:
i = a / b
print('Division\n a / b = h\n\n {}\n / \n{}\n =\n{}'.format(a, b, i))

Division
 a / b = h

 [1 2 3]
 / 
[[1.5 2.  3. ]
 [4.  5.  6. ]]
 =
[[0.66666667 1.         1.        ]
 [0.25       0.4        0.5       ]]


In [95]:
np.divide(a, b)

array([[0.66666667, 1.        , 1.        ],
       [0.25      , 0.4       , 0.5       ]])

1:12: E231 missing whitespace after ','


In [96]:
j = a * b
print('Multiplication\n a * b = h\n\n {}\n * \n{}\n =\n{}'.format(a, b, j))

Multiplication
 a * b = h

 [1 2 3]
 * 
[[1.5 2.  3. ]
 [4.  5.  6. ]]
 =
[[ 1.5  4.   9. ]
 [ 4.  10.  18. ]]


In [97]:
np.multiply(a, b)

array([[ 1.5,  4. ,  9. ],
       [ 4. , 10. , 18. ]])

In [98]:
print(b)
np.exp(b)  # Exponentiation (ie. e^b)

[[1.5 2.  3. ]
 [4.  5.  6. ]]


array([[  4.48168907,   7.3890561 ,  20.08553692],
       [ 54.59815003, 148.4131591 , 403.42879349]])

In [99]:
print(b)
np.sqrt(b)  # Square root

[[1.5 2.  3. ]
 [4.  5.  6. ]]


array([[1.22474487, 1.41421356, 1.73205081],
       [2.        , 2.23606798, 2.44948974]])

In [100]:
print(a)
np.sin(a)  # Print sines of an array

[1 2 3]


array([0.84147098, 0.90929743, 0.14112001])

In [101]:
print(b)
np.cos(b)  # Element-wise cosine

[[1.5 2.  3. ]
 [4.  5.  6. ]]


array([[ 0.0707372 , -0.41614684, -0.9899925 ],
       [-0.65364362,  0.28366219,  0.96017029]])

In [102]:
print(a)
np.log(a)  # Element-wise natural logarithm

[1 2 3]


array([0.        , 0.69314718, 1.09861229])

In [106]:
print_a(a)
a < 2  # Element-wise comparison array([True, False, False], dtype=bool)

Shape: (3,)
[1 2 3]


array([ True, False, False])

In [105]:
print_a(a)
print_a(b)
a == b  # Element-wise comparison

Shape: (3,)
[1 2 3]
Shape: (2, 3)
[[1.5 2.  3. ]
 [4.  5.  6. ]]


array([[False,  True,  True],
       [False, False, False]])

In [116]:
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])

In [115]:
print(np.array_equal(a, b))
print(np.array_equiv(a, b))
print(np.allclose(a, b))

True
True
True


In [125]:
a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [1, 2, 3]])

In [126]:
print(np.array_equal(a, b))
print(np.array_equiv(a, b))
print(np.allclose(a, b))

False
True
True


In [129]:
a = np.array([1e10, 1e-8])
b = np.array([1.00001e10, 1e-9])

In [131]:
print(np.array_equal(a, b))
print(np.array_equiv(a, b))
print(np.allclose(a, b))

False
False
True


In [137]:
a = np.array([1, 2, 3])
b = np.array([[1, 2, 3], [4, 5, 6]])

In [134]:
a.sum()  # Array-wise sum

6

In [135]:
print('{}\n'.format(a))

print(a.min())  # Array-wise minimum value
print(a.max())  # Array-wise maximum value

[1 2 3]

1
3


In [138]:
b[0][2] = 0
b[1][1] = 1
print('{}\n'.format(b))

print(b.min(axis=0))  # Minimum value of an array row (min along axis)
print(b.max(axis=0))  # Maximum value of an array row (max along axis)

[[1 2 0]
 [4 1 6]]

[1 1 0]
[4 2 6]


In [139]:
print('{}\n'.format(b))

print(b.min(axis=1))  # Minimum value of an array row (min along axis)
print(b.max(axis=1))  # Maximum value of an array row (max along axis)

[[1 2 0]
 [4 1 6]]

[0 1]
[2 6]


In [140]:
print('{}\n'.format(b))
b.cumsum(axis=1)  # Cumulative sum of the elements

[[1 2 0]
 [4 1 6]]



array([[ 1,  3,  3],
       [ 4,  5, 11]])

In [141]:
print('{}\n'.format(a))
a.mean()  # Mean

[1 2 3]



2.0

In [142]:
print('{}\n'.format(b))
np.median(b)  # Median

[[1 2 0]
 [4 1 6]]



1.5

In [143]:
print('{}\n'.format(a))
np.corrcoef(a)  # Correlation coefficient

[1 2 3]



1.0

In [144]:
print('{}\n'.format(b))
np.std(b)  # Standard deviation

[[1 2 0]
 [4 1 6]]



2.0548046676563256

**COPY / DEEP COPY:** When the contents are physically stored in another location, it is called Copy (deep by default). 

**VIEW / SHALLOW COPY:** If on the other hand, a different view of the same memory content is provided, we call it as View.

In [145]:
print('Array: {} --> mem loc: {}\n'.format(a, a.__array_interface__['data']))

h = a.view()  # Create a view of the array with the same data

print('Array: {} --> mem loc: {}\n'.format(h, h.__array_interface__['data']))

Array: [1 2 3] --> mem loc: (140190773552048, False)

Array: [1 2 3] --> mem loc: (140190773552048, False)



In [149]:
print('Array: {} --> mem loc: {}\n'.format(a, a.__array_interface__['data']))

h = np.copy(a)  # Create a copy of the array

print('Array: {} --> mem loc: {}\n'.format(h, h.__array_interface__['data']))

Array: [1 2 3] --> mem loc: (140190773552048, False)

Array: [1 2 3] --> mem loc: (140190773429392, False)



In [148]:
print('Array: {} --> mem loc: {}\n'.format(a, a.__array_interface__['data']))

i = a.copy()  # Create a copy of the array

print('Array: {} --> mem loc: {}\n'.format(h, h.__array_interface__['data']))

Array: [1 2 3] --> mem loc: (140190773552048, False)

Array: [1 2 3] --> mem loc: (140190773577344, False)



In [152]:
a = np.array([2, 1, 4])
print(a)

a.sort()  # Sort an array in place
print(a)

[2 1 4]
[1 2 4]


In [154]:
c = np.random.randint(0, 9, (2, 2, 3))
print('{}\n\n'.format(c))

c.sort(axis=0)  # Sort the elements of an array's axis
print(c)

[[[1 3 1]
  [3 2 0]]

 [[1 7 7]
  [2 5 2]]]


[[[1 3 1]
  [2 2 0]]

 [[1 7 7]
  [3 5 2]]]


# [`pandas`](https://pandas.pydata.org/pandas-docs/stable/)

`pandas` is a library that comes with many easy-to-use data structures and data analysis tools.

[Pandas Cheatsheet (pdf)](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)

## Pandas Data Structures

### Series

A constant size one dimensional array (holds any data type)

In [None]:
# series1 = pd.Series([10, 20, 30, 40])
# print(series1)

In [None]:
# names = np.array(['alskm', 'alskm', 'aslkdm'])
# s2 = pd.Series(names)
# print(s2)

In [148]:
# Series from np array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)

0    g
1    e
2    e
3    k
4    s
dtype: object


In [149]:
# Series with indicies
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
print(s)

a    3
b   -5
c    7
d    4
dtype: int64


In [150]:
# Series from a dictionary (can also do with list, etc)
dict = {'Geeks' : 10, 
        'for' : 20, 
        'geeks' : 30} 

ser = pd.Series(dict)    
print(ser)

Geeks    10
for      20
geeks    30
dtype: int64


### Dataframe

A two-dimensional labeled data structure with columns of potentially different types

In [None]:
# df = pd.DataFrame(['Name': ['michelle', 'frank', "joe"]])
# print(df)

In [None]:
# numList = [0, 10, 20, 30, 40]
# df2 = pd.DataFrame(numList)
# print(df)

In [None]:
# names = [['Michelle, 850'], ['Nicholas', 320]]
# df = pd.DataFrame(names, colums['Name', 'Salary'], dtype=float)
# print(df)

# df.describe()

# iterate through rows of df
# .iteritems() - for iterating over key,val pairs
# .iterrows() - for iterating over rows as (index,series) pairs
# .itertuples() - for iterating over rows as named tuples

In [155]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasília'],
        'Population': [11190846, 1303171035, 207847528]}
data

{'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasília'],
 'Population': [11190846, 1303171035, 207847528]}

In [154]:
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


In [158]:
# Define index later
df = pd.DataFrame({"a" : [4 ,5, 6],
                   "b" : [7, 8, 9],
                   "c" : [10, 11, 12]},
                  index = [1, 2, 3])
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [159]:
# Define index & column later
df = pd.DataFrame([[4, 7, 10],
                   [5, 8, 11],
                   [6, 9, 12]],
                  index=[1, 2, 3],
                  columns=['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
1,4,7,10
2,5,8,11
3,6,9,12


In [161]:
# Dataframe with multiindex
df = pd.DataFrame({"a" : [4 ,5, 6],
                   "b" : [7, 8, 9],
                   "c" : [10, 11, 12]},
                  index = pd.MultiIndex.from_tuples([('d',1),('d',2),('e',2)], names=['n','v']))
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4,7,10
d,2,5,8,11
e,2,6,9,12


## I/O

In [None]:
# IMPORTING DATA

# data = pd.read_csv("bla.csv")
# data.head()
# data.tail()
# data.sample(5)

### Read and Write to CSV

In [None]:
pd.read_csv('file.csv', header=None, nrows=5)

In [None]:
df.to_csv('myDataFrame.csv')

### Read and Write to Excel

In [None]:
pd.read_excel('file.xlsx')

In [None]:
pd.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')

In [None]:
# Read multiple sheets from the same file
xlsx = pd.ExcelFile('file.xls')

In [None]:
df = pd.read_excel(xlsx, 'Sheet1')

### Read and Write to SQL Query or Database Table

In [162]:
from sqlalchemy import create_engine

ModuleNotFoundError: No module named 'sqlalchemy'

In [None]:
engine = create_engine('sqlite:///:memory:')

In [None]:
pd.read_sql("SELECT * FROM my_table;", engine)

In [None]:
pd.read_sql_table('my_table', engine)

In [None]:
pd.read_sql_query("SELECT * FROM my_table;", engine)

In [None]:
pd.to_sql('myDf', engine)

## Selection

### Getting

In [164]:
print(s)
s['b'] #=>Get one element

a    3
b   -5
c    7
d    4
dtype: int64


-5

In [166]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4,7,10
d,2,5,8,11
e,2,6,9,12


In [167]:
df[1:] #=>Get subset of a DataFrame

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,2,5,8,11
e,2,6,9,12


### Selecting, Boolean Indexing & Setting

### By Position

In [169]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c
n,v,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
d,1,4,7,10
d,2,5,8,11
e,2,6,9,12


In [170]:
df.iloc[[0],[0]] #=>Select single value by row & 'Belgium' column

Unnamed: 0_level_0,Unnamed: 1_level_0,a
n,v,Unnamed: 2_level_1
d,1,4


In [None]:
df.iat([0],[0])

### By Label

In [None]:
df.loc[[0], ['Country']] #=>Select single value by row & 'Belgium' column labels

In [None]:
df.at([0], ['Country']) #=>'Belgium'

### By Label/Position

In [None]:
df.ix[2] #=>Select single row of Country Brazil subset of rows Capital Brasília Population 207847528

In [None]:
df.ix[:,'Capital'] #=>Select a single column of 0 Brussels subset of columns 1 New Delhi 2 Brasília

In [None]:
df.ix[1,'Capital'] #=>Select rows and columns 'New Delhi'

### Boolean Indexing

In [None]:
s[~(s > 1)] #=>Series s where value is not >1

In [None]:
s[(s < -1) | #=>(s > 2)] s where value is <-1 or >2

In [None]:
df[df['Population']>1200000000] #=>Use filter to adjust DataFrame

### Setting

In [None]:
s['a'] = 6 #=>Set index a of Series s to 6

## Dropping

In [None]:
s.drop(['a', 'c']) #=>Drop values from rows (axis=0)

In [None]:
df.drop('Country', axis=1) #=>Drop values from columns(axis=1)

## Sort and Rank

In [None]:
df.sort_index() #=>Sort by labels along an axis

In [None]:
df.sort_values(by='Country') #=>Sort by the values along an axis

In [None]:
df.rank() #=>Assign ranks to entries

## Retrieving Series/DataFrame Information

### Basic Info

In [None]:
df.shape #=>(rows,columns)

In [None]:
df.index #=>Describe index

In [None]:
df.columns #=>Describe DataFrame columns

In [None]:
df.info() #=>Info on DataFrame

In [None]:
df.count() #=>Number of non-NA values

### Summary Info

In [None]:
df.sum() #=>Sum of values

In [None]:
df.cumsum() #=>Cummulative sum of values

In [None]:
df.min()/df.max() #=>Minimum/maximum values

In [None]:
df.idxmin()/df.idxmax() #=>Minimum/Maximum index value

In [None]:
df.describe() #=>Summary statistics

In [None]:
df.mean() #=>Mean of values

In [None]:
df.median() #=>Median of values

## Applying Functions

In [None]:
f = lambda x: x*2

In [None]:
df.apply(f) #=>Apply function

In [None]:
df.applymap(f) #=>Apply function element-wise

## Data Alignment

### Internal Data Alignment

In [None]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])

In [None]:
s + s3

### Arithmetic Operations with Fill Methods

In [None]:
s.add(s3, fill_value=0)

In [None]:
s.sub(s3, fill_value=2)

In [None]:
s.div(s3, fill_value=4)

In [None]:
s.mul(s3, fill_value=3)

In [None]:
# Wrangling

# Sorting
# sort by labels
# sort by values
# sort using a specific sorting algorithm (quicksort, mergesort, etc)

# Handling Missing Data (replacing/dropping) and Duplicates
# replace()
# fillna()

# Joining, Merging, Concatenating, Grouping, Aggregating

## Visualization Aggregation, timeseries

Visualization with Pandas



In [None]:
# Series.box.plot()
# Dataframe.boxplot() or Dataframe.box.plot()

# Series.plot.area()
# Dataframe.plot.area()

# Dataframe.plot.scatter()

# Pie chart

# Bar Plot

# Histogram

In [1]:
# TODO maybe include this?

# Exercises

### Exercise #1

Create a numpy array of tuples with dates from 2019-2030 in the first position followed by 6 random numbers. Do this in the most efficient way.

```python
date_to_create = [('2019-01-01', 100.  , 104.06,  95.96, 100.34, 22351900, 100.34)
                  ('2020-01-01', 101.01, 109.08, 100.5 , 108.31, 11428600, 108.31)
                  ('2021-01-01', 110.75, 113.48, 109.05, 109.4 ,  9137200, 109.4 )
                  ...
                  ('2028-01-01', 313.16, 341.89, 310.3 , 332.  , 10597800, 332.  )
                  ('2029-01-01', 355.79, 381.95, 345.75, 381.02,  8905500, 381.02)
                  ('2030-01-01', 393.53, 394.5 , 357.  , 362.71,  7784800, 362.71)]
```

In [1]:
# Complete the exercise here

### Exercise #2

Create a numpy array ...

In [None]:
# Complete the exercise here

# Answers