# Module 3 (Pandas Core): Data Structures 

## Table of Contents

- [Import Libraries](#import-libraries)
- [Objects](#objects)
- [Key Data Structure](#key-data-structure)
- [Built-in Sequences](#built-in-sequences)
    - [List](#list)
    - [Dictionary](#dictionary)

## Import libraries <a class="anchor" id="import-libraries"></a>

In [1]:
import pandas as pd
import numpy as mp
from pandas_extensions.database import collect_data

## Objects <a class="anchor" id="objects"></a>

In [4]:
# Import data
df = pd.DataFrame(collect_data())
# Get object class
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [5]:
# Get the object's inheritance structure in the order that methods are searched for
print(type(df).mro())

[<class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.generic.NDFrame'>, <class 'pandas.core.base.PandasObject'>, <class 'pandas.core.accessor.DirNamesMixin'>, <class 'pandas.core.base.SelectionMixin'>, <class 'pandas.core.indexing.IndexingMixin'>, <class 'pandas.core.arraylike.OpsMixin'>, <class 'object'>]


It turns out that Python's objects get methods and attributes from the classes that they inherit from. The output above is the search path similar to R's pacakge environments. This allows us to use methods from all the classes in the inheritance structure.

In [7]:
# Objects have attributes
# Vscode shows object attributes with the wrench icon
print(df.shape)


(15644, 13)
Index(['order_id', 'order_line', 'order_date', 'quantity', 'price',
       'total_revenue', 'model', 'category_1', 'category_2', 'frame_material',
       'bikeshop_name', 'city', 'state'],
      dtype='object')


In [None]:
print(df.columns)

In [54]:
# Objects have methods
# Vscode shows object methods with the cube
print(df.query("order_id == 3"))

   order_id  order_line order_date  quantity  price  total_revenue  \
4         3           1 2011-01-10         1  10660          10660   
5         3           2 2011-01-10         1   3200           3200   
6         3           3 2011-01-10         1  12790          12790   
7         3           4 2011-01-10         1   5330           5330   
8         3           5 2011-01-10         1   1570           1570   

                            model category_1      category_2 frame_material  \
4        Supersix Evo Hi-Mod Team       Road      Elite Road         Carbon   
5                 Jekyll Carbon 4   Mountain   Over Mountain         Carbon   
6         Supersix Evo Black Inc.       Road      Elite Road         Carbon   
7  Supersix Evo Hi-Mod Dura Ace 2       Road      Elite Road         Carbon   
8                Synapse Disc 105       Road  Endurance Road       Aluminum   

               bikeshop_name        city state  
4  Louisville Race Equipment  Louisville    KY  
5  Lou

## Key data structure <a class="anchor" id="key-data-structure"></a>

Data frame is a key structure that holds Pandas series; it has columns and index attributes. The data frame is an object with many methods.

In [55]:
# Each column in a data frame is a pandas series
print(type(df["order_date"]))

<class 'pandas.core.series.Series'>


In [56]:
# Each series has methods that are dependent on its attributes
# Accessors can be used to call the series attribute "dt" and extract the year from the column
df["order_date"].dt.year

0        2011
1        2011
2        2011
3        2011
4        2011
         ... 
15639    2015
15640    2015
15641    2015
15642    2015
15643    2015
Name: order_date, Length: 15644, dtype: int64

Pandas series are built on top of Numpy arrays. The series adds an index and meta data like series name. The Numpy array then provides core functionality, like `numpy.sum()`.

In [57]:
# Acess the array from the series
df["order_date"].values

array(['2011-01-07T00:00:00.000000000', '2011-01-07T00:00:00.000000000',
       '2011-01-10T00:00:00.000000000', ...,
       '2015-12-25T00:00:00.000000000', '2015-12-25T00:00:00.000000000',
       '2015-12-25T00:00:00.000000000'], dtype='datetime64[ns]')

In [58]:
# Get type
type(df["order_date"].values)

numpy.ndarray

In [60]:
# The numpy array is actually a low level object that only has "object" in its inheritance
type(df["order_date"].values).mro()

[numpy.ndarray, object]

Numpy data types are extended built-in data types. It uses special data types (e.g. int64), which are usually optimized for memory allocation. 

In [71]:
# Get classes and the data types that the numpy arrays actually contain
print(df["price"].values.dtype)
print(df["order_date"].values.dtype)

int64
datetime64[ns]


## Built-in sequences <a class="anchor" id="built-in-sequences"></a>

*Container sequences*

- list, tuple, and collections.deque can hold items of different types, including nested containers.

*Flat sequences*

- str, bytes, bytearray, memoryview, and array.array hold items of one simple type.
  
Another way to group these sequences is by mutability: 

*Mutable sequences*

- list, bytearray, array.array, collections.deque, and memoryview

*Immutable sequences*

- tuple, str, and bytes


One important distinction between container and flat sequences is: A container sequence holds references to the objects it contains, which may be of any type, while a flat sequence stores the value of its contents in its own memory space, and not as distinct objects. Examine the diagram below:

<p align="center">
  <img width="600" height="250" img src="Image/sequence.png">
</p>

Here is a really nice visualization of the Python object ecosystem:

<p align="center">
  <img width="700" height="500" img src="Image/data_structure.png">
</p>

[Link](https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747)


### List <a class="anchor" id="list"></a>

In [74]:
# List operations
# Empty list
L = []
print(L)


[]
[123, 'abc', 1.23, {}]
['Bob', 40.0, ['dev', 'mgr']]


[-4, -3, -2, -1, 0, 1, 2, 3]

In [None]:
# Create list with 4 elements with indices 0:3
L = [123, 'abc', 1.23, {}]
print(L)


In [121]:
# Nested Sublist
L = ['Bob', 40.0, ['dev', 'mgr']]
print(L)


['Bob', 40.0, ['dev', 'mgr']]


In [122]:
# List of values from -4 to 4
list(range(-4, 4))

[-4, -3, -2, -1, 0, 1, 2, 3]

In [123]:
# Subseting
L[0]

'Bob'

In [124]:
# Index of index
# The third elent is a sublist and we extract the fist element from this sublist
print(L[2][0])
# This should be a string
print(type(L[2][0]))

dev
<class 'str'>


In [125]:
# Slice to subset multiple elements
L[0:2]

['Bob', 40.0]

In [126]:
# Length is the number of elements
len(L)

3

In [127]:
# Concatenate, repeat
print(L * 2)
print(L + L + L)

['Bob', 40.0, ['dev', 'mgr'], 'Bob', 40.0, ['dev', 'mgr']]
['Bob', 40.0, ['dev', 'mgr'], 'Bob', 40.0, ['dev', 'mgr'], 'Bob', 40.0, ['dev', 'mgr']]


In [128]:
# Iteration
for var in L: print(var)

Bob
40.0
['dev', 'mgr']


In [129]:
# Membership (are these elements in the list)
print(3 in L)
print("Bob" in L)

False
True


In [132]:
# Method grow
# A list is mutable and modifies in place and so growing in will not be memory inefficient like R
L.append(False)
print(L)

['Bob', 40.0, ['dev', 'mgr'], False]


In [133]:
# Method extend
L.extend([5, 6, 7])
print(L)

['Bob', 40.0, ['dev', 'mgr'], False, 5, 6, 7]


In [134]:
# Method insert
# Insert string object as the fifth element
L.insert(4, "Ken")
print(L)

['Bob', 40.0, ['dev', 'mgr'], False, 'Ken', 5, 6, 7]


In [140]:
# Method searching
# Get the index of the element
# This is similar to match() in R
print(L)
print(L.index("Ken"))
print(L.index(40))

['Bob', 40.0, ['dev', 'mgr'], False, 'Ken', 5, 6, 7]
4
1


In [141]:
# Method Count returns the number of elements with the specified value
L.count()

1

In [146]:
# Sort by ascending order by default, i.e., reverse=False
L_1 = [3, 4, 9, 3, 2, 24, 45, 13, 9]
L_1.sort(reverse=True)
print(L_1)

[2, 3, 3, 4, 9, 9, 13, 24, 45]


In [154]:
# Reverses the list
L_2 = [3, 4, 5, 7]
L_2.reverse()
print(L_2)

[7, 5, 4, 3]


In [163]:
# Copy creates a copy of the list
L_new = L.copy()
# Copy on modify
L_new[2] = 9
# Check new modified copy
print(L_new)
# Check original object (should be unchanged)
print(L)


[7, 6, 9, False, ['dev', 'mgr'], 40.0, 'Bob']
[7, 6, 4, False, ['dev', 'mgr'], 40.0, 'Bob']


In [172]:
# Before clear
print(L_1)
print(L_2)
L_1.clear()
L_2.clear()
# Clear all elements
print(L_1)
print(L_2)

[]
[]
[]
[]


In [173]:
# Pop removes element at the specified position
print(L)
L.pop(2)
print(L)

[7, 6, 4, False, ['dev', 'mgr'], 40.0, 'Bob']
[7, 6, False, ['dev', 'mgr'], 40.0, 'Bob']


In [174]:
# Another way to remove by position
print(L)
del L[0]
print(L)

[7, 6, False, ['dev', 'mgr'], 40.0, 'Bob']
[6, False, ['dev', 'mgr'], 40.0, 'Bob']


In [175]:
# Remove by name
print(L)
L.remove(False)
print(L)

[6, False, ['dev', 'mgr'], 40.0, 'Bob']
[6, ['dev', 'mgr'], 40.0, 'Bob']


In [178]:
# Remove slices
print(L_new)
del L_new[0:3]
print(L_new)

[7, ['dev', 'mgr'], 40.0, 'Bob']
['Bob']


In [200]:
# Subset and assignment
# This is similar to list[1:3] <- NULL in R
L = [3, "ken", [2, "3"], True]
print(L)
L[0:2] = []
print(L)

[3, 'ken', [2, '3'], True]
[[2, '3'], True]


In [201]:
# Subset one specific element and assign
print(L)
L[1] = 3
print(L)

[[2, '3'], True]
[[2, '3'], 3]


In [204]:
# Subset a slice and assign
L = list(range(-4, 5))
print(L)
L[3:7] = ["Ken", "needs", "a", "job", "now"]
print(L)

[-4, -3, -2, -1, 0, 1, 2, 3, 4]
[-4, -3, -2, 'Ken', 'needs', 'a', 'job', 'now', 3, 4]


### Dictionary <a class="anchor" id="dictionary"></a>