# Python Data Structures - Collections

## Pure Python

### Lists
- Ordered
- Mutable (can change what the list contains)

In [4]:
my_list = [1, 2, 3, 4, 5]
my_list[-2] = 7
my_list

[1, 2, 3, 7, 5]

In [7]:
names = ['Jay', 'Mary', 'Carina', 'Danielle']
names[1:3]

['Mary', 'Carina']

Python lists can contain different types of values, which almost universally does not make sense/is not useful in data analysis. 
You cannot search for an index that does not exist in a list. 

In [8]:
another_list = [7, 'fish', True]
another_list

[7, 'fish', True]

In [13]:
my_dog_info = [
    ['Name', 'Teddy'],
    ['Species', 'Dog'],
    ['Color', 'Black']
]

my_dog_info2 = [
    ['Name', 'Species', 'Color'],
    ['Teddy', 'Dog', 'Black']
]

### Dictionaries
- Unordered jk Ordered (as of Java 3.7)
- Mutable

In [14]:
my_dog = {
    'Name': 'Teddy',
    'Species': 'Dog',
    'Color': 'Black'
}
my_dog['Color'] = 'White'

In [15]:
my_dog['Color']

'White'

In [16]:
my_dog['Age'] = 5
my_dog

{'Name': 'Teddy', 'Species': 'Dog', 'Color': 'White', 'Age': 5}

### Tuples
- Immutable
- Ordered

In [18]:
my_tup = (255, 255, 255)
my_coords = (1, -2.5)
geo = (-45.6, 23)

### Sets
- Unordered
- Mutable
- All of the items must be unique 

In [22]:
list_o_names = ['Jay', 'DeAnna', 'Eddie', 'Jay', 'Jay']
list(set(list_o_names)) # way to drop duplicates from a list while keeping it as a list

['Jay', 'DeAnna', 'Eddie']

## Why is Pure Python not great for Data Analysis?

- Collections can contain different data types
- OOP - constructor methods, getters/setters, props -> slow things down and get in the way -> makes values a lot larger than they need to be (technically objects)
- Python lists are much larger (more memory) than np array
- Only one type is allowed in np array
- Most operations are vectorized across np arrays
- Numpy lists are computationally better than pure python lists 

In [24]:
s = 'DATA'
s.lower()

'data'

In [26]:
import sys
sys.getsizeof(5)
sys.getsizeof(my_list)

120

#### Simple sequential list

In [33]:
my_list = []
for  i in range(0, 1000):
    my_list.append(i)

In [34]:
len(my_list)

1000

In [39]:
sys.getsizeof(my_list)

8856

#### Evens only

In [37]:
evens_list = []
for i in range(0, 1000):
    if i % 2 == 0:
        evens_list.append(i)

In [38]:
len(evens_list)

500

#### List Comprehensions

In [41]:
my_list_comp = [i for i in range(0, 1000)]
len(my_list_comp)

1000

In [43]:
my_evens_comp = [i for i in range(0, 1000) if (i % 2 == 0) and (i != 6)]
len(my_evens_comp)

499

In [50]:
list_2d = []
for i in range(500):
    col = []
    for j in range(9):
        col.append(5)
    list_2d.append(col)

In [54]:
len(list_2d)

500

In [55]:
list_2d_comp = [[5 for i in range(9)] for j in range(500)]
len(list_2d_comp)

500

In [58]:
sys.getsizeof(list_2d_comp) * 10

42160

In [63]:
import numpy as np

In [68]:
my_arr = np.array(list_2d_comp)
my_arr

array([[5, 5, 5, ..., 5, 5, 5],
       [5, 5, 5, ..., 5, 5, 5],
       [5, 5, 5, ..., 5, 5, 5],
       ...,
       [5, 5, 5, ..., 5, 5, 5],
       [5, 5, 5, ..., 5, 5, 5],
       [5, 5, 5, ..., 5, 5, 5]])

In [69]:
my_arr.nbytes

36000

In [71]:
list_c = [[col for col in range(20)] for row in range (10000)]
sys.getsizeof(list_c) * len(list_c[0])

1703520

In [83]:
big_arr = np.array(list_c, dtype=np.int8)
big_arr.nbytes


200000

#### Generating data with NumPy

In [105]:
np.random.randint(2, 8,  10000).reshape(20, 500).ndim

2

In [106]:
np.random.randint(1, 10, 27).reshape(3, 3, 3)

array([[[3, 8, 7],
        [3, 8, 4],
        [9, 8, 2]],

       [[5, 6, 1],
        [8, 7, 7],
        [5, 2, 1]],

       [[4, 8, 3],
        [1, 8, 5],
        [1, 5, 4]]])

In [108]:
np.random.randint(0, 2, 16).reshape(2, 2, 2, 2)

array([[[[0, 0],
         [0, 1]],

        [[1, 1],
         [1, 0]]],


       [[[1, 1],
         [1, 1]],

        [[1, 0],
         [0, 1]]]])

#### Vectorized Operations

In [73]:
num_list = [2, 4, 6]

In [74]:
num_list + 10

TypeError: can only concatenate list (not "int") to list

Here's how we would add 10 to each item:

In [78]:
new_list = []

for el in num_list:
    new_list.append(el + 10)

new_list

[12, 14, 16]

In [75]:
num_arr = np.array(num_list)

In [76]:
num_arr + 10

array([12, 14, 16])

In [79]:
num_arr * 10

array([20, 40, 60])

In [80]:
list1 = [1, 2, 3]
list2 = [4, 5, 6]
list1 + list2

[1, 2, 3, 4, 5, 6]

In [81]:
arr1 = np.array(list1)
arr2 = np.array(list2)
arr1 + arr2

array([5, 7, 9])

In [84]:
num_arr / 37

array([0.05405405, 0.10810811, 0.16216216])

In [85]:
num_arr

array([2, 4, 6])

#### Here's where vectorizing operations gets a little tricky

In [95]:
import math
import pandas as pd

In [90]:
sq_list = [16, 144, 10000, 81, 4]
sq_arr = np.array(sq_list)

In [98]:
pd_arr = pd.Series(sq_arr)
pd_arr.apply(math.sqrt)

0      4.0
1     12.0
2    100.0
3      9.0
4      2.0
dtype: float64

#### NumPy arrays vs. pandas DataFrames
- nd arrays are a little faster
- pandas Series: 1 dim, pandas DataFrame: 2dim -> no 3d dataframes!
- ndarrays can be n dimensions - aerospace stuff may require 3+ dims
- ndarrays ALWAYS index numerically from 0
- pd Series and DataFrames indexed numerically from 0 by default, but index (row labels) can be anything

In [109]:
sq_arr[-1]

4