# Python data structures     
## Author: Erika Duan

![](../02_figures/01_lists-header.jpg)

# Lists  

A list is a sequential container for data values (whether logical, integers, floats, strings or other lists) and has some similarities to vectors in R. 

Properties of lists include:  

+ A single list can store different primitive types and even other lists.    
+ Lists have an integer and 0-based index, which allows for list slicing (i.e. subsetting).  
+ Lists can be appended using the methods `append()` or `insert()`.  
+ Values inside a list can be removed using the methods `remove()` or `pop()` or using the keyword `del`.  
+ The function `len()` can calculate the number of items in a list.  
+ Two lists can be concatenated with the operator `+`.  

In [1]:
#-----example 1-----  
list_a = [1, 2.4, "hello world", [0, 1, 2, 3]]
print(list_a)  

type(list_a) 

# Python allows lists containing different primitive types  

[1, 2.4, 'hello world', [0, 1, 2, 3]]


list

In [2]:
#-----example 2-----
list_b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  

print(list_b[0]) # the integer 1 occupies position 0   
print(list_b[1]) # the integer 2 occupies position 0 + 1  
print(list_b[-2]) # the integer 9 occupies the second to last position i.e. position -2  
print(list_b[0:3+1]) # extract values from position 0 to position 3
print(list_b[0::2]) # start from position 0 and extract values from every subsequent second position 
print(list_b[::2]) # the same as list_b[0::2]  
print(list_b[:]) # the same as list_b.copy()

1
2
9
[1, 2, 3, 4]
[1, 3, 5, 7, 9]
[1, 3, 5, 7, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [3]:
#-----example 3----- 
list_c = ["apple", "bear", "donut", "elephant", "guava"] 

list_c.append("horse") # always appends onto the last position in a item
list_c.insert(2, "cat") # appends "cat" to position 2 of the new list 
print(list_c)

list_c.insert(5, "french toast") # appends "cat" to position 5 of the new list
print(list_c)

['apple', 'bear', 'cat', 'donut', 'elephant', 'guava', 'horse']
['apple', 'bear', 'cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']


In [4]:
#-----example 4----- 
list_d = list_c.copy() 
print("Original list: {}".format(list_d)) 

del list_d[0:1+1] # del removes objects in a list by index (accepts integers and slices)
print("Objects in positions 0 to 1 are deleted: {}".format(list_d)) 

Original list: ['apple', 'bear', 'cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']
Objects in positions 0 to 1 are deleted: ['cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']


In [5]:
#-----example 5-----
print(list_d)

list_d.append("cat") # append "cat" to last position in the list
list_d.remove("cat") # only removes the first reference to an object in the list
print("First reference to cat is removed: {}".format(list_d))  

['cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']
First reference to cat is removed: ['donut', 'elephant', 'french toast', 'guava', 'horse', 'cat']


In [6]:
#-----example 6-----  
french_toast = list_d.pop(2)
print("The function pop does two things. It removes the object from the original list: {}. It also stores the removed item: {}."
      .format(list_d, french_toast))

The function pop does two things. It removes the object from the original list: ['donut', 'elephant', 'guava', 'horse', 'cat']. It also stores the removed item: french toast.


![](../02_figures/01_tuples-and-sets-header.jpg)

# Tuples    

A tuple is an object container that behaves like a list but is immutable. (In contrast, a list can be subsetted and subsets then assigned a new value).    

Properties of tuples include:  
+ Tuples can be created by enclosing objects inside round brackets `()`. 
+ You can subset tuples (tuples can be referenced by index i.e. `[n]`).   
+ You cannot alter tuples after they are created.      
+ You can check whether an item exists in a tuple by querying for each tuple position `in` a tuple (i.e. returns a logical).  
+ You can iterate through a tuple using a `for loop`.     

**Note:** Tuples are not commonly used for data manipulation tasks (although their property of being immutable can make them more useful than lists in special circumstances).  

In [7]:
#-----example 1-----  
tuple_1 = ("apple", "bear", "cat", "donut", 1, 2, 3, 4)
print("Object tuple_1 is of type {} and can contain different primitive types in the same tuple: {}."
      .format(type(tuple_1), tuple_1)) 

Object tuple_1 is of type <class 'tuple'> and can contain different primitive types in the same tuple: ('apple', 'bear', 'cat', 'donut', 1, 2, 3, 4).


In [8]:
#-----example 2----- 
list_1 = list(tuple_1)
print(tuple_1[0] == list_1[0])

# tuple_1[0] = "apple_red" produces an error

list_1[0] = "apple_red"
list_1

# lists are mutable but tuples are immutable  

True


['apple_red', 'bear', 'cat', 'donut', 1, 2, 3, 4]

In [9]:
#-----example 3-----
for index, object in enumerate(tuple_1, 1): # start index at 1 instead of 0  
    print("Item {}: {}".format(index, object))

Item 1: apple
Item 2: bear
Item 3: cat
Item 4: donut
Item 5: 1
Item 6: 2
Item 7: 3
Item 8: 4


# Sets  

A set is a container that behaves like a tuple but is unindexed and has no order (i.e. like a mathematical set).   

Properties of sets include:   
+ Sets can be created by including objects inside `{}`.    
+ Duplicate objects enclosed inside a set will be removed.   
+ You cannot subset a set (sets cannot be referenced by index i.e. `[n]`).    
+ You can check if an item is in a set using `in`.    
+ You can iterate through a set using a `for loop`.    

**Note:** Sets are not commonly used for data manipulation tasks.  

In [10]:
#-----example 1-----  
set_1 = {"maths", "physics", "chemistry", "maths", "biology", "biology"}
print(set_1) # no duplicate values

{'biology', 'maths', 'physics', 'chemistry'}


In [11]:
#-----example 2-----  
for value in set_1:
    print(value)

biology
maths
physics
chemistry


![](../02_figures/01_dictionaries-header.jpg)

# Dictionary  

A dictionary can be thought of as an unordered list where every item is associated with a key (i.e. a self-defined index of strings or numbers).    
+ The index values are called **keys**.  
+ A dictionary therefore contains **key-value pairs** with the format `{key: value(s)}`.      

Dictionaries can be created using `dict()` or by listing key-values pairs inside `{}`.  
+ `{"key_1": ["value_1", "value_2"], "key_2": "value_3"}`     

Dictionary key-value pairs can be accessed by subsetting on the key or by using the `get()` method.   

In [12]:
#-----example 1-----
branch_teams = {"dev_team": ["Jane", "Paul", "Gwen", "Suresh"],
                "user_design_team": ["Ming", "Sasha", "Amy"],
                "comms_team": ["David", "Prya", "Alice"]}  

# create a dictionary

branch_teams

{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'],
 'user_design_team': ['Ming', 'Sasha', 'Amy'],
 'comms_team': ['David', 'Prya', 'Alice']}

In [13]:
#-----example 2-----
branch_teams["comms_team"] 

# subsets the values associated with key = "comms_team"

['David', 'Prya', 'Alice']

In [14]:
#-----example 3----- 
comms_team = branch_teams.get("comms_team")
print(comms_team) # comms_team is stored as a list 

HR_team = branch_teams.get("HR_team", "No team exists")  
print(HR_team) 

# get() allows us to return an alternate string when the key is not found inside the dictionary 
# this prevents our code from returning an error message which halts the analysis

['David', 'Prya', 'Alice']
No team exists


To manipulate dictionaries, we can perform the following other actions:  
+ You can check whether a key exists in a dictionary using the keyword `in` i.e. `"key_2" in dict_1` should return `True`.  
+ You can retrieve dictionary keys using `dict_1.keys()`.  
+ You can retrieve dictionary values using `dict_1.values()`.  
+ You can retrieve dictionary items using `dict_1.items()`.    

In [15]:
#-----example 4----- 
print("dev_team" in branch_teams) 
print("policy_team" in branch_teams)

True
False


In [16]:
#-----example 5-----  
print(branch_teams.keys())
print(branch_teams.values())
print(branch_teams.items()) 

# compare the output difference for each statement i.e. item = all key + value pairs

dict_keys(['dev_team', 'user_design_team', 'comms_team'])
dict_values([['Jane', 'Paul', 'Gwen', 'Suresh'], ['Ming', 'Sasha', 'Amy'], ['David', 'Prya', 'Alice']])
dict_items([('dev_team', ['Jane', 'Paul', 'Gwen', 'Suresh']), ('user_design_team', ['Ming', 'Sasha', 'Amy']), ('comms_team', ['David', 'Prya', 'Alice'])])


You can modify key-value pairs in a dictionary.    
+ New items can be added through `dict_1["key_3"] = ["value_4", "value_5"]`   
+ You can delete items within a key using the `del` keyword or the `pop()` method (both methods will modify the dictionary in place).  

In [17]:
#-----example 6-----
branch_teams["data_science_team"] = ["Paula", "Mark"] 

# add a new key-value pair to dictionary

branch_teams

{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'],
 'user_design_team': ['Ming', 'Sasha', 'Amy'],
 'comms_team': ['David', 'Prya', 'Alice'],
 'data_science_team': ['Paula', 'Mark']}

In [18]:
#-----example 7-----
new_data_science_team = branch_teams.pop("data_science_team")
print(new_data_science_team)
branch_teams 

# dictionary is modified in place

['Paula', 'Mark']


{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'],
 'user_design_team': ['Ming', 'Sasha', 'Amy'],
 'comms_team': ['David', 'Prya', 'Alice']}

In [19]:
#-----example 8-----
branch_teams["comms_team"] = ["David", "Pryanna", "Alice", "Li"]

for key, values in branch_teams.items():
    print("The {} consists of members {}.".format(key, values))

The dev_team consists of members ['Jane', 'Paul', 'Gwen', 'Suresh'].
The user_design_team consists of members ['Ming', 'Sasha', 'Amy'].
The comms_team consists of members ['David', 'Pryanna', 'Alice', 'Li'].


**Note:** Dictionaries are mutable like lists because you can overwrite pre-existing key-value pairs.  

![](../02_figures/01_numpy-header.jpg)

# NumPy arrays  

Although lists are powerful and versatile containers, they become computationally expensive as the list length increases.   
The alternate solution is to use [NumPy](https://www.nature.com/articles/s41586-020-2649-2), which is an N-dimensional array container for storing numerical types.  

We can perform a diverse range of functions and mathematical operations using numpy arrays. Functions include:  

+ Creating 1D or 2D NumPy arrays.   
+ Accessing elements inside NumPy arrays. 
+ Performing mathematical operations, linear algebra via **np.linalg** and universal functions i.e. `min()`, `max()`, `np.mean()`, `np.std()`.   
+ Combining arrays (with respect to broadcasting rules).    

**Note:** A good NumPy tutorial can also be found [here](https://cs231n.github.io/python-numpy-tutorial/#numpy).  

In [20]:
#-----import the numpy library-----
import numpy as np

In [21]:
#-----example 1----- 
vector_1 = np.array([1, 3, 5, 7, 9])
vector_2 = np.arange(start = 1,
                     stop = 9+1, # stop at x+1 to end with value x
                     step = 2) # step = 2 means to select every second value after the start position 
vector_3 = np.ones(5) 

print(vector_1)
print(vector_2)
print(vector_3)

# numpy.arange([start, ]stop, [step, ], dtype=None) -> numpy.ndarray  

print(type(vector_1))
print(vector_1.ndim) # 1-d array
print(vector_1.shape) # length of values for a 1-day array 

[1 3 5 7 9]
[1 3 5 7 9]
[1. 1. 1. 1. 1.]
<class 'numpy.ndarray'>
1
(5,)


In [22]:
#-----example 2-----
vector_1 = np.array([1, 3, 5, 7, 9]) 

print(vector_1 + 5) # you can perform mathematical operations on arrays
print(vector_1 ** 2)

print(min(vector_1)) # you can perform universal functions on arrays
print(vector_1.mean())

[ 6  8 10 12 14]
[ 1  9 25 49 81]
1
5.0


In [23]:
#-----example 3----- 
array_2d_1 = np.array([[1, 2, 3], [4, 5, 6]])
array_2d_2 = np.arange(10, 30, 2).reshape(2, 5) 
array_2d_3 = np.full(shape = (2, 5),
                     fill_value = 10) 

rng = np.random.default_rng(111)
array_2d_4 = rng.random(10).reshape(2, 5)

print(array_2d_1)
print(array_2d_2)
print(array_2d_3)
print(array_2d_4)  

print(array_2d_4.ndim) # 2-d array 
print(array_2d_4.shape) # 2 rows, 5 columns   

[[1 2 3]
 [4 5 6]]
[[10 12 14 16 18]
 [20 22 24 26 28]]
[[10 10 10 10 10]
 [10 10 10 10 10]]
[[0.15366136 0.1693033  0.50596431 0.65811887 0.76758088]
 [0.10922746 0.79759653 0.96874591 0.24694934 0.19751383]]
2
(2, 5)


**Note:** The way that Python generates random numbers has changed. You are now required to call `default_rng` to get a new instance of a generator, as documented [here](https://numpy.org/devdocs/reference/random/index.html).    

### NumPy numerical types  

NumPy arrays can contain numerical types of different precision:  
+ `np.int8`  
+ `np.int32`  
+ `np.float`  
+ `np.float32`  
+ `np.float64`   
+ `np.bool`  
+ `np.complex`  

Applying the method `.dtype` to your NumPy array allows you to check its numerical type. You can also check the type of an individual value inside a NumPy array using the function `type`.     

In [24]:
#-----example 3-----
vector_1 = np.zeros(6) 
print(vector_1.dtype)

vector_2 = np.zeros(5, dtype = np.int8)
print(vector_2.dtype)

print(type(vector_1))
print(type(vector_1[0])) # i.e. NumPy array with type float64

float64
int8
<class 'numpy.ndarray'>
<class 'numpy.float64'>


### Accessing NumPy arrays    

For 1D NumPy arrays, we can:  
+ Use the 0-based index to subset values from an array (similar to a list).  
+ Slice a section of the NumPy array. Note that this returns a view and not a copy of the original array.   
+ Filtering by a list of boolean values (i.e. masking). Note that this returns a copy of the original array.  

In [25]:
#-----example 4-----  
rng = np.random.default_rng(111)
vector_1 = rng.integers(low=0, high=100, size=6)

print(vector_1)
print(vector_1[0]) # subset using 0-based index  

vector_2 = vector_1[0::3] # warning: slicing vector_1 returns a view not a copy
print(vector_2)

vector_2[0] = 100 
print(vector_1)

[47 15 72 16 71 50]
47
[47 16]
[100  15  72  16  71  50]


In [26]:
#-----example 5-----  
vector_3 = vector_1[[False, True, True, False, False, True]] 

# boolean array needs to be the same length as NumPy array
# subsetting by a boolean array returns a copy   

print(vector_1)
print(vector_3)

vector_3[0] = 200 
print(vector_1) # vector_1 remains unchanged

[100  15  72  16  71  50]
[15 72 50]
[100  15  72  16  71  50]


For 2D NumPy arrays, we can:  
+ Still subset values using the 0-based index i.e. `x[row, column]`.      
+ Slice a part of the 2D array. Note that this returns a sliced view and not a copy of the original 2D array.   
+ Subset using boolean masking.    

In [27]:
#-----example 6-----
array_1 = np.arange(1, 30, 2).reshape(3, 5) 
print(array_1)  

print(array_1[0, 4]) # first row, last column
print(array_1[-1, -1]) # last row, last column   

print(array_1[0:2+1, 0:2+1]) # subset rows and columns from position 0 to position 2
print(array_1[:, [1,2]]) # subset all rowns and columns from position 1 and 2  

# note that array_1[[0,1], [0,1]] does not return a 2D array
# instead, it returns a 1D array of individually subsetted values  

print(array_1[[0,1], [0,1]])

[[ 1  3  5  7  9]
 [11 13 15 17 19]
 [21 23 25 27 29]]
9
29
[[ 1  3  5]
 [11 13 15]
 [21 23 25]]
[[ 3  5]
 [13 15]
 [23 25]]
[ 1 13]


In [28]:
#-----example 7-----  
array_1 = np.arange(1, 30, 2).reshape(3, 5) 
print(array_1)

array_2 = array_1[0:1, :] # warning: slicing array_1 returns a view not a copy
print(array_2)  

array_1[0,0] = 3
print(array_2)

[[ 1  3  5  7  9]
 [11 13 15 17 19]
 [21 23 25 27 29]]
[[1 3 5 7 9]]
[[3 3 5 7 9]]


In [29]:
#-----example 8-----  
array_1 = np.arange(1, 30, 2).reshape(3, 5)   

# create random boolean 2D array
rng = np.random.default_rng(111)
bool_values = [True, False]

bool_mask = rng.choice(bool_values, size = (3, 5))
print(bool_mask)

# masking a 2D array produces a copy of a 1D vector 
array_2 = array_1[bool_mask]
print(array_2)

array_1[0, 0] = 3
print(array_2)

[[ True  True False  True False]
 [False  True False False False]
 [False  True  True False  True]]
[ 1  3  7 13 23 25 29]
[ 1  3  7 13 23 25 29]


### Broadcasting with NumPy arrays  

We can stack NumPy arrays with the `concatenate()` method if two NumPy arrays are of similar shape:  
+ Two arrays have the same number of columns for `concatenate(..., axis = 0)`.    
+ Two arrays have the same number of rows for `concatenate(..., axis = 1)`.    

In [30]:
#-----example 1-----
a = np.ones((3, 2), dtype = np.int) # 3 rows and 2 columns
b = np.ones((2, 2), dtype = np.int) *2  # 2 rows and 2 columns
print(a) 
print(b)

np.concatenate([a, b]) # stack arrays on top of each other  

# contrast with a + b, which produces the error message below
# ValueError: operands could not be broadcast together with shapes (3,2) (2,2) 

[[1 1]
 [1 1]
 [1 1]]
[[2 2]
 [2 2]]


array([[1, 1],
       [1, 1],
       [1, 1],
       [2, 2],
       [2, 2]])

In [31]:
#-----example 2-----
# axis = 0 or 1 represents the axis along which the arrays are joined  
# 0 represents bind rows together
# 1 represents bind columns together 

a = np.ones((3, 2), dtype = np.int) # 3 rows and 2 columns
b = np.ones((2, 2), dtype = np.int) * 2
c = np.ones((3, 3), dtype = np.int) * 3

print(np.concatenate([a, b], axis = 0))
print(np.concatenate([a, c], axis = 1)) # note that this does not work if b.shape = (2, 2)

[[1 1]
 [1 1]
 [1 1]
 [2 2]
 [2 2]]
[[1 1 3 3 3]
 [1 1 3 3 3]
 [1 1 3 3 3]]


**NumPy broadcasting rules**    

When two arrays do not have the same shape, NumPy still may allow element-wise operations to be performed by filling in copies of one array to match array shapes.  

1. Compare the array dimensions (does `a.ndim == b.ndim`?).  
2. If `a.ndim != b.ndim`, then prepend `1s` to the smaller array's shape.  
3. If `a.ndim == b.ndim`, examine `a.shape` and `b.shape`. 
4. If `a.shape == b.shape`, perform the element-wise operation.  
5. If `a.shape != b.shape` but either rows or columns are 1 in length, expand the shape of the smaller array until it matches the larger array. 
6. Otherwise, throw an error.  

In [32]:
#-----example 1-----
a = np.ones((3, 2))
b = np.ones((3, 1)) * 2 

print(a) 
print(b)
a.ndim == b.ndim

# broadcasting so that b is a 3 by 2 array containing 2s

a + b

[[1. 1.]
 [1. 1.]
 [1. 1.]]
[[2.]
 [2.]
 [2.]]


array([[3., 3.],
       [3., 3.],
       [3., 3.]])

In [33]:
#-----example 2-----
a = np.ones((2, 3))
b = np.ones((3, 1)) * 2 

print(a) 
print(b)

# broadcasting to first expand b into a 3 by 3 array containing 2s
# but a.shape = (2, 3) and expanded b.shape = (3, 3)  

# a + b produces an error
# ValueError: operands could not be broadcast together with shapes (2,3) (3,1) 

[[1. 1. 1.]
 [1. 1. 1.]]
[[2.]
 [2.]
 [2.]]


In [34]:
#-----example 3-----
a = np.ones((3, 1))
b = np.ones((3, )) * 2 

print(a) 
print(b)

# broadcasting to first expand b into a 3 by 3 array containing 2s
# broadcasting to then expand a into a 3 by 3 array containing 1s to match the shape of b

a + b

[[1.]
 [1.]
 [1.]]
[2. 2. 2.]


array([[3., 3., 3.],
       [3., 3., 3.],
       [3., 3., 3.]])

![](../02_figures/01_pandas-header.jpg)

# Pandas DataFrame  

Pandas is one of the most popular libraries for structured data manipulations in Python.  
There are two data structures supported by the Pandas module:  
+ **pd.Series** - similar to a 1D NumPy (but with extra indexing options).   
+ **pd.DataFrame** - similar to R data frames or tibbles.   

In [35]:
#-----import the pandas library-----
import pandas as pd

In [36]:
#-----example 1-----

In [37]:
#-----example 2-----

In [38]:
#-----example 3-----

In [39]:
#-----eample 4-----