# Python data structures     
## Author: Erika Duan

![](../02_figures/03_lists-header.jpg)

# Lists  

A list is a sequential container for data values (whether logical, integers, floats, strings or other lists) and has some similarities to vectors in R. The key difference is that a list can contain elements of different types.  

Properties of lists include:  

+ A single list can store different primitive types and even other lists.    
+ Lists have an integer and 0-based index, which allows for list slicing (i.e. subsetting).  
+ Lists can be appended using the methods `append()` or `insert()`.  
+ Values inside a list can be removed using the methods `remove()` or `pop()` or using the keyword `del`.  
+ The function `len()` can calculate the number of items in a list.  
+ Two lists can be concatenated with the operator `+`.  

In [1]:
#-----example 1.1-----  
list_a = [1, 2.4, "hello world", [0, 1, 2, 3]]

print(list_a)  
type(list_a) 

[1, 2.4, 'hello world', [0, 1, 2, 3]]


list

In [2]:
#-----example 1.2----- 
list_a = [1, 2.4, "hello world", [0, 1, 2, 3]]

for element in list_a:
    print(type(element))
    
# Python lists contain different primitive types  

<class 'int'>
<class 'float'>
<class 'str'>
<class 'list'>


In [3]:
#-----example 2.1-----
list_b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  

print(list_b[0]) # the integer 1 occupies position 0   
print(list_b[1]) # the integer 2 occupies position 0 + 1  
print(list_b[-2]) # the integer 9 occupies the second to last position i.e. position -2  

1
2
9


In [4]:
#-----example 2.2----- 
print(list_b[0:3+1]) # extract values from position 0 to position 3

[1, 2, 3, 4]


In [5]:
#-----example 2.3----- 
print(list_b[0::2]) # start from position 0 and extract values from every subsequent second position 
print(list_b[::2]) # the same as list_b[0::2]  

[1, 3, 5, 7, 9]
[1, 3, 5, 7, 9]


In [6]:
#-----example 2.4-----
list_b[:] # the same as list_b.copy()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [7]:
#-----example 3.1----- 
list_c = ["apple", "bear", "donut", "elephant", "guava"] 

print("Original list: {}".format(list_c)) 
list_c.append("horse") # always appends onto the last position in a item
list_c.insert(2, "cat") # inserts "cat" in position 2 of the new list
print("Modified list: {}".format(list_c))

Original list: ['apple', 'bear', 'donut', 'elephant', 'guava']
Modified list: ['apple', 'bear', 'cat', 'donut', 'elephant', 'guava', 'horse']


In [8]:
#-----example 3.2-----
print("Original list: {}".format(list_c)) 

list_c.insert(5, "french toast") # inserts "french toast" in position 5 of the new list
print("Modified list: {}".format(list_c))

Original list: ['apple', 'bear', 'cat', 'donut', 'elephant', 'guava', 'horse']
Modified list: ['apple', 'bear', 'cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']


In [9]:
#-----example 4.1----- 
list_d = list_c.copy() 
print("Original list: {}".format(list_d)) 

del list_d[0:1+1] # del removes objects in a list by index (accepts integers and slices)
print("Objects in positions 0 to 1 are deleted: {}".format(list_d)) 

Original list: ['apple', 'bear', 'cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']
Objects in positions 0 to 1 are deleted: ['cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']


In [10]:
#-----example 4.2-----
print("Original list: {}".format(list_d)) 

list_d.append("cat") # append "cat" to last position in the list
list_d.remove("cat") # only removes the first reference to an object in the list
print("First reference to cat is removed: {}".format(list_d))  

Original list: ['cat', 'donut', 'elephant', 'french toast', 'guava', 'horse']
First reference to cat is removed: ['donut', 'elephant', 'french toast', 'guava', 'horse', 'cat']


In [11]:
#-----example 5-----  
french_toast = list_d.pop(2)
print("The function pop does two things. \nIt removes the object from the original list: {}. \nIt also stores the removed item: {}."
      .format(list_d, french_toast))

The function pop does two things. 
It removes the object from the original list: ['donut', 'elephant', 'guava', 'horse', 'cat']. 
It also stores the removed item: french toast.


![](../02_figures/03_tuples-and-sets-header.jpg)

# Tuples    

A tuple is an object container that behaves like a list but is immutable. In contrast, elements inside a list can be assigned a new value.        

Properties of tuples include:  
+ Tuples can be created by enclosing objects inside round brackets `()`. 
+ Tuples can be subsetted and referenced by index i.e. `[n]`.   
+ You cannot alter tuples after they are created.      
+ You can check whether an item exists in a tuple by querying for each tuple position `in` a tuple (i.e. returns a logical).  
+ You can iterate through a tuple using a `for loop`.     

**Note:** Tuples are not commonly used for data manipulation tasks, although their property of being immutable can make them more useful than lists in special circumstances.     

In [12]:
#-----example 1-----  
tuple_1 = ("apple", "bear", "cat", "donut", 1, 2, 3, 4)
print("Object tuple_1 is of type {}. \nA tuple can contain different primitive types in the same tuple: {}."
      .format(type(tuple_1), tuple_1)) 

Object tuple_1 is of type <class 'tuple'>. 
A tuple can contain different primitive types in the same tuple: ('apple', 'bear', 'cat', 'donut', 1, 2, 3, 4).


In [13]:
#-----example 2----- 
print("Original tuple: {}".format(tuple_1))
list_1 = list(tuple_1)

# tuple_1[0] = "apple_red" produces an error

list_1[0] = "apple_red"
list_1

# lists are mutable but tuples are immutable  

Original tuple: ('apple', 'bear', 'cat', 'donut', 1, 2, 3, 4)


['apple_red', 'bear', 'cat', 'donut', 1, 2, 3, 4]

In [14]:
#-----example 3-----
for index, object in enumerate(tuple_1, 1): # start index at 1 instead of 0  
    print("Item {}: {} - type {}".format(index, object, type(object)))

Item 1: apple - type <class 'str'>
Item 2: bear - type <class 'str'>
Item 3: cat - type <class 'str'>
Item 4: donut - type <class 'str'>
Item 5: 1 - type <class 'int'>
Item 6: 2 - type <class 'int'>
Item 7: 3 - type <class 'int'>
Item 8: 4 - type <class 'int'>


# Sets  

A set is a container that behaves like a tuple but is unindexed and has no order (i.e. like a mathematical set).   

Properties of sets include:   
+ Sets can be created by enclosing objects inside squiggly brackets `{}`.    
+ Duplicate objects enclosed inside a set will be removed.   
+ You cannot subset a set (sets cannot be referenced by index i.e. `[n]`).    
+ You can check if an item is in a set using `in`.    
+ You can iterate through a set using a `for loop`.    

**Note:** Sets are not commonly used for data manipulation tasks.  

In [15]:
#-----example 1-----  
set_1 = {"maths", "physics", "chemistry", "maths", "biology", "biology"}
set_1 # no duplicate values

{'biology', 'chemistry', 'maths', 'physics'}

In [16]:
#-----example 2-----  
for subject in set_1:
    print(subject)

maths
physics
chemistry
biology


![](../02_figures/03_dictionaries-header.jpg)

# Dictionary  

A dictionary can be thought of as an unordered list where every item is associated with a key (i.e. a self-defined index of strings or numbers).    
+ The index values are called **keys**.  
+ A dictionary therefore contains **key-value pairs** with the format `{key: value(s)}`.      

Dictionaries can be created using `dict()` or by listing key-values pairs inside `{}`.  
+ `{"key_1": ["value_1", "value_2"], "key_2": "value_3"}`     

Dictionary key-value pairs can be accessed by subsetting on the key or by using the `get()` method.   

In [17]:
#-----example 1-----
branch_teams = {"dev_team": ["Jane", "Paul", "Gwen", "Suresh"],
                "user_design_team": ["Ming", "Sasha", "Amy"],
                "comms_team": ["David", "Prya", "Alice"]} # create a dictionary
branch_teams

{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'],
 'user_design_team': ['Ming', 'Sasha', 'Amy'],
 'comms_team': ['David', 'Prya', 'Alice']}

In [18]:
#-----example 2-----
branch_teams["comms_team"] # subsets the values associated with key = "comms_team"

['David', 'Prya', 'Alice']

In [19]:
#-----example 3.1----- 
comms_team = branch_teams.get("comms_team")

print(comms_team)   
type(comms_team) # comms_team is stored as a list 

['David', 'Prya', 'Alice']


list

In [20]:
#-----example 3.2----- 
HR_team = branch_teams.get("HR_team", "No team exists")  
HR_team

# get() allows us to return an alternate string when the key is not found inside the dictionary 
# this prevents our code from returning an error message which halts the analysis

'No team exists'

To manipulate dictionaries, we can perform the following other actions:  
+ You can check whether a key exists in a dictionary using the keyword `in` i.e. `"key_2" in dict_1` should return `True`.  
+ You can retrieve dictionary keys using `dict_1.keys()`.  
+ You can retrieve dictionary values using `dict_1.values()`.  
+ You can retrieve dictionary items using `dict_1.items()`.    

In [21]:
#-----example 1----- 
print("dev_team" in branch_teams) 
print("policy_team" in branch_teams)

True
False


In [22]:
#-----example 2-----  
print(branch_teams.keys())
print(branch_teams.values())
print(branch_teams.items())

# compare the output difference for each statement i.e. item = all key + value pairs

dict_keys(['dev_team', 'user_design_team', 'comms_team'])
dict_values([['Jane', 'Paul', 'Gwen', 'Suresh'], ['Ming', 'Sasha', 'Amy'], ['David', 'Prya', 'Alice']])
dict_items([('dev_team', ['Jane', 'Paul', 'Gwen', 'Suresh']), ('user_design_team', ['Ming', 'Sasha', 'Amy']), ('comms_team', ['David', 'Prya', 'Alice'])])


You can modify key-value pairs in a dictionary.    
+ New items can be added through `dict_1["key_3"] = ["value_4", "value_5"]`   
+ You can delete items within a key using the `del` keyword or the `pop()` method (both methods will modify the dictionary in place).  

In [23]:
#-----example 1-----
branch_teams["data_science_team"] = ["Paula", "Mark"] 

# add a new key-value pair to dictionary

branch_teams

{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'],
 'user_design_team': ['Ming', 'Sasha', 'Amy'],
 'comms_team': ['David', 'Prya', 'Alice'],
 'data_science_team': ['Paula', 'Mark']}

In [24]:
#-----example 2-----
new_data_science_team = branch_teams.pop("data_science_team")
print(new_data_science_team)
print(branch_teams) 

# dictionary is modified in place i.e. data_science_team has been deleted via pop()

['Paula', 'Mark']
{'dev_team': ['Jane', 'Paul', 'Gwen', 'Suresh'], 'user_design_team': ['Ming', 'Sasha', 'Amy'], 'comms_team': ['David', 'Prya', 'Alice']}


In [25]:
#-----example 3-----
branch_teams["comms_team"] = ["David", "Pryanna", "Alice", "Li"]

for key, values in branch_teams.items():
    print("The {} now consists of members {}.".format(key, values))

The dev_team now consists of members ['Jane', 'Paul', 'Gwen', 'Suresh'].
The user_design_team now consists of members ['Ming', 'Sasha', 'Amy'].
The comms_team now consists of members ['David', 'Pryanna', 'Alice', 'Li'].


**Note:** Dictionaries are mutable like lists because you can overwrite pre-existing key-value pairs.  

![](../02_figures/03_numpy-header.jpg)

# NumPy arrays  

Although lists are powerful and versatile containers, they become computationally expensive as the list length increases.   
The alternate solution is to use [NumPy](https://www.nature.com/articles/s41586-020-2649-2), which is an N-dimensional array container for storing numerical types.  

We can perform a diverse range of functions and mathematical operations using numpy arrays.  

Functions include:  

+ Creating 1D or 2D NumPy arrays.   
+ Accessing elements inside NumPy arrays. 
+ Performing mathematical operations, linear algebra via **np.linalg** and universal functions i.e. `min()`, `max()`, `np.mean()`, `np.std()`.   
+ Combining arrays (with respect to broadcasting rules).    

**Note:** A good NumPy tutorial can also be found [here](https://cs231n.github.io/python-numpy-tutorial/#numpy).  

In [26]:
#-----import the numpy library-----
import numpy as np

In [27]:
#-----example 1.1----- 
vector_1 = np.array([1, 3, 5, 7, 9])
vector_1

array([1, 3, 5, 7, 9])

In [28]:
#-----example 1.2-----
print("Numpy array dimensions: {}".format(vector_1.ndim)) # 1D array
print("Numpy array length(s): {}".format(vector_1.shape)) # length of values for a 1D array 

Numpy array dimensions: 1
Numpy array length(s): (5,)


In [29]:
#-----example 2----- 
vector_2 = np.arange(start = 1,
                     stop = 9+1, # stop at x+1 to end with value x
                     step = 2) # step = 2 means to select every second value after the start position   

# numpy.arange([start, ]stop, [step, ], dtype=None) -> numpy.ndarray  

vector_2

array([1, 3, 5, 7, 9])

In [30]:
#-----example 3-----  
vector_3 = np.ones(5)   
vector_3

array([1., 1., 1., 1., 1.])

In [31]:
#-----example 4.1-----
vector_1 = np.array([1, 3, 5, 7, 9]) 

print(vector_1 + 5) 
print(vector_1 ** 2)

# you can perform mathematical operations on arrays

[ 6  8 10 12 14]
[ 1  9 25 49 81]


In [32]:
#-----example 4.2-----
print(min(vector_1)) 
print(vector_1.mean()) 

# you can perform universal functions on arrays

1
5.0


In [33]:
#-----example 5.1----- 
array_2d_1 = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d_1)  

# you can create a 2D array manually by enclosing each row in a list  

print("Numpy array dimensions: {}".format(array_2d_1.ndim))  

[[1 2 3]
 [4 5 6]]
Numpy array dimensions: 2


In [34]:
#-----example 5.2-----   
array_2d_2 = np.arange(10, 30, 2).reshape(2, 5) 
print(array_2d_2)

# reshape an 1D array into a 2D array with 2 rows of length 5  

print("Numpy array dimensions: {}".format(array_2d_2.shape))  

[[10 12 14 16 18]
 [20 22 24 26 28]]
Numpy array dimensions: (2, 5)


In [35]:
#-----example 5.3-----  
array_2d_3 = np.full(shape = (2, 5),
                     fill_value = 10) 
print(array_2d_3)  

# create a 2D Numpy array with 2 rows of length 5 and fill it with 10s   

print("Numpy array dimensions: {}".format(array_2d_3.shape))  

[[10 10 10 10 10]
 [10 10 10 10 10]]
Numpy array dimensions: (2, 5)


In [36]:
#-----example 5.4-----  
rng = np.random.default_rng(111) # set random seed
array_2d_4 = rng.random(10).reshape(5, 2) # pick 10 random floats from the range 0.0, 1.0

print(array_2d_4)  
print("Numpy array dimensions: {}".format(array_2d_4.shape))  

[[0.15366136 0.1693033 ]
 [0.50596431 0.65811887]
 [0.76758088 0.10922746]
 [0.79759653 0.96874591]
 [0.24694934 0.19751383]]
Numpy array dimensions: (5, 2)


**Note:** The way that Python generates random numbers has changed. You are now required to call `default_rng` to get a new instance of a generator, as documented [here](https://numpy.org/devdocs/reference/random/index.html).    

## NumPy numerical types  

NumPy arrays are efficient because they allow for numerical types of different precision:  
+ `np.int8`  
+ `np.int32`  
+ `np.float`  
+ `np.float32`  
+ `np.float64`   
+ `np.bool`  
+ `np.complex`  

Applying the method `.dtype` to your NumPy array allows you to check its numerical type. You can also check the type of an individual value inside a NumPy array using the function `type`.     

In [37]:
#-----example 1-----
vector_1 = np.ones(6) 
print(vector_1.dtype)  

print("The range of a Numpy float64 is between {} and {}.".format(np.finfo(np.float64).min,
                                                                  np.finfo(np.float64).max))

float64
The range of a Numpy float64 is between -1.7976931348623157e+308 and 1.7976931348623157e+308.


In [38]:
#-----example 2-----  
vector_2 = np.ones(6, dtype = np.int8)
print(vector_2.dtype)

print("The range of a Numpy float64 is between {} and {}.".format(np.iinfo(np.int8).min,
                                                                  np.iinfo(np.int8).max))  

# manually reduce the size of the integer type using dtype

int8
The range of a Numpy float64 is between -128 and 127.


## Accessing NumPy arrays    

For 1D NumPy arrays, we can:  
+ Use the 0-based index to subset values from an array (similar to a list).  
+ Slice a section of the NumPy array. Note that this only returns a view and not a copy of the original array.   
+ Filtering by a list of boolean values (i.e. masking). Note that this returns a copy of the original array.  

In [39]:
#-----example 1-----  
rng = np.random.default_rng(111)
vector_1 = rng.integers(low=0, high=100, size=6)

print("Original vector_1: {}".format(vector_1))
print(vector_1[0]) # subset using 0-based index  

Original vector_1: [47 15 72 16 71 50]
47


In [40]:
#-----example 2-----
vector_2 = vector_1[0::3] # warning: slicing vector_1 returns a view not a copy
print(vector_2)

vector_2[0] = 100 
print("vector_1 is also changed: {}".format(vector_1))

# warning: changing vector_2 also changes vector_1

[47 16]
vector_1 is also changed: [100  15  72  16  71  50]


In [41]:
#-----example 3-----  
vector_3 = vector_1[[False, True, True, False, False, True]] 

# boolean array needs to be the same length as NumPy array
# subsetting by a boolean array returns a copy   

print("vector_1: {}".format(vector_1))
print("vector_3: {}".format(vector_3)) # only True values are returned i.e. masking  

vector_3[0] = 200 
print("vector_1 remains unchanged: {}".format(vector_1)) 

# vector_3 returns a new copy, not view, of vector_1 following boolean masking

vector_1: [100  15  72  16  71  50]
vector_3: [15 72 50]
vector_1 remains unchanged: [100  15  72  16  71  50]


For 2D NumPy arrays, we can:  
+ Still subset values using the 0-based index i.e. `x[row, column]`.      
+ Slice a part of the 2D array. Note that this returns a sliced view and not a copy of the original 2D array.   
+ Subset using boolean masking. Note that this returns a copy of the original 2D array.     

In [42]:
#-----example 1.1-----
array_1 = np.arange(1, 30, 2).reshape(3, 5) 
print(array_1)  

[[ 1  3  5  7  9]
 [11 13 15 17 19]
 [21 23 25 27 29]]


In [43]:
#-----example 1.2-----  
print(array_1[0, 4]) # subset value from first row, last column
print(array_1[-1, -1]) # subset value from last row, last column   

9
29


In [44]:
#-----example 1.3-----  
print(array_1[0:1+1, 0:2+1]) # subset rows and columns from position 0 to position 2

[[ 1  3  5]
 [11 13 15]]


In [45]:
#-----example 1.4-----
print(array_1[:, [0, 1, 2]]) # subset all rows and columns from position 0 to 2    

[[ 1  3  5]
 [11 13 15]
 [21 23 25]]


In [46]:
#-----example 1.5-----
array_1[[0,1], [0,1]]

# warning: array_1[[0,1], [0,1]] does not return a 2D array
# instead, it returns an unexpected 1D array  

array([ 1, 13])

In [47]:
#-----example 2-----  
array_1 = np.arange(1, 30, 2).reshape(3, 5) 
print("Original array_1: \n{}".format(array_1))

array_2 = array_1[0:1+1, :] 
print("New array_2 is a slice/view of array_1: \n{}".format(array_2))  

# warning: slicing array_1 returns a view not a copy

array_1[0,0] = 3
print("Changing array_1 also changes array_2: \n{}".format(array_2)) 

Original array_1: 
[[ 1  3  5  7  9]
 [11 13 15 17 19]
 [21 23 25 27 29]]
New array_2 is a slice/view of array_1: 
[[ 1  3  5  7  9]
 [11 13 15 17 19]]
Changing array_1 also changes array_2: 
[[ 3  3  5  7  9]
 [11 13 15 17 19]]


In [48]:
#-----example 3-----  
array_1 = np.arange(1, 30, 2).reshape(3, 5)   
print("Original array_1: \n{}".format(array_1))

# create random boolean 2D array filled with True and False values 
rng = np.random.default_rng(111)
bool_values = [True, False]

bool_mask = rng.choice(bool_values, size = (3, 5))
print("Random boolean mask: \n{}".format(bool_mask))

array_2 = array_1[bool_mask]
print("Masked array_1 values: \n{}".format(array_2))

# masking a 2D array produces a 1D vector copy  
# changing array_1 does not change masked array_2

array_1[0, 0] = 3
print("Changing array_1 does not change array_2: \n{}".format(array_2)) 

Original array_1: 
[[ 1  3  5  7  9]
 [11 13 15 17 19]
 [21 23 25 27 29]]
Random boolean mask: 
[[ True  True False  True False]
 [False  True False False False]
 [False  True  True False  True]]
Masked array_1 values: 
[ 1  3  7 13 23 25 29]
Changing array_1 does not change array_2: 
[ 1  3  7 13 23 25 29]


## Broadcasting with NumPy arrays  

We can stack NumPy arrays with the `concatenate()` method if two NumPy arrays are of similar shape:  
1. Two arrays have the same number of columns for `concatenate(..., axis = 0)`. This enables us to stacks two arrays on top of each other similar to `bind_rows()` in R.      
2. Two arrays have the same number of rows for `concatenate(..., axis = 1)`. This enables us to stack two arrays side-by-side of each other similar to `bind_cols()` in R.            

**Note:** By default, `axis = 0` i.e. the operation is conducted across the rows of a Python data structure.  

In [49]:
#-----example 1-----
a = np.ones((3, 2), dtype = np.int) # 3 rows and 2 columns
b = np.ones((2, 2), dtype = np.int) * 2  # 2 rows and 2 columns

np.concatenate([a, b]) # axis = 0 by default

# contrast with a + b, which produces the error message below
# ValueError: operands could not be broadcast together with shapes (3,2) (2,2) 

array([[1, 1],
       [1, 1],
       [1, 1],
       [2, 2],
       [2, 2]])

In [50]:
#-----example 2----- 
# axis = 1 binds columns together 

a = np.zeros((2, 2), dtype = np.int) 
b = np.ones((2, 3), dtype = np.int) * 3

np.concatenate([a, b], axis = 1) 

array([[0, 0, 3, 3, 3],
       [0, 0, 3, 3, 3]])

### NumPy broadcasting rules

When two arrays do not have the same shape, NumPy still may allow element-wise operations to be performed by filling in copies of one array to match the other array's shape.    

1. Compare the array dimensions (does `a.ndim == b.ndim`?).  
2. If `a.ndim != b.ndim`, then prepend `1s` to the smaller array's shape.  
3. If `a.ndim == b.ndim`, examine `a.shape` and `b.shape`. 
4. If `a.shape == b.shape`, perform the element-wise operation.  
5. If `a.shape != b.shape` but either rows or columns are 1 in length, expand the shape of the smaller array until it matches the larger array. 
6. Otherwise, throw an error.  

In [51]:
#-----example 1-----
a = np.zeros((3, 2))
b = np.ones((3, 1)) * 2 

print("A has 2 dimensions: \n{}".format(a)) 
print("B has 2 dimensions but 1 less column than A: \n{}". format(b))
print(a.ndim == b.ndim)

# broadcasting so that b is a 3 by 2 array containing 2s

a + b

A has 2 dimensions: 
[[0. 0.]
 [0. 0.]
 [0. 0.]]
B has 2 dimensions but 1 less column than A: 
[[2.]
 [2.]
 [2.]]
True


array([[2., 2.],
       [2., 2.],
       [2., 2.]])

In [52]:
#-----example 2-----
a = np.zeros((2, 3))
b = np.ones((2, 1)) * 2 

print("A has 2 rows and 3 columns: \n{}".format(a)) 
print("B has 2 rows and 1 column: \n{}". format(b))
print(a.ndim == b.ndim)

# broadcasting to expand b into a 2 by 3 array  

a + b

A has 2 rows and 3 columns: 
[[0. 0. 0.]
 [0. 0. 0.]]
B has 2 rows and 1 column: 
[[2.]
 [2.]]
True


array([[2., 2., 2.],
       [2., 2., 2.]])

In [53]:
#-----example 3----- 
a = np.zeros((2, 3))
b = np.ones((3, 1)) * 2 

print("A has 2 rows and 3 columns: \n{}".format(a)) 
print("B has 3 rows and 1 column: \n{}". format(b))
print(a.ndim == b.ndim)

# broadcasting to first expand b into a 3 by 3 array containing 2s
# but a.shape = (2, 3) and expanded b.shape = (3, 3)  

# a + b produces an error
# ValueError: operands could not be broadcast together with shapes (2,3) (3,1) 

A has 2 rows and 3 columns: 
[[0. 0. 0.]
 [0. 0. 0.]]
B has 3 rows and 1 column: 
[[2.]
 [2.]
 [2.]]
True


In [54]:
#-----example 4-----
a = np.zeros((3, 1)) 
b = np.ones((3, )) * 2 

print("A is a 2D array with 3 rows and 1 column: \n{}".format(a)) 
print("B is a 1D array with length 3: \n{}". format(b))
print(a.ndim == b.ndim)

# broadcasting to first expand b into a 2D and 3 by 3 array containing 2s
# broadcasting to then expand a into a 3 by 3 array to match the shape of b

a + b

A is a 2D array with 3 rows and 1 column: 
[[0.]
 [0.]
 [0.]]
B is a 1D array with length 3: 
[2. 2. 2.]
False


array([[2., 2., 2.],
       [2., 2., 2.],
       [2., 2., 2.]])

![](../02_figures/03_pandas-header.jpg)

# Pandas DataFrame  

Pandas is one of the most popular libraries for structured data manipulations in Python.  
There are two data structures supported by the Pandas module.  

## pd.Series  

+ Similar to a 1D NumPy (but with extra indexing options).   
+ Created from a list, NumPy array or dictionary.  
+ Subsetting a Series via its index is similar to indexing a Python dictionary.  
+ Use `iloc[]` to subset via the implicit index (i.e. 0-based index).  
+ Use `loc[]` to subset via the explicit index (i.e. explicit names).  

**Note:** Slicing with the implicit index does not include the last/upper value but slicing with the explicit index does.  

In [55]:
#-----import the pandas library-----
import pandas as pd

In [56]:
#-----example 1.1-----
series_1 = pd.Series([1, 2, 3, 4]) # manually creating a Series
print(series_1) 

# note the additional presence of a 0-based index

0    1
1    2
2    3
3    4
dtype: int64


In [57]:
#-----example 1.2-----  
series_2 = pd.Series(np.arange(5, 11+1, 2)) # converting a NumPy array into a Series  
print(series_2) 

0     5
1     7
2     9
3    11
dtype: int32


In [58]:
#-----example 1.3-----
series_3 = pd.Series({"item_1": "key", 
                      "item_2": "passport",
                      "item_3": "tickets",
                      "item_4": "snacks"}) # converting a dictionary into a Series key
print(series_3) 

# note that the dictionary key is used as the Series index i.e. an explicit index   

item_1         key
item_2    passport
item_3     tickets
item_4      snacks
dtype: object


In [59]:
#-----example 1.4-----
series_4 = pd.Series([1, 2, 3, 4],
                     index = ["id1", "id2", "id3", "id4"]) # explicitly creating an index 
series_4

id1    1
id2    2
id3    3
id4    4
dtype: int64

In [60]:
#-----example 2.1-----
print("series_2: \n{}".format(series_2))
print("series_2 has an implicit index: {}".format(series_2.index)) 

series_2.iloc[2:3+1] # slicing does not include the last value when slicing via the implicit index    

series_2: 
0     5
1     7
2     9
3    11
dtype: int32
series_2 has an implicit index: RangeIndex(start=0, stop=4, step=1)


2     9
3    11
dtype: int32

In [61]:
#-----example 2.2----- 
print("series_3: \n{}".format(series_3))
print("series_3 has an explicit index: {}".format(series_3.index)) 

print(series_3.iloc[0:2+1]) # you can still slice via the implicit index

series_3: 
item_1         key
item_2    passport
item_3     tickets
item_4      snacks
dtype: object
series_3 has an explicit index: Index(['item_1', 'item_2', 'item_3', 'item_4'], dtype='object')
item_1         key
item_2    passport
item_3     tickets
dtype: object


In [62]:
#-----example 2.3-----
print(series_3.loc["item_1": "item_3"]) # slicing includes the last value for explicit index

item_1         key
item_2    passport
item_3     tickets
dtype: object


## pd.DataFrame    

The Pandas DataFrame is the data structure that is most commonly used for data analysis.  

+ Similar to R data frames or tibbles. 
+ Can be created from a Pandas Series, NumPy array or dictionary.  
+ To select columns from a data frame, subsetting works via subsetting on column names.    
+ Use `iloc[]` to subset rows via the implicit index (i.e. 0-based index).  
+ Use `loc[]` to subset rows via the explicit index (i.e. explicit names).   

In [63]:
#-----example 1.1-----  
series_1 = pd.Series(np.arange(1, 3+1, 1))  

df_1 = pd.DataFrame(series_1, columns = ["items"]) # create a DataFrame with 1 column from a Series
df_1

Unnamed: 0,items
0,1
1,2
2,3


In [64]:
#-----example 1.2-----  
series_2 = pd.Series([f'ID {n}' for n in range(1, 3+1, 1)]) # f-strings allow easy string evaluations 

df_2 = pd.DataFrame({"ID" : series_2, 
                     "items" : series_1}) # create a DataFrame with 2 columns using a dictionary  
df_2  

Unnamed: 0,ID,items
0,ID 1,1
1,ID 2,2
2,ID 3,3


In [65]:
#-----example 2.1-----  
df_3 = pd.DataFrame(np.arange(1, 9+1).reshape(3,3)) # create a DataFrame from a 2D NumPy array  
df_3

Unnamed: 0,0,1,2
0,1,2,3
1,4,5,6
2,7,8,9


In [66]:
#-----example 2.2----- 
df_3.columns = ["Monday", "Wednesday", "Friday"] # input column names as a list
df_3 # df_3 is modified in place

Unnamed: 0,Monday,Wednesday,Friday
0,1,2,3
1,4,5,6
2,7,8,9


In [67]:
#-----example 2.3----- 
df_3.iloc[1: 2+1] # slice rows implicitly 

# note that df_2.loc[["ID3", "ID4"]] does not work 

Unnamed: 0,Monday,Wednesday,Friday
1,4,5,6
2,7,8,9


In [68]:
#-----example 2.4-----
df_3[["Monday", "Friday"]] # select columns expliciting via column name  

# use [[]] to return a DataFrame  

Unnamed: 0,Monday,Friday
0,1,3
1,4,6
2,7,9


You can use `df.index` to access information about the DataFrame index and `df.info()` to access information about the DataFrame columns.   

In [69]:
#-----example 1-----  
df_3.index

RangeIndex(start=0, stop=3, step=1)

In [70]:
#-----example 2-----
df_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   Monday     3 non-null      int32
 1   Wednesday  3 non-null      int32
 2   Friday     3 non-null      int32
dtypes: int32(3)
memory usage: 164.0 bytes


In [71]:
#-----example 3-----  
df_3.iloc[0: 2+1].shape # returns a tuple  

(3, 3)

You can use `df.shape` to extract the dimensions of your DataFrame into a tuple.    

## Concatenating Python DataFrames  

The DataFrame index dictates where arithmetic operations are applied between DataFrames.    

In [72]:
#-----example 1-----   
df1 = pd.DataFrame(np.arange(3), columns = ['a'], index = [0, 1, 2])  
df2 = pd.DataFrame(np.arange(3), columns = ['a'], index = [1, 2, 3]) # columns need to be named the same

df3 = df1 + df2
df3

# we should avoid using `+` to manipulate DataFrames   
# as it does not give us a clear idea how what happens to mismatched rows (all turns to NaN)

Unnamed: 0,a
0,
1,1.0
2,3.0
3,


In [73]:
#-----example 2-----
df3.dropna() # remove rows with NaNs

Unnamed: 0,a
1,1.0
2,3.0


In [74]:
#-----example 3----- 
df3.fillna(df3.mean()) # replace rows with NaNs with another value 

Unnamed: 0,a
0,2.0
1,1.0
2,3.0
3,2.0


When we want to combine DataFrames, we can use `concat()` although other methods also exist.   

The concatenation of Pandas Series and DataFrames follows some simple rules:  
+ Unlike addition, the position of indexes do not confer any special positional meaning. 
+ Indexes from each series or DataFrame are stacked on top of each other, unless `ignore_index = True` is specified.  
+ If DataFrames have different number of columns, the concatenation occurs by column name. DataFrames are stacked on top of each other and `NaN` is assigned to missing areas.   
+ We can also use `concat()` in combination with the argument `join = ...` to only keep DataFrame values that are present in both DataFrames.  
+ Advanced methods for combining DataFrames are preferrable and can be accessed via `join()` and `merge()`.  

In [75]:
#-----example 1.1-----  
series_1 = pd.Series([1, 2, 3], index = [0, 1, 2])
series_2 = pd.Series([1, 2, 3], index = [0, 1, 2])

pd.concat([series_1, series_2]) 

# two series are stacked on top of each other

0    1
1    2
2    3
0    1
1    2
2    3
dtype: int64

In [76]:
#-----example 1.2-----
pd.concat([series_1, series_2]).reset_index(drop = True)  

# two series are concatenated and the original index is reset but not saved as a new column

0    1
1    2
2    3
3    1
4    2
5    3
dtype: int64

In [77]:
#-----example 2-----  
df_1 = pd.DataFrame(np.arange(4).reshape(2, 2)) # numbers filled from left to right by row
df_2 = pd.DataFrame(np.arange(4).reshape(2, 2))  

pd.concat([df_1, df_2], ignore_index = True) 

# setting the argument to ignore_index = True also resets the DataFrame index  

Unnamed: 0,0,1
0,0,1
1,2,3
2,0,1
3,2,3


In [78]:
#-----example 3-----  
df_1 = pd.DataFrame(np.arange(6).reshape(2, 3), columns = ['a', 'b', 'c']) 
df_2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns = ['a', 'd'])  

print("df_1 has 2 rows and 3 named columns: \n{}".format(df_1))
print("df_2 has 2 rows and 2 named columns: \n{}".format(df_2))

pd.concat([df_1, df_2]) 

# DataFrames are concatenated via matching column names  

df_1 has 2 rows and 3 named columns: 
   a  b  c
0  0  1  2
1  3  4  5
df_2 has 2 rows and 2 named columns: 
   a  d
0  0  1
1  2  3


Unnamed: 0,a,b,c,d
0,0,1.0,2.0,
1,3,4.0,5.0,
0,0,,,1.0
1,2,,,3.0


In [79]:
#-----example 4-----  
df_1 = pd.DataFrame(np.arange(6).reshape(2, 3), columns = ['a', 'b', 'c']) 
df_2 = pd.DataFrame(np.arange(4).reshape(2, 2), columns = ['a', 'c'])  

print("df_1 has 2 rows and 3 named columns: \n{}".format(df_1))
print("df_2 has 2 rows and 2 named columns: \n{}".format(df_2))

pd.concat([df_1, df_2], join = 'inner', ignore_index = True) 

# join = 'inner' only returns columns that are present in both df_1 and df_2    

df_1 has 2 rows and 3 named columns: 
   a  b  c
0  0  1  2
1  3  4  5
df_2 has 2 rows and 2 named columns: 
   a  c
0  0  1
1  2  3


Unnamed: 0,a,c
0,0,2
1,3,5
2,0,1
3,2,3
