# Python Basics: Data Types and Data Structures

## Chapter 1: Lists

Lists are objects of variable length. They are mutable, i.e. the object (usually referred to the entries of a list)
may be changed. In particular, they can be modified in-place. This means that we can modify a list without
creating a new list.

### Create a list

Create a list simply by using [ ] and assigned it to a variable. You can also cast generators to the
built-in `list()` function which also yields a list object.

In [None]:
L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
L2 = ['word', 'this is a sentence', 'a', 'b', 'C']

A list can also contain other lists. This is referred to as nested list.

In [None]:
L4 = [L1, L2]

### Using a generator and the list() function

In [None]:
generator = range(5)
list(generator)

Note that `range()` creates a range object which is passed to the `list()` function to generate the actual list object.
We will later see that range is particularly helpful in terms of loops.


### Access elements in a list

Elements or groups of elements in a list can be accessed through their index or by using the slice grammar.
The indexing in Python starts at 0!

`my_list[start:stop:step]`

In [None]:
# Selection by index

L1[3]           # 3. element

In [None]:
L1[-2]          # 2. reversed

In [None]:
# Selection by slice

L1[::3]     # every third

In [None]:
L1[::-2]    # every second reversed

### ***Exercise: Access elements in a list***

In [None]:
# Consider

L1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


# Select elements from L!
# starting with item 2 towards item 6 in 3 step interval


# loop backwards in two step intervals starting with 3. item from the back


# start with 5. item from the back and iterate forward in 3 steps interval


### Change Elements in a list

As mentioned above, lists are mutable and thus the elements in a list can be changed (insert new, delete
entries, overwrite entries)

In [None]:
# Overwriting

L1[1] = 8888                # by index
L1

In [None]:
# Entire slices can be changed as well, however, the lenght has to match

L1[:3] = [33, 33, 33]

In [None]:
# Deletion

del L1[1]                       # by index
del L1[4:7]                     # by slicing
L2.remove('word')               # by name, not suitable for numbers
L1.remove(3)                    # removes only the first item of list that matches if multiple occur

In [None]:
# Insertion

L1.insert(4, 333)               # insert one element by index, here: index=4, number=333
L1[3:3] = [66, 77, 88]          # insert list elements (not list!) starting at index=3
L1.insert(2, [777, 777])        # insert a list as a list in the list, not viable!

L1

In [None]:
# Append elements to a list

x = 5
L1.append(x)
L1

# This method is commonly used. In most cases append() is used in a for loop. This method is also used with other data
# structures

In [None]:
# Entire list (WRONG WAY)

L1.append(L2)       # append a list and not(!) the elements within the list, for that see concatenate
L1                  # yields a list of elements with a list in the end

In [None]:
# Concatenate lists

L1 + L2  # this is not(!) the same like [L1, L2]

In [None]:
# We will also learn how to create a list via list comprehension instead of using the following for-loop

L = []

for i in range(10):
    x = i**2
    L.append(x)

L

### Iteration and List comprehension

In [None]:
# classical for-loop

for x in L1:
    print(x)

In [None]:
i = 1

while i < 6:
  print(i)
  i += 1    # which is the same syntax like i = i+1

### List comprehension

List comprehension offers an elegant way to create a new list based on another list and further conditions

Syntax:

    [ 'expression' for 'item' in list if 'conditional' ]

Semantically, this means

    [*transform variable* given by the *iteration* with certain *conditions* ]

this is equivalent to

    for item in list:
        if conditional:
            expression

The purpose of list comprehension is to create a list only(!). It is supposed to make code more readable
and clarify its intention. It is preferable to use a for loop if your list comprehension statement
exceeds two lines.

In [None]:
# Define two lists, one with numbers and one with strings, we will use them in future exercises

L_comp = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
L_comp_str = ['aa', 'ab', 'bb', 'bc']

In [None]:
# Print/Return identity

lst_comp_1 = [k for k in L_comp]
lst_comp_1

In [None]:
# Return identity with boundary constrain

lst_comp_3 = [k for k in L_comp if k < 5]
lst_comp_3

In [None]:
lst_comp_str_1 = [any_loop_name for any_loop_name in L_comp_str]
lst_comp_str_1

### ***Exercise: List Comprehension***

In [None]:
# Compute the square of the list and return the elements below 20


In [None]:
# Return the elements of the list that contain an 'a'


### Some Additional Methods

In [None]:
'''
append()    Add an element to the end of the list
extend()    Add all elements of a list to the another list
insert()    Insert an item at the defined index
remove()    Removes an item from the list

pop()       Removes and returns an element at the given index
clear()     Removes all items from the list
index()     Returns the index of the first matched item
count()     Returns the count of the number of items passed as an argument
sort()      Sort items in a list in ascending order
reverse()   Reverse the order of items in the list
copy()      Returns a shallow copy of the list
'''

<br/>

## Chapter 2: Dictionaries

Similar to lists, dictionaries are also mutable data structures, that is they can be altered. In contrast to list
objects, dictionaries are unordered and contain so called key-value pairs. Dictionaries are an essential
building block in the python programming language and thus highly optimized. We will later see that
dictionaries offer a convenient way to create pandas dataframes with low effort.

### Create a dictionary

Use { } brackets or the `dict()` function create dictionaries.

In [None]:
# Using {}-brackets

dict_1 = {
    "key_1": [1, 2, 3, 4, 5],
    "key_2": "Mustang",
    "key_3": 1964,
    "key_4": ['a', 'b', 'c']
}


dict_2 = {
    "key_3": [13, 22, 31, 44, 53],
    "key_4": [12, 33],
    "key_5": 1264
}


dict_3 = {
    "key_1": [13, 22, 31, 44, 53],
    "key_3": "Ford",
    "key_5": 1264
}


dict_4 = {
    "key_1": 1,
    "key_3": -12,
    "key_5": 3,
    "key_6": -7,
    "key_7": -3
}

In [None]:
# using dict() constructor

dict_dict = dict(key_1="banana", key_2="grape", key_3=12321)  # key = value

### Access elements in a dictionary

In [None]:
dict_1['key_1']         # by key

In [None]:
dict_1['key_1'][2]      # with nested objects

In [None]:
dict_1.get('key_2')     # by get() method

In the next section we will explore some properties of dictionaries.

### Add elements

In [None]:
dict_1['new_key'] = [1, 3, 56]  # by specifying key and value

In [None]:
dict_1.update(dict_2)           # Adding a dictionary to another dictionary

dict_1.update(dict_3)           # updating dictionary overwrites old keys with values of new key & values

### Delete elements

In [None]:
dict_1.pop('key_1')  # removes key and values and print the deleted value

del dict_1['key_3']  # using del operator

### Change elements

In [None]:
# Change value given a key
dict_1['key_2'] = 123

In [None]:
# method 1: change keys by creating a new one and delete the old one
dict_1['new_key'] = dict_1['key_2']
del dict_1['key_2']

In [None]:
# method 2: change keys with pop()
dict_1['new_key'] = dict_1.pop('key_2')

### Iterate over a dictionary

Iterating over dictionaries follows the same principle as with lists. The difference is that two objects
can be iterated through, namely the keys, values or even both.

In [None]:
# Iterate over the keys, returns keys only

for key in dict_1:
    print(key)

In [None]:
# Iterate over the values, returns values only

for values in dict_1.values():
    print(values)

In [None]:
# Iterate over dictionary items, returns both key and values as a pair

for items in dict_1.items():
    print(items)

In [None]:
# Iterate over dictionary items, returns both key and values as a pair

for key, value in dict_1.items():
    print(key, value)

### Create dictionaries with dict comprehension

In [None]:
# dict comprehension (CURLY brackets!)

a = {key: 1 for key in dict_1}

b = [k for (k, v) in dict_4.items() if v == 0]
c_keys = {k for (k, v) in dict_4.items() if k == "key_1"}
c_values = {v for (k, v) in dict_4.items() if v < 0}

In [None]:
# with a for loop

dicts = {}
keys = range(4)
values = ["Hi", "I", "am", "Dennis"]
for i in keys:
    dicts[i] = values[i]
print(dicts)

### Some Additional Methods

In [None]:
'''
clear()	        Removes all the elements from the dictionary
copy()	        Returns a copy of the dictionary
fromkeys()      Returns a dictionary with the specified keys and value
get()	        Returns the value of the specified key
items()	        Returns a list containing a tuple for each key value pair
keys()	        Returns a list containing the dictionary's keys
pop()	        Removes the element with the specified key
popitem()       Removes the last inserted key-value pair
setdefault()    Returns the value of the specified key. If the key does not exist: insert the key, with the specified value
update()        Updates the dictionary with the specified key-value pairs
values()        Returns a list of all the values in the dictionary
'''


<br/>

## Chapter 3: Pandas

The main object pandas provides is the pandas dataframe. This data structure is one of the core objects in
the data science discipline.

The dataframe object itself is a two-dimensional tabular-styled data structure with
labeled axes (rows and columns). Formally, it has three components: data, rows=index and columns.

The pandas package need to be installed first and then imported with the `import` statement.

In [1]:
import pandas as pd

### Create a pandas dataframe

**By "hand"**

This is a rather artificial example but offers the chance to inspect the building blocks of
the pandas data structure. As you can see, we are not restricted to any data types. Be aware that the lenght
of the list have to match the lenght of the index. The same goes for the number of data columns
and the list of column names.

In [None]:
# Define the data first which this dataframe should contain, note that the data is not homogeneous
dat_num_1 = [2, 3, 4, 5]
dat_num_2 = [32, 55, 4, 51]
dat_str_1 = ['a', 'b', 'c', 9]
dat_num_3 = [32, 55, 4, 51]
dat_str_2 = ['a', 'b', 'c', 10]
dat_list = [dat_num_1, dat_num_2, dat_str_1, dat_num_3, dat_str_2]


# Define the axis, i.e. columns and rows=index
col_name = ['word', 1234, 'col_label_3', 'col_label_4']
index_name = [1, 2, 'word', 'word 2', 5]


# Create the dataframe with the pd.DataFrame() command
dat_frame = pd.DataFrame(dat_list, columns=col_name, index=index_name)

dat_frame


**From a dictionary**

In Python, it is quite natural to create dataframes from Python. Given a dictionary, use the `pd.DataFrame()` function
to pass your dictionary to the dataframe constructor. By default, the keys are recognized as column names. If no
specific index is given, the rows will be simply enumerated starting from 0.

In [None]:
dict_1 = {
    "key_1": [1, 2, 3, 4, 5],
    "key_2": "Mustang",
    "key_3": 1964
}

pd.DataFrame(dict_1)

In [None]:
# It is also possible to pass the arguments explicitly to the function without creating them beforehand

pd.DataFrame(dict_1, index=[10, 11, 12, 13, 14], columns=['key_1', 'key_3', 'key_2'])

### Access axis elements - Column & Row labels

Accessing dataframes is essential. We can select the content by names or by index/position. Furthermore, we can also
access the axis, i.e. index and column names.

First, let's consider the axis. Every dataframe has the class method `.columns` and `.index` which returns a one
dimensional pandas series.

In [None]:
dat_frame.columns

In [None]:
dat_frame.index

In [None]:
# Check the datatype of your content by .dtypes

dat_frame.dtypes

### Access elements in a dataframe by key and label


Use the [ ] indexing style to select a column. Another way is to use the .column_name
method.

In [None]:
dat_frame['col_label_3']                                # Select data by column name
dat_frame.col_label_3                                   # Note: Returns column data and row labels


The .loc[] method selects data by the label. If the labels are ambiguous, the column is always
selected by default.

In [None]:
dat_frame.loc['word']                                   # Select row by label
                                                        # Note: Returns both row data and column labels

In [None]:
dat_frame.loc[:, 'col_label_1']                         # can be used for clarification


In [None]:
dat_frame[1234]['word']                                 # Select Cell by double key
dat_frame.loc[1, 'col_label_3']                         # Select Cell by loc


In [None]:
dat_frame.loc[[2, 'word'], [1234, 'col_label_4']]       # Slicing by row and column keys
dat_frame.loc[:'word 2', :'col_label_3']
dat_frame.loc[:, :'col_label_3']


### Access by index

Alternatively, data can be selected by index/position. However, you have to pay attention where your data
is located.

In [None]:
dat_frame.iloc[1, 2]

In [None]:
dat_frame.iloc[:, ::2]                                     # See list, for slicing grammar


### Manipulating dataframes

**Change axis labels**

There are several ways to change labels, here is the best practice. One way is
to use the .rename() method (see documentation for detailed description)

In [None]:
dat_frame.rename(columns={'col_label_1': 'new_label_1', 1234: 'new_label_2'})   # returns a new dataframe copy

In [None]:
dat_frame.rename(columns={1234: 'new_label_2'}, inplace=True)                   # overwrites old dataframe
                                                                                # Note: columns = ... ; also takes functions

In [None]:
# Same procedure with rows

dat_frame.rename(index={'word': 'new_index_1', 5: 1234})
dat_frame.rename(index={5: 1234}, inplace=True)

Another way is to use the .set_axis() method in which the axis has to be specified first.

In [None]:
# Rename columns, axis=1 or axis='column'
dat_frame.set_axis(['a', 'b', 'c', 'd'], axis=1)

In [None]:
dat_frame.set_axis(['aa', 'bb', 'cc', 'dd'], axis='columns', inplace=True)

In [None]:
# Rename rows, axis=0 or axis='index'
dat_frame.set_axis([10, 20, 30, 40, 50], axis=0)

In [None]:
dat_frame.set_axis([100, 200, 300, 400, 500], axis='index', inplace=True)

Often a column needs to be transformed into the index, e.g. a date column which we will later see.

In [None]:
dat_frame['transformed_index'] = dat_frame.index        # the index itself is still in effect

**Change dataframe values**

In [None]:
# Direct assignment with replacement, if positions are known

dat_frame.loc[2, 1234] = 9999                   # by key

In [None]:
dat_frame.loc[[1, 2], 1234] = [333, 444]

In [None]:
dat_frame.iloc[:, 3] = [1, 2, 3, 4, 5]          # by index

In [None]:
# With .replace() method, position are not required to be known, but labels are

dat_frame.replace(5, 'replaced 5')

In [None]:
dat_frame.replace('c', 'replaced c', inplace=True)          # replaces all(!) values that match

In [None]:
dat_frame.replace(['a', 'b', 'c', 'd'], 'all strings')      # replace a set of values with a single expression

In [None]:
dat_frame.replace([4, 4], 5555)                             # doesn't work

In [None]:
# replace a set of values with another set of equal length

dat_frame.replace(to_replace=['a', 'b', 'c', 'd'], value=[111, 222, 333, 444])


**Add columns and rows**

Let's add some columns first

In [None]:
dat_frame['new_unused_column_name'] = [1, 2, 3, 4, 5]       # setting with enlargement, new column is always at the end
                                                            # Note: Overwrites old dataframe

In [None]:
dat_frame.insert(2, 'new_col', [11, 22, 33, 44, 55])        # with .insert(): index position, column name, values
                                                            # Note: Overwrites old dataframe

In [None]:
dat_frame.assign(address=['D', 'B', 'C', 'P', 'H'])         # similar to setting with enlargement
                                                            # Note: Returns old dataframe with new column, no overwrite

In [None]:
dat_frame['doubled'] = dat_frame['col_label_4'] + \
                       dat_frame['col_label_4']             # creation by adding two columns


Next, add rows.

In [None]:
# Adding rows by enlargement

new_row_index = len(dat_frame)

dat_frame.loc[new_row_index] =[1, 2, 3, 4, 1]


In [None]:
# Adding rows with pandas internal .append method. However, both need to be a pandas object

new_line = pd.Series([1, 22, 4, 5], index=dat_frame.columns)
new_line

In [None]:
dat_frame.append(new_line, ignore_index=True)          # with .append() method, overwrite old index with new

**Delete columns and rows**

In [None]:
# Delete column
dat_frame.drop(columns=['col_label_1', 'col_label_3'], inplace=False)       # drop by label
dat_frame.drop(dat_frame.columns[2], axis=1)                                # drop by index

In [None]:
# Delete row
dat_frame.drop(index=['word'], inplace=True)                                # drop by label
dat_frame.drop(dat_frame.index[2])                                          # drop by index

# See: .pop() for other methods
# Note: Dropping columns, rows is the same like slicing a dataframe for you own needs


### Add Data, Merge and Concatenating

Probably one of the most important and ugliest task is to merge dataframes and identify differences. These
are the basic functionalities. Details will be explored in the exercises.

We consider two important operations, namely, concatenating and merging.

Use pd.concatenate if the dataframes are homogeneous, i.e. share the same columns

In [None]:
df1 = pd.DataFrame({'col1': ['k1', 'k2', 'k3', 'k4'],
                    'col2': [1, 2, 3, 5],
                    'col2': [16, 21, 32, 53]})
df1

In [None]:
df2 = pd.DataFrame({'col1': ['r1', 'r2', 'r3', 'r4'],
                    'col2': [5, 6, 7, 8]})
df2

In [None]:
df3 = pd.DataFrame({'col1': ['r1', 'r2', 'r3', 'r4', 'r5'],
                    'col2': [5, 6, 7, 8, 9],
                    'col3': [12,34,56,78,8]})
df3

In [None]:
# Enter the dataframes as a list (an iterable to be precise)

In [None]:
pd.concat([df1, df2])

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
pd.concat([df1, df3], axis=1, join='outer')

In [None]:
pd.concat([df1, df3], axis=0, join='outer')

In [None]:
pd.concat([df1, df3], axis=0, join='inner')

This is sufficient for our purpose for now. There is also the .merge() method which has more customization options.
See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
for more details