# Diving into Python types

An introduction to basic types in Python is outside of the scope of this
textbook. If needed, I refer readers to the official Python [tutorial](https://docs.python.org/3/tutorial/).
Instead, this section will focus on describing differences among 
commonly used types and specifically why some types are used over others
in certain situations. As this comes up in coding exercises in the 
textbook this section will be referred back to.

## Mutable versus Immutable

Two popular ways of storing a list of element in Python are `list` and `tuple`.
Here we discuss the main differences between these types and when to use one over
the other.  

The main difference is that Lists are mutable and Tuples are not. 

Elements can be replaced, added, or removed from lists.


In [81]:
my_list = [1, 2, 'a', 'b', 'b', 0]
my_list[0] = '1' # note that here 1 is and string and not an numeric 

However, elements **CANNOT** be replaced, added, or removed from tuples.


In [82]:
my_tup = (1, 2, 'a', 'b', 'b', 0)
my_tup[0] = '1' # raises a TypeError

TypeError: 'tuple' object does not support item assignment

For this reason, tuples lack of most of the methods that are found in list, 
with exception of `index` and `count`. 

In [83]:
my_list.remove('a') # remove a specific element from the list
my_list.append('c') # append a new element at the end of the list
my_list.insert(1, "n") # insert a new element in the index 1, moving all elements one index

my_list

['1', 'n', 2, 'b', 'b', 0, 'c']

All these methods define what `list` are intended for, storing a list of elements that COULD
be updated in future instructions.

The lack of all these methods in tuples, making it a "lighter" type than a list (in terms of bytes)

In [87]:
my_list = [1, 2, 3, 4, 5, 6]
my_tup = (1, 2, 3, 4, 5, 6)

from sys import getsizeof 
print(f'''
Size in bytes of my_list: {getsizeof(my_list)}
Size in bytes of my_tup: {getsizeof(my_tup)}
''')


Size in bytes of my_list: 104
Size in bytes of my_tup: 88



This characteristic of tuples and its inmutability make them good for other uses.
They work perfectly to maintain the integrity of a list of elements that you do not
want to modify, reducing conflicts with other parts of your code.
Because they are "lighter", they are the best option in big projects (or big lists)
in terms of reducing running times and memory optiimization.
Additionaly, another important feature of tuples is that they are hashable. 
In the following section about hashing, sets and dictionaries, this feature will be explored. 

So if your list of elements will be static once you create it, it is recommended to
use `tuple` instead of a `list`.

**Additional note**: Some implicit declarations in Python can be tricky. 
For example, the following code will result in a tuple of one element 
instead of a variable storing a string.

In [12]:
var = "anything", # note the trailing comma

In [14]:
type(var)

tuple

## Hashing, Sets and Dictionaries

One of the strategies of Python to speed up some operations is hashing the index of every element in a collection of data. Understading all details behind hashing algorithms may be complex, but in general terms it is a process that map a certain data (strings, integer, floats, etc.) to some representative integer value, commonly related with a determined position in a hashtable. By doing so, Python can calculate the relative position or index of a given element using the same formula instead of transverse the entire collection of data. 

To get a hash of certain data in Python you can use the function `hash()`

In [15]:
hash("anything")

7035250155953424183

You should be aware that no all type of data can be hashed, if you try to get the 
hash of a list you will get an error

In [17]:
hash([1,2,3]) # raises a TypeError

TypeError: unhashable type: 'list'

There are some types of data that are based in hashtables in Python,
`set` and `dict`. Both are declared using `{}`, but they are 
intended for different purposes. 

**Sets** are homologous to lists, they are mutable; however, sets are 
ordered and DO NOT store duplicated values. 

In [95]:
my_set = {1, 2, 'a', 'b', 'b', 0} # only one 'b' will be stored in this set and elements are sorted
my_set

{0, 1, 2, 'a', 'b'}

Similar to lists, sets have multiple methods for replacing, adding or removing elements

In [96]:
my_set.remove('a')
my_set.add('c')

my_set

{0, 1, 2, 'b', 'c'}

And due to its hashable nature, sets have handy methods for comparing different sets.

In [101]:
set_1 = {1,2,3,4,5}
set_2 = {1,2,8,4,5}

set_1.difference(set_2) # finds the elements that are different in set_1

{3}

**Dictionaries** are also an useful type that provides a structured way of storing data (`value`) and its representation or identifier (`key`). Procesing times for dictionaries are usually similar than the observed in `set` operations. 

One of the ways to define a dictionary is using `{}` to enclose one or more keys and its values. Keys most be pointed to their values using a `:`.

The data they can store is widely variable and it can be any type of data in python, even complex objects or other dictionaries.

In [204]:
my_dict = {
    'key1': 'A',
    '2': 1,
    'three': [0,1,2,3],
    4: 1, # opposite to keys, values can be duplicated in a dictionary
    'key5': {"inner_key1": 33, "inner_key2": 66} # little dict inside a dict
}

Dictionaries do not allow duplicated keys in the same way that sets do; nevertheless, Python does not return it as an error. Instead, it updates the previous value for a given key as shown in the following code.

In [182]:
my_dict = {
    'key1': (9,5,6),
    'key1': 1,
    'key1': 'Last' # this is the value that maintain under the key1
}

my_dict

{'key1': 'Last'}

Keys in a `dict` must be a hashable type, for that reason, a `list` cannot be a key in a dictionary. However, as mentioned before, tuples are hashable, for that tuples can be a key in a dictionary.

In [196]:
dict_a = {
    (1,2,3): 'Success',
}

dict_a

{(1, 2, 3): 'Success'}

In [197]:
dict_b = {
    [1,2,3]: 'Failure'  # raises a TypeError
}

TypeError: unhashable type: 'list'

One of the great potentials of dictionaries is their accessibility (speed and simplicity). To access a given `value` in a dictionary, you just need to look up for the `key` that represent it.

In [207]:
my_dict['key1']

'A'

In the same way than other types discussed in this document, dictionaries have multiple methods to perform some operations. For example, using the method `keys()` you can get all the keys inside a given `dict`.

In [210]:
my_dict.keys()

dict_keys(['key1', '2', 'three', 4, 'key5'])

### Performance on hash-based types

As discussed before, hashing could increase the performance of some operations in a collection of data. But how different it is? Is it worth it? Let's check how long Python takes to find a element in a long `list`, `tuple` or `set`?

In [218]:
# create a list 
long_list = list(range(100000)) # this is a different way to declare a list, using the function list()

# create a tuple 
long_tuple = tuple(long_list) # same as list, tuples can be declared with the function tuple()

# create a set
long_set = set(long_list) 

In [219]:
%%timeit ## using this built-in Jupyter magic function we can measure running time
99999 in long_list

519 µs ± 12.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [220]:
%%timeit 
99999 in long_tuple

470 µs ± 6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [221]:
%%timeit 
99999 in long_set

23.7 ns ± 0.0693 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Similar running times of sets can be achieved when we are looking for a key in a dictionary.

## Arrays and DataFrames (non-native data types)
There is a lot of ways to store data in Python, some of them came from third party libraries. This mean that they are not available with pure Python installation. Two of the most popular data types are Arrays and DataFrames due to their efficiency and methods.

### Arrays
**Arrays** are included in the library [NumPy](https://numpy.org/). This library add support to Python to manage very large and multidimentional arrays, as well as multiple mathematical functions useful to operate one this type. This data type is highly efficient, in term of memory and running times. Arrays may be variable in the number of dimensions, form 1D-array, similar to a `list`; 2D-array, similar to a matrix or table; to a high n-dimentional array.

In contrast to other data type explored in this document, NumPy arrays are homogenous (e.g. strings and integer cannot be in the same collection), being limited in the data you can store in a given array. Nonetheless, this does not mean that arrays cannot store some type of data, on the contrary, Numpy arrays can store a very wide variety of data types. As you may expect, some of the handy functions included in this library are limited by the type of data stored inside each array.

The following code shows how to create a 1D-array with NumPy.

In [249]:
import numpy as np

my_arr = np.array([1, 2, 3, 4, 5])
my_arr

array([1, 2, 3, 4, 5])

In our previous example all elements in our input list are integer; nonetheless, if one of the elements is a different type (string, float, etc.), NumPy will try to perfom an automatic tranformation to homogenize all element inside the array.

In [250]:
homogenized_arr = np.array([1, 2, 'string', 4, 5])
homogenized_arr # note in this array, all integers are now strings

array(['1', '2', 'string', '4', '5'], dtype='<U21')

In [251]:
homogenized_arr = np.array([1, 2, 3.0, 4, 5])
homogenized_arr # same happens here where NumPy convert all integers into floats

array([1., 2., 3., 4., 5.])

As we mentioned above, arrays can be multidimentional. Here is how a 2D-array can be declared

In [279]:
my_2darr = np.array([[1, 2],
                    [1, 3],
                    [1, 1]])

The variety of functions included in this library are very powerful to perform some operations with very little coding, for example:

In [283]:
original_arr = np.array([[0, 0, 0],
                        [2, 0, 3]])

original_arr + 100 # this adds iteratively 100 to every stored value in this 2D-array

array([[100, 100, 100],
       [102, 100, 103]])

Some of those functions handle matrix operations efficiently.

In [284]:
arr1 = np.array([[0, 0, 1],
               [3, 1, 2],
               [1, 0, 1]])
arr2 = np.array([[1, 1, 1],
               [1, 3, 2],
               [0, 1, 1]])

np.add(arr1, arr2)

array([[1, 1, 2],
       [4, 4, 4],
       [1, 1, 2]])

There is a vast amount of functions in this library. Understanding all of them are out of the scope of this course, but it is helpful to be familiar with the documentation to know what is the potential of this data type.

**Additional note**: You should know that there is a data type that can be found natively in Python in the [module `array`](https://docs.python.org/3.8/library/array.html). This data type is intended to compactly store arrays of numeric values by encoding them accordingly to its values characteristics, it shares some attributes with NumPy arrays, but lack of some functionalities found in the latter. 

The word array can be tricky, not only in Python, but also in other programming languages. For some of them they are simple lists, but for some others they share some characteristics as described above in NumPy definition. 

The following is an example of array definition in native Python module. 

In [237]:
import array
native_array = array.array('d', [1.0, 2.0, 3.14]) # d is the type of value, being double in this case

### DataFrames
Similar to the previous data type, **DataFrames** are a specific type of data included in the library [Pandas](https://pandas.pydata.org/). This library provides, not only a powerful and highly performant data structure, but also includes useful function for manipulating numerical tables and data series. Tables are composed by columns and row, and every cell is indexed. 

DataFrames has some benefits over other data types described in this document. Their structure allow an easy interpretation and reduce abstraction by allowing labeling axes (row and columns).

There are multiple ways to create a DataFrame, one of the most common is using a `dict` where `keys` are the column names and a list in its `value` is the data alocated in each cell. 

In [297]:
import pandas as pd
dict1 = {
    'Column+one': ['A', 'B', 'C', 'D', 'E', 'F'],
    'Column_2': ['s', 't', 'r', 'i', 'n', 'g'],
    'Column*3': [1, 2, 3, 4, 4, 3],
}
  
my_df = pd.DataFrame(dict1) # creates the dataframe from our dictionary

my_df

Unnamed: 0,Column+one,Column_2,Column*3
0,A,s,1
1,B,t,2
2,C,r,3
3,D,i,4
4,E,n,4
5,F,g,3


Jupyter notebooks have a very clear and appealing graphical representation of DataFrames, favoring the reduction of the abstraction needed to operate some tables.

In [305]:
list_of_lists = [['E','n', 4],['F', 'g', 3]]

my_df2 = pd.DataFrame(list_of_lists, columns=['Column+one', 'Column_2', 'Column*3']) # creates a dataframe from a list

pd.concat([my_df, my_df2]) # concatenate both dataframes

Unnamed: 0,Column+one,Column_2,Column*3
0,A,s,1
1,B,t,2
2,C,r,3
3,D,i,4
4,E,n,4
5,F,g,3
0,E,n,4
1,F,g,3


In the same way that NumPy arrays, the whole Pandas module include multitude of functions, methods, and properties for managing data in very efficient ways. However, exploring all of them are not the main scope of this unit. 