# Diving into Python types

An complete introduction to basic types in Python is outside of the scope of this textbook. If needed, I refer readers to the [official Python tutorial](https://docs.python.org/3/tutorial/), or this great [tutorial by Jake Vanderplas](https://jakevdp.github.io/WhirlwindTourOfPython/03-semantics-variables.html).
Instead, this section will focus on describing differences among 
commonly used types and specifically why some types are used over others
in certain situations. As this comes up in coding exercises in the 
textbook, this section will be referred back to.

### Standard types

What do we mean by *types*?
As a programming language, Python contains a set of core building blocks that are used to store and modify information to accomplish an amazing array of possible computational tasks. The most fundamental of these building blocks are the Python types. These are *objects* that can be created, with specific attributes and methods. You have seen several of these already. They include objects for storing boolean and numeric data, such as `bool`, `int` and `float`; objects for storing text data, such as `str` and `bytes`; objects for storing collections of objects, such as `list`, `tuple`; and hashed collections (more on this later), such as `set` and `dict`. 

Each of these can be created by using the type name as a constructor function, which will return the object that can then be stored as a variable. Or, there is also a shortcut syntax to creating each of these types, such as the use of square brackets to create a list. A few examples of creating objects using either approach are shown below.

In [None]:
x = int(1)           # x is an integer
x = str("hello")     # x is a string
x = list([1, 2, 3])  # x is a list

In [None]:
x = 1                # x is an integer
x = "hello"          # x is a string
x = [1, 2, 3]        # x is a list

You can use the builtin method `type` to query an object's type. This will return one of the above types if it is one of the core Python type objects. The example below shows that this object is of the class type `int`. 

In [None]:
# create an int object and return the type
x = 3
print(type(x))

The `type` function can also return the type of custom objects. Everything in Python is an object, and in addition to the core object types, you can also create custom *classes*, which represent new object types. These are created in Python using the `Class` statement. A simple example is below; we will discuss the creation of custom class objects much more in a later tutorial. 

In [None]:
# create a custom class object and return the type
class CustomInt:
    def __init__(self, value):
        self.value = value
    
x = CustomInt(3)
print(type(x))

I named this class `CustomInt` and it stores an integer. If we added more code to this class we could define additional methods and attributes of this class that operate on the integer data that it has stored. In this way, you can see that my custom class uses and relies on the existence of the `int` class. Similarly, you might be surprised to learn that `str` and `list` objects internally store data using a `dict` object. In fact, it turns out that all of the core Python objects rely on `dict`. This is the basis for a common saying, **everything in Python is a dictionary**. From this simple object all other core and custom objects can be created in Python. Pretty cool. We'll return to this later as we delve deeper into the design and construction of classes.

### Mutable versus Immutable

An important concept in Python that differentiates several of the core object types, and which can be additionally applied to custom classes, is whether or not the object is *mutable*. If an object is mutable its values can be changed; it is immutable then its values cannot. 

Let's look at a simple example. Two popular ways of storing a collection of objects in Python is using either a `list` or `tuple`. The main difference is that Lists are mutable and Tuples are not. This means that elements can be replaced, added, or removed from lists, with the operation returning the same list, but modified. By contrast, the objects in a tuple are fixed, and can only be changed by creating a new tuple with a different collection of objects.

In [None]:
# create a list, mutate the first element, and print.
my_list = [1, 2, 'a', 'b', 'b', 0]
my_list[0] = 'x' # note that here 1 is and string and not an numeric
print(my_list)

However, elements **CANNOT** be replaced, added, or removed from tuples. The example below will raise `TypeError`.


In [None]:
my_tup = (1, 2, 'a', 'b', 'b', 0)
my_tup[0] = '1' # raises a TypeError

For this reason, tuples lack most of the methods that are found in the list, except for `index` and `count`. Using tab-completion in a notebook you can see that list objects have many more methods available to them than tuples do, because there isn't all that much you can do with immutable objects. Many of the methods available to a `list` are intended for storing elements that COULD be updated in future instructions.

In [None]:
my_list = [1, 2, 'a', 'b', 'b', 0]
my_list.remove('a') # remove a specific element from the list
my_list.append('c') # append a new element at the end of the list
my_list.insert(1, "n") # insert a new element in the index 1, moving all elements one index
print(my_list)

So why do tuples even exist? There are several reasons. First off, the lack of mutation methods in tuples make them a "lighter" type than a list (in terms of bytes), because they need to store less information about their contents. Below we use a function from the Python standard library module `sys` to measure the size of each object in memory in number of bytes.

In [None]:
from sys import getsizeof 

my_list = [1, 2, 3, 4, 5, 6]
my_tup = (1, 2, 3, 4, 5, 6)

print(f"Size of my_list: {getsizeof(my_list)} bytes")
print(f"Size of my_tup: {getsizeof(my_tup)} bytes")

This characteristic of tuples and their immutability makes them suitable for other uses.
They work perfectly to maintain the integrity of a list of elements that you do not
want to modify, reducing conflicts with other parts of your code. This is particularly beneficial if you are storing designing a custom class object and storing data in an attribute that you *do not want to allow users to modify*. It easily enforces this constraint. Because they are lighter, they are the best option in big projects (or big lists)
in terms of reducing running times and memory optimization.
Finally, another important feature of tuples is that they are *hashable*. This feature will be explored in the following section about hashing, sets, and dictionaries. 

In summary, if your list of elements will be static once you create it, it is recommended to use `tuple` instead of a `list`.

**Additional note**: Some implicit declarations in Python can be tricky. 
For example, the following code will result in a tuple of one element 
instead of a variable storing a string.

In [None]:
var = "anything", # note the trailing comma
print(type(var))

## Hashing, Sets and Dictionaries

Following from our earlier anecdote that "everything in Python is a dictionary", you might expect that dictionaries are a particularly powerful and useful object type, and that is definitely true. And the reason for this is *hashing*. 

This is a process by which any immutable object in Python can be represented by a number in a table, which then acts a pointer to the place in memory where that item exists. Once it is represented by a number we can easily find it to ask if it exists, since its in an ordered table, and we can quickly perform several types of operations, such as asking whether the number is the same as another number. This is the way in which hash tables can turn complex questions (is this object the same as that object) into a simple question, is this number the same as that number. This simplification comes with enormous speed benefits.


In [None]:
# an example, the hash function returns a number for a string object
hash("anything")


Two core object types in Python that use hashing are `set` and `dict`. Both are declared using `{}` (curly brackets), but they are intended for different purposes. A `set` will store objects which are hashed, whereas a `dict` will store *keys* which are hashed, which act as pointers to other objects, termed *values*.

Fully understanding hash algorithms is complex. But understanding that operations that use hash algorithms are fast is an important lesson. In general, you should prefer the use of `dict` and `set` objects, which use hashing, as the fastest method for working with collections of data from which you want to mutate, index, or compare elements. 


You should be aware that not all types of data can be hashed, and this limits what type of data can be used inside of `set` or `dict` objects. 
Specifically, mutable objects, or objects that contain mutable objects, cannot be hashed (trying will raise a TypeError). This means mutable objects cannot be stored inside a `set`, and they cannot be used keys of a `dict`. 

In [None]:
# try to hash a mutable object (list) raises a TypeError
hash([1, 2, 3])

### Dictionaries

Dictionaries are also a useful type that provides a structured way of storing data (`value`) by associated it with an identifier (`key`), which can be hashed. The processing time to perform a search for a key in a dictionary is one of the fastest things you can do in Python. It is simply a query to a hash table (See Performance section below.)

One of the ways to define a dictionary is using `{}` to enclose one or more keys and their values. Keys must be pointed to their values using a `:` separator. You can also create a dictionary using the `dict` function by entering keys as arguments assigned to values. Any object can be stored as values in a dictionary, but keys must be immutable.

In [None]:
# using curly brackets to create a dict
my_dict = {'a': 1, 'b': 2}
print(my_dict)

In [None]:
# using the `dict` function to create a dict
my_dict = dict(a=1, b=2)
print(my_dict)

In [None]:
# many different types of values
my_dict = {
    'key1': 'A',
    '2': 1,
    'three': [0,1,2,3],
    4: 1, # opposite to keys, values can be duplicated in a dictionary
    'key5': {"inner_key1": 33, "inner_key2": 66} # little dict inside a dict
}

Dictionaries do not allow duplicated keys in the same way that sets do; nevertheless, Python does not return it as an error. Instead, it updates the previous value for a given key, as shown in the following code.

In [None]:
my_dict = {
    'key1': (9,5,6),
    'key1': 1,
    'key1': 'Last' # this is the value that maintain under the key1
}

my_dict

Keys in a `dict` must be a hashable type; for that reason, a `list` cannot be a key in a dictionary (it can be a value though). However, as mentioned before, tuples are hashable, and so tuples are the best replacement when you wish to store a type of collection as a key in a dictionary.

In [None]:
dict_a = {
    (1,2,3): 'Success',
}

dict_a

In [None]:
dict_b = {
    [1,2,3]: 'Failure'  # raises a TypeError
}

One of the great potentials of dictionaries is their accessibility (speed and simplicity). To access a given `value` in a dictionary, you can `index` its `key` object by entering it using square brackets, like below. Another option is to use the `.get` method function of dictionaries. 

In [None]:
# get value by indexing with key
my_dict['key1']

In [None]:
# get value by querying with .get function
my_dict.get("key1")

In the same way as other types discussed in this document, dictionaries have multiple other methods to perform some operations. For example, using the method `keys()`, you can get all the keys inside a given `dict`, and similar, `.items()` returns tuples of (key, value) pairs. 

In [None]:
my_dict.keys()

In [None]:
my_dict.items()

### Sets
Sets are similar to lists, and are mutable; however, sets are 
unordered, and DO NOT store duplicated values. Similar to dictionaries, sets store data using a hash table, where each item is mapped to a unique number. Unlike a dictionary, the items in a set (similar to the keys of a dict) are not used to fetch other items. Instead, the main purpose of sets is for *comparing* collections of objects.

Let's think again about hashing. Each item is mapped to a unique number (hash) in a hash table. This makes it a super efficient method for *comparing* objects, by asking whether their items are identical, or if they are subsets of each other, or which items are unique. Each of these queries simply involves getting the hash numbers and then asking whether the numbers are the same or not.

Sets can be created using curly brackets or the `set` function. A key feature is that they only store unique objects, duplicates are excluded. 

In [None]:
# create a set using curly brackets or set()
my_set = {1, 2, 3, 4}
my_set = set([1, 2, 3, 4])

In [None]:
# sets store only unique values (only one 'b' will be stored)
my_set = {1, 2, 'a', 'b', 'b', 0}
my_set

The order of objects in a set is arbitrary, but when printed will display in a sorted order. However, comparison methods on sets ask about the overlap in their contents, and not their order. The example below reports that the two sets contain the same items, and so the `==` operator return True. Set objects include several functions for performing comparisons, and similarly Python operators (e.g., `-`, `|`), can be used to compare sets in a syntax similar to mathematical notation. 

In [None]:
# are they equal? The order of objects in a set does not matter, 
{"a", "b"} == {"b", "a"}

In [None]:
# is one subset of the other?
my_set1 = {"a", "b"}
my_set2 = {"b", "a"}
my_set1.issubset(my_set2)

In [None]:
my_set1 = {"a", "b", "c"}
my_set2 = {"c", "d"}

# which items are unique to set1
print(my_set1 - my_set2)
print(my_set1.difference(my_set2))

In [None]:
my_set1 = {"a", "b", "c"}
my_set2 = {"c", "d"}

# which items are not shared
print(my_set1 ^ my_set2)
print(my_set1.symmetric_difference(my_set2))

Similar to lists, sets have multiple methods for replacing, adding or removing elements. 

In [None]:
my_set = {"a", "b", "c"}
my_set.remove('a')
my_set.add('c')
my_set

### Performance of hashing

As discussed before, hashing is *fast*. As a demonstration let's compare the performance of some operations in a collection of data. Let's calculate how long Python takes to find an element in a long `list`, `tuple`, `set` or `dict`? Below we create an object containing 100K integers for each of these object types. 

In [None]:
# create a list 
long_list = list(range(100000)) # this is a different way to declare a list, using the function list()

# create a tuple 
long_tuple = tuple(long_list) # same as list, tuples can be declared with the function tuple()

# create a set
long_set = set(long_list) 

# create a dict
long_dict = dict(zip(range(100000), range(100000)))

To measure performance here we use a feature of jupyter/IPython called a *magic function*. The `%%timeit` header in the cell below will tell the cell to run many thousands of times and to report back, how long it took to run each iteration on average. This is an easy to way to compare the speed of different operations, and to find an optimal solution.

The operation we perform in each of the cells below is to ask `X in Y`: is this object in that collection. This will return True or False. In all cases the answer is True, because of each of these objects contains the integer. However, the cells below will return the time it takes to perform the operation, rather than the solution to the query. 

As you can see, the query on `set` and `dict` were similar, and about 20X faster than the same operation in `list` or `tuple`. Furthermore, the speed of the hashed operation will remain the same *no matter how large the dictionary or set is*, whereas the operation on the list or tuple would become slower if the collection were larger. This demonstrates the utility of using hashed object types.

In [None]:
%%timeit
99999 in long_list

In [None]:
%%timeit 
99999 in long_tuple

In [None]:
%%timeit 
99999 in long_set

In [None]:
%%timeit 
99999 in long_dict

## Arrays and DataFrames (non-native data types)
There is a lot of ways to store data in Python; some of them came from third-party libraries. This means that they are not available with pure Python installation. Due to their efficiency and variety of included methods, two of the most popular data types are Arrays and DataFrames, which are available, respectively, in two of the most popular third-party libraries for doing data science in Python: numpy and pandas.

### Arrays
Numpy Arrays (also referred to as ndarrays) are included in the library [NumPy](https://numpy.org/). This library adds support to Python to manage very large and multidimensional arrays and multiple mathematical functions useful to operate on this type. This data type is highly efficient in terms of memory and running times. Arrays may be variable in the number of dimensions, form 1D-array, similar to a `list`; 2D-array, similar to a matrix or table; to a high n-dimensional array.

In contrast to other data types explored in this document, NumPy arrays are homogenous (e.g., strings and integer cannot be in the same collection), being limited in the data you can store in a given array. Nonetheless, this does not mean that arrays cannot keep some type of data. On the contrary, Numpy arrays can store a wide variety of data types. As you may expect, some of the handy functions included in this library are limited by the type of data stored inside each array.

The following code shows how to create a 1D-array with NumPy.

In [None]:
import numpy as np

my_arr = np.array([1, 2, 3, 4, 5])
my_arr

In our previous example, all elements in our input list are integer; nonetheless, if one of the elements is a different type (string, float, etc.), NumPy will try to perform an automatic transformation to homogenize all elements inside the array.

In [None]:
homogenized_arr = np.array([1, 2, 'string', 4, 5])
homogenized_arr # note in this array, all integers are now strings

In [None]:
homogenized_arr = np.array([1, 2, 3.2, 4, 5])
homogenized_arr # same happens here where NumPy convert all integers into floats

As we mentioned above, arrays can be multidimensional. Below is how a 2D-array can be declared. The exact usage of indentation can vary depending on your preference, and is a stylistic choice that does not affect the outcome.

In [None]:
# align indentation at function opening
my_2darr = np.array([[1, 2],
                    [1, 3],
                    [1, 1]])

In [None]:
# or, align indentation using 1 indent
my_2darr = np.array([
    [1, 2],
    [1, 3],
    [1, 1],
])

Functions included in this library are very powerful and can perform operations on individual cells of an array, over entire rows or columns, or over every element in an array. The latter operations, which affect multiple cells of the array are super efficient, using code that internally that is written in compiled languages (C or Fortran). Thus, in contrast to many standard Python coding routines, such as using for-loops, operating over many elements of a numpy array can be done at the much faster speed of compiled languages. This is type of operation which uses compiled code to operate over arrays is called *broadcasting*. 

In [None]:
original_arr = np.array([[0, 0, 0],
                        [2, 0, 3]])

print(original_arr + 100)       # add 100 to every cell of the 2D-array using broadcasting

Some of those functions handle matrix operations efficiently. For example, the two arrays below can be summed cell-by-cell using either the `+` operator, or the numpy function call.

In [None]:
arr1 = np.array([[0, 0, 1],
               [3, 1, 2],
               [1, 0, 1]])

arr2 = np.array([[1, 1, 1],
               [1, 3, 2],
               [0, 1, 1]])

In [None]:
np.add(arr1, arr2)

In [None]:
arr1 + arr2

There is a vast amount of functions in this library. Understanding all of them is out of this course's scope, but it is helpful to be familiar with the documentation to know the potential of this data type.

**Additional note**: You should know that a data type can be found natively in Python in the [module `array`](https://docs.python.org/3.8/library/array.html). This data type is intended to compactly store arrays of numeric values by encoding them accordingly to its values characteristics. It shares some attributes with NumPy arrays but lacks some functionalities found in the latter. 

The word array can be tricky, not only in Python but also in other programming languages. Some of them are simple lists, but for others they share some characteristics described above in the NumPy definition. 

The following is an example of array definition in the native Python module. 

In [None]:
import array
native_array = array.array('d', [1.0, 2.0, 3.14]) # d is the type of value, being double in this case

### DataFrames
Similar to the previous data type, **DataFrames** are a specific type of data included in the library [Pandas](https://pandas.pydata.org/). This library provides a powerful and highly performant data structure and includes useful functions for manipulating numerical tables and data series. Tables are composed of columns and rows, and every cell is indexed. 

DataFrames has some benefits over other data types described in this document. Their structure allows an easy interpretation and reduces abstraction by allowing labeling axes (row and columns). They are much easier to read and understand than lists or arrays.

There are multiple ways to create a DataFrame; one of the most common is using a `dict` where `keys` are the column names and a list in its `value` is the data located in each cell. 

In [None]:
import pandas as pd
dict1 = {
    'Column+one': ['A', 'B', 'C', 'D', 'E', 'F'],
    'Column_2': ['s', 't', 'r', 'i', 'n', 'g'],
    'Column*3': [1, 2, 3, 4, 4, 3],
}
  
my_df = pd.DataFrame(dict1) # creates the dataframe from our dictionary

my_df

Jupyter notebooks have a very clear and appealing graphical representation of DataFrames, favoring the reduction of the abstraction needed to operate some tables.

In [None]:
list_of_lists = [['E','n', 4],['F', 'g', 3]]

my_df2 = pd.DataFrame(list_of_lists, columns=['Column+one', 'Column_2', 'Column*3']) # creates a dataframe from a list

pd.concat([my_df, my_df2]) # concatenate both dataframes

In the same way as NumPy arrays, the whole Pandas module includes a multitude of functions, methods, and properties for managing data in very efficient ways. However, exploring all of them is not the main scope of this unit. 