## Python data types 

### Data types

- int
- float
- range
- str
- list
- tuple
- dict
- set

#### Strings

Typical uses:
    
- Human readable text
- Mnemonic identifiers
- Intermediate representation of objects saved to a file or transmitted over a network  
    The process of converting an object (e.g. a list or a number) to a byte string (of type `bytes`) that can be written to a file or transmitted on a network is called *serialization*. The inverse process is called *deserialization*. Although not necessary, it is practial to first convert an object to a string which is subsequently *encoded* to a byte string.
- Avoid uses beyond those above

Characteristics:

- Immutable
- Ordered
- Iterable
- Indexable
- Sliceable

#### Lists

Typical uses:
 
- Used to hold multiple objects where order matters
- Sometimes very large (tens of millions)
- May hold multiple instances of same object
- Used in implementation of queues, stacks and other data structures
- Typically, all elements of a list are of same type (but not a requirement)

Characteristics:

- Mutable
- Ordered
- Iterable
- Indexable
- Sliceable

#### Tuples

Typical uses:

- Used to hold multiple objects where order matters
- Normally holds few elements (often created using literals)
- Often holds elements of different type
- Used to represent data points, the attributes of an object

Characteristics:

- Immutable
- Otherwise, same as list

#### Sets

Typical uses:

- Container of *distinct* objects where order does *not* matter
- Sometimes very large (ten millions)
- It's faster to check if a large set contains an element than to check if a list contains an element
- Can do set operations such as union, intersection, set difference

Characteristics:

- Mutable (`frozenset` is the immutable counterpart)
- Elements must be *hashable* (which implies immutable)
- Not ordered
- Iterable
- Not indexable

#### Dictionaries

Typical uses:

- Used to look up objects by key rather than by index
- Used to map each object in a collection to some value, e.g. countries to population
- Used as a representation of an object, e.g. represent a book as a dictionary with keys: `'title'`, `'author'`, `'publication_year'`, `'publisher'`, etc.
- Used as caches

Characteristics:

- Mutable
- Keys must be hashable
- Not directly iterable, but keys and items are

We can illustrate the difference between set and list with the following example. Assume we want a list of *distinct* random numbers between 1 and 10,000,000. We can try and produce this list directly. The code in the cell below works fine as long as the number of random numbers is small. Try increasing count to 1,000,000 and see what happens.

In [0]:
from random import randrange

random_list = []
count = 1_000
while count > 0:
    n = randrange(1, 10_000_000)
    if n not in random_list:
        random_list.append(n)
        count -= 1
print(len(random_list))

1000


The problem is that the time it takes to the check `n not in random_list` is proportional with the size of the list. We can avoid this by using a set.

In [0]:
from random import shuffle
random_set = set()
count = 1_000_000
while count > 0:
    n = randrange(1, 10_000_000)
    if n not in random_set:
        random_set.add(n)
        count -= 1
random_list = list(random_set)
shuffle(random_list)
print(len(random_list))
# len(set(random_list)) == len(random_list)


1000000


###  Worked example: Representing a database table

The task is to represent a database table in Python in such a way that one can easily and efficiently perform typical queries on it. Select appropriate data structure, e.g., lists, dictionaries, etc. The table we will be using can be downloaded from [here](https://www.kaggle.com/jessicali9530/honey-production/downloads/honey-production.zip/2).

(For the sake of the challenge, do not use `pandas`.)


To read the `honeyproduction.csv` file, use the `csv` package.

In [2]:
import csv

Let's read a few rows of the file to see what it looks like. 

The lines below show how to read a file. We use `with open ...`, which is the recommended way of accessing a file since the file is automatically closed when execution leaves the indented block below.

In the code lines below, `csv.reader` returns a reader, which allows us to *iterate* over the lines of the file. Anything we can iterate over is refered to as an *iterable*, so `reader` is an iterable. This means that we can, e.g., use it in a for-loop:

```Python
for line in reader:
    print(line)
```

But `reader` is also an *iterator*, so we can also call `next` on the reader, which will return the next line.

In [1]:
# Uncomment the two lines below and execute cell to upload honeyproduction.csv
# from google.colab import files
# files.upload()

In [3]:
with open('assets/honeyproduction.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for _ in range(5):
        print(next(reader))

['state', 'numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb', 'prodvalue', 'year']
['AL', '16000', '71', '1136000', '159000', '0.72', '818000', '1998']
['AZ', '55000', '60', '3300000', '1485000', '0.64', '2112000', '1998']
['AR', '53000', '65', '3445000', '1688000', '0.59', '2033000', '1998']
['CA', '450000', '83', '37350000', '12326000', '0.62', '23157000', '1998']


We can see that the file contains a heading row followed by some data rows. `reader` has split each line into a list of strings corresponding to the values in each column. We will need to convert some of the column entries to `int` or `float`. 

**A note on iterables and iterators.** An *iterable* is a thing that we can iterate over, e.g., using a for-loop. Examples of iterables that we have encountered so far are ranges, strings, lists, tuples, and sets. When iterating over an iterable, the iterable's *iterator* is used. The function `iter()` is used to access an iterable's iterator. The iterator is also used when iterating over an iterable with a for-loop, but this is done witout having to invoke `iter()` explicitly.

The function `next()`, when invoked on an iterator, returns the next element of the iterator. As a side-effect, it also advances the iterator so that next time `next()` is called, the following element is returned. When iterating over an iterable with a for-loop, `next()` is called repeatedly on the iterable's iterator until the iterator is exhausted.

It is important to distinguish between iterables and iterators. For example, calling next on an iterable will fail unless the iterable is also an iterator:

```Python
my_list = [1, 2, 3]
my_list_iterator = iter(my_list)
next(my_list_iterator) # evaluates to 1 and advances the iterator
next(my_list) # fails
```

It is not uncommon that an iterable is its own iterator (so that calling `iter()` on the iterable returns the iterable itself). This is the case for `csv.reader`.

#### Alternative 1: Representing a table as a list of rows

The code below instantiates two variables; `heading_row`, which contains the first row as a list, and `data_rows`, which is a list of lists.

In [0]:
data_rows = []
with open('honeyproduction.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    heading_row = next(reader)
    for row in reader:
        for i, col in enumerate(row[1:5], start=1):
            row[i] = int(float(col))
        row[5] = float(row[5])
        for i, col in enumerate(row[6:], start=6):
            row[i] = int(float(col))
        data_rows.append(row)

Let's also create a `heading_dict` that maps the name of each column to the column's index. This will allow us to refer to a column by its name rather than by its index.

In [0]:
heading_dict = {}
for idx, name in enumerate(heading_row):
    heading_dict[name] = idx
heading_dict

{'numcol': 1,
 'priceperlb': 5,
 'prodvalue': 6,
 'state': 0,
 'stocks': 4,
 'totalprod': 3,
 'year': 7,
 'yieldpercol': 2}

An alternative way of defining the `heading_dict` is to use *comprehension* as shown in the cell below. We will address comprehensions in more detail later in this course, so don't worry if you don't quite understand how to read (and write) them yet. This is just a small sample of what is to come later.

In [0]:
heading_dict = {name:idx for idx, name in enumerate(heading_row)}
heading_dict

{'numcol': 1,
 'priceperlb': 5,
 'prodvalue': 6,
 'state': 0,
 'stocks': 4,
 'totalprod': 3,
 'year': 7,
 'yieldpercol': 2}

We can reference any row by index:

In [0]:
data_rows[10]

['IN', 9000, 92, 828000, 489000, 0.85, 704000, 1998]

And we can get the value at a given column in a row by index. Instead of having to count the columns to find the index to use, we can use `heading_dict`.

In [0]:
f'The price per pound is: {data_rows[10][heading_dict["priceperlb"]]}'

'The price per pound is: 0.85'

We can make queries on our table. For example, give me all rows where totalprod is greater than 35000000. 

In [0]:
query_result = []
for row in data_rows:
    if row[heading_dict['totalprod']] > 35000000:
        query_result.append(row)
query_result

[['CA', 450000, 83, 37350000, 12326000, 0.62, 23157000, 1998],
 ['ND', 400000, 90, 36000000, 8640000, 1.36, 48960000, 2008],
 ['ND', 510000, 91, 46410000, 12995000, 1.5, 69615000, 2010]]

Alternatively, we can use comprehension. Compare the code in the cell below with the way we would write the query in SQL.
```SQL
select * in honeyproductiontable where totalprod > 35000000
```

In [0]:
query_result = [row for row in data_rows if row[heading_dict['totalprod']] > 35000000]
query_result

[['CA', 450000, 83, 37350000, 12326000, 0.62, 23157000, 1998],
 ['ND', 400000, 90, 36000000, 8640000, 1.36, 48960000, 2008],
 ['ND', 510000, 91, 46410000, 12995000, 1.5, 69615000, 2010]]

#### Alternative 2: Representing a table as a dictionary

In this alternative, we represent the table as a dictionary where the keys are the column headings and the values are lists of all the data entries in the corresponding column.

In [0]:
table_dict = {}
for col_idx, heading in enumerate(heading_row): 
    column = []
    for row in data_rows:
        column.append(row[col_idx])
    table_dict[heading] = column

Or, using comprehension:

In [0]:
table_dict = {heading:[row[col_idx] for row in data_rows] 
              for col_idx, heading in enumerate(heading_row)}

We can get a *slice* of the table:

In [0]:
{heading:column[:4] for heading, column in table_dict.items()}

{'numcol': [16000, 55000, 53000, 450000],
 'priceperlb': [0.72, 0.64, 0.59, 0.62],
 'prodvalue': [818000, 2112000, 2033000, 23157000],
 'state': ['AL', 'AZ', 'AR', 'CA'],
 'stocks': [159000, 1485000, 1688000, 12326000],
 'totalprod': [1136000, 3300000, 3445000, 37350000],
 'year': [1998, 1998, 1998, 1998],
 'yieldpercol': [71, 60, 65, 83]}

Now that we have our dictionary representing the table, we don't need `data_rows` any longer. We can get a row, represented as a dictionary mapping column heading to column entry, with the following function:

In [0]:
def row(row_idx):
    row = {}
    for heading, column in table_dict.items():
        row[heading] = column[row_idx]
    return row

In [0]:
def row(row_idx):
    return {heading:column[row_idx] for heading, column in table_dict.items() }

In [0]:
row(10)

{'numcol': 9000,
 'priceperlb': 0.85,
 'prodvalue': 704000,
 'state': 'IN',
 'stocks': 489000,
 'totalprod': 828000,
 'year': 1998,
 'yieldpercol': 92}

We can make queries:

In [0]:
query_result = []
for row_idx, total_prod in enumerate(table_dict['totalprod']):
    if total_prod > 35000000:
        query_result.append(row(row_idx))  
query_result

[{'numcol': 450000,
  'priceperlb': 0.62,
  'prodvalue': 23157000,
  'state': 'CA',
  'stocks': 12326000,
  'totalprod': 37350000,
  'year': 1998,
  'yieldpercol': 83},
 {'numcol': 400000,
  'priceperlb': 1.36,
  'prodvalue': 48960000,
  'state': 'ND',
  'stocks': 8640000,
  'totalprod': 36000000,
  'year': 2008,
  'yieldpercol': 90},
 {'numcol': 510000,
  'priceperlb': 1.5,
  'prodvalue': 69615000,
  'state': 'ND',
  'stocks': 12995000,
  'totalprod': 46410000,
  'year': 2010,
  'yieldpercol': 91}]

Using comprehension, the query becomes more declarative and succinct:

In [0]:
[row(row_idx) 
 for row_idx, total_prod in enumerate(table_dict['totalprod']) 
 if total_prod > 35000000]

[{'numcol': 450000,
  'priceperlb': 0.62,
  'prodvalue': 23157000,
  'state': 'CA',
  'stocks': 12326000,
  'totalprod': 37350000,
  'year': 1998,
  'yieldpercol': 83},
 {'numcol': 400000,
  'priceperlb': 1.36,
  'prodvalue': 48960000,
  'state': 'ND',
  'stocks': 8640000,
  'totalprod': 36000000,
  'year': 2008,
  'yieldpercol': 90},
 {'numcol': 510000,
  'priceperlb': 1.5,
  'prodvalue': 69615000,
  'state': 'ND',
  'stocks': 12995000,
  'totalprod': 46410000,
  'year': 2010,
  'yieldpercol': 91}]

### Exercises

In [7]:
# Replace None with a data structure that holds 5 country names.
names = ['USA', 'Mexico', 'Ireland', 'Cuba', 'Canada']

In [8]:
# Replace None with a data structure that holds the capitals of the countries.
capitals = ['Washington', 'Mexico City', 'Dublin', 'Havana', 'Ottawa']

In [0]:
# Replace None with the populations (in millions) of the 5 countries
populations = [329, 127, 5, 11, 37]

In [6]:
# Replace None with an appropriate data structure
usa = {'name': 'USA', 'capital':'Washington', 'population':329}
# countries = []
# for i in range(len(names)):
#   country = {'name':names[i], 'capital':capitals[i], 'population': populations[i]}
#   countries.append(country)
countries = [{'name':name, 'capital':capital, 'population': population} 
             for name, capital, population in zip(names, capitals, populations)]
countries

NameError: name 'names' is not defined

In [4]:
# Complete

def country_by_capital(capital):
    """Returns the name of the country whose capital is capital"""
#       for country in countries:
#         if country['capital'] == capital:
#             return country['name']
    
    return next((country['name'] 
             for country in countries if country['capital'] == capital), None)

In [5]:
country_by_capital('Dublin') 

NameError: name 'countries' is not defined