<a href="https://colab.research.google.com/github/benchrisblair/cap-comp215/blob/main/examples/week2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sequence and Map Data Structures - Strings, Tuples, Lists, Dictionaries
This is our week 2 examples notebook and will be available on Github from the powderflask/cap-comp215 repository.

As usual, the first code block just imports the modules we will use.

In [1]:
import datetime
import matplotlib.pyplot as plt
import  matplotlib.dates as mdates
from pprint import pprint

## f-strings
A `string` is a sequence of characters / symbols.
This familiar data structure is quite powerful, and format-strings (f-strings) take it to the next level...

In [2]:
today = datetime.date.today()
the_answer = 42
PI = 3.1415926535

f'{today:%d/%m/%Y} is not special, but {the_answer} and {PI:0.2f} are!'

'22/01/2023 is not special, but 42 and 3.14 are!'

## List Comprehension
Provides a compact syntax for two very common sequence-processing algorithms:  Map  and Filter

Basic syntax:

In [21]:
# [i for i in range(10)]

# Generate a list of values based on a set of rules
list_1 = [2*i if i%2==0 else 3*i if i%3==0 else i*4 for i in range(0,100)]
print(list_1)

# The list comprehension can be made clearer by first defining a function...
def calculate_it(val):
    return 2*val if val%2==0 else 3*val if val%3==0 else val*4

# ...then using that function to manipulate i in the list comprehension
list_2 = [calculate_it(i) for i in range(0, 100)]
print(list_2)

from collections import defaultdict

def counter_factory(): # This factory function can be replaced with 'lambda : 0' in defaultdict
    return 0

counts_1 = defaultdict(counter_factory)  # We're not calling the counter_factory function so we don't need to include the parentheses
for i in range(0, 100):
    counts_1[calculate_it(i)] += 1
print(counts_1)

# Instead of defining the counter_factory function, we can use a lambda function
counts_2 = defaultdict(lambda:0)
for i in range(0, 100):
    counts_2[calculate_it(i)] += 1
print(counts_2)

# We can also do dictionary comprehensions which show mapping more explicitly
dict_comp = {i:calculate_it(i) for i in range(0, 100)}
print(dict_comp)

# We can also remove the duplicates using a set comprehension
set_comp = {calculate_it(i) for i in range(0, 100)}
print(set_comp)

# Questions:
# 1. Does the set comprehension automatically remove the duplicates? Is that a characteristic of a set in Python?

[0, 4, 4, 9, 8, 20, 12, 28, 16, 27, 20, 44, 24, 52, 28, 45, 32, 68, 36, 76, 40, 63, 44, 92, 48, 100, 52, 81, 56, 116, 60, 124, 64, 99, 68, 140, 72, 148, 76, 117, 80, 164, 84, 172, 88, 135, 92, 188, 96, 196, 100, 153, 104, 212, 108, 220, 112, 171, 116, 236, 120, 244, 124, 189, 128, 260, 132, 268, 136, 207, 140, 284, 144, 292, 148, 225, 152, 308, 156, 316, 160, 243, 164, 332, 168, 340, 172, 261, 176, 356, 180, 364, 184, 279, 188, 380, 192, 388, 196, 297]
[0, 4, 4, 9, 8, 20, 12, 28, 16, 27, 20, 44, 24, 52, 28, 45, 32, 68, 36, 76, 40, 63, 44, 92, 48, 100, 52, 81, 56, 116, 60, 124, 64, 99, 68, 140, 72, 148, 76, 117, 80, 164, 84, 172, 88, 135, 92, 188, 96, 196, 100, 153, 104, 212, 108, 220, 112, 171, 116, 236, 120, 244, 124, 189, 128, 260, 132, 268, 136, 207, 140, 284, 144, 292, 148, 225, 152, 308, 156, 316, 160, 243, 164, 332, 168, 340, 172, 261, 176, 356, 180, 364, 184, 279, 188, 380, 192, 388, 196, 297]
defaultdict(<function counter_factory at 0x7f507f5d2e50>, {0: 1, 4: 2, 9: 1, 8: 1, 20:

### Map Algorithm
Apply the same function to every item in another sequence (i.e., provide a "mapping" from the source sequence to the target)

In [19]:
# Example: Compute the first 10 natural squares
squares = [x**2 for x in range(1,11)]
print(squares)

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


### Filter
Select a sub-set of the elements from another sequence based on some criteria.

In [20]:
VOWELS = 'aeiouAEIOU'
text = '''
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
'''
# Problem: create a string with just the vowels from the text, in order.
list_of_vowels = [char for char in text if char in VOWELS]

delimiter = ', '
str_of_vowels = delimiter.join(list_of_vowels)
print(str_of_vowels)

vowels_seen = set()
def is_seen(char):
  if char in vowels_seen:
    return True
  else:
    vowels_seen.add(char)
    return False

# [vowel for vowel in list_of_vowels if not is_seen(vowel)] --> replace list comprehensions with side effects using loops (as seen below)

unique_vowels = []
# for vowel in list_of_vowels:
#   if not is_seen(vowel):
#     unique_vowels.append(vowel)
for vowel in list_of_vowels:
  if vowel in VOWELS and vowel not in vowels_seen:
    vowels_seen.add(vowel)
    unique_vowels.append(vowel)

print(vowels_seen)
print(unique_vowels)

# Questions:
# 1. Why do we need vowels_seen and unique_vowels? Both seem to contain the same information.

o, e, i, u, o, o, i, a, e, o, e, e, u, a, i, i, i, e, i, e, o, e, i, u, o, e, o, i, i, i, u, u, a, o, e, e, o, o, e, a, a, a, i, u, a, U, e, i, a, i, i, e, i, a, u, i, o, u, e, e, i, a, i, o, u, a, o, a, o, i, i, i, u, a, i, u, i, e, e, a, o, o, o, o, e, u, a
{'u', 'e', 'o', 'i', 'U', 'a'}
['o', 'e', 'i', 'u', 'a', 'U']


## Data Wrangling with List Comprehension
E-learn's Live Quiz module tracks quiz scores for each student, but does not store them in the gradebook, and instead reports them in the most useless way.

Let's do some "data wrangling" to make sense out of this mess!

### The Problem: Unstructured Data!
Notice it is just a single large string! The real data set has 36 students, and I need to do this every week!

In [23]:
quiz_scores = """
  1.                 Ali Oop scored  7/ 8 = 87%


  2.          Alison Ralison scored  8/ 8 = 100%


  3.         Ambily Piturbed scored  8/ 8 = 100%


  4.  Arshan Risnot Farquared scored  5/ 8 = 62%


  5.       Ayushma Jugernaugh scored  5/ 8 = 62%


  6.       Brayden Labaguette scored  7/ 8 = 87%
"""

### Goal
Turn this into structured data: a list of 2-tuples, each student's full name and their integer score.

In [28]:
# Start by eliminating the blank lines between each entry, removing unnecessary text from both ends of the string, and splitting string into elements
lines = [line.split()[1:-3] for line in quiz_scores.split('\n') if line]
pprint(lines)

# Clean up each entry by combining the first, middle, and last names, and eliminating the remaining unnecessary text 
scores = [(' '.join(line[:-2]), int(line[-1][:-1])) for line in lines]
pprint(scores)

[['Ali', 'Oop', 'scored', '7/'],
 ['Alison', 'Ralison', 'scored', '8/'],
 ['Ambily', 'Piturbed', 'scored', '8/'],
 ['Arshan', 'Risnot', 'Farquared', 'scored', '5/'],
 ['Ayushma', 'Jugernaugh', 'scored', '5/'],
 ['Brayden', 'Labaguette', 'scored', '7/']]
[('Ali Oop', 7),
 ('Alison Ralison', 8),
 ('Ambily Piturbed', 8),
 ('Arshan Risnot Farquared', 5),
 ('Ayushma Jugernaugh', 5),
 ('Brayden Labaguette', 7)]


## Records
A *record* is a compound data value - a collection of simpler data values (fields) that all describe a single entity.

 * tuple
 * dictionary
 * object

Problem: develop the data representation for a `student` in a student record system,
where a `student` has a first and last name, student id, and date of birth

In [30]:
# Tuple
tuple_students = [
    ('Bob', '', 'Squarepants', 123456789, datetime.date(year=1994, month=2, day=25)),
    ('Dora', 'The', 'Explora', 192837465, datetime.date(year=2000, month=8, day=14))
]

s = tuple_students[-1]
s[4]  # must know that the DOB is at index 4 to access the correct data
age = datetime.date.today() - s[4]  # calculate age in days
age.days // 365  # calculate age in years

# Dictionary
dict_students = [
    {
        'first': 'Bob',
        'middle': '',
        'last': 'Squarepants',
        'sn': 123456789,
        'dob': datetime.date(year=1994, month=2, day=25)
    },
    {
        'first': 'Dora',
        'middle': 'The',
        'last': 'Explora',
        'sn': 192837465,
        'dob': datetime.date(year=2000, month=8, day=14)
    }
]

s = dict_students[-1]
s['dob']

# We can use a list comprehension to generate a record of all students
students = [{'first':s[0], 'middle':s[1], 'last':s[2], 'sn':s[3], 'dob':s[4]} for s in tuple_students]
students

[{'first': 'Bob',
  'middle': '',
  'last': 'Squarepants',
  'sn': 123456789,
  'dob': datetime.date(1994, 2, 25)},
 {'first': 'Dora',
  'middle': 'The',
  'last': 'Explora',
  'sn': 192837465,
  'dob': datetime.date(2000, 8, 14)}]

In [31]:
# Objects
from dataclasses import dataclass

@dataclass
class Student:
    first: str
    middle: str
    last: str
    sn: int
    dob: datetime.date

    def full_name(self):
        return f'{self.first} {self.last}'

@dataclass
class SkilledStudent(Student):  # Example of inheritance --> next week
    skill: str

students = [
    Student('Bob', '', 'Squarepants', 123456789, datetime.date(year=1994, month=2, day=25)),
    SkilledStudent('Dora', 'The', 'Explora', 192837465, datetime.date(year=2000, month=8, day=14), skill="Exploring")
]

dora = [s for s in students if s.first=='Dora'][0]
dora.full_name()

dora.first = 'Fred'  # Change Dora's first name to Fred
dora

# dora.skill = 'Exploring'  # Adding a skill attribute does not affect the Student class, NEVER do this
# dora.skill

# bob = students[0]
# bob.skill  # Returns an error b/c Bob does not have the skill attribute

SkilledStudent(first='Fred', middle='The', last='Explora', sn=192837465, dob=datetime.date(2000, 8, 14), skill='Exploring')

**Data types** (such as `int`, `float`, and `bool`) are indivisible --> known as *atomic*

Note, in Python multi-character strings are not indivisible because they can be divided into individual characters. A `str` containing a single character is *atomic*.

Returning to our `Student` class from above, we can see that the `Student` object can be divided into smaller parts. This is known as a *compound* data type. **Classes** allow us to define our own data types.

In [32]:
type('hello')  # Returns str --> NOT atomic
type('a')  # Returns str --> atomic
type(42)  # Returns int --> atomic
type(True)  # Returns bool --> atomic

type(dora)  # Returns Student --> NOT atomic
type(Student)  # Returns type
type(type)  # Returns type...

type