# Programming and Data Analysis

> Data Types and Structures in Python

Kuo, Yao-Jen <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com/)

## Variables

## A variable is a name that refers to a value

```python
variable_name = literal_value
```

## Choose names for our variables: don'ts

- Do not use built-in functions.
- Cannot use [keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords).
- Cannot start with numbers.

Source: <https://www.python.org/dev/peps/pep-0008/>

## If you accidentally replaced built-in function with variable, use `del` to release it

```python
print = 5566
print("Hello, world!")
#del print
#print("Hello, world!")
```

## Choose names for our variables: dos

- Use a lowercase single letter, word, or words.
- Separate words with underscores to improve readability(so-called snake case).
- Be meaningful.

Source: <https://www.python.org/dev/peps/pep-0008/>

## Using `#` to write comments in our program

Comments can appear on a line by itself, or at the end of a line.

In [1]:
# turn fahrenheit to celsius
def from_fahrenheit_to_celsius(x):
    out = (x - 32) * 5/9
    return out

print(from_fahrenheit_to_celsius(32))  # turn 32 fahrenheit to celsius
print(from_fahrenheit_to_celsius(212)) # turn 212 fahrenheit to celsius

0.0
100.0


## Everything from `#` to the end of the line is ignored during execution

## Data Types

## Values belong to different types, we commonly use

- `int` and `float` for numeric computing.
- `str` for symbolic.
- `bool` for conditionals.
- `NoneType` for undefined values.

## Use `type` function to check the type of a certain value/variable

In [2]:
print(type(5566))
print(type(42.195))
print(type("Hello, world!"))
print(type(True))
print(type(False))
print(type(None))

<class 'int'>
<class 'float'>
<class 'str'>
<class 'bool'>
<class 'bool'>
<class 'NoneType'>


## How to form a `str`?

Use paired `'`, `"`, or `"""` to embrace letters strung together.

In [3]:
str_with_single_quotes = 'Hello, world!'
str_with_double_quotes = "Hello, world!"
str_with_triple_double_quotes = """Hello, world!"""
print(type(str_with_single_quotes))
print(type(str_with_double_quotes))
print(type(str_with_triple_double_quotes))

<class 'str'>
<class 'str'>
<class 'str'>


## If we have single/double quotes in `str` values we might have `SyntaxError`

```python
mcd = 'I'm lovin' it!'
```

## Use `\` to escape or paired `"` or paired `"""`

In [4]:
mcd = 'I\'m lovin\' it!'
mcd = "I'm lovin' it!"
mcd = """I'm lovin' it!"""

## We've seen arithmetic operators for numeric values

How about those for `str`?

## `str` type takes `+` and `*`

- `+` for concatenation.
- `*` for repetition.

In [5]:
mcd = "I'm lovin' it!"
print(mcd)
print(mcd + mcd)
print(mcd * 3)

I'm lovin' it!
I'm lovin' it!I'm lovin' it!
I'm lovin' it!I'm lovin' it!I'm lovin' it!


## Format our `str` printouts

- The `.format()` way.
- The `f-string` way.

## The `.format()` way: uses `{}` for string print with format

In [6]:
def hello_anyone(anyone):
    out = "Hello, {}!".format(anyone)
    return out

print(hello_anyone("Anakin Skywalker"))
print(hello_anyone("Luke Skywalker"))

Hello, Anakin Skywalker!
Hello, Luke Skywalker!


## The `f-string` way: uses `{}` for string print with format

In [7]:
def hello_anyone(anyone):
    out = f"Hello, {anyone}!"
    return out

print(hello_anyone("Anakin Skywalker"))
print(hello_anyone("Luke Skywalker"))

Hello, Anakin Skywalker!
Hello, Luke Skywalker!


## Commonly used format

- `{:.f}` for float format.
- `{:,}` for comma format.

In [8]:
def format_pi(pi):
    return f"{pi:.2f}"

print(format_pi(3.1415))
print(format_pi(3.141592))

3.14
3.14


In [9]:
def format_krw(ntd):
    krw = ntd * 42.67
    return f"{ntd:,} NTD to {krw:,.0f} KRW."

print(format_krw(1000))
print(format_krw(5000))

1,000 NTD to 42,670 KRW.
5,000 NTD to 213,350 KRW.


## How to form a `bool`?

- Use keywords `True` and `False` directly.
- Use relational operators.
- Use logical operators.

## Use keywords `True` and `False` directly

In [10]:
print(True)
print(type(True))
print(False)
print(type(False))

True
<class 'bool'>
False
<class 'bool'>


## Use relational operators

We have `==`, `!=`, `>`, `<`, `>=`, `<=`, `in`, `not in` as common relational operators to compare values.

In [11]:
print(5566 == 5566.0)
print(5566 != 5566.0)
print('56' in '5566')

True
False
True


## Use logical operators

- We have `and`, `or`, `not` as common logical operators to manipulate `bool` type values.
- Getting a `True` only if both sides of `and` are `True`.
- Getting a `False` only if both sides of `or` are `False`.

In [12]:
print(True and True)  # get True only when both sides are True
print(True and False)
print(False and False)
print(True or True)
print(True or False)
print(False or False) # get a False only when both sides are False
# use of not is quite straight-forward
print(not True)
print(not False)

True
False
False
True
True
False
False
True


## An example of using logical operators

Good marathon weather is often described as dry **and** cold. Say, the probabilities of dry and cold on race day are both 50%, there is a 25% of chance for good marathon weather.

In [13]:
def is_good_marathon_weather(is_dry, is_cold):
    return is_dry and is_cold

print(is_good_marathon_weather(True, True))
print(is_good_marathon_weather(True, False))
print(is_good_marathon_weather(False, True))
print(is_good_marathon_weather(False, False))

True
False
False
False


## An example of using logical operators(cont'd)

Good marathon weather is often described as dry **or** cold. Say, the probabilities of dry and cold on race day are both 50%, there is a 75% of chance for good marathon weather.

In [14]:
def is_good_marathon_weather(is_dry, is_cold):
    return is_dry or is_cold

print(is_good_marathon_weather(True, True))
print(is_good_marathon_weather(True, False))
print(is_good_marathon_weather(False, True))
print(is_good_marathon_weather(False, False))

True
True
True
False


## `bool` is quite useful in control flow and filtering data.

## Python has a special type, the `NoneType`, with a single value, None

- This is used to represent undefined values.
- It is not the same as `False`, or an empty string `''` or 0.

In [15]:
a_none_type = None
print(type(a_none_type))
print(a_none_type == False)
print(a_none_type == '')
print(a_none_type == 0)
print(a_none_type == None)

<class 'NoneType'>
False
False
False
True


## A function without `return` statement actually returns a `NoneType`.

In [16]:
def hello_anyone(anyone):
    print(f"Hello, {anyone}!")

hello_anyone("Anakin Skywalker")
hello_anyone("Luke Skywalker")

Hello, Anakin Skywalker!
Hello, Luke Skywalker!


In [17]:
func_out = hello_anyone("Anakin Skywalker")
type(func_out)

Hello, Anakin Skywalker!


NoneType

## Data types can be dynamically converted using functions

- `int()` for converting to `int`.
- `float()` for converting to `float`.
- `str()` for converting to `str`.
- `bool()` for converting to `bool`.

## Upcasting(to a supertype) is always allowed

`NoneType` -> `bool` -> `int` -> `float` -> `str`.

In [18]:
print(bool(None))
print(int(True))
print(float(1))
print(str(1.0))

False
1
1.0
1.0


## While downcasting(to a subtype) needs a second look

In [19]:
print(float('1.0'))
print(int('1'))
print(bool('False'))
print(bool('NoneType'))

1.0
1
True
True


## Data Structure

## What is a data structure?

> In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

Source: <https://en.wikipedia.org/wiki/Data_structure>

## Why data structure?

As a software engineer, the main job is to perform operations on data, we can simplify that operation into: 

1. Take some input
2. Process it
3. Return the output

Quite similar to what we've got from the definition of a function.

## To make the process efficient, we need to optimize it via data structure

Data structure decides how and where we put the data to be processed. A good choice of data structure can enhance our efficiency.

## We will talk about 4 built-in data structures in Python

- `list`
- `tuple`
- `dict` as in dictionary
- `set`

## Built-in data structures refer to those need no self-definition or importing

Quite similar to the comparison of built-in functions vs. self-defined/third party functions.

## Built-in Data Structure: `list`

## Lists

Lists are the basic ordered and mutable data collection type in Python. They can be defined with comma-separated values between square brackets.

In [20]:
primes = [2, 3, 5, 7, 11]
print(type(primes)) # use type() to check type
print(len(primes))  # use len() to check how many elements are stored in the list

<class 'list'>
5


## Lists have a number of useful methods

- `.append()`
- `.pop()`
- `.remove()`
- `.insert()`
- `.sort()`
- ...etc.

We can use `TAB` and `SHIFT - TAB` for documentation prompts in a notebook environment.

In [21]:
primes.append(13) # appending an element to the end of a list
print(primes)
primes.pop() # popping out the last element of a list
print(primes)
primes.remove(2) # removing the first occurance of an element within a list
print(primes)
primes.insert(0, 2) # inserting certain element at a specific index
print(primes)
primes.sort(reverse=True) # sorting a list, reverse=False => ascending order; reverse=True => descending order
print(primes)

[2, 3, 5, 7, 11, 13]
[2, 3, 5, 7, 11]
[3, 5, 7, 11]
[2, 3, 5, 7, 11]
[11, 7, 5, 3, 2]


## Python provides access to elements in compound types through

- **indexing** for a single element
- **slicing** for multiple elements

## Python uses zero-based indexing

In [22]:
primes.sort()
print(primes[0]) # the first element
print(primes[1]) # the second element

2
3


## Elements at the end of the list can be accessed with negative numbers, starting from -1

In [23]:
print(primes[-1]) # the last element
print(primes[-2]) # the second last element

11
7


## While indexing means fetching a single value from the list, slicing means accessing multiple values in sub-lists

- start(inclusive)
- stop(non-inclusive)
- step

```python
# slicing syntax
OUR_LIST[start:stop:step]
```

In [24]:
print(primes[0:3:1]) # slicing the first 3 elements
print(primes[-3:len(primes):1]) # slicing the last 3 elements 
print(primes[0:len(primes):2]) # slicing every second element

[2, 3, 5]
[5, 7, 11]
[2, 5, 11]


## If leaving out, it defaults to

- start: 0
- stop: -1
- step: 1

So we can do the same slicing with defaults

In [25]:
print(primes[:3]) # slicing the first 3 elements
print(primes[-3:]) # slicing the last 3 elements 
print(primes[::2]) # slicing every second element
print(primes[::-1]) # a particularly useful tip is to specify a negative step

[2, 3, 5]
[5, 7, 11]
[2, 5, 11]
[11, 7, 5, 3, 2]


## Built-in Data Structure: `tuple`

## Tuples

Tuples are in many ways similar to lists, but they are defined with parentheses rather than square brackets.

In [26]:
primes = (2, 3, 5, 7, 11)
print(type(primes)) # use type() to check type
print(len(primes))  # use len() to check how many elements are stored in the list

<class 'tuple'>
5


## The main distinguishing feature of tuples is that they are immutable

Once they are created, their size and contents cannot be changed.

In [27]:
primes = [2, 3, 5, 7, 11]
primes[-1] = 13
print(primes)
primes = tuple(primes)

[2, 3, 5, 7, 13]


In [28]:
try:
    primes[-1] = 11
except TypeError as e:
    print(e)

'tuple' object does not support item assignment


## Use TAB to see if there is any mutable method for tuple

```python
primes.<TAB>
```

## Tuples are often used in a Python program; like functions that have multiple return values

In [29]:
def get_locale(country, city):
    return country, city

print(get_locale("Taiwan", "Taipei"))
print(type(get_locale("Taiwan", "Taipei")))

('Taiwan', 'Taipei')
<class 'tuple'>


## Multiple return values can also be individually assigned

In [30]:
my_country, my_city = get_locale("Taiwan", "Taipei")
print(my_country)
print(my_city)

Taiwan
Taipei


## Built-in Data Structure: `dict`

## Dictionaries

Dictionaries are extremely flexible mappings of keys to values, and form the basis of much of Python's internal implementation. They can be created via a comma-separated list of `key:value` pairs within curly braces.

In [31]:
the_celtics = {
    'isNBAFranchise': True,
    'city': "Boston",
    'fullName': "Boston Celtics",
    'tricode': "BOS",
    'teamId': 1610612738,
    'nickname': "Celtics",
    'confName': "East",
    'divName': "Atlantic"
}

print(type(the_celtics))
print(len(the_celtics))

<class 'dict'>
8


## Elements are accessed through valid key rather than zero-based order

In [32]:
print(the_celtics['city'])
print(the_celtics['confName'])
print(the_celtics['divName'])

Boston
East
Atlantic


## New key:value pair can be set smoothly

In [33]:
the_celtics['isMyFavorite'] = True
print(the_celtics)

{'isNBAFranchise': True, 'city': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': 1610612738, 'nickname': 'Celtics', 'confName': 'East', 'divName': 'Atlantic', 'isMyFavorite': True}


## Use `del` to remove a key:value pair from a dictionary

In [34]:
del the_celtics['isMyFavorite']
print(the_celtics)

{'isNBAFranchise': True, 'city': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': 1610612738, 'nickname': 'Celtics', 'confName': 'East', 'divName': 'Atlantic'}


## Common mehtods called on dictionaries

- `.keys()`
- `.values()`
- `.items()`

In [35]:
print(the_celtics.keys())
print(the_celtics.values())
print(the_celtics.items())

dict_keys(['isNBAFranchise', 'city', 'fullName', 'tricode', 'teamId', 'nickname', 'confName', 'divName'])
dict_values([True, 'Boston', 'Boston Celtics', 'BOS', 1610612738, 'Celtics', 'East', 'Atlantic'])
dict_items([('isNBAFranchise', True), ('city', 'Boston'), ('fullName', 'Boston Celtics'), ('tricode', 'BOS'), ('teamId', 1610612738), ('nickname', 'Celtics'), ('confName', 'East'), ('divName', 'Atlantic')])


## Built-in Data Structure: `set`

## Sets

The fourth basic collection is the set, which contains unordered collections of unique items. They are defined much like lists and tuples, except they use the curly brackets.

In [36]:
primes = {2, 3, 5, 7, 11}
odds = {1, 3, 5, 7, 9}
print(type(primes))
print(len(odds))

<class 'set'>
5


## Python's sets have all of the operations like union, intersection, difference, and symmetric difference

## Union: elements appearing in either sets

In [37]:
print(primes | odds)      # with an operator
print(primes.union(odds)) # equivalently with a method

{1, 2, 3, 5, 7, 9, 11}
{1, 2, 3, 5, 7, 9, 11}


## Intersection: elements appearing in both

In [38]:
print(primes & odds)             # with an operator
print(primes.intersection(odds)) # equivalently with a method

{3, 5, 7}
{3, 5, 7}


## Difference: elements in primes but not in odds

In [39]:
print(primes - odds)           # with an operator
print(primes.difference(odds)) # equivalently with a method

{2, 11}
{2, 11}


## Symmetric difference: items appearing in only one set

In [40]:
print(sorted((primes - odds) | (odds - primes))) # union two differences
print(primes ^ odds)                             # with an operator
print(primes.symmetric_difference(odds))         # equivalently with a method

[1, 2, 9, 11]
{1, 2, 9, 11}
{1, 2, 9, 11}


##  One of the powerful features of Python's compound objects is that they can contain objects of any type, or even a mix of types

##  Take data.nba.net for example

The official API of NBA is a bunch of compound dictionaries contained other dictionaries/lists as values.

Source: <https://data.nba.net/10s/prod/v1/today.json>