# Python Data Processing

## Comprehensions

#### Example: Regression

* $y$ is a quantity
    * $y \in \mathbb{R}$
        * $\mathbb{R}$ aproximately `float`

In [1]:
from random import randint

In [4]:
X = [
    (60, 120, 200),  # hr, bp, caffine
    (randint(55, 100), randint(100, 140), 100 * randint(1, 10)),
    (randint(55, 100), randint(100, 140), 100 * randint(1, 10)),
    (randint(55, 100), randint(100, 140), 100 * randint(1, 10)),
]

In [7]:
y = [ # was sleep good?
    True, 
    True,
    False, 
    False
]

In [8]:
X

[(60, 120, 200), (97, 119, 300), (76, 112, 600), (60, 104, 300)]

#### Data Simulation: Generating $X$

In [10]:
[ i for i in range(0, 4)]

[0, 1, 2, 3]

By convention, when not using a variable, we name it `_`, 

In [12]:
[ _ for _ in range(0, 4)]

[0, 1, 2, 3]

We can use `range` to generate `N` data points by ignoring the output of `range` and just writing the generation code, 

In [22]:
X = [ (randint(60, 100), randint(100, 200), randint(0, 600)) for _ in range(0, 4)]

In [16]:
print(X)

[(93, 103, 600), (89, 133, 176), (96, 139, 534), (72, 102, 22)]


### Computing $y$ from $X$

Consider computing scores across $X$ based on some weighting factors, 

In [18]:
[ x for x in X ]

[(93, 103, 600), (89, 133, 176), (96, 139, 534), (72, 102, 22)]

In [40]:
[  2*x0 + 3*x1 + 4*x2 for (x0, x1, x2) in X ]

[1217, 1926, 1299, 3027]

If these scores go between `0` and `1` we can interpret them as a *probability* of $y$ being True, 

In [33]:
[  x0/100 + x1/200 + x2/600 for (x0, x1, x2) in X ]

[1.9749999999999999, 2.356666666666667, 1.815, 2.631666666666667]

In [34]:
[  1/3 * x0/100 + 1/3 * x1/200 + 1/3 * x2/600 for (x0, x1, x2) in X ]

[0.6583333333333333, 0.7855555555555555, 0.605, 0.8772222222222221]

Above we are just weighing each $x$ *the same* (ie., `1, 1, 1`), we would like to weight differently, 

In [37]:
w = [1, 2, 3]
b = -1000
[ (w[0]*x0) + (w[1]*x1) + (w[2]*x2) + b for (x0, x1, x2) in X ]

[-175, 352, -100, 1187]

To return to probabilities, we can `scale` and `weight`, 

In [38]:
scale = [(1/3 * 1/100), (1/3 * 1/200), (1/3 * 1/600)]

w = [0.2, 0.4, 0.4]
b = -0.2

[ (w[0]*scale[0]*x0) + (w[1]*scale[1]*x1) + (w[2]*scale[2]*x2) + b for (x0, x1, x2) in X ]

[0.0026666666666666505,
 0.047555555555555545,
 -0.021999999999999992,
 0.10222222222222221]

In practice, weights may often include the scale factor,

In [39]:
w = [0.2 * (1/3 * 1/100), 0.4 * (1/3 * 1/200), 0.4 * (1/3 * 1/600)]
b = -0.2

[ (w[0]*x0) + (w[1]*x1) + (w[2]*x2) + b for (x0, x1, x2) in X ]

[0.0026666666666666505,
 0.047555555555555545,
 -0.021999999999999992,
 0.10222222222222221]

The predicitons for $y$, ie., $\hat{y}$ are whether these are `<0`, 

In [41]:
w = [0.2 * (1/3 * 1/100), 0.4 * (1/3 * 1/200), 0.4 * (1/3 * 1/600)]
b = -0.2

[ (w[0]*x0) + (w[1]*x1) + (w[2]*x2) + b < 0 for (x0, x1, x2) in X ]

[False, False, True, False]

## Further Comprehensions

#### Filters

In [44]:
X = [
    (1, 2),
    (None, 3),
    (1.2, 2.4),
]

A decision tree formula is a formula involving logical operations, 

In [43]:
[ (x0 <3) and (x1 > 2) for x0, x1 in X]

[False, True]

There is missing data,

In [47]:
[ (x0 <3) and (x1 > 2) for x0, x1 in X]

TypeError: '<' not supported between instances of 'NoneType' and 'int'

Lets filter,

In [48]:
[ (x0 <3) and (x1 > 2) 
     for x0, x1 in X 
     if x0 is not None
]

[False, True]

#### Dictionary Comprehensions

In [49]:
[ c for c in "Michael"]

['M', 'i', 'c', 'h', 'a', 'e', 'l']

A dictionary comprehension has the syntax, `{ k : v ... for ... in old }`, 

In [51]:
{ c : "Michael".index(c) for c in "Michael"}

{'M': 0, 'i': 1, 'c': 2, 'h': 3, 'a': 4, 'e': 5, 'l': 6}

You are required to find some formula to generate a key `k` and `v` value possible from some, eg., list,

In [54]:
{ x0 : x1 for x0, x1  in X }

{1: 2, None: 3, 1.2: 2.4}

## Looping Idioms

Python has utility looping operations (aka iterators, aka "streaming operations") which enable different kinds of data processing patterns,

In [55]:
range(0, 10)

range(0, 10)

`range` for repetition,

In [60]:
[ "Ho" for _ in range(0, 3)]

['Ho', 'Ho', 'Ho']

`zip` for combination,

In [56]:
zip("Micahel", [1, 2, 3, 4])

<zip at 0x7fc470713d40>

In [62]:
X = [1, 2, 3]
y = [True, True, False]

In [63]:
[ f"X is {x} and y is {y} " for x, y in zip(X, y) ]

['X is 1 and y is True ', 'X is 2 and y is True ', 'X is 3 and y is False ']

`enumerate` provides indexes, 

In [57]:
enumerate(X)

<enumerate at 0x7fc4604b87c0>

In [65]:
X = [
    (1, 2),
    (None, 3),
    (1.2, 2.4),
]

for i, x in enumerate(X):
    print(i, x)

0 (1, 2)
1 (None, 3)
2 (1.2, 2.4)


Using `list` we can convert iterators to lists which computes all the elements the iterator generates,

In [66]:
list(reversed(X))

[(1.2, 2.4), (None, 3), (1, 2)]

...sorting,

In [59]:
sorted(y)

[False, False, True, True]

In [67]:
sorted("Michael")

['M', 'a', 'c', 'e', 'h', 'i', 'l']

In [69]:
sorted("Michael", reverse=True)

['l', 'i', 'h', 'e', 'c', 'a', 'M']

Min'ing and Max'ing, 

In [70]:
min("Micahel")

'M'

In [71]:
max("Micahel")

'l'

Which are very useful when working with lists of tuples,

In [72]:
ranked = [
    (101, ["SOMEDATA", "SOMEMORE"]),
    (33, ["SOMEDATA", "SOMEMORE"]),
    (6, ["SOMEDATA", "SOMEMORE"])
]

`min` and `max` consider the first element only when producing a minimum or maximum,

In [73]:
min(ranked)

(6, ['SOMEDATA', 'SOMEMORE'])

In [74]:
max(ranked)

(101, ['SOMEDATA', 'SOMEMORE'])

## Example

Consider a case of trying lots of different `w, b` values for a dataset and recording the error associated with each.

In [80]:
X_age = [18, 31, 23]
y_price = [2.3, 4.6, 3.2]

[x  for x in X_age ]

[18, 31, 23]

For example,

In [81]:
[ 0.1 * x + 1  for x in X_age ]

[2.8, 4.1, 3.3000000000000003]

In [83]:
w, b = 0.1, 1

[ w * x + b for x in X_age ]

[2.8, 4.1, 3.3000000000000003]

Let's compare with the known $y$, 

In [86]:
[(y, (w * x + b)) for x, y in zip(X_age, y_price) ]

[(2.3, 2.8), (4.6, 4.1), (3.2, 3.3000000000000003)]

The error is their difference, ignoring the sign,

In [94]:
[ abs(y - (w * x + b)) for x, y in zip(X_age, y_price) ]

[0.5, 0.5, 0.10000000000000009]

In [98]:
[ abs(y - (w * x + b)) for x, y in zip(X_age, y_price) ]

[0.5, 0.5, 0.10000000000000009]

Let's try lots of different $w, b$, 

In [101]:
guesses = [guess/10 for guess in range(0, 10)]

guesses

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

We can pair up guesses to give us wg and bg, 

In [108]:
[(wg, bg) for wg in [1, 2] for bg in [1, 2]]

[(1, 1), (1, 2), (2, 1), (2, 2)]

This is lots,

In [114]:
wb = [(wg, bg) for wg in guesses for bg in guesses ]

In [115]:
len(wb)

100

The first 5,

In [116]:
wb[:5]

[(0.0, 0.0), (0.0, 0.1), (0.0, 0.2), (0.0, 0.3), (0.0, 0.4)]

In [118]:
trials = [ 
            ( sum([abs(y - (w * x + b)) for x, y in zip(X_age, y_price)]), (w, b) )
            for w, b in wb
]

In [121]:
trials[:2]

[(10.1, (0.0, 0.0)), (9.799999999999999, (0.0, 0.1))]

Consider looking through `trials` history to find the best, `w, b`,

In [120]:
min(trials)

(1.0, (0.1, 0.9))

...in one go, 

In [127]:
min(( sum(abs(y - (w/10 * x + b/10)) for x, y in zip(X_age, y_price))
     , (w, b)) 
    
    for w in range(0, 10) 
    for b in range(0, 10) 
)


(1.0, (1, 9))

Or,

In [132]:
def error(xs, ys, w, b):
    return sum(abs(y - (w * x + b)) for x, y in zip(xs, ys))

In [133]:
min(
    (error(X_age, y_price, w, b), (w, b)) 
    
    for w in [i/10 for i in range(0, 10)]
    for b in [i/10 for i in range(0, 10)]
)


(1.0, (0.1, 0.9))