In [1]:
import pandas as pd
import random
import numpy as np
from collections import Counter

# Expected value 

### Prerequisite: un-weighted, arithmetic mean

for list comprehension, you are going to get a thing that's going to be a part of an iteration (you can also filter after that as well)

In [2]:
rand_nums = [random.randint(0, 100) for _ in range(20)]

In [3]:
rand_nums

[94,
 13,
 80,
 14,
 32,
 79,
 97,
 12,
 85,
 63,
 87,
 86,
 53,
 15,
 59,
 46,
 99,
 93,
 86,
 61]

In [4]:
rand_nums[:5]

[94, 13, 80, 14, 32]

the arithmetic mean is the sum of the set divided by the number of items in the set.

In [5]:
np.mean(rand_nums)

62.7

(yair) the standard deviation is the sqaure root of the variance (which is the difference of each number from the mean, squared, and then those numbers averaged - meaning summed and divided by the number of items in the set) 

(abe) the standard deviation is the sqaure root of the average of the squared deviations from the mean

In [6]:
np.std(rand_nums)

30.12490663885948

### Weighted averages: values & frequencies

In [7]:
price_items = [20, 40, 60]

we've created a catalogue of three items and stored their prices in the above list. the average price, to the consumer, per item is 40

In [8]:
rand_nums2 = [random.choice(price_items) for _ in range(20)]

In [9]:
rand_nums2

[20,
 40,
 60,
 20,
 40,
 40,
 40,
 20,
 20,
 40,
 40,
 60,
 20,
 20,
 40,
 40,
 40,
 20,
 20,
 20]

In [10]:
rand_nums2[:5]

[20, 40, 60, 20, 40]

we've create a new list that represents 20 random transactions at the store, where a customer bought 1 of the above items.

In [11]:
np.mean(rand_nums2)

33.0

numbers generated: 60, 60, 60, 60, 20
unweighted mean is 260/5 = 52

note that the average here is 45 and not 52, why is that?

in this case we are weighting the average based on the frequency of purchases. the fact that this average is lower than 52, shows that more items were sold at a lower price.

In [12]:
Counter(rand_nums2)

Counter({20: 9, 40: 9, 60: 2})

`Counter` is like a `value_counts` without needding a dataframe, and it can be used on lists. it returns a dictionary.

#### Side note: dictionary comprehensions

In [13]:
cnts = Counter(rand_nums2)

In [14]:
cnts.items()

dict_items([(20, 9), (40, 9), (60, 2)])

In [15]:
str_cnts = {str(k): v for k, v in cnts.items()}

In [16]:
str_cnts

{'20': 9, '40': 9, '60': 2}

#### Back to analysis...

In [17]:
cnts

Counter({20: 9, 40: 9, 60: 2})

In [18]:
freq_items = [cnts[k] for k in price_items]

In [19]:
freq_items

[9, 9, 2]

alternative way to see this...

In [20]:
for k in price_items:
    print(k, cnts[k])

20 9
40 9
60 2


In [21]:
freq_items

[9, 9, 2]

In [22]:
df = pd.DataFrame({"price": price_items, "cnt": freq_items})

In [23]:
df = df[['price', 'cnt']]

In [24]:
df

Unnamed: 0,price,cnt
0,20,9
1,40,9
2,60,2


*we have gone from a flat or denormalized view to a normalized view*

#### Calculate frequencies with Pandas 

In [25]:
df2 = pd.DataFrame(rand_nums2)

In [26]:
df2.head()

Unnamed: 0,0
0,20
1,40
2,60
3,20
4,40


In [27]:
df2.columns = ['purchase']

In [28]:
df2.head()

Unnamed: 0,purchase
0,20
1,40
2,60
3,20
4,40


In [29]:
df2.purchase.value_counts()

20    9
40    9
60    2
Name: purchase, dtype: int64

In [30]:
df2['purchase'].value_counts()

20    9
40    9
60    2
Name: purchase, dtype: int64

In [31]:
df2.purchase.value_counts().to_frame()

Unnamed: 0,purchase
20,9
40,9
60,2


In [32]:
df2.purchase.value_counts().to_frame().reset_index()

Unnamed: 0,index,purchase
0,20,9
1,40,9
2,60,2


In [33]:
df2 = df2.purchase.value_counts().to_frame().reset_index()

In [34]:
df2.columns = ['price', 'cnt']

In [35]:
df2

Unnamed: 0,price,cnt
0,20,9
1,40,9
2,60,2


#### back to averages...

In [36]:
df

Unnamed: 0,price,cnt
0,20,9
1,40,9
2,60,2


In [37]:
df.price.mean()

40.0

In [38]:
(df.price * df.cnt / df.cnt.sum()).sum()

33.0

In [39]:
(df['price'] * df['cnt'] / df['cnt'].sum()).sum()

33.0

*PEMDAS: parentheses, exponent, multiplication, division, addition, subtraction*

In [40]:
(df['price'] * df['cnt'] / df['cnt'].sum())

0     9.0
1    18.0
2     6.0
dtype: float64

method split out step by step...

In [41]:
df['cnt_rel'] = df.cnt / df.cnt.sum()

In [42]:
df

Unnamed: 0,price,cnt,cnt_rel
0,20,9,0.45
1,40,9,0.45
2,60,2,0.1


In [43]:
df['proportion_tot_rev'] = df.price * df.cnt_rel

In [44]:
df

Unnamed: 0,price,cnt,cnt_rel,proportion_tot_rev
0,20,9,0.45,9.0
1,40,9,0.45,18.0
2,60,2,0.1,6.0


In [45]:
weighted_avg_price = df.proportion_tot_rev.sum()

In [46]:
weighted_avg_price

33.0

the above shows the method to obtain the weigted average, done as follows:

- the relative count is each frequency divided by the total occurances of frequency
- the proportion to total revenue is the relative count times the price
- the weighted average is the sum of the proportion to total revenue

when calculating a weighted average, you are basicaly taking a set of values, scaling them by their relative contribution to something else (e.g. frequency, inventory sell-through) and then taking the sum of that relative contribution.

### Expected value

Let's say a data scientist on your team has been researching in-store customer activity and has observed the following:

- there is a 70% probability of a customer not buying anything

of the customers who do buy:
- 50% buy a t-shirt (\$20)
- 30% buy a polo (\$40)
- 20% buy a button-down (\$60)

What is the expected value of customer's vist?

In [47]:
df3 = df2['price'].sort_values().to_frame()

In [48]:
df3

Unnamed: 0,price
0,20
1,40
2,60


In [49]:
df3.index = ['t-shirt', 'polo', 'button-down']

In [50]:
df3['prob'] = 3*[0]

In [51]:
df3

Unnamed: 0,price,prob
t-shirt,20,0
polo,40,0
button-down,60,0


In [52]:
no_purch = pd.DataFrame({"price": [0], "prob": [.7]}, 
                        index=["no-purchase"])

In [53]:
no_purch

Unnamed: 0,price,prob
no-purchase,0,0.7


In [54]:
df3.append(no_purch)

Unnamed: 0,price,prob
t-shirt,20,0.0
polo,40,0.0
button-down,60,0.0
no-purchase,0,0.7


In [55]:
df3 = df3.append(no_purch)

In [56]:
item_labels = ['t-shirt', 'polo', 'button-down']

In [57]:
probs_df = pd.DataFrame({'prob': [.5, .3, .2]}, 
                       index=item_labels)

In [58]:
probs_df

Unnamed: 0,prob
t-shirt,0.5
polo,0.3
button-down,0.2


In [59]:
df3.loc[item_labels, 'prob']

t-shirt        0.0
polo           0.0
button-down    0.0
Name: prob, dtype: float64

In [60]:
probs_df = probs_df * .3

In [61]:
probs_df

Unnamed: 0,prob
t-shirt,0.15
polo,0.09
button-down,0.06


In [62]:
df3.loc[item_labels, 'prob'] = probs_df

In [63]:
df3

Unnamed: 0,price,prob
t-shirt,20,0.15
polo,40,0.09
button-down,60,0.06
no-purchase,0,0.7


In [64]:
expected_value = (df3.price * df3.prob).sum()

In [65]:
expected_value

10.2

*this is the probability-weighted arithmetic mean or average*

### random quiz shit...

In [66]:
np.mean([3, 5, 8,10, 15, -23])

3.0

In [67]:
np.std([7, 10, 15, 5])

3.766629793329841

##### What does [1, 5, 3].sort() return?

In [68]:
d = [1, 5, 3]

In [69]:
d

[1, 5, 3]

In [70]:
d.sort()

In [71]:
d

[1, 3, 5]

In [72]:
sorted(d)

[1, 3, 5]

`.sort()` method sorts in-place returning nothing. `sorted()` function makes a copy, sorts and returns.

In [73]:
0 % 2

0

In [74]:
3 % 2

1

In [75]:
for i in range(1000):
    if i % 2:
        print(i + 1)

2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
66
68
70
72
74
76
78
80
82
84
86
88
90
92
94
96
98
100
102
104
106
108
110
112
114
116
118
120
122
124
126
128
130
132
134
136
138
140
142
144
146
148
150
152
154
156
158
160
162
164
166
168
170
172
174
176
178
180
182
184
186
188
190
192
194
196
198
200
202
204
206
208
210
212
214
216
218
220
222
224
226
228
230
232
234
236
238
240
242
244
246
248
250
252
254
256
258
260
262
264
266
268
270
272
274
276
278
280
282
284
286
288
290
292
294
296
298
300
302
304
306
308
310
312
314
316
318
320
322
324
326
328
330
332
334
336
338
340
342
344
346
348
350
352
354
356
358
360
362
364
366
368
370
372
374
376
378
380
382
384
386
388
390
392
394
396
398
400
402
404
406
408
410
412
414
416
418
420
422
424
426
428
430
432
434
436
438
440
442
444
446
448
450
452
454
456
458
460
462
464
466
468
470
472
474
476
478
480
482
484
486
488
490
492
494
496
498
500
502
504
506
508
510
512
514
516
518
520
522
524
526
5

`for i in range(1000)`
- this line loops i from 0 through 999
    
`if i % 2`
- this will return False if `i % 2` is `None`, `False`, or 0, otherwise it will return `True`
- in this case, `i % 2` will always return 0 or 1
    
`print(i + 1)`  
- prints i, which is the position in the loop, + 1

In [76]:
3 not in [1, 7, 5]

True

In [77]:
"ab" in "a beautiful day"

False

In [78]:
[1, 5, 7] == [1, 7, 5]

False