# Log generation
Пусть у нас есть интернет-магазин, который посещают 10 посетителей в час. Итого за год имеем 87 600 посетителей (у wildberries - около 40 млн).  

Каждый посетитель совершает какие-то действия на сайте, которые записываются в log файл. Мы имитируем эти действия с помощью 100 случайных слов (из трех случайных букв).  

Наш лог файл получился небольшого размера и легко помещается в оперативную память одной машины. Но так бывает не всегда. Есть два пути решения проблемы.

![](scaling.png)

In [1]:
import random, string, datetime

In [None]:
letters = string.ascii_lowercase

LOG_SIZE = 10*24*365

def get_random_word(length=3):
   return ''.join(random.choice(letters) for i in range(length))

def get_line(length=100):
  t = str(datetime.datetime.now())
  words = [get_random_word() for n in range(length)]
  return t + ' ' + ' '.join(w for w in words) + '\n'

lines = [get_line() for l in range(LOG_SIZE)]

with open('log.txt', 'w') as f:
  f.writelines(lines)

In [3]:
with open('log.txt') as f:
  lines = f.readlines()

print(f'Total lines - {len(lines)}. First 5:')
lines[:5]

Total lines - 87600. First 5:


['2022-02-16 17:47:42.154547 zjv gfu dam iro hoj wuu biv lez qbw lgv dkz wvj tzj zdu iro ouo ydo mhx hsw tyi dwi pbj pww uwg sui mdi yss dks sev pnr bqh kkm svh opr eor kxq uau hir gga wsu wuj nqx uif qmk urg ccw bgl vkv wap ffa mfb ejg uig yig pbs hpk jtp dfm fpb ift ndv iiz bsm qsp zvu eny gma ekn sat hgk bgz kkm qmu cpi gmw qde wxc nab uiw piu ati clz aze fie tqt utg swn mkv ioj ljx img rli saj rsc tqx yqy szr mky kbs nob\n',
 '2022-02-16 17:47:42.155723 xhx xpp hnu axu sji xhy seq lhi sfi qdg kqq ytp jbu olb doh omn xap clr xxq dwm dco zfp lnz eyj zsx stk dtd eae fno aio xpu vuq kfa dff vdu jiu nwi iuu xse iwv nrs bmn nrm kod gtb fpt oxb ejj qfk yua hdb eny tgh ldb ezb avy fzi zqv hhf dxm tfk uak mes gko cxy eye kmj slf pwl hls ygk hhn pon gnd uiu sgf atb wuv gei zzq omc wfh yzg puu gfu sbf hvp iay sdz hby zhp hnw nmj fxk xnd zqy tnp czi nut vft\n',
 '2022-02-16 17:47:42.156961 urw wwl mec ism fkg pub byj ihz vqr hjk ruy ebn uel wkq iaa box gge pdu zxo ubo xjh xmi zkl imz hup ymq c

Проведем простую аналитику: сколько раз встречается слово 'cat' и слово 'dog' в нашем файле.

In [4]:
%%time
cat_count = 0
dog_count = 0

for line in lines:
  if 'cat' in line:
    cat_count += 1
  if 'dog' in line:
    dog_count += 1

print(f'cat - {cat_count}, dog - {dog_count}')

cat - 480, dog - 474
CPU times: user 36.6 ms, sys: 4.05 ms, total: 40.6 ms
Wall time: 43.5 ms


Немного усложним задачу: сколько раз встречается каждое слово? 

In [5]:
from collections import Counter

words = []
for line in lines:
  words += line.split(' ')[2:]

In [10]:
%%time
counts = Counter(words)

CPU times: user 1.06 s, sys: 4.07 ms, total: 1.06 s
Wall time: 1.05 s


Попробуем ускорить вычисления за счет разделения списка слов

In [11]:
%%time
LEN = len(words) // 2
words1 = words[:LEN]
words2 = words[LEN:]

counts1 = Counter(words1)
counts2 = Counter(words2)
counts3 = counts1 + counts2

CPU times: user 819 ms, sys: 24 ms, total: 843 ms
Wall time: 840 ms


In [12]:
counts3 == counts

True

Или с помощью параллельных процессов

In [13]:
from multiprocessing import Process, Queue, cpu_count

q = Queue()

def count(q, words):
    c = Counter(words)
    q.put(c)
    
c = cpu_count()
print(f'CPU count - {c}')

CPU count - 4


In [14]:
%%timeit
p1 = Process(target=count, args=(q, words1))
p1.start()
ac1 = q.get()
p1.join()

586 ms ± 26.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%%time
p1 = Process(target=count, args=(q, words1))
p1.start()
p2 = Process(target=count, args=(q, words2))
p2.start()

ac1 = q.get()
ac2 = q.get()

p1.join()
p2.join()

async_count = ac1 + ac2

CPU times: user 35.9 ms, sys: 12.3 ms, total: 48.2 ms
Wall time: 805 ms


In [16]:
async_count == counts

True

## What is map reduce?

Greeting programming language paradigm:
1. Imperative:
* Procedural (Fortran, Pascal, C ...)
* Object Oriented (Smalltalk, C++, Java ...)
* Parallel Processing (Ada, Occam, cuda ...)
2. Declarative:
* Logic (Prolog ...)
* Functional (Lisp, Scheme ...)
* Database (SQL ...) 

Большинство современных языков поддерживают несколько парадигм программирования.

## High-order functions (Функции высшего порядка)
#### map

In [17]:
data = list(range(5))
print(data)

[0, 1, 2, 3, 4]


In [18]:
def my_qube(x):
  return x*x*x

list(map(my_qube, data))

[0, 1, 8, 27, 64]

In [19]:
list(map(lambda x: x*x*x, data))

[0, 1, 8, 27, 64]

#### filter 

In [20]:
list(filter(lambda x: x%2==0, data))

[0, 2, 4]

#### reduce 

In [21]:
import functools as f

def my_sum(a, b):
  return a + b

f.reduce(my_sum, data)

10

In [22]:
f.reduce(lambda a, b: a + b, data)

10

# Домашнее задание

#### 1. Переведите все слова в лог-файле в uppercase, в решении используйте map 

In [1]:
with open('log.txt') as f:
  lines = f.readlines()

In [2]:
%%time
list(   map(str.upper, lines)    )

CPU times: user 47.9 ms, sys: 16.2 ms, total: 64.1 ms
Wall time: 64.3 ms


['2022-02-16 17:47:42.154547 ZJV GFU DAM IRO HOJ WUU BIV LEZ QBW LGV DKZ WVJ TZJ ZDU IRO OUO YDO MHX HSW TYI DWI PBJ PWW UWG SUI MDI YSS DKS SEV PNR BQH KKM SVH OPR EOR KXQ UAU HIR GGA WSU WUJ NQX UIF QMK URG CCW BGL VKV WAP FFA MFB EJG UIG YIG PBS HPK JTP DFM FPB IFT NDV IIZ BSM QSP ZVU ENY GMA EKN SAT HGK BGZ KKM QMU CPI GMW QDE WXC NAB UIW PIU ATI CLZ AZE FIE TQT UTG SWN MKV IOJ LJX IMG RLI SAJ RSC TQX YQY SZR MKY KBS NOB\n',
 '2022-02-16 17:47:42.155723 XHX XPP HNU AXU SJI XHY SEQ LHI SFI QDG KQQ YTP JBU OLB DOH OMN XAP CLR XXQ DWM DCO ZFP LNZ EYJ ZSX STK DTD EAE FNO AIO XPU VUQ KFA DFF VDU JIU NWI IUU XSE IWV NRS BMN NRM KOD GTB FPT OXB EJJ QFK YUA HDB ENY TGH LDB EZB AVY FZI ZQV HHF DXM TFK UAK MES GKO CXY EYE KMJ SLF PWL HLS YGK HHN PON GND UIU SGF ATB WUV GEI ZZQ OMC WFH YZG PUU GFU SBF HVP IAY SDZ HBY ZHP HNW NMJ FXK XND ZQY TNP CZI NUT VFT\n',
 '2022-02-16 17:47:42.156961 URW WWL MEC ISM FKG PUB BYJ IHZ VQR HJK RUY EBN UEL WKQ IAA BOX GGE PDU ZXO UBO XJH XMI ZKL IMZ HUP YMQ C

In [None]:
# expected
"""['2022-02-12 14:07:14.497548 ERK QJF REP DQL ZSC SHL BNB AAP OAU XML BKU PUD OYU LRZ KUF LYC XKH WVQ UVE LEW RVN HOD ZFJ SHF PAS YFC VTZ JPA GHU TJD HSS MPS NEO EYJ GNY VDD XLD XIK UMZ KBN WNE GYW MBX LNI TTA QCN BHF KUZ WCQ QBJ SHI YHR DLL MVH SPV BTG MTJ XLN AIY HKA BQO ZZH BBI XRH NEP CNK KUV RNE FNU SCZ KFV EHM WYI EWE OAP UWF IVJ HFL ODJ SOE RTE FXE OLY TOG VUX KYP RUK MCQ COO TBD XBI CUG TLB GJY GKQ TEO UZF SIR RGK KMP\n',
 '2022-02-12 14:07:14.498591 VIS BRA CHC GAM FYT VCP DWO VHT NEM LAA CXS YRC KOY ZTJ TZC MJI XPJ QBA XMT PJX QBE POY EUC MLH FPU CNX PSG QLN QMF APS URL GZZ CSU FYY IME LJG GGI LJA EDD TVG SJM JIZ EQM LRK GUN AQZ IGQ SBA LZU HUJ EQM HZJ KKJ JHE GJW ELG RXA RLH VBL YMR LDJ ISY VJX GVV UFX NYY LHG QYV XMT JET KRJ MNJ JKH MJL WVT RVW MOJ OES QXA ZJW DQV VNE ZEP UGV GXZ YCB FPS BZF ENH WSH PJD WED CLX WBN QGS BOP QFF ZWK CRQ XJR\n',
 '2022-02-12 14:07:14.499505 WXG DLB AKS IXN JEK QGD TUY KSU HDF CEY JEQ GLU MAW TEW YZO AVC IBD YSS HYL YBO DCF PNT UTB MZW ZVJ EYN YYW FLM TIL WYG WJR ETO XMD KRL OOE MEN HED QRJ JZC SFT RXN JNC PEG CWV TKO XRU YEP AQW ZZV VNU PYU THS PML ZEG XIK RYF DRS VFQ GLB FDZ SCT KEP GXP NKL XTE TRF LIX CDB ARM NBT GCV BNU AFX PPG PRV DYX AJN ZDJ QTU IRB QLI LNG RJR LFN KBD JTV IPX OIV YUS DZR MXG QHM CTI LLM IUT ZCK RCT QPN LQS OYO\n',
 '2022-02-12 14:07:14.500389 WOY WFE LUU BWK YBR FNU BRY CFR YWS TVA ABR TJI MWU THK NMJ HLB NFG ICR ITF DAY ZFB ZZI QHA VLA KZW WEW QYY IGI XLW VWB RYI HDH CTN MQE OYI ICD IEC CGL OIL GYR ZYI SGW DWU KQB CTR YIZ XHT EVM OOA UAF JUO NBK XMD VOM ZTQ GLO FTZ BPZ XNP FNH IQJ DAO CYN CDE MXS CEB VUV GEC BZI SRF XYH PWE UEZ QVA FKN WKA TJM VUI HSI EGS UTL WBT TUG JYC KOH NLO AYF QAZ FYM AFA WPH WKV DMA JDP NDW ROF SUQ QOK QZP PFZ\n',
 '2022-02-12 14:07:14.501292 ZWL QPO HOC WLB HRM DOM DIQ CKZ INA HBY DQR OMQ UWN NCW YIY GYB ZZZ HIC CHM LUY CIX MYD PPH NEK VUP IFH VRW XSX RMI RFH QLW WBT LDQ BXN YMA HXX INQ YZS NZI LCN AGU GNW MGB XYP WTM QMI QJC UOU RAC UAR NPD ELS CIN OMN QRJ KLU AIZ OVL GLF VIB QHF SRX CLH DGJ DQY CXK OQO IRC TKK PFZ PVE FBN NQT NDQ VMC ZGT KBI SOD QBB MQR LSL YMM MIQ HZF MGQ ZLM SZO YOL HWC WIY RAI NKK VAL VWM FPW PHL TZS DVU WKQ DNJ\n']"""

#### 2. Создайте новый список с логами, в которых будут только слова из white_list.txt, в решении используйте map и filter

In [3]:
with open('white_list.txt') as f:
    ww_lines = f.readlines()

ww = "".join([w for w in ww_lines]).split()
ww = list(map(str.lower, ww))

In [7]:
ww

['and',
 'fix',
 'own',
 'are',
 'fly',
 'odd',
 'ape',
 'fry',
 'our',
 'ace',
 'for',
 'pet',
 'act',
 'got',
 'pat',
 'ask',
 'get',
 'peg',
 'arm',
 'god',
 'paw',
 'age',
 'gel',
 'pup',
 'ago',
 'gas',
 'pit',
 'air',
 'hat',
 'put',
 'ate',
 'hit',
 'pot',
 'all',
 'has',
 'pop',
 'but',
 'had',
 'pin',
 'bye',
 'how',
 'rat',
 'bad',
 'her',
 'rag',
 'big',
 'his',
 'rub',
 'bed',
 'hen',
 'row',
 'bat',
 'ink',
 'rug',
 'boy',
 'ice',
 'run',
 'bus',
 'ill',
 'rap',
 'bag',
 'jab',
 'ram',
 'box',
 'jug',
 'sow',
 'bit',
 'jet',
 'see',
 'bee',
 'jam',
 'saw',
 'buy',
 'jar',
 'set',
 'bun',
 'job',
 'sit',
 'cub',
 'jog',
 'sir',
 'cat',
 'kit',
 'sat',
 'car',
 'key',
 'sob',
 'cut',
 'lot',
 'tap',
 'cow',
 'lit',
 'tip',
 'cry',
 'let',
 'top',
 'cab',
 'lay',
 'tug',
 'can',
 'mat',
 'tow',
 'dad',
 'man',
 'toe',
 'dab',
 'mad',
 'tan',
 'dam',
 'mug',
 'ten',
 'did',
 'mix',
 'two',
 'dug',
 'map',
 'use',
 'den',
 'mum',
 'van',
 'dot',
 'mud',
 'vet',
 'dip',
 'mom',


In [4]:
def white_filter(log_string):
    log_words = log_string.split()
    timestamp = str(log_words[0]) + str(w[1]) + " "
    words = w[2:]
    filtered_words = list(  filter(lambda x: x in ww, words)  )
    
    return timestamp + " ".join([w for w in filtered_words])  

In [8]:
white_filter(lines[1])

'2022-02-1617:47:42.155723 eye nut'

In [9]:
%%time
out = list(map(white_filter, lines))

CPU times: user 20.2 s, sys: 47.7 ms, total: 20.3 s
Wall time: 20.5 s


In [10]:
out[:5]

['2022-02-1617:47:42.154547 dam sat',
 '2022-02-1617:47:42.155723 eye nut',
 '2022-02-1617:47:42.156961 box',
 '2022-02-1617:47:42.163537 ',
 '2022-02-1617:47:42.164261 nod']

In [None]:
# expected 
"""['2022-02-1214:07:14.497548 sir',
 '2022-02-1214:07:14.498591 jet',
 '2022-02-1214:07:14.499505 peg arm',
 '2022-02-1214:07:14.500389 day tug',
 '2022-02-1214:07:14.501292 ']"""

#### 3. Посчитайте количество слов из white_list в файле log.txt

In [None]:
# expected
"""Counter({'arm': 158,
         'tug': 154,
         'may': 175,
         'tan': 194,
         'cry': 184,
         'fin': 174... })"""

In [12]:
from collections import Counter

In [14]:
%%time
words = []
for line in out:
  words += line.split(' ')[2:]

Counter(words)

CPU times: user 51.1 ms, sys: 20 µs, total: 51.1 ms
Wall time: 69.7 ms


Counter({'sat': 170,
         'nut': 185,
         'why': 194,
         'gel': 201,
         'jog': 191,
         'for': 179,
         'tow': 181,
         'air': 206,
         'bee': 190,
         'odd': 155,
         'vet': 195,
         'dot': 208,
         'ace': 190,
         'oar': 195,
         'yak': 182,
         'buy': 181,
         'who': 160,
         'pup': 201,
         'dip': 175,
         'yes': 167,
         'had': 175,
         'put': 131,
         'but': 184,
         'lay': 170,
         'car': 171,
         'job': 143,
         'pat': 182,
         'two': 197,
         'got': 200,
         'lit': 204,
         'arm': 177,
         'war': 177,
         'get': 186,
         'own': 193,
         'zap': 167,
         'rat': 186,
         'bye': 180,
         'jet': 180,
         'way': 186,
         'van': 182,
         'ear': 165,
         'kit': 182,
         'ape': 182,
         'pet': 160,
         'jam': 174,
         'ask': 170,
         'zip': 166,
         'all