-> The User Pattern Analysis with <-
______ ______ .___ ___. .___ ___. ___ .__ __. _______ __ __ .__ __. _______
/ | / __ \ | \/ | | \/ | / \ | \ | | | \ | | | | | \ | | | ____|
| ,----'| | | | | \ / | | \ / | / ^ \ | \| | | .--. |______| | | | | \| | | |__
| | | | | | | |\/| | | |\/| | / /_\ \ | . ` | | | | |______| | | | | . ` | | __|
| `----.| `--' | | | | | | | | | / _____ \ | |\ | | '--' | | `----.| | | |\ | | |____
\______| \______/ |__| |__| |__| |__| /__/ \__\ |__| \__| |_______/ |_______||__| |__| \__| |_______|
-> JeongMin Kwon <-
-> cojette@gmail.com <-
-> 2015-08-27 <-
-> http://bit.ly/1NINdeF <-
-
Situation
-
What is User Pattern?
-
Case with Simple Command-line tools
-
Conclusion
-
Grasping user/service status
-
Beta-test for new services
-
Examining poor-defined log
-
Experiment/Test Evaluation
-
Development based data analysis
For myself only : *Ad-hoc data analysis *
-
regular and predictuable action of users
-
5W1H
┍━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑
│ Basic Indicator │ Predict │
├───────━━━━━━━━━━━──━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━─┼─────────━━━━━━━━━━━━━━━━━━━━━━━━━━─━━━━━━┤
│ - Who :user id, ip... | - Why : purpose, target ... |
| - When : timestamp | |
| - Where : page_id, action_id, category ... | |
| - What : login, buy, action... | |
| - How : purchase method, device, reference ... | |
┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
- Basic Indicators cover lots of decision problems
-
Total user status
-
Time series insights of metrics
-
user count and durations
-
Understanding each user
-
Service login term
-
contents and duration per each weekday, hour, month
-
...and many more
-
Background: Launching new temporary items in a small mobile game
-
Goal: Understanding users shopping time, items users react
- If users do not react to new items, new items will be removed
-
log: flat file
$ head tr_sample.csv id,type,user_id,description,del_item_id,gem,bean,recip,recdate 1963179,13,13465,,200002,0,0,113.32.91.206,2015-07-01 00:00:01.000000 1963180,9,2910, 산책 ,,2,0,211.187.227.229,2015-07-01 00:00:01.000000 1963181,9,12341,건강관리,,0,110,202.43.69.142,2015-07-01 00:00:01.000000
-
Details: log overview, sell count, sell status per hour and users, and more
-
Data Collection
-
Data Cleansing
-
Data Exploration
-
Data Modeling
-
Data interpretation
-
Data Collection
-
Data Cleansing
-
Data Exploration
Data Wrangling
- Command Line Tool: Ubiquitous, Extensible, Agile
-
original : comma-separated values
-
usual: character-separated values (included tab, other delimeter)
-
basic text log file format
-
csvkit (https://github.com/onyxfish/csvkit)
-
suite of utilities for working with cvs format
-
install: sudo apt-get install
-
copy, mv, ...
-
Internet Download : curl
$ curl -L -O https://github.com/cojette/UserPatternwithCommandLine/log_sample.xlsx
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed
100 158 0 158 0 0 24 0 --:--:-- 0:00:06 --:--:-- 39 100 65331 100 65331 0 0 4978 0 0:00:13 0:00:13 --:--:-- 15918
-
Controlling xlsx : in2csv
$ in2csv data/log_sample.xlsx > data/log_sample.csv
-
Database : sql2csv
$ sql2csv --db 'sqlite:///data/iris.db' --query 'SELECT * FROM iris limit 5'
-
Data check : csvlook + head
$ head -n 5 tr_sample.csv |csvlook |----------+------+---------+------------------+-------------+-----+------+-----------------+-----------------------------| | id | type | user_id | description | del_item_id | gem | bean | recip | recdate | |----------+------+---------+------------------+-------------+-----+------+-----------------+-----------------------------| | 1963179 | 13 | 13465 | | 200002 | 0 | 0 | 113.32..06 | 2015-07-01 00:00:01.000000 | | 1963180 | 9 | 2910 | 산책 | | 2 | 0 | 211.187..29 | 2015-07-01 00:00:01.000000 | | 1963181 | 9 | 12341 | 건강관리 | | 0 | 110 | 202.43..42 | 2015-07-01 00:00:01.000000 | | 1963182 | 9 | 6286 | 뮤지컬 | | 0 | 1000 | 218.153..281 | 2015-07-01 00:00:04.000000 | |----------+------+---------+------------------+-------------+-----+------+-----------------+-----------------------------|
-
Syntax error check: csvclean
$ csvclean tr_sample.csv No errors.
-
Subset from xlsx : csvcut
$ csvcut -c 1,3,5,6,9 tr_sample.csv | csvlook |head |----------+---------+-------------+-----+-----------------------------| | id | user_id | del_item_id | gem | recdate | |----------+---------+-------------+-----+-----------------------------| | 1963179 | 13465 | 200002 | 0 | 2015-07-01 00:00:01.000000 | | 1963180 | 2910 | | 2 | 2015-07-01 00:00:01.000000 | | 1963181 | 12341 | | 0 | 2015-07-01 00:00:01.000000 | | 1963182 | 6286 | | 0 | 2015-07-01 00:00:04.000000 |
-
Filtering: grep, csvgrep
$ csvgrep -c 5 -m 200003 tr_sample.csv |head -n 3 | csvlook |----------+------+---------+-------------+-------------+-----+------+-----------+-----------------------------| | id | type | user_id | description | del_item_id | gem | bean | recip | recdate | |----------+------+---------+-------------+-------------+-----+------+-----------+-----------------------------| | 1963185 | 13 | 7410 | | 200003 | 0 | 0 | 223.18..3 | 2015-07-01 00:00:05.000000 | | 1963188 | 13 | 7410 | | 200003 | 0 | 0 | 223.18..3 | 2015-07-01 00:00:06.000000 | |----------+------+---------+-------------+-------------+-----+------+-----------+-----------------------------|
-
Simple statistics : csvstat
$ csvstat --unique tr_sample.csv
- id: 500
- type: 4
- user_id: 110
- description: 82
$ csvstat --freq tr_sample.csv
- id: { "1963208": 1, "1963209": 1, "1963204": 1, "1963205": 1, "1963206": 1 }
- type: { "9": 332, "13": 98, "4": 69, "12": 1 }
- user_id: { "10684": 24, "2910": 19, "231": 13, "12778": 12, "11124": 12 }
- description: { "보너스": 78, "룰렛": 75, "건강관리": 12, "토크": 11, "보너스": 9 }
-
Exploration with SQL : csvsql
$ csvsql --query "select substr(recdate, 15,2) as hour, count(1) > from tr_sample group by substr(recdate, 15,2)" tr_sample.csv | csvlook |-------+-----------| | hour | count(1) | |-------+-----------| | 00 | 108 | | 01 | 122 | | 02 | 119 | | 03 | 98 | | 04 | 53 | |-------+-----------|
-
Exploration with Python : csvpy
$csvpy tr_sample.csv Welcome! "tr_sample.csv" has been loaded in a CSVKitReader object named "reader". In [1]: reader.next() Out[1]: [u'1963179', u'13', u'13465',
-
Check users who bought items 2 or more times : items they bought, reaction for new items
$ csvsql --query "select user_id from (select user_id, count(1) as cnt from tr_sample group by user_id)
where cnt >=2" tr_sample.csv >tr_user.csv
$ csvjoin -c "user_id, user_id" tr_user.csv tr_sample.csv >tr_up.csv
$ head -n 5 tr_up.csv |csvlook |----------+---------+------+---------+-------------+-------------+-----+------+------------+-----------------------------| | user_id | id | type | user_id | description | del_item_id | gem | bean | recip | recdate | |----------+---------+------+---------+-------------+-------------+-----+------+------------+-----------------------------| | 231 | 1963189 | 9 | 231 | SNS포스팅 | | 0 | -90 | 180.70..18 | 2015-07-01 00:00:06.000000 | | 231 | 1963204 | 13 | 231 | | 200004 | 0 | 0 | 180.70..18 | 2015-07-01 00:00:16.000000 | | 231 | 1963235 | 9 | 231 | 토크 | | 0 | 210 | 180.70..18 | 2015-07-01 00:00:38.000000 | | 231 | 1963236 | 9 | 231 | 건강관리 | | 0 | 110 | 180.70..18 | 2015-07-01 00:00:38.000000 | |----------+---------+------+---------+-------------+-------------+-----+------+------------+-----------------------------|
$ csvstat tr_up.csv
1. user_id
<type 'int'>
Nulls: False
Min: 231
Max: 13482
Sum: 4303878
Mean: 8855.71604938
$ csvsql --query "select substr(recdate, 15, 2) as hour, del_item_id, count(1) as cnt from tr_up
> where length(del_item_id) > 2 group by substr(recdate, 15, 2) , del_item_id" tr_up.csv |csvlook
|-------+-------------+------|
| hour | del_item_id | cnt |
|-------+-------------+------|
| 00 | 200001 | 3 |
| 00 | 200002 | 3 |
| 00 | 200003 | 11 |
| 00 | 200004 | 3 |
-
Other Command-line tools: BigML ...
-
Programming pipeline : modeling code (python, R.. )
$ R CMD BATCH sampleml.R $ R -e 'hist(demo.data)' > hist.png
-
parallizing modeling code: parallel
$ ls data/*.txt | parallel --tag "cat {} | ./user_cluster.R 10"
-
-
Pandashells: shell pipeline with the statistical and visualization tools of the python data-stack (https://github.com/robdmc/pandashells)
$ p.example_data -d tips | p.plot -x total_bill -y tip -s 'o' --title 'Tip Vs Bill' > test.png
-
Data Analysis: Back To The Basic
-
User Pattern : Useful, important and easy
-
csvkit: Handful command-line tools for simple data analysis
-
Handful of options cover 90% of use cases
-
Useful for Ad-hoc analysis : Can speed up ordinary pipelines and code
-
Author: Jeroen Janssens
-
Published by O'Reilly in October 2014
-
Overview of all tools on http://datascienceatthecommandline.com
- csv*** --help
-> Thank you for listening. <-