# Data Science by Example
### Data Science for Data Scientists
---

## Overview
* What? Topic
* Why? Applications
* Who? Prereqs
* How? Objectives
* How? Process
* When? Lesson Plan

### What? Topic

"Data Science" is an opaque term; via examples, this module illustrates the kinds of problems a data scientist might solve. By illustration you will gain an appreciation of the technical skillset, mindset and methodology of data science. 

### Why? Applications

After this module you will better appreciate the value data science can bring to an organization; and what skills you may need to acquire to become a data scientist. 

### Who? Prereqs

No prior knowledge is assumed, however learners may benefit from some backgound in mathematics and computer science. 

The module *illustrates* problems, we expect many learners will "not grasping every detail". 

### When? Lesson Plan (c. 1h 30m)
* Case Study: Argonaught (10 min)
* Statistics & Machine Learning: The Risk of Return (20 min)
    * The Problem
    * Investigation
    * Ideation
    * Wb. Solution Design
    * Related Problems
* Algorithms &  Graph Theory: Recommending Products (20 min)
    * ...
* Big Data: User Behaviour with Event Systems (20 min)
    * ...
* Discuss: What is Problem Solving? (5 min)
    * More Problems (2 min)
    * Reflect: What skills, concerns, topics, knowledge have we used? (5 min)
    * Review: What is a Data Scientist? (5 min)

### How? Objectives (30 min)
* Brainstorm: Mind-Map concerns for some example problems

### How? Process
* Module time: 2h
* Discussion & Slides (c. 30min)
* Whiteboards (c. 1h)
* Exercise Discussion (c. 30min)
---

## Learning

## Argonaught
### An Ecommerce Business

###  An Ecommerce Business
* argonaught is a traditional retail buisness 
    * with an ecommerce website and mobile app
* argonuaght's website has:
    * customers
    * product reviews
    * product wishlists
    * ... 
    

### How will we solve problems?

* investigation: what is the problem?
    * Q. What steps are relevant?
* ideation: what could we do to solve it?
    * Q. What steps are relevant?
* solution (design): what would a plausible solution look like?
    * Q. What steps are relevant?
* solution (build): how do we build it?
    * Q. What steps are relevant?

* investigation: what is the problem?
    * problem understanding
        * domain
        * objective
        * impact 
    * quantification & measurement
    * data exploration
* ideation: what could we do to solve it?
    * subproblems identified 
    * strategies to solve subproblems / problems
    * hyopothesis generation
* solution (design): what would a plausible solution look like?
    * experimental design
    * product/implementation design
* solution (build): how do we build it?
    * ...

### Why not use business case studies ?
* all examples are based on real problems
    * but business case studies do not walk through problem from a data *scientist*'s pov
* business case studies focus on the problem-solution and solution's buisness value
    * not on how a data scientist has added value to an existing business
* business case studies seem to narrow the applicability of data science
    * and fail to educate on the mindset & value of the *role*
* some case studies do this -- and area often published as journal articles or youtube technical talks
    * these can be overly technical and omit clear motivations
    * several are included in this module

## Statistics & Machine Learning
### The Risk of Return

#### The Problem

* quite a lot of products are being returned within their 30 day window

#### Investigation
* what questions should we ask?

* what is the existing system/policy?
* how many products are being sold?
* how many are being returned?
* what is the loss of profit associated with a return?
* why is this profit lost?
* what causes returns?
* (what causes sales?)

* argonaught sell $100$ toys a year costing $£100$ each 
    * each toy makes the company a profit of $£10$ after $£10$ of operating & delivery costs
    * a return costs them $£5$ reducing a resale to $0$ profit
* $20%$ of toys are being returned within their $30$ day window
* problem is: we are making $10 * 80$  profit (vs 100)
    * can we improve this?

#### Ideation

* What actions can we take?
* What can we quantify?
* What data do we have?
* Is existing data relevant to taking possible actions?
* What can we do if we lack data?

* What can we quanitfy?
    * $P(Return | CustomerBehaviour)$ ?
    * $P(Purchase | ItemPrice, CustomerBehaviour)$ ?
* Can we:
    * modify prices?
    * modify return window?
* tiny amount of data
    * we should look at each return in detail (only 20 items)
    * perform an essentially qualitative analysis
    * even, perhaps: email the customer asking them why they returned
* data not relevant to certain strategies
    * eg., how do we know $P(Purchase | Price)$?
    * ... we need to experiment!
        * ie., to change prices and directly observe $P(Purchase)$
* we can look at qualitative analysis and *simulate* datasets which model our assumptions
    * eg., that "recent history of returns" increases $P(Return)$, etc.
* or we can experiment with customer base to collect more data 

#### Wb. Solution Design

* let's change the return window if we think that a customer is likely to return an item
    * two windows: statutory minimum (eg., 7 day) and 30 day
* factors relevant to return:
    * age, location, history of recent returns, repeat customer, whether item is gifted
* build a logistic regression model using experimental customer data
    * $\hat{y} = \hat{f}(x_{Age}, x_{FqReturns}, x_{Gift}, \dots)$
* plug in model to orders page
    * define model in python: `def predict_returner(): ...`
    * define action: `if predict_returner():`
        * show short-window , `else:` show long window
        

#### Related Problems
* loan default risk analysis
* changing pricing models based on customer behaviour
* obtaining demographics from likes (/preference signals) -- cf. facebook pdf

## Algorithms &  Graph Theory
###  Recommending Products

#### The Problem
* recommendations are currently determined by a product score & sales team
* these are labour-intensive and not tailored to the customer

#### Investigation
* what questions should we ask?

* what is the existing system/policy?
* how are scores computed?
* how does the sales team select products?
* how effective is this system?
    * ie., what is the sales/profit impact?
* what impact will an improved system have?

* what is the current scoring system?
    * `score = rating_score + popularity_score`
    * based on filtered list of products by sales team
* why are recommendations not currently tailored?
    * very hard to compute personalised scores for all users, all products
* how effective is the current scoring system?
    * customers probability of purchasing X:
        * in general, $P(X)$ (ie., #purchasesX / #purchases) 
        * via query, $P(X|Q)$
        * via recommendation, $P(X|R)$
    * suppose: $P(X|Q) = 0.8$, $P(X|R) = 0.2$, $P(X) = 0.01$
    * what do these say?
    * what don't they say?
        * ie., what experiments could we do?
        * eg., to determine efficacy various scoring strategies
* what profit increase is associated with an improvement in scoring?
    * eg., strategies R_1, R_2: $P(X|R_1) = 0.1$, $P(X|R_2) = 0.2$
    * what is this in profit?
        * 10% * Profit/Item
      

#### Ideation

* What actions can we take?
* What can we quantify?
* What data do we have?
* Is existing data relevant to taking possible actions?
* What can we do if we lack data?

* What could we add to `score` ?
    * $P(Purchase|Age, Gender, Location)$ ?
        * requires knowing demographics 
        * & computing conditionals for lots of users/products
* Can we measure $P(Purchase|UserCharacteristics)$ indirectly?
    * eg., via "how often people *similar to* this user purchase?
* How can we determine who is *similar*?
    * community detection via reviews/purchases
    * users who "review-alike" are "alike"
* If we can determine a "similar user set" then:
    * we can specialize the `rating_score`
    * and add a `community_score`
        * how "important" that product is to that community
        * ie., a graph-centrality measure
 

#### Wb. Solution Design

* bi-partite graph, $G$
    * two types of nodes: users, products
    * users are connected to products they like (high score review)
* we filter $G$ into $F$, when we know the user $u$ we are recommending to
    * filter by selecting "fans of the same products" 
    * fans = users who review the same products
* with $F$, we group users into communities
    * community = group of users who review related products
        * "related" = often reviewed by the same user
* and obtain $u$'s community, $C$
    * this contains user-products
    * where the products are those which are "preferred" by $u$'s community
* we extract the products $p$ from $C$ and score each with the same formula above
    * we can also add additional terms, eg., 
    * eg., `community_popularity` which measures, in $C$, how often a product is reviewed
    * + many other terms
    

#### Related Problems
* (forensic) monitoring networking events in 4G networks
* amazon recommendations (cf. case study pdf)

## Big Data
### User Behaviour with Event Systems

#### The Problem
We need a data system which supports the inferences and techniques above. These require (or are otherwise helped by) vuser *behavior*. 

However, we currently *only* store transactional retail data: orders, customers, etc. This does not capture: when a transaction occured, who initiated it, from what system, and other context. It is this context which enables inference. 

#### Investigation: What is the Problem?
* What questions should we ask?

* What do existing data systems track?
* What software do these systems support?
* How long have they been in operation?
* What current technical problems arise from their use?
* What kinds of queries do these systems answer?
* What kinds don't they?

* What do existing data systems track?
    * reviews, users, products, orders, stock, campaigns, pages...
    * customer_support, ...
* What software do these systems support?
    * ecommerce & review website
    * customer support website
    * mobile app
    * advert campagin system
    * editorial content system
    * stock & purchasing system
* How long have they been in operation?
    * v1 2005, v2 2010, v3 2017
* What current technical problems arise from their use?
    * poor performance in review system & mobile app
    * presently no recorded connections between ads, reviews, etc.
    * editorial content poorly monitored (eg., hits on article, but no other metrics)
* What kinds of queries do these systems answer?
* What kinds don't they?

#### Ideation: What could we do to solve it?

* What actions can we take?
* What can we quantify?
* What data do we have?
* Is existing data relevant to taking possible actions?
* What can we do if we lack data?

* We need to change the website & apps 
    *  many aspects of user interaction need to be recorded
* We need events:
    * subject, verb, object, context
    * CUSTOMER VIEWS PAGE {ID: 10, TIME: 10pm}
* We need a new system to track these events
* Do we keep existing transactional system?
    * YES... but where does it sit in relation to new system?
    * ... how complex can the migration be? (without too much impact)
* What kinds of queries should an event system answer?
    * Query:
        * What are customers doing now?
        * How often does a given customer visit the site?
            * read pages, ...
        * What campagins/reviews/pages *impact* sales?
        * etc.
        

#### Solution Design: What would a plausible solution look like?
* event log
    * exteremely fast append-only log of events
    * live log records 24 hr windows 
    * daily archive & wipe of live log 
* retail db *dervived* from log
    * all existing tables populated from log *asynchronously* where possible
        * eg., reviews, but not orders
* existing queries point to retail db
* "log archive" supplied to data science & analytical teams 
* derive *new* per-service databases from main log
    * eg., camapaign-targeting service
* migrate & replace "low-performance" services to new system
    * eg., with mobile app, feed to event log & query against new dbs

#### Related Problems
* netflix data system ( https://www.youtube.com/watch?v=CZ3wIuvmHeM )

### More Problems

* Sales campaigns are targeted to demographics. We could improve sales with better targeting.


### Discuss: What is Problem Solving?

* open problems vs. closed problems
    * under-specified open problem solving requires:
        * establishing relevance criteria 
        * establishing success criteria
        * building the problem framing
    * over-specified open problem solving requires:
        * designing systems which conform to specification
        * control, quality, integrity (etc.) checks on specifications
        * reliability, performance (etc.) checks on implementation 
    * closed problem solving requires:
        * building systems to exact specification

### Brainstorm: Mind-Map concerns for some example problems

* example organizations (problem domains: amazon, spotify...)
* probabalistic questions (how likely is H...)
* what data is relevant?
* how would we collect it?
* how would we store/track/etc.?

### Reflect: What skills, concerns, topics, knowledge have we used?
### Review: What is a Data Scientist?

---
## Summary
