# Data Analyst Intern
Data Science team at LinxImpulse<br>
datascience@chaordicsystems.com

## Introduction

Linx Impulse (previsously Linx+Neemu+Chaordic) is atop the world of e-commerce, delivering to each user a unique and personalized experience while shopping online. We have the biggest database of behavioral data of the Latam market and we use it to build amazing products that deliver product recommendations, search results and targetted newsletters to millions of people everyday.

Our Data Scientists are on a mission to increment company value by applying statistics and machine learning techniques to our vast sea of data. We are looking for someone who can generate story from data and is willing to work in all stages of our pipeline.

You will be the go-to person when it comes to product performance analysis and insights generation. Focusing exclusively on understanding the underlying mechanisms of how decisions can affect conversion, lift and other KPIs in our product suite. This is a position of high intra-company visibility since it requires frequent reporting to managers, directors and heads of all teams.

But first, you have to prove yourself capable by overcoming this coding challenge. Since this is an opening for an intern position, we rely more on the potential shown in your solutions than on your CV to judge the candidates. So, do your best and good work!

## The Challenge

You are given a sample of a full week of anonymized interaction events from an e-commerce.<br>
Your task is to understand the data and show us patterns and insights to further tune our solutions for that online retailer.

For that purpose, you will have to:

 * Download and parse the data
 * Understand (research if needed) what is being asked to calculate
 * Write legible, reproducible and efficient code to compute those metrics
 * Report the results with the appropriate tables and plots


## Inputs

Just download from this link: https://s3.amazonaws.com/ml-challenge/ecommerce-events.ndjson.xz<br>
20M compressed, 300M uncompressed.

All events captured are stored in JSON format.
There are 4 types of events therein: page, product, search and transaction.

Below, every field of each type will be described. Many hints hidden there (Google is your friend).

### Page
Generic pageview event.

In [None]:
{'eventType': 'page', # identifies the type of event
 'date': '2017-06-03 18:20:53', # timestamp of event generation
 'visitor': '4ce48a50-f688-11e6-aed6-b52bfa51fc22', # anonymous visitor persistent and unique identifier
 'deviceType': 'mobile', # either desktop or mobile
 'osType': 'Android', # extracted from user-agent
 'osVersion': '6.0', # same
 'browserType': 'Chrome Mobile', # same
 'browserVersion': '58.0.3029', # same
 'referrerType': 'search', # Google,Yahoo,Bing->search, Facebook,Instagram->social, other hosts->other, no referrer->direct
 'utm_source': 'Partner_9', # redacted acquisition channel partner, extracted from the url querystring
 'utm_medium': 'Medium_7', # redacted acquisition medium (normally email or ads), from querystring
 'utm_campaign': 'Campaign_17', # redacted acquisition campaign (like cart abandonment email), from querystring
 'pageType': 'subcategory', # useful to build a conversion funnel
 'category': '319|689', # hierarchical product category (689 is inside 319)
 'url': '7806427dfc116ddbbcb37095a477330851b0afe9', # hashed URL
 'referrer': '18691404689e091548a691bf31b9995e8e7b9fcd' # hashed referrer
}

### Product

Product details pageview event

In [None]:
{'eventType': 'product',
 'date': '2017-06-01 13:35:13',
 'visitor': '4806c650-46e8-11e7-8604-956bce5b488e',
 'deviceType': 'desktop',
 'osType': 'Windows',
 'osVersion': '7',
 'browserType': 'Chrome',
 'browserVersion': '58.0.3029',
 'referrerType': 'direct',
 'utm_source': 'Partner_4',
 'utm_medium': 'Medium_3',
 'utm_campaign': 'Campaign_3',
 'pageType': 'product',
 'category': '558|727',
 'url': '9a1cbff5cbff8666be513efeee781a71ff556f5f',
 'referrer': None,
 'product': 5639, # product unique identification
 'tags': ['216', '695', '944'], # tag codes associated with that product
 'price': 24.9, # product value
 'status': 'AVAILABLE' # AVAILABLE, UNAVAILABLE or REMOVED
}

### Search

Search results pageview event

In [None]:
{'eventType': 'search',
 'date': '2017-06-03 09:02:06',
 'visitor': '0eccdd90-477d-11e6-b479-450984b9b0b7',
 'deviceType': 'desktop',
 'osType': 'Windows',
 'osVersion': None,
 'browserType': 'Chrome',
 'browserVersion': '51.0.2683',
 'referrerType': None,
 'utm_source': None,
 'utm_medium': None,
 'utm_campaign': None,
 'pageType': 'search',
 'category': None,
 'url': '95a9fe3739eef4f7f38c6c0644e319ba77a1349a',
 'referrer': '5c808dff012fcbe74dc2941a7664e357db367176',
 'query': 'papel ou casa ventosa', # obfuscated search query
 'searchItems': [6077,15749,3398,8007, # products returned as a result of that query, respecting the order they were shown
                 18741,832,19311,12891,
                 3264,16885,20613,156,
                 11030,16204,2247,127,
                 17178,6166,268,1851,
                 2521,20810,16551,18626]
}

### Transaction

Purchase confirmation pageview event

In [None]:
{'eventType': 'transaction',
 'date': '2017-06-01 17:30:54',
 'visitor': '84cc0d00-462c-11e7-93a0-39eac0558d08',
 'deviceType': 'desktop',
 'osType': 'Windows',
 'osVersion': None,
 'browserType': 'Firefox',
 'browserVersion': '53.0.0',
 'referrerType': None,
 'utm_source': None,
 'utm_medium': None,
 'utm_campaign': None,
 'pageType': 'confirmation',
 'category': None,
 'url': 'da7caa77e2729e12b32a9d7d1a324652ce2264a6',
 'referrer': '6e03ee62984224d0c0f08d4b68b819297d7f4d14',
 'order': 5545, # unique transaction identification
 'orderItems': [{ # list of products in the cart in that transaction
     'product': 16493, # product id
     'price': 19.9, # product unit price
     'quantity': 1.0 # number of units bought
  },
  {'product': 1432, 'price': 20.9, 'quantity': 1.0},
  {'product': 4621, 'price': 395.01, 'quantity': 1.0}]
}

## Questions

Using this dataset, retrieve the demanded information.

As you progress, you will notice the questions become increasingly methodologically ambigous. You will have to make decisions, and explicitely state them, on what assumptions or simplifications were made.

Always keep in mind that usefulness is the most important characteristc of any good metric.

Also, timing is more important than completeness. So, we prefer that you hand us less answers but thorough ones than all of them but half baked or belated. Go as far as you can, until the deadline.

### 1. What was the total revenue?

### 2. What percentage of visitors used a mobile device?

### 3. What search query had the highest click-through rate?
Among queries with at least 15 instances.<br>
It's considered a click if the target page is a product page and its product id was in the list of search results.

### 4. When is the site most busy?
Do both what part of the day and what day of the week.

### 5. What is the share of revenue among categories brought by Campaign_2?
Use last touch attribution

### 6. Estimate the impact of unavailable products
Bonus points if you consider cross-elasticity<br>
(don't fret too much. deliver something, no model is ever going to be perfect anyway)

## Results delivery

1. __The main deliverable is a Report with your answers__.<br>
And accompanying details, such as criteria used and neat plots.


2. __You may use any language, tools, and cloud services you want.__<br>
But if you use Python we will like you more ;)


3. __If you do well on the Report, we will want to see your code.__<br>
Which should be hosted on github or another online repository you can share with us.


## Documentation

We encourage you to write down every step down the road. Design decisions, references you used, difficulties and insights, and partial results.<br>
That way, wrapping it all up when we call you for the interview should be a breeze.

If you have any doubts, feel free to contact us at datascience@chaordic.com.br.