# Views are different perspectives of our data
That would be a way to get a better understanding of the data.  
If you have an idea for a view, share it or make an issue.

In [2]:
import altair as alt
import numpy as np
import pandas as pd

from mlrepricer.oldsql import schemas
from mlrepricer import setup, helper
from mlrepricer.oldsql.database import SQLite
alt.data_transformers.enable('default', max_rows=1000000)

DataTransformerRegistry.enable('default')

In [3]:
t = schemas.pricemonitor(SQLite)()  # tableobject
df = pd.read_sql_query(f'SELECT * FROM {t.table}', t.conn, parse_dates=[t.eventdate], index_col='ID')

### Filter1 aka features we probably should not track in the first place.
for me it drops like 4% of the data

In [4]:
# this is available in mlrepricer.helper.cleanup()
filter1 = df[(df.instock==1)&(df.isfeaturedmerchant==1)]
filter1 = filter1.drop(['instock', 'isfeaturedmerchant'], axis=1)
# those offers suck too
filter1 = filter1[(filter1.shipping_maxhours+filter1.shipping_minhours)<=72]

We look at each attribute separetely now.  
Basically it's ok to drop both features.  
The instock attribute has some winners, but it's to unlikely.  
It's like 1/300 of the dataset got that feature.

In [5]:
drop2 = df[df.isfeaturedmerchant==0]  # it has a zero chance to be a winner
print(f'We can drop {len(drop2)} rows and loose {len(drop2[drop2.isbuyboxwinner==1])} winners, we expected this to be Zero')

We can drop 10429 rows and loose 0 winners, we expected this to be Zero


In [6]:

drop = df[df.instock==0]
normalized_lostwinner = len(filter1[filter1.isbuyboxwinner==1])/len(filter1)/(len(drop[drop.isbuyboxwinner==1])/len(drop))
print(f'Its {normalized_lostwinner:.2f} times less likely to find a buyboxwinner in the data set we drop here, we have {len(drop)} datarows.')
if normalized_lostwinner < 5:
    raise ValueError

Its 6.49 times less likely to find a buyboxwinner in the data set we drop here, we have 1022 datarows.


# The stuff below is not wrong, but it makes only little sense to look at the data this way.
Better just look on a per message level. Because one message has all the information for a state.

## What is the name for the type of feed data we get?
It's like time series data, but tracking only changes.
I think without reducing the complexity we can imagine it in 5 dimensions.
 - asin
 - time_changed  
   competitor = groupby(['sellerid', 'isprime']).min()  # min because maybe some have duplicates offers  
     those are two seperate dimensions:
     - sellerid
     - isprime
 - features
 
does this make sense?

Knowing this we can make a pivot table. Let's refer to it as groundtruth.

In [8]:
filter1.groupby(['asin', 'sellerid', 'isprime', 'time_changed']).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,feedback,feedbackpercent,isbuyboxwinner,price,shipping_maxhours,shipping_minhours
asin,sellerid,isprime,time_changed,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
B0126QIK2K,A2EHWGAW6J9W8Q,0,2018-05-22 07:33:12.314,1606,98,0,12.00,48,24
B0126QIK2K,A2EHWGAW6J9W8Q,0,2018-05-22 07:42:09.934,1606,98,1,12.00,48,24
B0126QIK2K,A2EHWGAW6J9W8Q,0,2018-05-22 09:38:58.802,1606,98,1,12.00,48,24
B0126QIK2K,AFEE1JGTCYXOG,0,2018-05-22 07:33:12.314,2072,100,1,11.85,24,24
B0126QIK2K,AFEE1JGTCYXOG,0,2018-05-22 07:42:09.934,2072,100,0,21.57,24,24
B0126QIK2K,AFEE1JGTCYXOG,0,2018-05-22 09:38:58.802,2072,100,0,32.29,24,24
B015NJUI5O,A10GXGUDASI5WW,1,2018-05-10 02:03:16.011,503,99,0,10.48,0,0
B015NJUI5O,A10GXGUDASI5WW,1,2018-05-10 06:03:58.780,503,99,1,10.48,0,0
B015NJUI5O,A10GXGUDASI5WW,1,2018-05-10 13:23:17.559,503,99,1,10.48,0,0
B015NJUI5O,A10GXGUDASI5WW,1,2018-05-10 20:30:04.405,504,99,0,12.99,0,0


What can we improve from here?
 - only track stable states, like those hold for more than 5 minutes
 - linear transform isprime to remove a dimension  
More ideas are welcome on github.