# Exploring Retail Rocket

## Required Packages

In [37]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
from datetime import datetime

## Exploring Datasets

In [3]:
directory = Path("../Datasets/2/")
files = [f.name for f in Path(directory).iterdir()]

print(files)

['item_properties_part1.csv', 'item_properties_part2.csv', 'category_tree.csv', 'events.csv']


### Item Properties (1)

In [12]:
item_properties1 = pd.read_csv("../Datasets/2/item_properties_part1.csv")
print(f'The dataset contains {len(item_properties1)} rows')
item_properties1.head(20)

The dataset contains 10999999 rows


Unnamed: 0,timestamp,itemid,property,value
0,1435460400000,460429,categoryid,1338
1,1441508400000,206783,888,1116713 960601 n277.200
2,1439089200000,395014,400,n552.000 639502 n720.000 424566
3,1431226800000,59481,790,n15360.000
4,1431831600000,156781,917,828513
5,1436065200000,285026,available,0
6,1434250800000,89534,213,1121373
7,1431831600000,264312,6,319724
8,1433646000000,229370,202,1330310
9,1434250800000,98113,451,1141052 n48.000


### Item Properties (2)

In [13]:
item_properties2 = pd.read_csv("../Datasets/2/item_properties_part2.csv")
print(f'The dataset contains {len(item_properties2)} rows')
item_properties2.head(20)

The dataset contains 9275903 rows


Unnamed: 0,timestamp,itemid,property,value
0,1433041200000,183478,561,769062
1,1439694000000,132256,976,n26.400 1135780
2,1435460400000,420307,921,1149317 1257525
3,1431831600000,403324,917,1204143
4,1435460400000,230701,521,769062
5,1433041200000,286407,202,820407
6,1438484400000,256368,888,437265 1296497 n24.000 229949 651738 285933
7,1437879600000,307534,888,150169 212349 1095303 824508 1257235 153900
8,1439089200000,102767,888,5135 790941 1055803 221748 122132 n12.000 1135...
9,1431831600000,215180,71,1096621


Both the item properties enlists the properties and the changes in the properties of items at specific intervals.
- It is a change log so if any of the properties of an item changes.
- It is logged only at specific intervals, so a snapshot of the change log
- If any property changes the changes is recorded along with the timestamp
- All rows in the property column are hashed (words into numbers)
- All rows in the value were normalised (cleaned and simplified e.g., lowercased, removed special characters etc.). and then went through a stemming procedure (e.g., running to run, universities to univers etc.)
**Note:**The above step reduced vocabulary size to improve generalisation for machine learning models.
- After stemming each word is converted to a number using a hashing function. This removes actual text (for privacy and compactness)

In [18]:
# combining item_properties1 and item_properties2
item_properties = pd.concat([item_properties1, item_properties2])
print(f'Total number of rows after the concatenation: {len(item_properties)}')

Total number of rows after the concatenation: 20275902


In [25]:
# checking for null values in the dataset
item_properties.isnull().sum()

timestamp    0
itemid       0
property     0
value        0
dtype: int64

In [31]:
# getting the count of distinct values in item_id
item_properties['itemid'].value_counts()


itemid
158903    468
254069    462
91855     461
150800    459
120386    444
         ... 
251894     15
342400     14
73456      13
207227     12
243157     12
Name: count, Length: 417053, dtype: int64

There are 417,053 items which have at least 12 properties listed in the dataset

In [32]:
# checking the count of distinct properties in the dataset
item_properties['property'].value_counts()

property
888           3000398
790           1790516
available     1503639
categoryid     788214
6              631471
               ...   
782                 1
288                 1
722                 1
744                 1
769                 1
Name: count, Length: 1104, dtype: int64

 There are 1104 unique/distinct properties in our dataset with at least 1 value in the dataset

In [45]:
# checking the time range for the entire dataset
datetime.utcfromtimestamp(max(item_properties['timestamp'])/1000) - datetime.utcfromtimestamp(min(item_properties['timestamp'])/1000)

datetime.timedelta(days=126)

So we have a total of 126 days worth of data here.

### Category Tree

In [14]:
category_tree = pd.read_csv("../Datasets/2/category_tree.csv")
print(f'The dataset contains {len(category_tree)} rows')
category_tree.head(20)

The dataset contains 1669 rows


Unnamed: 0,categoryid,parentid
0,1016,213.0
1,809,169.0
2,570,9.0
3,1691,885.0
4,536,1691.0
5,231,
6,542,378.0
7,1146,542.0
8,1140,542.0
9,1479,1537.0


Category tree simply contains the categories and whether they have a parent or not. Could construct a hierarchical structure from this.

### Events

In [15]:
events = pd.read_csv("../Datasets/2/events.csv")
print(f'The dataset contains {len(events)} rows')
events.head(20)

The dataset contains 2756101 rows


Unnamed: 0,timestamp,visitorid,event,itemid,transactionid
0,1433221332117,257597,view,355908,
1,1433224214164,992329,view,248676,
2,1433221999827,111016,view,318965,
3,1433221955914,483717,view,253185,
4,1433221337106,951259,view,367447,
5,1433224086234,972639,view,22556,
6,1433221923240,810725,view,443030,
7,1433223291897,794181,view,439202,
8,1433220899221,824915,view,428805,
9,1433221204592,339335,view,82389,


Events basically contains a log of all the events by individual users at specific user id.

In [47]:
datetime.utcfromtimestamp(max(events['timestamp'])/1000) - datetime.utcfromtimestamp(min(events['timestamp'])/1000)

datetime.timedelta(days=137, seconds=86383, microseconds=404000)

We have about 137 days worth of events.

In [48]:
# comparing the max and min of events and item_properties
datetime.utcfromtimestamp(max(item_properties['timestamp'])/1000) - datetime.utcfromtimestamp(max(events['timestamp'])/1000)

datetime.timedelta(days=-5, seconds=12, microseconds=212000)

In [49]:
datetime.utcfromtimestamp(min(item_properties['timestamp'])/1000) - datetime.utcfromtimestamp(min(events['timestamp'])/1000)

datetime.timedelta(days=6, seconds=86395, microseconds=616000)

events start 6 days prior to item_properties and ends after 5 days which explains the 11 days difference.