## 1. The Hypothesis

In past years we have seen a lot of cross-domain application from various deep learning areas to particular real-life cases with apparently little connection. One of those areas is that of applying deep representation learning based on neural language processing to business analytics system and recommender systems in particular.

The intuition behind our proposed deep learning pipeline is that we could, at least in theory, generate, through multiple modeling iterations, powerful-enough semantic vector space embeddings for each individual item (product) so that we can *infer replacements items and propose them in the case on original product shortages* – all of these in a self-supervised setting

The hypothesis is that we can apply both direct retrofitting-based fine-tuning on the pre-trained product embeddings as well as re-construction of the GloVe vector space with new co-occurrence matrices in order to generate a product vector space model able to generate item-replacement information. More precisely the hypothesis is that our proposed approach will reduce the cosine distance between products that can actually replace each other in real life in a similar manner as presented in the work of Faruqui et al, Dingwall et al, that addresses word vectors, as well as push the distance of the product embeddings that are not similar but still have a semantic relatioship resulted from the vector space optimization process.

## 2. The Data

For our experiment we decided to use a real-life transactional dataset that has the following properties. Further information on the data can be observed in the data loading, preparation and minimal visualization of the experiment.

#### We load the required packages

In [3]:
import numpy as np
from scipy import sparse
import itertools
import os
import pandas as pd
from datetime import datetime as dt
from time import time
import textwrap
from itertools import combinations

#### Setup global variables and pretty-prints

In [4]:
DATA_HOME = 'experiment_data'
DATA_FILE = os.path.join(DATA_HOME, 'df_tran_proc_top_15k.csv')
META_FILE = os.path.join(DATA_HOME, 'df_items.csv')

CHUNK_SIZE = 100 * 1024 ** 2 # read 100MB chunks

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.width', 1000)
pd.set_option('precision', 4)  
np.set_printoptions(precision=3)
np.set_printoptions(suppress=True)
np.set_printoptions(linewidth=500)


### 2.1. A few  utility methods

In order to confer a real-life setup to our experiment we will generated the MCO with batch reading of the transactional data. For this purpose we have `generate_sparse_mco(file_name)` function

In [5]:
lst_log = []
_date = dt.now().strftime("%Y%m%d_%H%M")
log_fn = dt.now().strftime("logs/"+_date+"_log.txt")

def P(s=''):
    lst_log.append(s)
    print(s, flush=True)
    try:
        with open(log_fn, 'w') as f:
            for item in lst_log:
                f.write("{}\n".format(item))
    except:
        pass
    return

def Pr(s=''):
    print('\r' + str(s), end='', flush=True)

def add_to_mco(df_chunk, dct_mco, basket_id_field, item_id_field):
  Pr("  Grouping the transactions...")
  t1 = time()
  transactions = df_chunk.groupby(basket_id_field)[item_id_field].apply(list)
  nr_trans = df_chunk[basket_id_field].unique().shape[0]
  t_trans = time() - t1
  Pr("  {} transactions grouped in {:.2f}s".format(
      nr_trans, t_trans))
  P("")
  times = []
  time_delta = 1000
  last_time = time()
  for i, (index, l) in enumerate(transactions.items()):
    t1 = time()
    market_basket = np.unique(l)  # keep only unique elements
    if market_basket.shape[0] == 1:
      continue
    perm_market_basket = list(itertools.permutations(market_basket, 2))
    for pair in perm_market_basket:
      if pair not in dct_mco: 
        dct_mco[pair] = 0
      dct_mco[pair] += 1
    if (i % time_delta) == 0:
      elapsed = time() - last_time
      last_time = time()
      times.append(elapsed)
      mean_time = np.mean(times) / time_delta
      remain_time = nr_trans * mean_time - (i + 1) * mean_time
      Pr("  Processed transactions {:.1f}% - {:.2f} min remaning...".format(
          (i + 1) / nr_trans * 100,
         remain_time / 60))
    # endfor
  # endfor
  P("")
  return dct_mco


def generate_sparse_mco(file_name, chunk_size=CHUNK_SIZE, basket_id_field='BasketId', item_id_field='IDE'):
  data_size = os.path.getsize(file_name)
  P("Reading transactional data file '{}' of size {:.2f} GB...".format(file_name, data_size / 1024**3))
  t1 = time()
  chunk_generator = pd.read_csv(file_name, chunksize=chunk_size)

  dct_mco = {}
  n_rows = 0
  for i, df in enumerate(chunk_generator):
    n_rows += df.shape[0]
    P("Processing chunk {} of data - ({} rows so far) ...".format(i+1, n_rows))
    dct_mco = add_to_mco(df, dct_mco, basket_id_field, item_id_field)
    
  P("  Converting dict to sparse matrix...")
  t2 = time()
  csr_mco = sparse.csr_matrix((
          list(dct_mco.values()),
          [list(x) for x in zip(*list(dct_mco.keys()))],
      ))
  t3 = time()
  t_full = t3 - t1
  t_csr = t3 - t2
  P("  MCO Processing done in {:.2f} min (sparse mat creation: {:.2f} min):".format(
      t_full / 60, t_csr / 60))
  P("  Transactional data:")
  P(textwrap.indent(str(df.iloc[:15]), " " * 4))
  P("  MCO data:")
  P(textwrap.indent(str(csr_mco[:15,:15].toarray()), " " * 4))
  return csr_mco

### 2.2 Loading, viewing and understanding

The real-life provided data comes within a few files. The most notable files are the transactional database file and the metdata file. We are going to read the metadata file and see part of its content as well as read a small chunk for the transactional dataset and display it.

In [6]:
df_meta = pd.read_csv(META_FILE)
df_meta.iloc[:15]

Unnamed: 0,ItemId,IDE,Freq,ItemName,Ierarhie1,Ierarhie2,IsActive
0,545535,0,21248,YUVAL NOAH HARARI / SAPIENS. SCURTA ISTORIE A OMENIRII,1,17,1
1,398648,1,16058,REZERVE STILOU T10 BLUE LAMY SET,11,134,1
2,406083,2,13371,FELICITARI A 7331335123458,11,107,1
3,576732,3,11656,YUVAL NOAH HARARI / HOMO DEUS. SCURTA ISTORIE A VIITORULUI,1,17,1
4,563633,4,10381,MARK MANSON / ARTA SUBTILA A NEPASARII,1,48,1
5,486656,5,8592,ECKHART TOLLE / PUTEREA PREZENTULUI. ED. VI,1,48,1
6,258901,6,8341,TURTA DULCE TIP INIMIOARE,20,269,1
7,602874,7,7955,YUVAL NOAH HARARI / 21 DE LECTII PENTRU SECOLUL XXI,1,17,1
8,163219,8,7921,CARTI POSTALE (2 S DESIGN) #2000032105856,11,23,1
9,259150,9,7904,JELLY BEAN 75G FT GOURMET BOX F0075-0696F,20,161,1


The metadata information taken directly from a real-life production system (ERP) contains raw information minimally describing each product-SKU `IdemId` with product name (`ItemName`) and other information such as number of item sales in observed in the selected period, a unique sequential item identificator (`IDE`) as well as as hierarchy information in two fields `Ierarhie1` and `Ierarhie2` that will be further used as a knowledge graph.

In [7]:
chunk_reader = pd.read_csv(DATA_FILE, iterator=True) 
chunk_reader.get_chunk(15)

Unnamed: 0,BasketId,ItemId,SiteId,TimeStamp,Qtty,ClientId,IsActive,IDE
0,7130756,441093,26,2016-01-01 14:49:02.403,1.0,-103,1,1638
1,7130756,464012,26,2016-01-01 14:49:02.403,1.0,-103,1,1975
2,7130756,464013,26,2016-01-01 14:49:02.403,1.0,-103,1,2192
3,7130802,377742,20,2016-01-01 15:49:03.860,1.0,-103,1,2132
4,7130802,405083,20,2016-01-01 15:49:03.860,1.0,-103,1,213
5,7130802,405084,20,2016-01-01 15:49:03.860,1.0,-103,1,165
6,7130803,365473,26,2016-01-01 15:49:42.567,1.0,-103,1,354
7,7130803,381249,26,2016-01-01 15:49:42.567,1.0,-103,1,10603
8,7130809,344274,26,2016-01-01 15:53:19.953,1.0,-103,1,5416
9,7130809,393877,26,2016-01-01 15:53:19.953,1.0,-103,1,5693


In [8]:
chunk_reader.close()

## 3. Metrics and overall evaluation


## 4. The Models


## 5. The Pipeline 


## 6. Progress history


## 7. References

 - Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 39-41.

 - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .

 - Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. International conference on machine learning, (pp. 1188-1196).

 - Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors forWord Representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1532-1543).
 
 - Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2014). Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166.

 - Mrkšić, N., Séaghdha, D. O., Thomson, B., Gašić, M., Rojas-Barahona, L., Su, P. H., & Young, S. (2016). Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892.

 - Ganitkevitch, J., Van Durme, B., & Callison-Burch, C. (2013). PPDB: The paraphrase database. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 758-764).

 - Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan, V., & Sharp, D. (2015). E-commerce in your inbox: Product recommendations at scale. Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1809-1818.
 
 - Lengerich, B. J., Maas, A. L., & Potts, C. (2017). Retrofitting distributional embeddings to knowledge graphs with functional relations. arXiv preprint arXiv:1708.00112.

 - Volkovs, M., Yu, G. W., & Poutanen, T. (2017). Content-based Neighbor Models for Cold Start. In Proceedings of the Recommender Systems Challenge 2017, (pp. 1-6).

 - Dingwall, N., & Potts, C. (2018). Mittens: An Extension of GloVe for Learning Domain-Specialized. arXiv preprint arXiv:1803.09901.
