# Analysis of a dataset using data mining techniques.

This notebook focuses on exploring and mining basic information from an anonymised retail transactions dataset given by the company Instacart.

## Taking a first glance at the dataset
*Source code for this section may be find in file `dist/first-glance.class.ts`* 

The dataset consists of information about 3.4 million grocery orders, distributed across 6 `.csv` files listed below:

In [1]:
/**
 * Folder in which the dataset files are.
 */
const __foldername: string = 'instacart_basket_data';


import * as cp from 'child_process';

/**
 * Function executes a child_process listing the files in the instacart_basket_data folder
 * and returns the listed files though a Promise.
 */
function listFiles(): Promise<string[]> {
    // Async behavior
    return new Promise( (resolve) => {
        // Listing the files in the instacart_basket_data folder.
        cp.exec(`ls ${__foldername}`)
            .stdout.on('data', (data: string) => {
                // Formatting ls output as an Array of strings (representing the file names)
                resolve(data.match(/[^\r\n]+/g));
            });
    });
}

listFiles().then( (files: string[]) => console.log(files) );

[ 'aisles.csv',
  'departments.csv',
  'formatted_itemsets.csv',
  'order_id__product_number.csv',
  'order_products__train.csv',
  'orders.csv',
  'product_id__order_number.csv',
  'products.csv' ]


undefined

Files composing the dataset are listed below:

```
[ 'aisles.csv',
  'departments.csv',
  'order_products__prior.csv',
  'order_products__train.csv',
  'orders.csv',
  'products.csv' ]
```

As a starting point, in order to have a first glance of the data we will actually be playing with throughout this entire report; let us display the first items composing each `.csv` file listed above. 

We will use NodeJS's straight-forward File I/O fs to list, open and `'csv-parse'`'s `Parser` to parse `.csv` files.

In [2]:
import * as fs from 'fs';
import{ Parser } from 'csv-parse';

/**
 * Function reads a .csv file and returns it properlly formatted.
 */
function readFile<T>(filePath: string): Promise<Array<T>> {
    // Async behavior
    return new Promise( (resolve, reject) => {
        let ret: Array<T> = [];

        // 'csv-parse' Parser, columns options groups each row by column in an object
        let parser: Parser = new Parser({
                delimiter: ',',
                columns: true
        });

        parser
            .on('data', (data: T) => ret.push(data) )
            .on('error', (error: any) => reject(error) )
            .on('end', () => resolve(ret) );

        // Reading the file and piping to parser
        fs.createReadStream(filePath).pipe(parser);
    });
}

listFiles().then( (files: string[]) => {
    files
        .forEach( (file: string) => {
            // Formatting to get absolute paths to files.
            let filePath: string = `${__dirname}/${__foldername}/${file}`;

            // Reading file
            readFile<any>(filePath)
                // Logging the first two elements of the array
                .then( (data: Array<any>) => {
                    console.log(`First two elements of file ${file}:`);
                    console.log(data.slice(0,2));
                })
        })
});

undefined

File `aisles.csv` is structured as such; and contains the two following rows:

| aisle_id | aisle |
|----------|-------|
| 1 | prepared soups salads |
| 1 | specialty cheeses |

File `departments.csv` is structured as such; and contains the two following rows:

| department_id | department |
|---------------|------------|
| 1 | frozen |
| 2 | other |


File `products.csv` is structured as such; and contains the two following rows:

| product_id | product_name | aisle_id | department_id | 
|------------|--------------|----------|---------------|
| 1 | Chocolate Sandwich Cookies | 61 | 19 |
| 2 | All-Seasons Salt | 104 | 13 |


Files `aisles.csv`, `departments.csv`, `products.csv` and all (trivially) contain information about the aisles, products, and departments names respectivelly; which may not be of any interet to us other than for enhanced visualisation. We'll thus have to perform the proper `JOIN`s (a.k.a. `UNION`) between these tables and our future data / pattern collections when needed.

File `orders.csv` is surely a bit more interresting, especially having in mind **sequential pattern mining**, as it lists all the orders, and contains information on **when** and **by who** it has been placed.

Finally, files `order_products__train.csv` and `order_products__prior.csv` contain the same, and most valuable information in regard to pattern mining, as they contain products ordered within each order.

## Warming up : First statistics over the dataset
*Source code for this section may be find in file `dist/stats.class.ts`* 

We'll keep our first analysis of the data simple, and start by computing some simple statistics over the dataset, on the number of orders,  products, products per order, etc. 
In addition to understanding more about the data, this will also allow us to find some appropriate and revelant criteria on which we could base (and reduce) our dataset for extended itemset analysis; as well as giving us a first glance at pattern dedundancy. 

Keeping in mind that each row of the `order_products__prior.csv` file has the following structure:
- order_id
- product_id
- add_to_cart_order
- reordered

An interesting approach would be to regroup these objects by both `order_id` and `product_id`, as it would allow us to have a glance over the product distribution through the orders on the first hand, as well as idea of each product's popularity on the other.

To do so, we need to transform the dataset consequently. Thus, let's start by defining `ProductOrder` as the structure of the data outputted by the `.csv` file parser:

In [3]:
/**
 * Data structure we gather from CSV File.
 */
interface ProductOrder {
    order_id: string,
    product_id: string,
    add_to_cart_order: string,
    reordered: string
}



First two elements of file departments.csv:
[ { department_id: '1', department: 'frozen' },
  { department_id: '2', department: 'other' } ]


undefined

From this point, as we focus on exploiting data from file `order_products__prior.csv`, which contains more than 1 million records; and for the sake of memory usage, we'll try to work on data streams as much as possible, rather than parsing 1-million-elements-cached `Arrays` when it comes to data transformation.
We'll thus be using Reactive Programming library `RxJS` in that intent.

Reactive programming is nothing new as it only consists in programming with asynchronous data streams, which languages like JS are basically all about. `RxJS` yet provides us with an amazing and complete approach -as well as a great toolbox of functions- to combine, create and filter such streams easily.

### Grouping by orders
*Sources are available in file `dist/stats-on-orders.spaghetti.ts`.*

Let us define a function allowing us to group `ProductOrder` objects by any key of the `ProductOrder` interface. We'll use in that intent `RxJS`'s `groupBy` method, which basically groups the items emitted by an `Observable` (a.k.a. stream; in our case, the `ReadStream` of the considered file) according to a specified criterion (in our case, either the `product_id` or the `order_id`), and emits these grouped items as `GroupedObservable`s, one `GroupedObservable` per group.

Let's start by defining a `Group<T>` as an object containg an `id` (the grouping criterion basically), as well as an `Array` of whatever item of type `T` we're grouping. This will be one of many product of our function:

In [4]:
interface Group<T> {
    id: string,
    items: Array<T>
}

First two elements of file aisles.csv:
[ { aisle_id: '1', aisle: 'prepared soups salads' },
  { aisle_id: '2', aisle: 'specialty cheeses' } ]


undefined

We'll also forge ahead (keeping pattern mining in mind) by allowing this method to `.map()` the `ProductOrder` objects to whatever we want (its `id`, `product_name`...) depending on our need.

Such a function is given below:

In [5]:
import { Observable } from 'rxjs/Observable';
import 'rxjs/add/operator/finally';
import 'rxjs/add/operator/groupBy';
import * as RxNode from 'rx-node';

/**
 * Function returns an Observable of `ProductOrder` group by a defined criterion. You may map the parsed `ProductOrder` to whatever value you want.
 */
function _readAndGroupBy<T>( key: keyof ProductOrder, map: (val: ProductOrder) => T ): Rx.Observable<Group<T>> {
    /**
     * 'csv-parse' Parser, columns options groups each row by column in an object.
     */
    let parser: Parser = new Parser({
        delimiter: ',',
        columns: true
    });

    // Turning native stream into Observable
    return RxNode.fromStream( fs.createReadStream(`${__foldername}/order_products__train.csv`).pipe(parser) )
        // Grouping objects by order
        .groupBy( (data: ProductOrder) => data[key] )
        // At this point, we basically have an Observable by group. Thus we need to flatten that.
        .flatMap( (group: Rx.GroupedObservable<string, ProductOrder>) => {
            return group
                // Formatting the data
                .map(map)
                // And flattening the Observable array.
                .reduce( (concat: Group<T>, current: T) => {
                    concat.items.push(current);
                    return concat;
                }, {
                    id: group.key,
                    items: []
                })
        });
}


{}

Let us group the `ProductOrder` by their `order_id`. `ProductOrders` will be caracterized by their `product_id` (We'll thus trivially have an list of Orders (or itemsets), as `Arrays` of `product_id`s). 

Function above will return us with all the processed groups. We'll compute some basic statistics on these from there, such as: 
- The number of orders (number of groups);
- The minimum number of product in an order (minimum of arrays length);
- The maximum number of product in an order (maximum of arrays length);
- The average product number per order (average array length);
- The number of records in the `order_products__prior.csv` file (sum of arrays length);

In [6]:
function statsOnOrders(): void {
    console.log('Gathering data, this might take a while...');
    
    /**
     * All the groups.
     */
    let groups: Array<Group<string>> = [];

    let stats: any = {
        max: 0,
        min: Infinity,
        sum: 0
    }

    /**
     * Reads the file and groups `ProductOrders as intended`
     */

    _readAndGroupBy<string>('order_id', (productOrder: ProductOrder) => productOrder.product_id )
        // Once all groups are loaded, displaying them.
        .finally( () => {
            console.log(`Maximum number of ProductOrders: ${stats.max}`);
            console.log(`Minimum number of ProductOrders: ${stats.min}`);
            console.log(`Average number of ProductOrders: ${stats.sum / groups.length}`);
            console.log(`Total number of ProductOrders: ${stats.sum}`);
            console.log(`Number of itemsets: ${groups.length}`);
        })
        // Note that this behaviour (induced by the flatMap of readAndGroupBy) makes everything pretty much blocking again.
        .subscribe( (group: Group<string>) => {
            // Computing some basic stats on the fly
            stats.max = Math.max(group.items.length, stats.max);
            stats.min = Math.min(group.items.length, stats.min);
            stats.sum += group.items.length;

            // Pushing group to groups.
            groups.push(group)
        });
}

statsOnOrders();

Gathering data, this might take a while...


undefined

Grouping the records by their `order_id` enlights us of the following information:

| Number of orders | Minimum product number per order | Maximum product number per order | Average product number per order | Total number of records |
|-|-|-|-|-|-|
| 131,209 | 1 | 80 | 10.6 | 1,384,617 |

Some trivial modifications of the function above could allow us to retrieve the number of product per order for enhanced visualisation (code can be find in source files): 

### Grouping by product

Creating itemsets (`Group`s) of Orders, based on the `product_id` of `ProductOrders` may also be of interest to us, as it allow us "feel" a product "popularity" by counting the number of orders it appears in. This is a pretty good deal in regards to pattern mining, as:
- a "frequent" product will be more likely to appear in frequent itemsets;
- its number of appearance in the collection is, by definition, the maximum support over the dataset. Considering a Product A being the most popular in a dataset such as ours, the itemset { A } will trivially be the absolute, most frequent itemset to be find in the entire dataset;
- if a product is too frequent, it may be of interest to ignore it, as the itemsets to be find may not be revealant enough.

Code is basically the same as before, thus won't be included in the notebook. Sources are however still available in file `dist/stats-on-products.spaghetti.ts`.
Grouping the records by their `product_id` gives us the following results:


| Number of products | Minimum order number per product | Maximum order number per product | Average order number per product | Total number of records |
|-|-|-|-|-|-|
| 391,23 | 1 | 18,726 | 35.39 | 1,384,617 |

Joining retrieved data with table `products.csv` using the (dirty, yet it works.) following function: 

In [7]:
/**
 * Function finds element of array of which the key corresponds to value; and returns another defined value of this element.
 */
function join<T>( array: T[], initKey: keyof T, value: any, returnKey: keyof T): any {
    let element: T = array.find((element: T) => element[initKey] == value );
    return element[returnKey];
}

undefined

... we are able to compare products popularity, and conclude on: 
- Banana being the most popular product, being ordered 18,726 times;
- 46 products including 100% Black Cherry & Concord Grape Juice, Breaded Popcorn Turkey Dogs or Lip Balm, are the least popular with only 1 order.

## Frequent item sets

Knowing a little bit more about our data, we'll now move to mine and gather frequent itemsets from our **training** dataset (`order_products_train.csv`); in other words, **TODO**

### Dataset formatting

Upon this point we will be using SPMF library's JAVA implementation of Apriori (through command lines), in order to mine frequent item sets from our dataset. This implementation needs the data to be formatted as such: 

```
A B C
D E
```

With `{ A, B, C }` and `{ D, E }` representing itemsets, with `A, B, C, D, E` being integers exclusively. 
- Itemsets needs to separated by a `return carriage` character (`\r\n`);
- Within itemsets, items' ids are separated by a plain `space` character.

Transforming this dataset into this format is pretty straightforward using the code we already wrote previously; though for the sake of lisibility now we're done tinkering with the data, these functions have been clustered in a proper class `CSVParser`.

The following code process the dataset into the proper format, and pushes it into a new `formatted_dataset.csv` file:

In [8]:
import { CSVParser } from './class/csv-parser.class';
import { Group } from './interface/group.interface';
import { Product } from './interface/product.interface';
import { ProductOrder } from './interface/product-order.interface';

export class FormatData {
    private readonly __foldername: string = 'instacart_basket_data';
    private _output: string[] = [];

    constructor() {
        new CSVParser<Product>(`${__dirname}/../${this.__foldername}/products.csv`).loadAll()
            .then( (products: Product[]) => {
                console.log('Products has been loaded');

                new CSVParser<ProductOrder>(`${__dirname}/../${this.__foldername}/order_products__train.csv`)
                    // Grouping items by order_id, and mapping every item composing these itemsets to their product_id.
                    .generateItemsets<string>('order_id', (productOrder: ProductOrder) => productOrder.product_id)
                    // Once execution is complete, writing the formatted dataset into a proper file.
                    .finally( () => {
                        // Writing number of product per order_id in a new file : The array of already formatted rows is joined by a return carriage character.
                        fs.writeFile(`/Users/alexisfacques/Projects/python-apriori/formatted_itemsets.csv`, this._output.join('\r\n'), (err: any) => {
                            if(err) return console.log(err);
                            console.log('The file was saved!');
                        });
                    })
                    // On group reception, formatting the items composing is as a ROW (joined by plain space character), and pushing it the output array.
                    .subscribe((group: Group<ProductOrder,string>) => this._output.push(group.items.join(' ')) )
            })
    }
}

new FormatData();

Error: Cannot find module './class/csv-parser.class'

First two elements of file order_id__product_number.csv:
[ { '1': '36', '8': '8' }, { '1': '38', '8': '9' } ]
First two elements of file products.csv:
[ { product_id: '1',
    product_name: 'Chocolate Sandwich Cookies',
    aisle_id: '61',
    department_id: '19' },
  { product_id: '2',
    product_name: 'All-Seasons Salt',
    aisle_id: '104',
    department_id: '13' } ]
First two elements of file order_products__train.csv:
[ { order_id: '1',
    product_id: '49302',
    add_to_cart_order: '1',
    reordered: '1' },
  { order_id: '1',
    product_id: '11109',
    add_to_cart_order: '2',
    reordered: '1' } ]
Maximum number of ProductOrders: 80
Minimum number of ProductOrders: 1
Average number of ProductOrders: 10.552759338155157
Total number of ProductOrders: 1384617
Number of itemsets: 131209
First two elements of file orders.csv:
[ { order_id: '2539329',
    user_id: '1',
    eval_set: 'prior',
    order_number: '1',
    order_dow: '2',
    order_hour_of_day: '08',
    days_since_p

One in the SPMF library: http://www.philippe-fournier-viger.com/spmf/