# Part 25: Creating a Graph around the Kaggle H&M Personalized Fashion Recommendations Competition

The purpose of this notebook is to set up a graph around the [H&M Personalized Fashion Recommendations](https://www.kaggle.com/c/h-and-m-personalized-fashion-recommendations/overview).  You will need to download the data files from Kaggle yourself (see the Bite-Sized README for information on how to do that.)  You will also need to have a Neo4j database stood up (in the video, I will be demonstrating this with a Sandbox instance).

_Note that this notebook is just to get you started.  It does not contain a full graph model, since I assume you will want to think about an appropriate model yourself._ :)

_Also note: this notebook is not quite stand-alone and is intended to accompany the video ["Part 25: Creating a Graph for a Kaggle Competition"](https://dev.neo4j.com/bites_part25)._

In [1]:
import time
import pandas as pd
from neo4j import GraphDatabase

In [2]:
articles_df = pd.read_csv('./articles.csv.zip')
articles_df = articles_df.drop_duplicates()
articles_df.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


In [3]:
customers_df = pd.read_csv('./customers.csv.zip')
customers_df.drop_duplicates()
customers_df.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...


In [4]:
transactions_df = pd.read_csv('./transactions_train.csv.zip')
transactions_df.drop_duplicates()
transactions_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [5]:
articles_df.shape, customers_df.shape, transactions_df.shape

((105542, 25), (1371980, 7), (31788324, 5))

## Connection to Neo4j

Below is a class for using the official Neo4j Python driver to connect to a running database.  In the following cell, you will establish the connection, which requires that you know the IP address and password of the `neo4j` user account.  For more information on how this part works, see the Bite-Sized video ["Part 1: Connect from Jupyter to a Neo4j Sandbox"](https://dev.neo4j.com/bites_part1) and ["Part 3: Using the Neo4j Python Driver"](https://dev.neo4j.com/bites_part3).

In [6]:
class Neo4jConnection:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response

In [7]:
uri = 'bolt://3.95.5.199:7687'
pwd = 'splice-mattress-buckle'

conn = Neo4jConnection(uri=uri, user="neo4j", pwd=pwd)

In [8]:
conn.query('CREATE CONSTRAINT articles IF NOT EXISTS ON (a:Article) ASSERT a.article_id IS UNIQUE')
conn.query('CREATE CONSTRAINT customers IF NOT EXISTS ON (c:Customer) ASSERT c.cusomter_id IS UNIQUE')

[]

### The following are helper functions to take a Pandas dataframe and use it to populate the graph

(This is done in batches via the `insert_data` function.)

In [9]:
def add_articles(rows, batch_size=1000):
    # Adds category nodes to the Neo4j graph.
  
    query = '''UNWIND $rows AS row
    MERGE (a:Article {article_id: row.article_id})
    SET a.prod_name = row.prod_name,
        a.color = toInteger(row.perceived_colour_value_id),
        a.section_name = row.section_name,
        a.detail_descr = row.detail_descr
    RETURN count(*) as total
    '''
    
    return insert_data(query, rows, batch_size)


def add_customers(rows, batch_size=1000):
    
    query = '''UNWIND $rows AS row
    MERGE (c:Customer {customer_id: row.customer_id})
    SET c.age = toInteger(row.age)
    RETURN COUNT(*) AS total
    '''
    
    return insert_data(query, rows, batch_size)


def add_rels(rows, batch_size=1000):
    
    query = '''UNWIND $rows AS row
    MATCH (c:Customer {customer_id: row.customer_id})
    MATCH (a:Article {article_id: row.article_id})
    MERGE (c)-[r:PURCHASED]->(a)
    SET r.price = toFloat(row.price)
    RETURN COUNT(r) AS total
    '''

    return insert_data(query, rows, batch_size)
    

def insert_data(query, rows, batch_size = 10000):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):

        res = conn.query(query, parameters={'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        total += res[0]['total']
        batch += 1
        result = {"total":total, "batches":batch, "time":time.time()-start}
        print(result)

    return result

### For demonstration purposes we will do this on a subset of the overall data (i.e. the first 10,000 transactions)

In [16]:
tx_small_df = transactions_df.head(10000)
len(tx_small_df['customer_id'].unique()), len(tx_small_df['article_id'].unique())

(2954, 4903)

In [17]:
cust_ls = tx_small_df['customer_id'].unique().tolist()
art_ls = tx_small_df['article_id'].unique().tolist()

small_cust_df = customers_df[customers_df['customer_id'].isin(cust_ls)]
small_art_df = articles_df[articles_df['article_id'].isin(art_ls)]

small_cust_df.shape, small_art_df.shape

((2954, 7), (4903, 25))

In [18]:
add_articles(small_art_df)

{'total': 1000, 'batches': 1, 'time': 1.840942621231079}
{'total': 2000, 'batches': 2, 'time': 3.5251400470733643}
{'total': 3000, 'batches': 3, 'time': 5.217744827270508}
{'total': 4000, 'batches': 4, 'time': 6.932212591171265}
{'total': 4903, 'batches': 5, 'time': 8.439055442810059}


{'total': 4903, 'batches': 5, 'time': 8.439055442810059}

In [19]:
add_customers(small_cust_df)

{'total': 1000, 'batches': 1, 'time': 0.8283143043518066}
{'total': 2000, 'batches': 2, 'time': 2.1138017177581787}
{'total': 2954, 'batches': 3, 'time': 3.7562503814697266}


{'total': 2954, 'batches': 3, 'time': 3.7562503814697266}

In [20]:
add_rels(tx_small_df)

{'total': 1000, 'batches': 1, 'time': 1.7827272415161133}
{'total': 2000, 'batches': 2, 'time': 3.5216715335845947}
{'total': 3000, 'batches': 3, 'time': 5.189208745956421}
{'total': 4000, 'batches': 4, 'time': 6.85724663734436}
{'total': 5000, 'batches': 5, 'time': 8.514592409133911}
{'total': 6000, 'batches': 6, 'time': 10.161670923233032}
{'total': 7000, 'batches': 7, 'time': 11.804686784744263}
{'total': 8000, 'batches': 8, 'time': 13.555474042892456}
{'total': 9000, 'batches': 9, 'time': 15.300577878952026}
{'total': 10000, 'batches': 10, 'time': 16.93193745613098}


{'total': 10000, 'batches': 10, 'time': 16.93193745613098}