<a href="https://colab.research.google.com/github/drshyamsundaram/CommunityDataModel/blob/gh-pages/SNAP_copurchasing_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**SNAP Graph - Amazon Co-purchase Data Graph Analysis**
**Code base**  Dr Shyam Sundaram (drshyamsundaramindia@gmail.com) & Gaurav J (email id)

**Dataset** https://snap.stanford.edu/data/amazon0302.html

**About SNAP PY** https://snap.stanford.edu/snappy/index.html
**Documentation** https://snap.stanford.edu/snappy/doc/index.html

**Data set information**
Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j.

**Citation** J. Leskovec, L. Adamic and B. Adamic. The Dynamics of Viral Marketing. ACM Transactions on the Web (ACM TWEB), 1(1), 2007.

**Data Set Details**
The data was collected in March 02 2003.
Nodes	262111
Edges	1234877
Nodes in largest WCC	262111 (1.000)
Edges in largest WCC	1234877 (1.000)
Nodes in largest SCC	241761 (0.922)
Edges in largest SCC	1131217 (0.916)
Average clustering coefficient	0.4198
Number of triangles	717719
Fraction of closed triangles	0.09339
Diameter (longest shortest path)	32
90-percentile effective diameter	11

** Data files** 
https://snap.stanford.edu/data/index.html#amazon

**Data set files Details**
Name	Type	Nodes	Edges	Description
amazon0302	Directed	262,111	1,234,877	Amazon product co-purchasing network from March 2 2003
amazon0312	Directed	400,727	3,200,440	Amazon product co-purchasing network from March 12 2003
amazon0505	Directed	410,236	3,356,824	Amazon product co-purchasing network from May 5 2003
amazon0601	Directed	403,394	3,387,388	Amazon product co-purchasing network from June 1 2003
amazon-meta	Metadata	548,552	1,788,725	Amazon product metadata: product info and all reviews on around 548,552 products.


**Setting up key installations**
1. Setup Openjdk 8 & APARK with hadoop 3.1.1
2. Path SETUP(s) JAVA_HOME & SPARK_HOME
3. Install SNAP and relevant LIBS
4. Load key libraries

In [None]:
# (Step 1) Installing JDK & SPARK
#!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz
!tar xf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
# (Step 2) PATH Setups for SPARK & JAVA
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"

In [None]:
#(Step 3) install key libraries
!pip install snap-stanford networkx stemming

Collecting snap-stanford
[?25l  Downloading https://files.pythonhosted.org/packages/9b/11/76c0bbe6dbd5004ea6ae2fcd748d5d6a17f65c506b894f2de13d36918d13/snap_stanford-6.0.0-cp37-cp37m-manylinux1_x86_64.whl (11.6MB)
[K     |████████████████████████████████| 11.6MB 8.6MB/s 
Collecting stemming
  Downloading https://files.pythonhosted.org/packages/d1/eb/fd53fb51b83a4e3b8e98cfec2fa9e4b99401fce5177ec346e4a5c61df71e/stemming-1.0.1.tar.gz
Building wheels for collected packages: stemming
  Building wheel for stemming (setup.py) ... [?25l[?25hdone
  Created wheel for stemming: filename=stemming-1.0.1-cp37-none-any.whl size=11139 sha256=2944040c83e0db1773451605460c6eb69a12518a29bd731d7123eb86877784a3
  Stored in directory: /root/.cache/pip/wheels/e8/05/2e/2ddeb64d4464b854b48323f9676528c17560da7d153db7b0e2
Successfully built stemming
Installing collected packages: snap-stanford, stemming
Successfully installed snap-stanford-6.0.0 stemming-1.0.1


In [None]:
#(Step 4) Key library imports
import gzip
import io
import requests
import string
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from stemming.porter2 import stem
import networkx
import pandas as pd

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Starting of the core processing code blocks**
1. Load from the meta data file
2. Read the data from the amazon-meta file;populate amazonProducts nested dicitonary;


In [None]:
# (Step 1) Loading the SNAP product copurchase data set meta data
metafile_path="https://snap.stanford.edu/data/bigdata/amazon/amazon-meta.txt.gz"
web_response = requests.get(metafile_path, stream=True)

meta_data_file=web_response.content # Content in bytes from requests.get
#csv_gz_file = web_response.content # Content in bytes from requests.get
                                   
f = io.BytesIO(meta_data_file)
fhr = gzip.open(f,'rt')

In [None]:
# (Step 2) Read the data from the amazon-meta file;
# populate amazonProducts nested dicitonary;Products meta data model
# Convert to SPARK

Products = []

Id = ""
ASIN = ""
Title = ""
Categories = ""
Group = ""
Copurchased = ""
SalesRank = ""
TotalReviews = ""
AvgRating = ""

for line in fhr:
    line = line.strip()
    # a product block started
    if(line.startswith("Id")):
        Id = line[3:].strip()
    elif(line.startswith("ASIN")):
        ASIN = line[5:].strip()
    elif(line.startswith("title")):
        Title = line[6:].strip()
        Title = ' '.join(Title.split())
    elif(line.startswith("group")):
        Group = line[6:].strip()
    elif(line.startswith("salesrank")):
      
        SalesRank = line[10:].strip()
    elif(line.startswith("similar")):
        ls = line.split()
        Copurchased = ' '.join([c for c in ls[2:]])
    elif(line.startswith("categories")):
        ls = line.split()
        Categories = ' '.join((fhr.readline()).lower() for i in range(int(ls[1].strip())))
        Categories = re.compile('[%s]' % re.escape(string.digits+string.punctuation)).sub(' ', Categories)
        Categories = ' '.join(set(Categories.split())-set(stopwords.words("english")))        
        Categories = ' '.join(stem(word) for word in Categories.split())
    elif(line.startswith("reviews")):
        ls = line.split()
        TotalReviews = ls[2].strip()
        AvgRating = ls[7].strip()
    elif (line==""):
        try:
            MetaData = {}
            if (ASIN != ""):
              MetaData["ASIN"] = ASIN
              MetaData['Id'] = Id            
              MetaData['Title'] = Title
              MetaData['Categories'] = ' '.join(set(Categories.split()))
              MetaData['Group'] = Group
              MetaData['Copurchased'] = Copurchased
              MetaData['SalesRank'] = int(SalesRank)
              MetaData['TotalReviews'] = int(TotalReviews)
              MetaData['AvgRating'] = float(AvgRating)

            #print(MetaData)
            Products.append(MetaData)

        except Exception as e:
          print(e, ASIN)

invalid literal for int() with base 10: '' 0878223398


In [None]:
import pandas as pd

In [None]:
# Creation of the products meta data frame and saving file
df_products = pd.DataFrame(Products)
df_products.to_csv('amazonreviewsMetaData.csv',index=False)  # CSV Save
df_products.to_parquet('amazonreviewsMetaData.parquet.gzip',
              compression='gzip')  # Parquet Save

In [None]:
# Download meta file from the google collab files

#from google.colab import files
#files.download("amazonreviewsMetaData.csv")

# Reading as Parquet option
#df_products1 = pd.read_parquet('amazonreviewsMetaData.parquet.gzip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Remove NA(s) from the Products Meta
df_products.dropna(inplace=True)
# Show the metadata we have created
df_products

Unnamed: 0,ASIN,Id,Title,Categories,Group,Copurchased,SalesRank,TotalReviews,AvgRating
1,1890132586,51424,The Toilet Papers: Recycling Waste and Conserv...,general jp environment engin store specialti b...,Book,0964425890 0966678303 1559633891 0936070110,643446.0,2.0,5.0
2,0821225898,51425,Dan Kiley : The Complete Works of America's Ma...,general architectur home technic horticultur b...,Book,0262731169 1568982380 050028427X 1568981481 02...,56634.0,1.0,4.0
3,0686651545,51426,"One Hundred-One Sandwiches, Fish, Egg Salad, M...",,Book,,3468435.0,0.0,0.0
4,014036675X,51427,Alice's Adventures in Wonderland (Puffin Class...,general children fantasi author british c illu...,Book,6303212220 0486408787 0812523350 0486228533 06...,1057502.0,180.0,4.5
5,0816742359,51428,How to Draw Donkey Kong & Friends (How to Draw...,children popular literatur book cultur subject,Book,,212816.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
497124,B000059TOC,548547,The Drifter,suspens ann dvd general today b outlet titl ac...,DVD,630366704X B0002ERXB8 B0001932ZU B0001VTPUE B0...,0.0,1.0,5.0
497125,B00006JBIX,548548,The House Of Morecock,general anim genr featur art intern independ c...,DVD,B0002HOE6C B0002I84JO B00004WZQN B00069CQ8E B0...,0.0,8.0,3.0
497126,0879736836,548549,Catholic Bioethics and the Gift of Human Life,general histori spiritu scienc religion social...,Book,1931709920 188187110X 081890643X 1580510469 08...,0.0,1.0,4.0
497127,B00008DDST,548550,"1, 2, 3 Soleils: Taha, Khaled, Faudel",general today genr music featur categori conce...,DVD,B00012FWNC B0002UNQQI B00069FKLO B0000CNTHZ B0...,0.0,3.0,5.0


**Products Meta Model**
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   ASIN          497128 non-null  object 
 1   Id            497128 non-null  object 
 2   Title         497128 non-null  object 
 3   Categories    497128 non-null  object 
 4   Group         497128 non-null  object 
 5   Copurchased   497128 non-null  object 
 6   SalesRank     497128 non-null  float64
 7   TotalReviews  497128 non-null  float64
 8   AvgRating     497128 non-null  float64

 **Unqiue Product Groups**
 array(['Book', 'Music', 'DVD', 'Video', 'Video Games', 'Software','Baby Product', 'CE', 'Toy', 'Sports'], dtype=object)

In [None]:
# Show the Products Meta Model
df_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 497128 entries, 1 to 497128
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   ASIN          497128 non-null  object 
 1   Id            497128 non-null  object 
 2   Title         497128 non-null  object 
 3   Categories    497128 non-null  object 
 4   Group         497128 non-null  object 
 5   Copurchased   497128 non-null  object 
 6   SalesRank     497128 non-null  float64
 7   TotalReviews  497128 non-null  float64
 8   AvgRating     497128 non-null  float64
dtypes: float64(3), object(6)
memory usage: 37.9+ MB


In [None]:
# Unique products groups
df_products.Group.unique()

array(['Book', 'Music', 'DVD', 'Video', 'Video Games', 'Software',
       'Baby Product', 'CE', 'Toy', 'Sports'], dtype=object)

**SPARK Executions**

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
from pyspark import SparkContext
try:
    sc = SparkContext('local', 'SNAP_copurchasing_example_context')
except ValueError:
    print('SparkContext already exists!')
from pyspark.sql import SparkSession
try:
    spark = SparkSession.builder.appName('SNAP_copurchasing_example').getOrCreate()
except ValueError:
    print('SparkSession already exists!')

SparkContext already exists!


In [None]:
# Network Build - Part 1 of the dats set
import gzip
import io
import requests
import pandas

web_response_nw_build = requests.get("https://snap.stanford.edu/data/amazon0601.txt.gz",
                             stream=True)
web_response_nw_build_part1 = web_response_nw_build.content # Content in bytes from requests.get
                                   # See comments below why this is used.

web_response_nw_build_part1_file = io.BytesIO(web_response_nw_build_part1)
fhr2 = gzip.open(web_response_nw_build_part1_file,'rt')

In [None]:
# Load into Pandas DF
df_network_part1 = pandas.read_table(fhr2, sep='\t', delimiter=None, skiprows = 3)
df_network_part1.tail()

Unnamed: 0,# FromNodeId,ToNodeId
3387383,403392,121379
3387384,403392,190663
3387385,403393,318438
3387386,403393,326962
3387387,403393,403383


In [None]:
# Get the list of unique products id(s) from the products metadata
df_products.Id.unique()

array(['51424', '51425', '51426', ..., '548549', '548550', '548551'],
      dtype=object)

**Construct Undirected SNAP graph**

**Details** https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html#graph-and-network-types

In [None]:
import snap
# Create SNAP undirected graph:
G_UD = snap.TUNGraph.New()

# Add nodes:
nodes = set(df_network_part1['# FromNodeId'].tolist() + df_network_part1['ToNodeId'].tolist())
for node in nodes:
    G_UD.AddNode(int(node))

# Add edges:
for index, row in df_network_part1.iterrows():
    G_UD.AddEdge(int(row['# FromNodeId']), int(row['ToNodeId']))

**Quick Graph Diagnostics**
snap.PrintInfo(G_UD, "amazon Stats", "amazon-info.txt", False)

**amazon Stats:**
  Nodes:                    403394
  Edges:                    2443408
  Zero Deg Nodes:           0
  Zero InDeg Nodes:         0
  Zero OutDeg Nodes:        0
  NonZero In-Out Deg Nodes: 403394
  Unique directed edges:    4886816
  Unique undirected edges:  2443408
  Self Edges:               0
  BiDir Edges:              4886816
  Closed triangles:         3986507
  Open triangles:           60250166
  Frac. of closed triads:   0.062060
  Connected component size: 0.999926
  Strong conn. comp. size:  0.999926
  Approx. full diameter:    20
  90% effective diameter:  7.589876

In [None]:
# Quick Graph Diagnostics
snap.PrintInfo(G_UD, "amazon Stats", "amazon-info.txt", False)

In [None]:
# Debug Code - In case required
# Print result:
#G1.Dump()

In [None]:
# Referece for TIntStrH https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html#hash-table-types
# Create a hash table of Node ID(s)
NIdName = snap.TIntStrH()

# Loop through nodes and add to Nodes Hashtable
for node in nodes:
    NIdName[node] = str(node)

In [None]:
# Length of Node Hash Table
print(len(NIdName))

403394


In [None]:
snap.DrawGViz(G_UD, snap.gvlDot, "SNAP_copurchasing_part1.png", "amazon", NIdName)

In [None]:
import snap

G = snap.GenGrid(snap.PUNGraph, 5, 3)
G.DrawGViz(snap.gvlDot, "grid5x3.png", "Grid 5x3")

snap.DrawGViz(G_UD, snap.gvlDot, "SNAP_copurchasing_part1.png", "amazon", NIdName)

**Computing Structural Properties**

**Reference** https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html#computing-structural-properties
1. Get a distribution of connected components (component size, count)
2. Get degree distribution pairs (out-degree, count)
3. Generate a Preferential Attachment graph on 100 nodes and out-degree of 3
4. Define a vector of floats and get first eigenvector of graph adjacency matrix
5. Get an approximation of graph diameter:
6. Count the number of triads:
7. Get the clustering coefficient


In [None]:
# Computing Structural Properties
# Reference: https://snap.stanford.edu/snappy/doc/tutorial/tutorial.html#computing-structural-properties
# 1. Get a distribution of connected components (component size, count)
CntV = snap.TIntPrV()
snap.GetWccSzCnt(G_UD, CntV)
for p in CntV:
    print("size %d: count %d" % (p.GetVal1(), p.GetVal2()))

size 2: count 2
size 3: count 2
size 5: count 1
size 15: count 1
size 403364: count 1


In [None]:
# 2. Get degree distribution pairs (out-degree, count)
snap.GetOutDegCnt(G_UD, CntV)
for p in CntV:
    print("degree %d: count %d" % (p.GetVal1(), p.GetVal2()))

In [None]:
# 3. Generate a Preferential Attachment graph on 100 nodes and out-degree of 3
# 4. Define a vector of floats and get first eigenvector of graph adjacency matrix
# 5. Get an approximation of graph diameter:
# 6. Count the number of triads:
# 7. Get the clustering coefficient


In [None]:

# define a vector of floats and get first eigenvector of graph adjacency matrix
EigV = snap.TFltV()
snap.GetEigVec(G_UD, EigV)
nr = 0
for f in EigV:
    nr += 1
    print("%d: %.6f" % (nr, f))


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
398395: 0.000006
398396: 0.000005
398397: 0.000007
398398: -0.000007
398399: 0.000577
398400: 0.000820
398401: 0.000007
398402: 0.000005
398403: 0.000010
398404: 0.000409
398405: -0.000002
398406: -0.000018
398407: -0.000008
398408: 0.008621
398409: -0.000005
398410: -0.000005
398411: -0.000005
398412: -0.000009
398413: 0.000236
398414: -0.000002
398415: 0.000031
398416: -0.000001
398417: 0.000024
398418: 0.005247
398419: -0.000014
398420: 0.000011
398421: 0.000004
398422: 0.000578
398423: -0.000020
398424: -0.000023
398425: 0.000062
398426: 0.000056
398427: -0.000026
398428: 0.000099
398429: -0.000016
398430: -0.000002
398431: -0.000014
398432: -0.000028
398433: -0.000013
398434: -0.000012
398435: -0.000014
398436: 0.000008
398437: 0.000009
398438: 0.000008
398439: 0.000007
398440: 0.000005
398441: 0.000009
398442: 0.000009
398443: -0.000001
398444: 0.000010
398445: 0.000007
398446: 0.000075
398447: 0.004840
398448: -0.0

In [None]:

# get an approximation of graph diameter
diam = snap.GetBfsFullDiam(G_UD, 10)
print("diam", diam)   

diam 19


In [None]:
# count the number of triads:
triads = snap.GetTriads(G1)
print("triads", triads)

triads 3986507


In [None]:
# get the clustering coefficient
cf = snap.GetClustCf(G1)
print("cf", cf)

DataTypes:
snap.TInt() - 0 if empty
snap.TFloat()
snap.TStr()

Vector types: sequence of same values
TIntV, TFloatV, TStrV
Add(), Len(),[index], for i in V
v = snap.TIntV()
v.Add(1)
v.Add(2)
v.Add(3)
v.Add(4)
v.Add(5)
print v.Len()
print v[3]
v[3] = 2*v[2]
print v[3]
for item in v:
print item
for i in range(0, v.Len()):
print i, v[i]

Hash table:
h = snap.TIntStrH()
h[5] = "apple"
h[3] = "tomato"
h[9] = "orange"
h[6] = "banana"
h[1] = "apricot"
print h.Len()
print "h[3] =", h[3]
h[3] = "peach"
print "h[3] =", h[3]
for key in h:
print key, h[key]

Pairs:
TIntStrPrV: a vector of (integer, string) pairs
TIntPrV: a vector of (integer, integer) pairs
TIntPrFltH: a hash table with (integer, integer) pair keys and float values

p = snap.TIntStrPr(1,"one")
print p.GetVal1()
1
print p.GetVal2()
one