<a href="https://colab.research.google.com/github/Yunpei24/BigDataBase/blob/main/ProjetBigData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initialisation de l'environnement d'exécution

Installation du JDK

In [29]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Téléchargement de l'archive du framework Apache Spark

In [30]:
# Download Spark
!wget -q https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

Extraction de l'archive dans le dossier courant <mark>/content</mark>

In [31]:
# Unzip the file
!tar xf spark-3.3.1-bin-hadoop3.tgz

Installation des modules Python <b>pyspark</b> et <b>findspark</b>

In [32]:
!pip install -q pyspark
!pip install -q findspark

Test de l'installation de pyspark

In [33]:
!find /content -name "pyspark"

/content/spark-3.3.1-bin-hadoop3/python/pyspark
/content/spark-3.3.1-bin-hadoop3/python/pyspark/python/pyspark
/content/spark-3.3.1-bin-hadoop3/bin/pyspark


Création des variables d'environnement <mark>SPARK_HOME</mark> et <mark>JAVA_HOME</mark> pour situer respectivement les emplacements d'installation de Spark et Java 

In [34]:
import os
os.environ["SPARK_HOME"] =  "/content/spark-3.3.1-bin-hadoop3" 
os.environ["JAVA_HOME"] ="/usr/lib/jvm/java-8-openjdk-amd64"

Importation des bibliothèques Spark SQL

In [35]:
import findspark 
print("findspark.init() initialise les variables d'environnement pour spark") 
findspark.init() 

# Pyspark session objects
from pyspark.sql import SparkSession 
# Pyspark session configuration
from pyspark import SparkConf  

# Pyspark functions
import pyspark.sql.functions as f
from pyspark.sql import * 

# Pyspark SQL data types
from pyspark.sql.types import *

findspark.init() initialise les variables d'environnement pour spark


# Analyse et visualisation de données

## Définition de fonctions pour l'environnement PySpark

La fonction <mark>demarrer_spark</mark> permet d'initialiser une session <i>client</i> avec Spark

In [36]:
def demarrer_spark():
  local = "local[*]"
  appName = "TP3"
  configLocale = SparkConf().setAppName(appName).setMaster(local).\
  set("spark.executor.memory", "100G").\
  set("spark.driver.memory","50G").\
  set("spark.sql.catalogImplementation","in-memory").\
  set("spark.driver.maxResultSize", "10G")
  
  spark = SparkSession.builder.config(conf = configLocale).getOrCreate()
  sc = spark.sparkContext
  sc.setLogLevel("ERROR")
  
  # spark.conf.set("spark.sql.autoBroadcastJoinThreshold","-1")
  # On ajuste l'environnement d'exécution des requêtes à la taille du cluster (4 coeurs)
  # spark.conf.set("spark.sql.shuffle.partitions","200")    

  print("session démarrée, son id est ", sc.applicationId)
  return spark

Démarrage de la session

In [37]:
spark = demarrer_spark()

ConnectionRefusedError: ignored

En vue de simplifier l'exécution des requêtes SQL, nous définissons la commande magique &#128526; <b><font color="blue">%%sql</font></b> pour exécuter les requêtes plus facilement

In [None]:
from IPython.core.magic import (register_line_magic, register_cell_magic, register_line_cell_magic)
import gc

def removeComments(query):
  result = ""
  for line in query.split('\n'):
    if not(line.strip().startswith("--")):
      result += line + "\n"
  return result

@register_line_cell_magic
def sql(line, cell=None):
    "To run a sql query. Use:  %%sql"
    val = cell if cell is not None else line
    tabRequetes = removeComments(val).split(";")
    resultat = None
    est_une_requete = False
    for r in tabRequetes:
        r = r.strip()
        if len(r) > 2:
          resultat = spark.sql(r)
          est_une_requete = r.lower().startswith('select') or r.lower().startswith('with')  
    if(est_une_requete):
      # Explain the execution plan
      #resultat.explain()
      # Display the result
      return display(resultat)
    else:
      return print('ok')

De même, nous redéfinissons la fonction <b>display</b> pour un meilleur affichage des données manipulées.

In [None]:
import pandas as pd

def display(df, n=10):
  pd.set_option('max_columns', None)
  pd.set_option('max_colwidth', None)
  pdf = df.limit(n).toPandas()
  # Free memory
  df.unpersist()
  # Force Spark to free memory
  spark.catalog.clearCache()
  # and Python too
  gc.collect(2)
  return pdf

print("display redéfini")

## Définition de fonctions de visualisation

Fonction d'exécution de requête SQL et conversion du résultat (un Dataframe Spark) en Dataframe Pandas

In [None]:
import gc

def getPandasDataFrame(sqlQuery):
  # Execute SQL Query with PySpark
  dfSpark = spark.sql(sqlQuery)
  # Convert Spark dataframe to Pandas dataframe
  pdf = dfSpark.toPandas()
  # Force Spark to free memory
  dfSpark.unpersist()
  spark.catalog.clearCache()
  # and Python too
  gc.collect(2)
  # Return the Pandas Dataframe
  return pdf

Fonctions de visualisation

In [None]:
import plotly.graph_objects as go
import plotly.express as px
import plotly.tools as pt
import numpy as np
import math

def drawLine(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  # plotting the line chart
  fig = px.line(pdf, x=pdf.columns[0], y=pdf.columns[1])
  # showing the plot
  fig.show()

def drawBar(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  # plotting the bar chart
  fig = px.bar(pdf, x=pdf.columns[0], y=pdf.columns[1])
  # showing the plot
  fig.show()

def drawHistogram(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  # plotting the histogram chart
  fig = px.histogram(pdf, x=pdf.columns[0], y=pdf.columns[1])
  # showing the plot
  fig.show()

def drawHeatmap(sql, scale=lambda x: x):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  if len(pdf.columns) != 3 and not (pdf[pdf.columns[2]].dtype == np.float64 or pdf[pdf.columns[2]].dtype == np.int64):
    raise Exception("Sorry, no numbers below zero")
  source = pdf[pdf.columns[0]].tolist()
  target = pdf[pdf.columns[1]].tolist()
  value = pdf[pdf.columns[2]].tolist()
  # plotting the figure
  fig = go.Figure(data = go.Heatmap(x = source, y = target, z = [scale(x) for x in value])) 
  fig.show()

def drawPie(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  # plotting the pie chart
  fig = px.pie(pdf, names=pdf.columns[0], values=pdf.columns[1])
  # showing the plot
  fig.show()

def drawStackedBar(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  # plotting the stacked bar chart
  fig = px.bar(df, x=pdf.columns[0], y=pdf.columns[2], color=pdf.columns[1], hover_data=pdf.columns[1], barmode = 'stack')
  # showing the plot
  fig.show()

def drawSankey(sql):
  # Getting Pandas Dataframe
  pdf = getPandasDataFrame(sql)
  
  labels = []
  x = set(pdf[pdf.columns[0]].tolist())
  dicX = {}
  i = 0
  for e in x:
    dicX[e] = i
    labels.append(e)
    i += 1
    
  y = set(pdf[pdf.columns[1]].tolist())
  dicY = {}
  # i = len(labels)
  for e in y:
    if(e in dicX):
      dicY[e] = dicX[e]
    else:
      dicY[e] = i
      i += 1
    labels.append(e)

  fig = go.Figure(data=[go.Sankey(
    node = dict(
      thickness = 5,
      line = dict(color = "green", width = 0.1),
      label = labels,
      color = "blue"
    ),
    link = dict(
      # indices correspond to labels
      source = [dicX[e] for e in pdf[pdf.columns[0]].tolist()],
      target = [dicY[e] for e in pdf[pdf.columns[1]].tolist()],
      value = pdf[pdf.columns[2]].tolist()
  ))])

  # showing the plot
  fig.show()

## Récupération du jeu de données

Téléchargement du jeu

In [None]:
!curl -L -o ecommerce-behavior-data-from-multi-category-store.zip 'https://drive.google.com/u/0/uc?id=1CVhmxsU3GY0FYGS1uP3m_tGbyGjEfuQc&export=download&confirm=t'
#!curl -L -o ecommerce-behavior-data-from-multi-category-store.zip 'https://storage.googleapis.com/kaggle-data-sets/411512/835452/compressed/2019-Nov.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20230128%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20230128T111731Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2afcc1a011c86d89fe64d1c12bc1432f703c525f12c74f43fad4f455a5183c74589932fbabb73bce85de91427906abecec18c6929a894fd0ca8657683b665379deea648ef51f6bb4c114125998ee24b7fdd2b630cdc327e142d0f8130f2f5e9306d45293940e87b2c05aa32151f52ab4a85638d5920e6de0fbf13b8daaffd7fbeb21009fc42c8baf268a399a1419b0bf0c9a5a5150732d0d10d4a1b90c7b516d60a01ffb2dc3b42c9266f3acdecf42b791a074f379ec89295af92a337d89af4f092e6a74db6b74f75305604e9593e265dafdf6e25dbe9b9160840864260541f1a188473fa9c59514fd0d4136cd04066084275d95e238525b94333cac9a6b6ceb'

Extraction des données

In [None]:
!unzip -o ecommerce-behavior-data-from-multi-category-store.zip
!ls .

Aperçu de format des données

Chargement du jeu de données dans Spark

In [None]:
#!head -10 2019-Nov.csv

In [None]:
df = spark.read.csv("2019-Nov.csv", header=True, sep=',')
#display(df)

Affichage du nombre d'enregistrements du jeu de données

Affichage du schéma du Dataframe Spark

In [None]:
#df.printSchema()

Casting de certaines colonnes aux types de données attendus

In [None]:
df = df.withColumn("event_time",df.event_time.cast(TimestampType()))
df = df.withColumn("product_id",df.product_id.cast(IntegerType()))
df = df.withColumn("category_id",df.category_id.cast(IntegerType()))
df = df.withColumn("price",df.price.cast(DoubleType()))
df = df.withColumn("user_id",df.user_id.cast(IntegerType()))

Affichage du nouveau schéma du Dataframe Spark

In [None]:
#df.printSchema()

Matérialisation du dataframe comme une vue SQL avec la vue <mark>purchases</mark> qui pointe sur lui

In [None]:
df.createOrReplaceTempView('events')

# 1. Le chiffre d'affaire réalisé selon les jours de la semaine

In [None]:
dayNames = ['', 'Lundi', 'Mardi', 'Mercredi', 'Jeudi', 'Vendredi', 'Samedi', 'Dimanche']
  
def weekDayName(dayID):
  global dayNames
  val = int(dayID)
  if(1 <= val <= 7):
    return dayNames[val]
  else:
    return "Unknown"

In [None]:
spark.udf.register("weekDayName", weekDayName, StringType())

In [None]:
sql = "SELECT weekDayName(date_format(event_time, 'F')) AS JoursSemaine, SUM(price) AS ChiffreAffaire FROM events WHERE event_type= 'purchase' GROUP BY weekDayName(date_format(event_time, 'F'))"
drawHistogram(sql)

# 2. Donnons l'évolution du nombre d'achats de produits selon les jours du mois.

In [None]:
def extractDay(day):
  day1 = str(day).strip().split(" ")
  if len(day1) == 2:
    Date, Heure = day1
    return Date

In [None]:
spark.udf.register("extractDay", extractDay, StringType())

In [None]:
sql = "SELECT extractDay(event_time) AS JoursMois, COUNT(event_type) FROM events WHERE event_type='purchase' GROUP BY extractDay(event_time)"
drawLine(sql)

#3. Donnons le top 3 des catégories de produits et leurs chiffres d'affaires par type d'évènement.

In [None]:
%%sql
CREATE OR REPLACE TEMP VIEW TopProd AS SELECT category_code, event_type, SUM(price) AS chiffreDaffaire FROM events WHERE category_code != 'None' GROUP BY category_code, event_type

In [None]:
%%sql
CREATE OR REPLACE TEMP VIEW TopProd2 AS SELECT category_code, event_type, chiffreDaffaire, RANK() over (PARTITION BY event_type ORDER BY chiffreDaffaire DESC) AS rang FROM TopProd

In [None]:
# Donnons le top 3 des catégories de produits et leurs chiffres d'affaires par type d'évènement
sql = "SELECT event_type, category_code, chiffreDaffaire FROM TopProd2 WHERE rang <= 3"
drawHeatmap(sql, scale=math.log)

# 4. Donnons le chiffre d'affaire gagné sur chaque marque (brand) selon le jour de la semaine.

In [38]:
sql = "SELECT brand AS Marque, weekDayName(date_format(event_time, 'F')) AS JoursSemaine, SUM(price) AS chiffreDaffaire FROM events GROUP BY weekDayName(date_format(event_time, 'F')), brand"
drawHeatmap(sql, scale=math.log)

ConnectionRefusedError: ignored