# Project Steam

> Utilisation: ce notebook est destiné à fonctionnner dans l'environnement **Databriks** ONLY!

## Loading raw dataset

In [0]:
# Imports used withing that notebook
from pyspark.sql import functions as F
from pyspark.sql.types import *

In [0]:
steam_dataset_path = "s3://full-stack-bigdata-datasets/Big_Data/Project_Steam/steam_game_output.json"

raw_df = spark.read.json(steam_dataset_path)

print("Original Stream dataset loaded.")

Original Stream dataset loaded.


## Schema walkthrough

In [0]:
raw_df.count()

55691

In [0]:
raw_df.columns

['data', 'id']

In [0]:
raw_df.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- appid: long (nullable = true)
 |    |-- categories: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- ccu: long (nullable = true)
 |    |-- developer: string (nullable = true)
 |    |-- discount: string (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- header_image: string (nullable = true)
 |    |-- initialprice: string (nullable = true)
 |    |-- languages: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- negative: long (nullable = true)
 |    |-- owners: string (nullable = true)
 |    |-- platforms: struct (nullable = true)
 |    |    |-- linux: boolean (nullable = true)
 |    |    |-- mac: boolean (nullable = true)
 |    |    |-- windows: boolean (nullable = true)
 |    |-- positive: long (nullable = true)
 |    |-- price: string (nullable = true)
 |    |-- publisher: string (nullable = true)
 |    |-- release_date: string (nullable = true)
 |    |-

On a 2 propriétés racines: "data" et "id".

- "**data**" est une structure complexe.
- "**id**" est une chaine de caratères

**data**

On voit que le champs **tags** est une structure un peu particulière (une propriété = 1 tag avec en valeur le nombre de fois ou le l'application a été taggée avec celui-ci). Par ailleurs, compte tenu des attentes de l'analyse il semble envisageable d'ignorer ce champ pour la suite.

In [0]:
# A quoi ressemble le schéma imbriqué de "data"
print(f"{'Propriété':<20} | Type")
print(f"{'-'*20} | {'-'*10}")

for field in raw_df.schema["data"].dataType.fields:
    print(f"{field.name:<20} | {type(field.dataType).__name__}")

Propriété            | Type
-------------------- | ----------
appid                | LongType
categories           | ArrayType
ccu                  | LongType
developer            | StringType
discount             | StringType
genre                | StringType
header_image         | StringType
initialprice         | StringType
languages            | StringType
name                 | StringType
negative             | LongType
owners               | StringType
platforms            | StructType
positive             | LongType
price                | StringType
publisher            | StringType
release_date         | StringType
required_age         | StringType
short_description    | StringType
tags                 | StructType
type                 | StringType
website              | StringType


In [0]:
from pyspark.sql.types import StructType, StructField
from typing import List, Dict, Generator, Union, Callable

def walkSchema(schema: Union[StructType, StructField], exclude: List[str] = []) -> Generator[str, None, None]:
    """Explores a PySpark schema:
    
    schema: StructType | StructField
    
    Yield
    -----
    A generator of strings, the name of each field in the schema
    """

    # some keys may have a dot in there name. This function aim to escape properly those ones if encountered.
    def _sanitizedKeyName(name: str):
        if '.' in name:
            return f"`{name}`"
        return name

    # we define a function _walk that produces a string generator from
    # a dictionnary "schema_dct", and a string "prefix"
    def _walk(schema_dct: Dict['str', Union['str', list, dict]],
              prefix: str = "", exclude: List[str] = []) -> Generator[str, None, None]:
        assert isinstance(prefix, str), "prefix should be a string" # check if prefix is a string

        if prefix in exclude:
            print(f"!! -> prefix {prefix} is exluded")
            return
        
        # this function returns "name" if there's no prefix and "prefix.name" if prefix exists
        fullName: Callable[str, str] = lambda name: ( 
            _sanitizedKeyName(name) if not prefix else f"{prefix}.{_sanitizedKeyName(name)}")
        
        # we get the next name one level lower from the dictionnary
        name = schema_dct.get('name', '')

        # if the type is struct then we search for the fields key
        # if fields is there we apply the function again and dig one level deeper in
        # the schema and set a prefix
        if schema_dct['type'] == 'struct':
            assert 'fields' in schema_dct, (
                "It's a StructType, we should have some fields")
            for field in schema_dct['fields']:
                yield from _walk(field, prefix=prefix, exclude=exclude)
        if schema_dct['type'] == 'array':
            # {'type': 'array', 'elementType': 'string', 'containsNull': True}
            elementType = schema_dct['elementType']
            if elementType == 'dict' or elementType == 'dict':
                print(f"Array type '{elementType}' not handled")
            else:
                # base type, map array as is
                yield prefix
        # if we have a dict type and we can't find fields then we
        # dig one level deeper and apply the _walk function again
        elif isinstance(schema_dct['type'], dict):
            assert 'fields' not in schema_dct, (
                "We're missing some keys here")
            yield from _walk(schema_dct['type'], prefix=fullName(name), exclude=exclude)
        # If we finally reached the end and found a name we yield the full name
        elif name:
            yield fullName(name)
    
    yield from _walk(schema.jsonValue(), exclude=exclude)


In [0]:
for col_name in walkSchema(raw_df.schema, exclude =["data.tags"]):
  print(col_name)

data.appid
data.categories
data.ccu
data.developer
data.discount
data.genre
data.header_image
data.initialprice
data.languages
data.name
data.negative
data.owners
data.platforms.linux
data.platforms.mac
data.platforms.windows
data.positive
data.price
data.publisher
data.release_date
data.required_age
data.short_description
!! -> prefix data.tags is exluded
data.type
data.website
id


In [0]:
# display(raw_df.limit(5).toPandas())

In [0]:

# Regardons un enregistrement
raw_df.select(F.col('id'), F.col('data')).take(1)

[Row(id='10', data=Row(appid=10, categories=['Multi-player', 'Valve Anti-Cheat enabled', 'Online PvP', 'Shared/Split Screen PvP', 'PvP'], ccu=13990, developer='Valve', discount='0', genre='Action', header_image='https://cdn.akamai.steamstatic.com/steam/apps/10/header.jpg?t=1666823513', initialprice='999', languages='English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean', name='Counter-Strike', negative=5199, owners='10,000,000 .. 20,000,000', platforms=Row(linux=True, mac=True, windows=True), positive=201215, price='999', publisher='Valve', release_date='2000/11/1', required_age='0', short_description="Play the world's number 1 online action game. Engage in an incredibly realistic brand of terrorist warfare in this wildly popular team-based game. Ally with teammates to complete strategic missions. Take out enemy sites. Rescue hostages. Your role affects your team's success. Your team's success affects your role.", tags=Row(1980s=266, 1990's=

In [0]:
# C'est quoi ce data.type?
c_df = raw_df.groupBy("data.type").count()
c_df.show()

+--------+-----+
|    type|count|
+--------+-----+
|hardware|    1|
|    game|55690|
+--------+-----+



Ce qui nous intéresse ce sont les jeux.

In [0]:
# ip == appid ?
raw_df.filter(F.col("id") != F.col("data.appid")).count()

0

On peut ne considérer que le contenu de `data` puisque `id` est bien une réplique de `appid` lui-même contenu dans `data`.

## Préparation d'un dataset adapté

In [0]:
# Transforming a few columns
# languages: from comma separated list to array (ex.: "languages": "Simplified Chinese, English, Traditional Chinese, Japanese, Korean",)
# genres: from comma separated list to array 
# release_date: as Date ("release_date": "2019/06/24")

raw_df_columns_remap = {
  "appid": F.col("data.appid"),
  "ccu": F.col("data.ccu"),
  "categories": F.col("data.categories"),
  "developer": F.col("data.developer"),
  "genres": F.split("data.genre", ",\s*"),
  "name": F.col("data.name"),
  "initial_price": F.col("data.initialprice"),
  "price": F.col("data.price"),
  "discount": F.col("data.discount"),
  "positive_ratings": F.col('data.positive'),
  "negative_ratings": F.col('data.negative'),
  "publisher": F.col("data.publisher"),
  "release_date": F.to_date(F.col("data.release_date"), 'yyyy/MM/d'),
  "has_linux_support": F.col("data.platforms.linux"),
  "has_mac_support": F.col("data.platforms.mac"),
  "has_windows_support": F.col("data.platforms.windows"),
  "owners": F.col("data.owners"),
  "languages": F.split("data.languages", ",\s*"),
  "required_age": F.col("data.required_age"),
  "type": F.col('data.type'),
}
# Some casts may be used later:
# discount -> cast(DoubleType())) 
# price -> cast(DoubleType())) 
# initialprice -> cast(DoubleType())) 
# required_age -> cast(IntegerType()

# 1. mapping des colonnes du schéma imbriqué
# 2. cast de quelques valeures (numérique, date)
# 3. filtrer pour ne garder que les jeux
steam_games_df = raw_df.withColumns(raw_df_columns_remap).drop('data').filter(F.col('type') == 'game')

steam_games_df.count()


55690

In [0]:
display(steam_games_df.take(5))

id,appid,ccu,categories,developer,genres,name,initial_price,price,discount,positive_ratings,negative_ratings,publisher,release_date,has_linux_support,has_mac_support,has_windows_support,owners,languages,required_age,type
10,10,13990,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Valve,List(Action),Counter-Strike,999,999,0,201215,5199,Valve,2000-11-01,True,True,True,"10,000,000 .. 20,000,000","List(English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean)",0,game
1000000,1000000,0,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",IndigoBlue Game Studio,"List(Action, Adventure, Indie)",ASCENXION,999,999,0,27,5,PsychoFlux Entertainment,2021-05-14,False,False,True,"0 .. 20,000","List(English, Korean, Simplified Chinese)",0,game
1000010,1000010,99,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud, Steam Trading Cards)",NEXT Studios,"List(Adventure, Indie, RPG, Strategy)",Crown Trick,1999,599,70,4032,646,"Team17, NEXT Studios",2020-10-16,False,False,True,"200,000 .. 500,000","List(Simplified Chinese, English, Japanese, Traditional Chinese, French, German, Spanish - Spain, Russian, Portuguese - Brazil)",0,game
1000030,1000030,76,"List(Multi-player, Single-player, Co-op, Steam Achievements, Steam Cloud, Shared/Split Screen, Full controller support, Steam Trading Cards, Shared/Split Screen Co-op, Remote Play on Phone, Remote Play on Tablet, Remote Play on TV, Remote Play Together)",Vertigo Gaming Inc.,"List(Action, Indie, Simulation, Strategy)","Cook, Serve, Delicious! 3?!",1999,1999,0,1575,115,Vertigo Gaming Inc.,2020-10-14,False,True,True,"100,000 .. 200,000",List(English),0,game
1000040,1000040,0,List(Single-player),DoubleC Games,"List(Action, Casual, Indie, Simulation)",细胞战争,199,199,0,0,1,DoubleC Games,2019-03-30,False,False,True,"0 .. 20,000",List(Simplified Chinese),0,game


In [0]:
def build_null_or_empty_check(column, dtype):
    if dtype == "string":
        # for string let also check for empty strings
        return F.when(F.isnull(column) | (F.trim(F.col(column)) == ''), column)

    return F.when(F.isnull(column), column)
    
steam_games_df.select([F.count(build_null_or_empty_check(c, t)).alias(c) for c, t in steam_games_df.dtypes]).show()

+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------------+---------+------------+-----------------+---------------+-------------------+------+---------+------------+----+
| id|appid|ccu|categories|developer|genres|name|initial_price|price|discount|positive_ratings|negative_ratings|publisher|release_date|has_linux_support|has_mac_support|has_windows_support|owners|languages|required_age|type|
+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------------+---------+------------+-----------------+---------------+-------------------+------+---------+------------+----+
|  0|    0|  0|         0|      126|     0|   0|            0|    0|       0|               0|               0|      134|         222|                0|              0|                  0|     0|        0|           0|   0|
+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------

Il nous manque quelques dates de release
Tous les developpeurs et éditeurs ne sont pas renseignés (on pourra toujours tenter de remplacer l'un par l'autre en cas de valeur absente)

In [0]:
# Check: on peut voir que c'est le empty qui faisait office de "non renseigné"
print("Check des valeurs non renseignées (isnull):")
steam_games_df.select([F.count(F.when(F.isnull("publisher"), "publisher")).alias("publisher"), F.count(F.when(F.isnull("developer"), "developer")).alias("developer")]).show()

# On s'assure que publisher et developper sont NULL quand non renseignés. Cela simplifiera l'usage du coalesce plus tard.
columns_nullifying_on_empty = {
    "publisher": F.when(F.trim(F.col("publisher"))=="", None).otherwise(F.trim(F.col("publisher"))),
    "developer": F.when(F.trim(F.col("developer")) == "", None).otherwise(F.trim(F.col("developer"))),
}

# Adaptation du DataFrame
steam_games_df = steam_games_df.withColumns(columns_nullifying_on_empty)

# Check de validation
print("Check des valeurs non renseignées (isnull) aprés update du DataFrame:")
steam_games_df.select([F.count(F.when(F.isnull("publisher"), "publisher")).alias("publisher"), F.count(F.when(F.isnull("developer"), "developer")).alias("developer")]).show()


Check des valeurs non renseignées (isnull):
+---------+---------+
|publisher|developer|
+---------+---------+
|        0|        0|
+---------+---------+

Check des valeurs non renseignées (isnull) aprés update du DataFrame:
+---------+---------+
|publisher|developer|
+---------+---------+
|      134|      126|
+---------+---------+



In [0]:
steam_games_df.select([F.count(build_null_or_empty_check(c, t)).alias(c) for c, t in steam_games_df.dtypes]).show()

+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------------+---------+------------+-----------------+---------------+-------------------+------+---------+------------+----+
| id|appid|ccu|categories|developer|genres|name|initial_price|price|discount|positive_ratings|negative_ratings|publisher|release_date|has_linux_support|has_mac_support|has_windows_support|owners|languages|required_age|type|
+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------------+---------+------------+-----------------+---------------+-------------------+------+---------+------------+----+
|  0|    0|  0|         0|      126|     0|   0|            0|    0|       0|               0|               0|      134|         222|                0|              0|                  0|     0|        0|           0|   0|
+---+-----+---+----------+---------+------+----+-------------+-----+--------+----------------+----------

## Analyse macro

* Quels sont les éditeurs ayant publié le plus de jeux sur Steam ?
* Quels sont les jeux les mieux notés ?
* Y a-t-il des années avec un nombre plus élevé de sorties ? La pandémie de Covid a-t-elle eu un impact sur le volume de publications ?
* Comment les prix sont-ils répartis ? Y a-t-il beaucoup de jeux en promotion ?
* Quelles sont les langues les plus représentées ?
* Combien de jeux sont interdits aux moins de 16 ou 18 ans ?


### Quels sont les éditeurs ayant publié le plus de jeux sur Steam ?


In [0]:
# .select(F.coalesce("publisher", "developer").alias("publisher")) \
publisher_df = steam_games_df \
        .select("publisher") \
        .distinct() \
        .count()

print("Nombre d'editeurs :")
display(publisher_df)
print()

result_df = steam_games_df \
        .select(F.coalesce("publisher", "developer").alias("publisher")) \
        .groupBy("publisher").agg(F.count("*").alias("games_count")) \
        .orderBy(F.desc("games_count")) \
        .limit(10)

print("Top 10 des publications de jeux par editeurs :")
display(result_df)

Nombre d'editeurs :


29834


Top 10 des publications de jeux par editeurs :


publisher,games_count
Big Fish Games,423
8floor,202
SEGA,165
Strategy First,151
Square Enix,141
Choice of Games,140
Sekai Project,132
HH-Games,132
Ubisoft,127
Laush Studio,126


Databricks visualization. Run in Databricks to view.

#### Analyse

L'éditeur le plus représenté sur Steam est [**Big Fish Games**](https://en.wikipedia.org/wiki/Big_Fish_Games) avec au catalogue 423 jeux.


### Quels sont les jeux les mieux notés ?

Un ancien article propose une formule pour évaluation plus juste des scores (ref. : https://steamdb.info/blog/steamdb-rating/).
Ce classement l'utilisera.

$$(Total Reviews = Positive Reviews + Negative Reviews)$$

$$( Review Score = \frac{Positive Reviews}{Total Reviews} )$$

$$( Rating = Review Score - (Review Score - 0.5)*2^{-log_{10}(Total Reviews + 1)} )$$


In [0]:
result_df = steam_games_df \
        .withColumn("total_ratings", F.col("positive_ratings") + F.col("negative_ratings")) \
        .withColumn("rating_score", F.col("positive_ratings") / F.col("total_ratings")) \
        .withColumn("rating", F.round(100 * (F.col("rating_score") - (F.col("rating_score") - 0.5) * F.pow(2, -F.log10(F.col("total_ratings") + 1))), 2))

print("Top 20 des jeux sur Steam :")
display(result_df.select("name", "rating", "price", "positive_ratings", "negative_ratings", "publisher").orderBy(F.desc("rating")).limit(20))

Top 20 des jeux sur Steam :


name,rating,price,positive_ratings,negative_ratings,publisher
Portal 2,97.7,999,305671,3770,Valve
People Playground,97.49,999,142920,1649,Studio Minus
Hades,97.38,2499,199960,2829,Supergiant Games
Vampire Survivors,97.37,499,130311,1624,poncle
Stardew Valley,97.24,1499,497558,9283,ConcernedApe
Wallpaper Engine,97.18,399,561096,11031,Wallpaper Engine Team
Terraria,97.1,999,1014711,22380,Re-Logic
Portal,97.0,999,111786,1752,Valve
RimWorld,96.89,3499,142201,2550,Ludeon Studios
Half-Life: Alyx,96.81,5999,73942,1156,Valve


#### Analyse
Les meilleurs jeux, en utilisant la formule proposée sont:
1. Portal 2 (Valve)
2. People Playground (Studio Minus) [???]
3. Hades (Supergiant Games)


### Y a-t-il des années avec un nombre plus élevé de sorties ? La pandémie de Covid a-t-elle eu un impact sur le volume de publications ?

In [0]:
covid_years = [2020, 2021, 2022] # fin officielle de la pandémie en mai 2023

result_df = steam_games_df \
        .withColumn("release_year", F.year("release_date")) \
        .withColumn("is_covid", F.col("release_year").isin(covid_years)) \
        .groupBy("release_year", "is_covid").agg(F.count("*").alias("games_count"))

display(result_df \
          .orderBy(F.desc("release_year")))

release_year,is_covid,games_count
2022.0,True,7451
2021.0,True,8805
2020.0,True,8287
2019.0,False,6949
2018.0,False,7663
2017.0,False,6006
2016.0,False,4176
2015.0,False,2565
2014.0,False,1550
2013.0,False,469


Databricks visualization. Run in Databricks to view.

#### Analyse

Il y a eu une forte croissance du nombre de jeux jusqu'en 2018, puis le nombre de publication est resté à peu prés stable (en moyenne) marquant ainsi une sorte de stabilisation sur la période COVID. Il faudrait les chiffres aprés 2022 pour affiner l'analyse.
On peut quand même remarquer que la première année de COVID (2019) a vu un recul par rapport à l'année précédente, avant de rebondir et surpasser le score de 2018 (adaptation à la pandémie en 2019 avec des reports ?).

Sur la base de ces données, c'est 2021, en plein COVID que l'on trouve l'année la plus prolixe (8805 publications).


### Comment les prix sont-ils répartis ?

In [0]:
prices_df = steam_games_df \
    .withColumn("price", F.col("price").cast(DoubleType()) / 100.0) \
    .withColumn("initial_price", F.col("initial_price").cast(DoubleType()) / 100.0)

In [0]:
display(prices_df)

id,appid,ccu,categories,developer,genres,name,initial_price,price,discount,positive_ratings,negative_ratings,publisher,release_date,has_linux_support,has_mac_support,has_windows_support,owners,languages,required_age,type
10,10,13990,"List(Multi-player, Valve Anti-Cheat enabled, Online PvP, Shared/Split Screen PvP, PvP)",Valve,List(Action),Counter-Strike,9.99,9.99,0,201215,5199,Valve,2000-11-01,True,True,True,"10,000,000 .. 20,000,000","List(English, French, German, Italian, Spanish - Spain, Simplified Chinese, Traditional Chinese, Korean)",0,game
1000000,1000000,0,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud)",IndigoBlue Game Studio,"List(Action, Adventure, Indie)",ASCENXION,9.99,9.99,0,27,5,PsychoFlux Entertainment,2021-05-14,False,False,True,"0 .. 20,000","List(English, Korean, Simplified Chinese)",0,game
1000010,1000010,99,"List(Single-player, Partial Controller Support, Steam Achievements, Steam Cloud, Steam Trading Cards)",NEXT Studios,"List(Adventure, Indie, RPG, Strategy)",Crown Trick,19.99,5.99,70,4032,646,"Team17, NEXT Studios",2020-10-16,False,False,True,"200,000 .. 500,000","List(Simplified Chinese, English, Japanese, Traditional Chinese, French, German, Spanish - Spain, Russian, Portuguese - Brazil)",0,game
1000030,1000030,76,"List(Multi-player, Single-player, Co-op, Steam Achievements, Steam Cloud, Shared/Split Screen, Full controller support, Steam Trading Cards, Shared/Split Screen Co-op, Remote Play on Phone, Remote Play on Tablet, Remote Play on TV, Remote Play Together)",Vertigo Gaming Inc.,"List(Action, Indie, Simulation, Strategy)","Cook, Serve, Delicious! 3?!",19.99,19.99,0,1575,115,Vertigo Gaming Inc.,2020-10-14,False,True,True,"100,000 .. 200,000",List(English),0,game
1000040,1000040,0,List(Single-player),DoubleC Games,"List(Action, Casual, Indie, Simulation)",细胞战争,1.99,1.99,0,0,1,DoubleC Games,2019-03-30,False,False,True,"0 .. 20,000",List(Simplified Chinese),0,game
1000080,1000080,3,"List(Multi-player, Single-player, Steam Achievements, Full controller support, Steam Trading Cards)",IndieLeague Studio,"List(Action, Adventure, Indie, RPG)",Zengeon,19.99,7.99,60,1018,462,2P Games,2019-06-24,False,True,True,"100,000 .. 200,000","List(Simplified Chinese, English, Traditional Chinese, Japanese, Korean)",0,game
1000100,1000100,0,"List(Single-player, Steam Achievements, Steam Cloud)",七月九日,"List(Adventure, Indie, RPG, Strategy)",干支セトラ　陽ノ卷｜干支etc.　陽之卷,12.99,12.99,0,18,6,Starship Studio,2019-01-24,False,False,True,"0 .. 20,000","List(Japanese, Simplified Chinese, Traditional Chinese)",0,game
1000110,1000110,0,"List(Multi-player, Single-player, Co-op, Online PvP, Online Co-op, PvP)",重庆环游者网络科技,"List(Action, Adventure, Casual, Free to Play, Massively Multiplayer)",Jumping Master(跳跳大咖),0.0,0.0,0,50,34,重庆环游者网络科技,2019-04-08,False,False,True,"20,000 .. 50,000","List(English, Simplified Chinese, Traditional Chinese)",0,game
1000130,1000130,0,"List(Single-player, Steam Achievements, Steam Leaderboards)",Simon Codrington,"List(Casual, Indie)",Cube Defender,2.99,2.99,0,6,0,Simon Codrington,2019-01-06,False,True,True,"0 .. 20,000",List(English),0,game
1000280,1000280,0,List(Single-player),Villain Role,"List(Indie, RPG)",Tower of Origin2-Worm's Nest,13.99,13.99,0,32,12,Villain Role,2021-09-09,False,False,True,"0 .. 20,000","List(English, Simplified Chinese, Traditional Chinese)",0,game


Databricks visualization. Run in Databricks to view.

Il y a quelques "outliers" des jeux dont les tarifs sont extremement eloignés de la moyenne!

In [0]:
display(prices_df.filter(F.col("price") > 100).orderBy(F.desc("price")))

id,appid,ccu,categories,developer,genres,name,initial_price,price,discount,positive_ratings,negative_ratings,publisher,release_date,has_linux_support,has_mac_support,has_windows_support,owners,languages,required_age,type
1200520,1200520,0,"List(Multi-player, Single-player, Co-op, LAN Co-op)",Fury Games,List(Action),Ascent Free-Roaming VR Experience,999.0,999.0,0,6,0,Fury Games,2019-12-27,False,False,True,"0 .. 20,000",List(English),0,game
253670,253670,0,List(Single-player),Aartform,List(Animation & Modeling),Aartform Curvy 3D 3.0,299.9,299.9,0,31,13,Aartform,2013-11-12,False,False,True,"0 .. 20,000",List(English),0,game
502570,502570,91,"List(Partial Controller Support, Steam Cloud)",SideFX,"List(Animation & Modeling, Design & Illustration, Video Production, Game Development)",Houdini Indie,269.99,269.99,0,152,8,SideFX,2018-10-10,False,True,True,"0 .. 20,000",List(English),0,game
2070990,2070990,2,List(),MAGIX Software GmbH,List(Video Production),VEGAS Edit 20 Steam Edition,249.0,249.0,0,1,0,MAGIX Software GmbH,2022-11-01,False,False,True,"0 .. 20,000","List(English, French, German, Spanish - Spain)",0,game
1022640,1022640,0,List(Single-player),wandwand,"List(Casual, Indie, RPG)",Lgnorant girl doll,199.99,199.99,0,2,2,wandwad,2019-02-15,False,False,True,"0 .. 20,000",List(English),0,game
1035340,1035340,0,List(),上海皋城软件有限公司,List(Education),眼睛（眼球）结构研究,199.99,199.99,0,1,0,上海皋城软件有限公司,2019-03-10,False,False,True,"0 .. 20,000","List(English, Simplified Chinese)",0,game
1103060,1103060,0,List(Single-player),H.G.G.,"List(Adventure, Casual, Indie, Early Access)",Run Thief,199.99,199.99,0,0,2,H.G.G.,2019-08-23,False,False,True,"0 .. 20,000","List(English, Spanish - Spain, Russian)",0,game
1259300,1259300,0,List(Single-player),"Patrick Kelley, CIT",List(Simulation),Spot Sample Witness Simulator,199.99,199.99,0,5,2,"Kelley Integrity Safety Solutions, LLC, Endless Simulations",2020-05-26,False,False,True,"0 .. 20,000",List(English),0,game
1289890,1289890,0,List(Single-player),RK,"List(Indie, RPG, Simulation)",VR Long March,199.99,199.99,0,2,1,RK,2020-11-04,False,False,True,"0 .. 20,000",List(Simplified Chinese),0,game
1429800,1429800,0,List(Single-player),Klip VR Immersive Technologies Pvt Ltd,"List(Adventure, Casual, Simulation)",Chandrayaan VR,199.99,199.99,0,1,0,Klip VR,2020-12-28,False,False,True,"0 .. 20,000",List(English),0,game


**Ascent Free-Roaming VR Experience** semble être une licence destinée à des salles d'arcade VR... On pourra éventuellement l'éliminer de nos analyses et surtout des graphiques.

In [0]:
prices_df = prices_df.filter(F.col("price") < 300).orderBy(F.desc("price"))

In [0]:
display(prices_df \
        .select("price") \
        .orderBy(F.desc("price")))

price
299.9
269.99
249.0
199.99
199.99
199.99
199.99
199.99
199.99
199.99


Databricks visualization. Run in Databricks to view.

In [0]:
result_df = steam_games_df \
    .withColumn("price", F.col("price").cast(DoubleType()) / 100.0) \
    .withColumn("initial_price", F.col("initial_price").cast(DoubleType()) / 100.0)

display(result_df.select(F.min("price").alias("min_price"), F.max("price").alias("max_price")))
# display(steam_games_df.select(F.min(F.col("price").cast(DoubleType())).alias("min_price"), F.max("price").alias("max_price")))

# width_bucket requires pyspark 3.5
# Bucket config
num_buckets = 20
min_price = 0.1
max_price = 200.0
bucket_width = (max_price - min_price) / num_buckets

# Create a new column with bucket number
result_df = result_df.withColumn(
    "bucket",
    F.width_bucket(F.col("price"), F.lit(min_price), F.lit(max_price), 20)
)

# Compute bucket label as a string, e.g., "$0.0–$20.0"
result_df = result_df.withColumn(
    "bucket_label",
    F.expr(f"""
        CASE
            WHEN bucket = 0 THEN 'Below {min_price}'
            WHEN bucket = {num_buckets + 1} THEN 'Above {max_price}'
            ELSE concat(
                '$',
                CAST({min_price} + ({bucket_width}) * (bucket - 1) AS STRING),
                '–$',
                CAST({min_price} + ({bucket_width}) * (bucket) AS STRING)
            )
        END
    """)
)

# Group by label and count
result_dist = result_df.groupBy("bucket_label").agg(F.count("*").alias("count"))

# Show result
display(result_dist.orderBy("bucket_label"))


min_price,max_price
0.0,999.0


bucket_label,count
$0.100000000000000–$10.095000000000001,35941
$10.095000000000001–$20.090000000000002,9015
$110.045000000000011–$120.040000000000012,2
$120.040000000000012–$130.035000000000013,6
$140.030000000000014–$150.025000000000015,7
$150.025000000000015–$160.020000000000016,1
$190.005000000000019–$200.000000000000020,17
$20.090000000000002–$30.085000000000003,1789
$30.085000000000003–$40.080000000000004,600
$40.080000000000004–$50.075000000000005,238


Databricks visualization. Run in Databricks to view.

In [0]:

result_df = steam_games_df \
    .withColumn("price", F.col("price").cast(DoubleType()) / 100.0) \
    .withColumn("initial_price", F.col("initial_price").cast(DoubleType()) / 100.0)

# Compute buckets with custom ranges
buckets = ["Free", "0<$<5", "5<$<10", "10<$<15", "15<$<20", "20<$<30", "30<$<50", "50<$<70", "70<$<100","Over 100"]
result_df = result_df.withColumn(
    "bucket",
    F.expr(f"""
        CASE
            WHEN price = 0 THEN 0
            WHEN price < 5 THEN 1
            WHEN price < 10 THEN 2
            WHEN price < 15 THEN 3
            WHEN price < 20 THEN 4
            WHEN price < 30 THEN 5
            WHEN price < 50 THEN 6
            WHEN price < 70 THEN 7
            WHEN price < 100 THEN 8
            ELSE 9
        END
    """)
)

# UDF (User Defined Function) qui retourne le label à partir de l'index du bucket
def get_bucket_label(bucket):
    if 0 <= bucket < len(buckets):
        return buckets[bucket]
    elif bucket >= len(buckets):
        return buckets[-1]
    else:
        return "Below range"

bucket_label_udf = F.udf(get_bucket_label, StringType())

result_df = result_df.withColumn("bucket_label", bucket_label_udf(F.col("bucket")))

# Group by label and count
result_dist = result_df.groupBy("bucket", "bucket_label").agg(F.count("*").alias("count"))

# Show result
display(result_dist.orderBy("bucket"))

bucket,bucket_label,count
0,Free,7779
1,0<$<5,23478
2,5<$<10,12450
3,10<$<15,5311
4,15<$<20,3711
5,20<$<30,1794
6,30<$<50,839
7,50<$<70,240
8,70<$<100,51
9,Over 100,37


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

#### Analyse

Beaucoup de jeux gratuits (14%) et encore plus à moins de 5$ (la majorité soit 42.2%).
Enfin 22% des jeux sont entre 5 et 10$.
Globalement, 78% des jeux coutent moins de 10$! De quoi s'amuser à moindre coût.


### Y a-t-il beaucoup de jeux en promotion ?

In [0]:
result_df = steam_games_df \
    .withColumn("price", F.col("price").cast(DoubleType()) / 100.0) \
    .withColumn("initial_price", F.col("initial_price").cast(DoubleType()) / 100.0)

discounted_games = result_df.filter(F.col("discount") > 0).count() #.select("discount").distinct().show()
available_games = result_df.count()

print(f"Nombre de jeux total en magasin : {available_games}")
print(f"Nombre de jeux en réduction : {discounted_games} -> {discounted_games / available_games * 100:.2f}% du total")

Nombre de jeux total en magasin : 55690
Nombre de jeux en réduction : 2518 -> 4.52% du total


#### Analyse

4.5% des jeux sont proposés avec une promotion.

### Quelles sont les langues les plus représentées ?


In [0]:
# Dans notre DataFrame, "languages" est un tableau.
# Pour pouvoir prendre en compte chaque langue, il va falloir passer par un "explode" et ainsi créer une ligne par élément du tableau.
result_df = steam_games_df \
    .withColumn("language", F.explode("languages"))

print("Top 10 des langues jouables :")
display(result_df \
    .groupBy("language").agg(F.count("*").alias("games_count")) \
    .orderBy(F.desc("games_count")) \
    .limit(15))

Top 10 des langues jouables :


language,games_count
English,55116
German,14019
French,13426
Russian,12922
Simplified Chinese,12782
Spanish - Spain,12233
Japanese,10368
Italian,9304
Portuguese - Brazil,6750
Korean,6600


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

#### Analyse

Sans surprise, l'Anglais est largement en tête (quasiment tous les jeux ont une version Anglaise).
Classement :
1. English
2. German
3. French

### Combien de jeux sont interdits aux moins de 16 ou 18 ans ?

In [0]:
result_df = steam_games_df \
    .withColumn("is_16_plus", F.col("required_age") >= 16) \
    .withColumn("is_18_plus", F.col("required_age") >= 18)

# 16+ is not mixing with 18+ for clarity
is_16_plus_count = result_df.filter((F.col("is_16_plus") == True) & (F.col("is_18_plus") == False)).count()
is_16_18_plus_count = result_df.filter((F.col("is_16_plus") == True)).count()
is_18_plus_count = result_df.filter(F.col("is_18_plus") == True).count()

available_games = result_df.count()


print(f"Interdits aux moins de 16 ans: {is_16_plus_count} jeux ({is_16_plus_count / available_games * 100:.2f}% du total).")
print(f"Un total (16+ & 18+) de {is_16_18_plus_count} jeux ({is_16_18_plus_count / available_games * 100:.2f}% du total) est interdit aux moins de 16 ans.")
print(f"Interdits aux moins de 18 ans: {is_18_plus_count} jeux ({is_18_plus_count / available_games * 100:.2f}% du total).")


Interdits aux moins de 16 ans: 76 jeux (0.14% du total).
Un total (16+ & 18+) de 305 jeux (0.55% du total) est interdit aux moins de 16 ans.
Interdits aux moins de 18 ans: 229 jeux (0.41% du total).


#### Analyse
0.14% des jeux sont explicitement interdits aux moins de 16 ans.
Au total, **0.55%** des jeux seront interdits aux moins de 16 ans.

0.41% des jeux sont interdits aux moins de 18 ans.

## Analyse par genre

* Quels sont les genres les plus représentés sur la plateforme ?
* Certains genres présentent-ils de meilleurs ratios d’avis positifs/négatifs ?
* Certains éditeurs privilégient-ils des genres particuliers ?
* Quels sont les genres les plus lucratifs ?



In [0]:
# "genres" est un tableau -> explode
steam_games_genres_df = steam_games_df.withColumn('genre', F.explode('genres'))

### Quels sont les genres les plus représentés sur la plateforme ?

In [0]:
# Total number of games: 55690
display(steam_games_genres_df \
    .groupBy("genre").agg(F.count("*").alias("games_count")) \
    .withColumn("ratio", 100 * F.col("games_count") / 55690) \
    .orderBy(F.desc("games_count")) \
    .limit(10))


genre,games_count,ratio
Indie,39681,71.25336685221764
Action,23759,42.662955647333455
Casual,22086,39.65882564194649
Adventure,21431,38.48267193391992
Strategy,10895,19.56365595259472
Simulation,10836,19.457712336146525
RPG,9534,17.119770156221943
Early Access,6145,11.034297001256958
Free to Play,3393,6.092655773029269
Sports,2666,4.787214939845573


Databricks visualization. Run in Databricks to view.

Databricks visualization. Run in Databricks to view.

#### Analyse

71% des jeux sont des jeux indépendants.

42% des jeux sont des jeux d'action.

### Certains genres présentent-ils de meilleurs ratios d’avis positifs/négatifs ?

In [0]:
# Note: un jeu peu être multi catégoriel
# Compute rating ratios per genre
result_df = steam_games_genres_df.groupBy("genre") \
    .agg((F.sum("positive_ratings")/F.sum("negative_ratings")).alias("ratings_ratio"))

display(result_df.orderBy(F.desc("ratings_ratio")).limit(10))


genre,ratings_ratio
Photo Editing,42.03353946889778
Animation & Modeling,26.17327220369809
Design & Illustration,24.958603325063876
Utilities,16.99503482518447
Game Development,8.38759926695174
Indie,7.67017877344188
Audio Production,7.331141281289775
,7.078350594848496
Video Production,6.815425987043149
Casual,6.527673915758578


Databricks visualization. Run in Databricks to view.

#### Analyse

Le genre **Photo Editing** offre le meilleur ratio.

### Certains éditeurs privilégient-ils des genres particuliers ?

In [0]:
from pyspark.sql.window import Window

# Aggregate over publisher
wPublisher  = Window.partitionBy("publisher")

# Limit to publisher with more than 50 games
result_df = steam_games_genres_df \
        .select(F.coalesce("publisher", "developer").alias("publisher"), "genre") \
        .withColumn("games_count", F.count("*").over(wPublisher)) \
        .filter(F.col("games_count") > 50) \
        .groupBy("publisher", "genre", "games_count") \
        .agg((F.count("*")/F.col("games_count")).alias("genre_ratio"), F.count("*").alias("genre_games_count")) \
        .orderBy(F.desc("genre_ratio"))

display(result_df)

publisher,genre,games_count,genre_ratio,genre_games_count
8floor,Casual,243,0.831275720164609,202
MangaGamer,Adventure,66,0.6818181818181818,45
Reforged Group,Indie,135,0.6518518518518519,88
Slitherine Ltd.,Strategy,162,0.6049382716049383,98
William at Oxford,Casual,88,0.5795454545454546,51
"Humongous Entertainment, Nightdive Studios",Casual,56,0.5714285714285714,32
Zoo Corporation,Casual,55,0.5454545454545454,30
MumboJumbo,Casual,63,0.5396825396825397,34
ImperiumGame,Indie,75,0.5333333333333333,40
familyplay,Casual,66,0.5303030303030303,35


#### Analyse

Pas vraiment pertinent comme analyse. Certains éditeurs ont effectivement un genre prédominant (casual gaming chez 8floor).

Notons également que beaucoup de jeux sont multi-genres.



### Quels sont les genres les plus lucratifs ?

In [0]:
# Besoin du nombre de copies -> approximation avec le minimum/maximum du range "owners" ?
steam_games_genres_df.select("owners").distinct().show()

+--------------------+
|              owners|
+--------------------+
|  100,000 .. 200,000|
|1,000,000 .. 2,00...|
|20,000,000 .. 50,...|
|5,000,000 .. 10,0...|
|         0 .. 20,000|
|    20,000 .. 50,000|
|2,000,000 .. 5,00...|
|50,000,000 .. 100...|
|200,000,000 .. 50...|
|500,000 .. 1,000,000|
|10,000,000 .. 20,...|
|  200,000 .. 500,000|
|   50,000 .. 100,000|
+--------------------+



In [0]:
# Besoin du nombre de copies -> approximation avec le minimum/maximum du range "owners" ?
result_df = steam_games_genres_df \
    .withColumn("price", F.col("price").cast(DoubleType()) / 100.0) \
    .withColumn("owners", F.split(F.col("owners"),  "\s*[\.]+\s*")) \
    .withColumn("min_owners", F.replace(F.col("owners").getItem(0), F.lit(","), F.lit("")).cast(IntegerType())) \
    .withColumn("max_owners", F.replace(F.col("owners").getItem(1), F.lit(","), F.lit("")).cast(IntegerType())) \
    .withColumn("min_revenue", F.round(F.col("min_owners") * F.col("price"), 2)) \
    .withColumn("max_revenue", F.round(F.col("max_owners") * F.col("price"), 2)) \
    .withColumn("industry_min_revenue", F.sum("min_revenue").over(Window.partitionBy()))

# display(result_df)

result_df = result_df \
    .groupBy("genre", "industry_min_revenue") \
    .agg(F.sum("min_revenue").alias("total_min_revenue"), F.sum("max_revenue").alias("total_max_revenue")) \
    .withColumn("ratio_over_industry", F.round(100 * F.col("total_min_revenue") / F.col("industry_min_revenue"), 2)) \
    .orderBy(F.col("total_min_revenue").desc()) \
    .limit(10)

display(result_df)


genre,industry_min_revenue,total_min_revenue,total_max_revenue,ratio_over_industry
Action,133873939200.0,35929270100.0,81583638100.0,26.84
Adventure,133873939200.0,22618906500.0,51872570400.0,16.9
Indie,133873939200.0,19134717000.0,45558437200.0,14.29
RPG,133873939200.0,16675128300.0,37671157900.0,12.46
Strategy,133873939200.0,12362392100.0,27937690000.0,9.23
Simulation,133873939200.0,11422233300.0,26117265800.0,8.53
Casual,133873939200.0,4476568500.0,11685344800.0,3.34
Massively Multiplayer,133873939200.0,3692721300.0,8167594200.0,2.76
Early Access,133873939200.0,3147066800.0,7770256100.0,2.35
Sports,133873939200.0,1816053400.0,4483741300.0,1.36


Databricks visualization. Run in Databricks to view.

#### Analyse

Une grosse partie des revenus (26% des revenus du catalogue Steam) semble être captée par les jeux d'action.

## Analyse par plateforme

* La majorité des jeux sont-ils disponibles sur Windows, Mac ou Linux ?
* Certains genres sont-ils plus souvent disponibles sur certaines plateformes ?

In [0]:
steam_games_platforms_df = steam_games_genres_df \
    .withColumn("total_games_count", F.count("*").over(Window.partitionBy()))

### La majorité des jeux sont-ils disponibles sur Windows, Mac ou Linux ?

In [0]:
result_df = steam_games_platforms_df \
    .groupBy("has_windows_support", "has_mac_support", "has_linux_support", "total_games_count") \
    .agg(F.count("*").alias("games_count")) \
    .withColumn("total_games_ratio", F.round(100 * F.col("games_count") / F.col("total_games_count"), 2)) \
    .orderBy(F.desc("has_windows_support"), F.desc("has_mac_support"), F.desc("has_linux_support"))

display(result_df.select("has_windows_support", "has_mac_support", "has_linux_support", "games_count", "total_games_ratio"))

has_windows_support,has_mac_support,has_linux_support,games_count,total_games_ratio
True,True,True,19388,12.33
True,True,False,16535,10.51
True,False,True,4636,2.95
True,False,False,116675,74.19
False,True,True,3,0.0
False,True,False,30,0.02
False,False,True,3,0.0


In [0]:
result_df = steam_games_platforms_df \
    .agg(
        F.sum(F.col("has_linux_support").cast(IntegerType())).alias("linux_count"),
        F.sum(F.col("has_mac_support").cast(IntegerType())).alias("mac_count"),
        F.sum(F.col("has_windows_support").cast(IntegerType())).alias("windows_count"),
        F.max(F.col("total_games_count")).alias("total_games_count")
    ) \
    .withColumn("linux_games_ratio", F.round(100 * F.col("linux_count") / F.col("total_games_count"), 2)) \
    .withColumn("mac_games_ratio", F.round(100 * F.col("mac_count") / F.col("total_games_count"), 2)) \
    .withColumn("windows_games_ratio", F.round(100 * F.col("windows_count") / F.col("total_games_count"), 2))

display(result_df)

linux_count,mac_count,windows_count,total_games_count,linux_games_ratio,mac_games_ratio,windows_games_ratio
24030,35956,157234,157270,15.28,22.86,99.98


#### Analyse

- Quasiment 100% des jeux disponibles sur Steam supportent Windows.
- 22% tournent sur Mac.
- 15% supportent Linux.


### Certains genres sont-ils plus souvent disponibles sur certaines plateformes ?

In [0]:
result_df = steam_games_genres_df \
    .groupBy('genre') \
    .agg(
        F.sum(F.col('has_linux_support').cast('int')).alias('linux_count'),
        F.sum(F.col('has_mac_support').cast('int')).alias('mac_count'),
        F.sum(F.col('has_windows_support').cast('int')).alias('windows_count')
    )

display(result_df)

genre,linux_count,mac_count,windows_count
Education,19,56,317
Massively Multiplayer,164,270,1459
Sexual Content,7,13,54
Adventure,3302,5039,21427
Sports,287,506,2665
Accounting,0,4,16
Audio Production,7,41,193
Video Production,6,29,247
Animation & Modeling,38,74,322
Racing,304,424,2154


#### Analyse

On peut voir que les jeux sont d'abord sur Windows, ensuite Maci puis Linux.