## 1. PySpark environment setup

In [1]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## 2. Data sources and Spark data abstraction (DataFrame) setup

### A. Artworks Data Loading

In [2]:
artworksDF = spark.read \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .csv("moma_artworks.csv")

### B. Artists Data Loading

In [3]:
artistsDF = spark.read \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .csv("artists.csv")

## 3. Data set metadata analysis
### A. Artworks data schema and size

In [4]:
from IPython.display import display, Markdown

artworksDF.printSchema()
display(Markdown("This DataFrame of Artworks has **%d rows**." % artworksDF.count()))

root
 |-- Artwork ID: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Artist ID: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Medium: string (nullable = true)
 |-- Dimensions: string (nullable = true)
 |-- Acquisition Date: string (nullable = true)
 |-- Credit: string (nullable = true)
 |-- Catalogue: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Classification: string (nullable = true)
 |-- Object Number: string (nullable = true)
 |-- Diameter (cm): string (nullable = true)
 |-- Circumference (cm): string (nullable = true)
 |-- Height (cm): string (nullable = true)
 |-- Length (cm): string (nullable = true)
 |-- Width (cm): string (nullable = true)
 |-- Depth (cm): string (nullable = true)
 |-- Weight (kg): string (nullable = true)
 |-- Duration (s): string (nullable = true)



This DataFrame of Artworks has **142853 rows**.

### B. Get one or multiple random samples from the artworks dataset

In [5]:
artworksDF.cache() # optimization to make the processing faster
artworksDF.sample(False, 0.1).take(2) # 10% sampling

[Row(Artwork ID='7', Title='The Manhattan Transcripts Project, New York, New York, Episode 1: The Park', Artist ID='7056', Name='Bernard Tschumi', Date='1976-77', Medium='Gelatin silver photograph', Dimensions='"14 x 18"" (35.6 x 45.7 cm)"', Acquisition Date='1995-01-17', Credit='Purchase and partial gift of the architect in honor of Lily Auchincloss', Catalogue='Y', Department='Architecture & Design', Classification='Architecture', Object Number='3.1995.1', Diameter (cm)=None, Circumference (cm)=None, Height (cm)='35.6', Length (cm)=None, Width (cm)='45.7', Depth (cm)=None, Weight (kg)=None, Duration (s)=None),
 Row(Artwork ID='17', Title='The Manhattan Transcripts Project, New York, New York, Episode 1: The Park', Artist ID='7056', Name='Bernard Tschumi', Date='1976-77', Medium='Gelatin silver photograph', Dimensions='"14 x 18"" (35.6 x 45.7 cm)"', Acquisition Date='1995-01-17', Credit='Purchase and partial gift of the architect in honor of Lily Auchincloss', Catalogue='Y', Departmen

### C. Artists data schema and size

In [6]:
from IPython.display import display, Markdown

artistsDF.printSchema()

display(Markdown("This DataFrame of Artists has **%d rows**." % artistsDF.count()))

root
 |-- Artist ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Birth Year: integer (nullable = true)
 |-- Death Year: integer (nullable = true)



This DataFrame of Artists has **15091 rows**.

### D. Get one or multiple random samples from the artists dataset

In [193]:
artistsDF.cache() # optimization to make the processing faster
artistsDF.sample(False, 0.1).take(2) # 10% sampling

[Row(Artist ID=3, Name='Bill Arnold', Nationality='American', Gender='Male', Birth Year=1941, Death Year=None),
 Row(Artist ID=33, Name='A.A.P.', Nationality='American', Gender=None, Birth Year=None, Death Year=None)]

### E. Data entities, metrics and dimensions

For our analysis the following elements are identified:

* **Entities:** Artwork (main one which is measured - facts), Artist (dimension - is also measured)
* **Metrics:**  Date, Height, Weight, Width, Diameter, Acqusition Date,  ...
* **Dimensions:** Artwork ID, Title, Department, Classification, Artist ID, Name, Nationality, Gender, ...

### F. Column categorization

The following could be a potential column categorization:

* **Artwork related columns:** *Date*, *Acqusition Date*, *Medium*, *Dimensions*, *Weight*, *Height*, *Diameter* and *Width (cm)*
* **Artist related columns:** *Nation*, *Gender*, *Birth Year* and *Death Year*

## 4. Columns groups basic profiling to better understand our dataset
### A. Artwork related columns basic profiling

In [7]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit


print("Checking for nulls on Artwork ID, Date, Department, Classification, Medium, Acquisition Date:")
artworksDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Artwork ID", "Date", "Department", "Classification" \
                                                                     , "Medium", "Acquisition Date"]]).show()

print("Checking for nulls on Dimensions, Weight, Height, Width, Diameter:")
artworksDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Dimensions", "Weight (kg)", "Height (cm)", \
                                                                     "Width (cm)", "Diameter (cm)"]]).show()

Checking for nulls on Artwork ID, Date, Department, Classification, Medium, Acquisition Date:
+----------+----+----------+--------------+------+----------------+
|Artwork ID|Date|Department|Classification|Medium|Acquisition Date|
+----------+----+----------+--------------+------+----------------+
|         0|7696|     23175|         16965| 17292|           26526|
+----------+----+----------+--------------+------+----------------+

Checking for nulls on Dimensions, Weight, Height, Width, Diameter:
+----------+-----------+-----------+----------+-------------+
|Dimensions|Weight (kg)|Height (cm)|Width (cm)|Diameter (cm)|
+----------+-----------+-----------+----------+-------------+
|     17842|     142202|      42273|     42852|       140645|
+----------+-----------+-----------+----------+-------------+



### B. Artist related columns basic profiling

In [205]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

print("Checking for nulls on Nationality, Name, Artist ID, Gender, Birth Year, Death Year:")
artistsDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["Nationality","Name",\
                                                                     "Birth Year", "Death Year","Artist ID","Gender"]]).show()

Checking for nulls on Nationality, Name, Artist ID, Gender, Birth Year, Death Year:
+-----------+----+----------+----------+---------+------+
|Nationality|Name|Birth Year|Death Year|Artist ID|Gender|
+-----------+----+----------+----------+---------+------+
|       2487|   0|      3854|     10512|        0|  3072|
+-----------+----+----------+----------+---------+------+



In [8]:
print("Checking number of distinct values in columns Artist Name and Nationality:")
artistsDF.select([countDistinct(c).alias(c) for c in ["Name", "Nationality"]]).show()

Checking number of distinct values in columns Artist Name and Nationality:
+-----+-----------+
| Name|Nationality|
+-----+-----------+
|15039|        126|
+-----+-----------+



### C. Modifications on datasets to get them better fit for analysis

In [9]:
from IPython.display import display, Markdown
from pyspark.sql.functions import *

# To make clearer and distinguish Name of Artist from Name of Artwork, column is renamed as below:

artistsDF_renamed = artistsDF.withColumnRenamed("Name", "Artist Name")

artistsDF_renamed.printSchema()

display(Markdown("**Name** column of Artworks Dataframe is modified as **Artist Name**." ))

root
 |-- Artist ID: integer (nullable = true)
 |-- Artist Name: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Birth Year: integer (nullable = true)
 |-- Death Year: integer (nullable = true)



**Name** column of Artworks Dataframe is modified as **Artist Name**.

In [13]:
from IPython.display import display, Markdown
from pyspark.sql.functions import *

# Datasets are combined to conduct comprehensible query 

joinedDF = artworksDF.join(artistsDF_renamed, "Artist ID")

joinedDF.printSchema()

joinedDF.head()

display(Markdown("Datasets are combined by column **Artist ID**" ))

root
 |-- Artist ID: string (nullable = true)
 |-- Artwork ID: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Medium: string (nullable = true)
 |-- Dimensions: string (nullable = true)
 |-- Acquisition Date: string (nullable = true)
 |-- Credit: string (nullable = true)
 |-- Catalogue: string (nullable = true)
 |-- Department: string (nullable = true)
 |-- Classification: string (nullable = true)
 |-- Object Number: string (nullable = true)
 |-- Diameter (cm): string (nullable = true)
 |-- Circumference (cm): string (nullable = true)
 |-- Height (cm): string (nullable = true)
 |-- Length (cm): string (nullable = true)
 |-- Width (cm): string (nullable = true)
 |-- Depth (cm): string (nullable = true)
 |-- Weight (kg): string (nullable = true)
 |-- Duration (s): string (nullable = true)
 |-- Artist Name: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Gender: string (null

Datasets are combined by column **Artist ID**

### D. Continue Basic Profiling in order to Grasp Better Understanding about the Collection

In [14]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

nationsDF = artistsDF.groupBy("Nationality").agg(count(lit(1)).alias("Total"))

display(Markdown("**Top 10 Nationalities** of Artists Contributed to the Collection:" ))

nationsDF.where(col("Nationality")!="NA").orderBy(col("Total").desc()).show(10)

**Top 10 Nationalities** of Artists Contributed to the Collection:

+-------------------+-----+
|        Nationality|Total|
+-------------------+-----+
|           American| 5198|
|             German|  930|
|             French|  839|
|            British|  835|
|            Italian|  531|
|           Japanese|  498|
|              Swiss|  280|
|              Dutch|  265|
|Nationality unknown|  255|
|           Austrian|  243|
+-------------------+-----+
only showing top 10 rows



In [15]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

artist_piecesDF = joinedDF.groupBy("Artist Name", "Gender") \
                            .agg(count("*").alias("Number of Pieces"))

display(Markdown("The **Most Productive Artists** of MoMA with gender info:" ))

artist_piecesDF.where(~col("Artist Name") \
               .like("Unknown%")).orderBy(col("Number of Pieces").desc()).show(20)


The **Most Productive Artists** of MoMA with gender info:

+--------------------+------+----------------+
|         Artist Name|Gender|Number of Pieces|
+--------------------+------+----------------+
|        Eugène Atget|  Male|            5049|
|    Louise Bourgeois|Female|            3318|
|Ludwig Mies van d...|  Male|            2533|
|       Jean Dubuffet|  Male|            1434|
|     Lee Friedlander|  Male|            1313|
|       Pablo Picasso|  Male|            1292|
|        Marc Chagall|  Male|            1154|
|       Henri Matisse|  Male|            1061|
|      Pierre Bonnard|  Male|             894|
|         Lilly Reich|Female|             807|
|  Frank Lloyd Wright|  Male|             785|
|     George Maciunas|  Male|             758|
|       August Sander|  Male|             749|
|       Émile Bernard|  Male|             631|
|     Georges Rouault|  Male|             603|
|         Ben Kinmont|  Male|             597|
|    Aristide Maillol|  Male|             579|
|        André Derain|  Male|             573|
|          So

In [16]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

display(Markdown("**The richest departments in terms of Art Pieces Type** of MoMA:" ))

joinedDF.where(col("Department")!="NA").groupBy("Department") \
          .agg(count("*").alias("Number of Pieces")).orderBy(col("Number of Pieces").desc()).show(8)


**The richest departments in terms of Art Pieces Type** of MoMA:

+--------------------+----------------+
|          Department|Number of Pieces|
+--------------------+----------------+
|Prints & Illustra...|           51803|
|         Photography|           26048|
|Architecture & De...|           13796|
|            Drawings|            9966|
|Painting & Sculpture|            3537|
|                Film|            2314|
|Media and Perform...|            2100|
|   Fluxus Collection|            1415|
+--------------------+----------------+
only showing top 8 rows



In [232]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

display(Markdown("This table shows **the richest collections in terms of Art Pieces Type** of MoMA:" ))

joinedDF.where(col("Classification")!="NA").groupBy("Classification") \
          .agg(count("*").alias("Number of Pieces")).orderBy(col("Number of Pieces").desc()).show(8)


This table shows **the richest collections in terms of Art Pieces Type** of MoMA:

+--------------------+----------------+
|      Classification|Number of Pieces|
+--------------------+----------------+
|          Photograph|           26569|
|               Print|           25617|
|    Illustrated Book|           23722|
|             Drawing|           10575|
|              Design|            9044|
|            Painting|            2187|
|        Architecture|            2106|
|Mies van der Rohe...|            1946|
+--------------------+----------------+
only showing top 8 rows



## 5. Answer some business questions to help establishing the new exhibition

### A. Number of Paintings for Highlighted Artists of the Department, with nations and genders

In [21]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

joinedDF.where(col("Department")\
          .like("Prints & Ill%") | col("Department")\
          .like("Painting%")).groupBy("Artist Name", "Nationality", "Gender") \
          .agg(count("*").alias("Total")).orderBy(col("Total").desc()).limit(20).show()


+-------------------+-----------+------+-----+
|        Artist Name|Nationality|Gender|Total|
+-------------------+-----------+------+-----+
|   Louise Bourgeois|   American|Female| 3234|
|      Jean Dubuffet|     French|  Male| 1209|
|      Pablo Picasso|    Spanish|  Male| 1173|
|       Marc Chagall|     French|  Male| 1048|
|      Henri Matisse|     French|  Male|  994|
|     Pierre Bonnard|     French|  Male|  874|
|      Émile Bernard|     French|  Male|  630|
|        Ben Kinmont|   American|  Male|  587|
|   Aristide Maillol|     French|  Male|  563|
|       André Derain|     French|  Male|  554|
|    Georges Rouault|     French|  Male|  544|
|         Raoul Dufy|     French|  Male|  522|
|         Sol LeWitt|   American|  Male|  494|
|      Maurice Denis|     French|  Male|  493|
|          Joan Miró|    Spanish|  Male|  393|
|    George Maciunas|   American|  Male|  386|
|  Pierre Alechinsky|    Belgian|  Male|  356|
|      Thomas Bewick|    British|  Male|  323|
|       Jaspe

### B. How the Paintings from Modern Era are distributed based on 'Artistic Style of Period':

In [19]:
# In order to select 'paintings', relevant values are filtered from 'Department' and 'Classification'
# A subset is created to facilitate analysis

plasticArtsDF = artworksDF.filter((col("Department")=="Painting & Sculpture") & (col("Classification")=="Painting")\
                                  | (col("Department").like("%Illustrat%")) & (col("Classification")=="Painting"))

In [22]:
from pyspark.sql.functions import count, round

# From invaluable.com/blog/art-history-timeline/ modern artworks are categorized as follows:
#   "Impressionism"      - 1865s-1900s
#   "Fauvism"            - 1900s-1920s
# ...
# This categorization is made based on years and artists of the era

plasticArtsStylesDF = plasticArtsDF\
   .withColumn("Style of Period", when((col("Date").like("%186%") | col("Date").like("%187%") | col("Date").like("%188%")\
                                  | col("Date").like("189%") | col("Date").like("%190%")) & (col("Name").like("%Monet%")\
                                  | col("Name").like("%Manet%") | col("Name").like("%Degas%")| col("Name").like("%Renoir%")\
                                  | col("Name").like("%zanne%") | col("Name").like("%Seurat%") | col("Name").like("%Utrillo%")\
                                  | col("Name").like("%Gauguin%")| col("Name").like("%Signac%")| col("Name").like("%Gogh%"))\
                                  , "Impressionism")\
                                  .when((col("Date").like("%190%") | col("Date").like("%191%") | col("Date").like("%192%"))\
                                  & (col("Name").like("%Matisse%") | col("Name").like("%Derain%") | col("Name")\
                                  .like("%Vlaminck%") | col("Name").like("%Dufy%") | col("Name").like("%Rouault%")\
                                  | col("Name").like("%Metzinger%") | col("Name").like("%Dongen%")),"Fauvism")\
                                  .when((col("Date").like("%190%") | col("Date").like("%191%") | col("Date").like("%192%"))\
                                  & (col("Name").like("%Gogh%") | col("Name").like("%Munch%") | col("Name").like("%Kandinsky%")\
                                  | col("Name").like("%Schiele%") | col("Name").like("%Bacon%") | col("Name").like("%Kirchner%")\
                                  | col("Name").like("%Marc%") | col("Name").like("%Klee%") | col("Name").like("%Nolde%")\
                                  | col("Name").like("%Freud%") | col("Name").like("%Chagall%") | col("Name").like("%Macke%")\
                                  | col("Name").like("%Kokoschka%") | col("Name").like("%Soutine%") | col("Name")\
                                  .like("%Giacometti%") | col("Name").like("%Buffet%") | col("Name").like("%Soutine%"))\
                                  ,"Expressionism")\
                                  .when((col("Date").like("%190%") | col("Date").like("%191%")) & (col("Name").like("%Picasso%")\
                                  | col("Name").like("%Braque%") | col("Name").like("%Gris%") | col("Name").like("%Gleizes%")\
                                  | col("Name").like("%Popova%") | col("Name").like("%Fernand L%") | col("Name")\
                                  .like("%Goncharova%") | col("Name").like("%Laurens%")),"Cubism")\
                                  .when((col("Date").like("%191%") | col("Date").like("%192%") | col("Date").like("%193%")\
                                  | col("Date").like("%194%") | col("Date").like("%195%")) & (col("Name").like("%Salvador D%")\
                                  | col("Name").like("%Magrit%") | col("Name").like("%Ernst%")| col("Name").like("%Tanguy%")\
                                  | col("Name").like("%Chirico%") | col("Name").like("%Kahlo%") | col("Name").like("%Joan M%")\
                                  | col("Name").like("%Duchamp%") | col("Name").like("%Man R%")),"Surrealism")\
                                  .when((col("Date").like("%194%") | col("Date").like("%195%")) & (col("Name").like("%Pollock%")\
                                  | col("Name").like("%Rothko%")  | col("Name").like("%Kline%") | col("Name").like("%Still%")\
                                  | col("Name").like("%Motherwell%") | col("Name").like("%Franken%") | col("Name")\
                                  .like("%Mitchell%") | col("Name").like("%Hofmann%") | col("Name").like("%Rauschen%")\
                                  | col("Name").like("%Newman%") | col("Name").like("%Nakamura%")), "Abstract Expressionism")\
                                  .when((col("Date").like("%195%") | col("Date").like("%196%")) & (col("Name").like("%Warhol%")\
                                  | col("Name").like("%Licht%") | col("Name").like("%Franken%") | col("Name").like("%Hockney%")\
                                  | col("Name").like("%Thiebaud%") | col("Name").like("%Johns%") | col("Name").like("%Indian%")\
                                  | col("Name").like("%Kusam%") | col("Name").like("%Hamilto%") | col("Name").like("%Murakam%")\
                                  ),"Pop Art")\
                                  .when((col("Date").like("%196%") | col("Date").like("%197%")) & (col("Name").like("%Boetti%")\
                                  | col("Name").like("%Pistol%") | col("Name").like("%Merz%") | col("Name").like("%Burri%")\
                                  | col("Name").like("%Manzoni%") | col("Name").like("%Fontana%")), "Arte Povera")\
                                  .when((col("Date").like("%196%") | col("Date").like("%197%")) & (col("Name").like("%Tony S%")\
                                  | col("Name").like("%Kelly%") | col("Name").like("%Judd%") | col("Name").like("%Stella%")\
                                  | col("Name").like("%Laughin%") | col("Name").like("%Hesse%") | col("Name").like("%Mangol%"))\
                                  ,"Minimalism")\
                                  .when((col("Date").isNull() | col("Date").like("%196%") | col("Date").like("%197%"))\
                                  & (col("Name").like("%Ducham%") | col("Name").like("%Klein%") | col("Name").like("%Kosuth%")\
                                  | col("Name").like("%Kawara%") | col("Name").like("%Baldessari%")\
                                  | col("Name").like("%Yoko Ono%") | col("Name").like("%Weiner%")),"Conceptual Art")\
                                  .when(col("Date").like("%197%") | col("Date").like("%198%") | col("Date").like("%199%")\
                                  ,"Contemporary Art").otherwise("others"))


In [23]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit
from pyspark.sql.functions import max, min, avg, stddev

#  Now, figure out what is the ratio per "Style of Period" for Paintings in MoMA collection

totalPlasticArtwork = plasticArtsStylesDF.count()

display(Markdown("**Ratio of 'Paintings' per Style of the Period:**"))
plasticArtsStylesDF.where(col("Style of Period")!="others")\
                   .groupBy("Style of Period").agg(count("Style of Period").alias("Number of Artwork")\
                   ,(count("Style of Period")/totalPlasticArtwork*100).alias("Ratio"))\
                   .orderBy(col("Number of Artwork").desc())\
                   .select("Style of Period","Number of Artwork",round("Ratio",1).alias("Ratio")).show()


**Ratio of 'Paintings' per Style of the Period:**

+--------------------+-----------------+-----+
|     Style of Period|Number of Artwork|Ratio|
+--------------------+-----------------+-----+
|    Contemporary Art|              387| 17.9|
|          Surrealism|               68|  3.1|
|              Cubism|               52|  2.4|
|Abstract Expressi...|               52|  2.4|
|       Expressionism|               43|  2.0|
|             Fauvism|               40|  1.8|
|      Conceptual Art|               37|  1.7|
|          Minimalism|               33|  1.5|
|             Pop Art|               32|  1.5|
|       Impressionism|               20|  0.9|
|         Arte Povera|                6|  0.3|
+--------------------+-----------------+-----+



### C. In terms of Spatial Design and Wall Sizes, dimensions of the paintings:

In [24]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit
from pyspark.sql.functions import max, min, avg, stddev
from pyspark.sql.types import IntegerType

#  Now, figure the dimensions of paintings per "Style of Period"

print ("Summary of Columns: Weight and Height")
plasticArtsStylesDF.where(~(col("Height (cm)").isNull() | col("Width (cm)").isNull()))\
.select("Height (cm)","Width (cm)").summary().show()


display(Markdown("**Information about the Dimensions of Paintings:** per 'Style of Period'"))
plasticArtsStylesDF.where(~(col("Height (cm)").isNull() | col("Width (cm)").isNull()))\
                   .groupBy("Style of Period")\
                   .agg(round(avg("Height (cm)"),3).alias("AvgHeight"), round(max("Height (cm)"),3).alias("MaxHeigth")\
                   ,round(avg("Width (cm)"),3).alias("AvgWidth"), round(max("Width (cm)"),3).alias("MaxWidth"))\
                   .orderBy("Style of Period").show()

Summary of Columns: Weight and Height
+-------+-----------------+------------------+
|summary|      Height (cm)|        Width (cm)|
+-------+-----------------+------------------+
|  count|             2141|              2141|
|   mean|123.9236047366331|131.51749151271736|
| stddev|79.21204045608808| 151.3728718095856|
|    min|                0|                 0|
|    25%|          61.2776|              61.0|
|    50%|            104.5|              99.7|
|    75%|            181.9|             170.0|
|    max|             99.9|              99.8|
+-------+-----------------+------------------+



**Information about the Dimensions of Paintings:** per 'Style of Period'

+--------------------+---------+---------+--------+--------+
|     Style of Period|AvgHeight|MaxHeigth|AvgWidth|MaxWidth|
+--------------------+---------+---------+--------+--------+
|Abstract Expressi...|  165.687|     86.1| 158.919|    96.9|
|         Arte Povera|   161.35|     41.0| 205.017|    80.3|
|      Conceptual Art|   32.351|    55.88|  33.045|    62.5|
|    Contemporary Art|  173.671|     99.2| 201.994|    99.7|
|              Cubism|   93.863|     96.8|  73.437|    99.7|
|       Expressionism|   89.865|     98.9|  86.223|    99.7|
|             Fauvism|  102.575|     99.3|  99.912|    99.1|
|       Impressionism|    75.27|     91.8|  76.942|    99.4|
|          Minimalism|  138.224|     76.2| 169.867|    95.6|
|             Pop Art|  169.611|     86.5| 152.279|    91.9|
|          Surrealism|   89.418|     96.5|  95.001|    99.7|
|              others|  114.732|     99.9| 119.056|    99.8|
+--------------------+---------+---------+--------+--------+



### D. Just in case, weigth info is crucial for organizing the logistics of paintings to be picked

In [243]:
from IPytho.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit
from pyspark.sql.functions import max, min, avg, stddev

display(Markdown("**Information about the Weight of 'Paintings':**"))
plasticArtsStylesDF.agg(min("Weight (kg)").alias("MaxWeight (kg)"), max("Weight (kg)").alias("MinWeight (kg)")).show()

**Information about the Weight of 'Paintings':**

+--------------+--------------+
|MaxWeight (kg)|MinWeight (kg)|
+--------------+--------------+
|  112.00007334|       18.0975|
+--------------+--------------+

