In [63]:
spark

# Tabular Data

In [64]:
my_grocery_list = [
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["Cake", 1, 10.99],
    ["Orange", 5, 2.50]
]

In [65]:
df_grocery_list = spark.createDataFrame(my_grocery_list, ["Item", "Quantity", "Price"])
df_grocery_list.printSchema()


root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



In [66]:
col_name = ["Item", "Quantity", "Price"]
df_grocery_list = spark.createDataFrame(my_grocery_list, col_name)

In [67]:
df_grocery_list.show()

[Stage 50:>                                                         (0 + 1) / 1]

+------+--------+-----+
|  Item|Quantity|Price|
+------+--------+-----+
|Banana|       2| 1.74|
| Apple|       4| 2.04|
|Carrot|       1| 1.09|
|  Cake|       1|10.99|
|Orange|       5|  2.5|
+------+--------+-----+



                                                                                

In [68]:
df_grocery_list.printSchema()

root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



<img src="../img/img8.png" alt="img8">

### Reading the dataset (`Canadian Radio-Television and Telecommunications Commission`)

<img src="../img/img9.png" alt="img9">

> **NOTE:**
>- `CSV` files are relatively easy to process
>- PySpark provides a whopping 25 optional parameters when ingesting a `CSV` file
>- Compare this to the two for reading `text` data

In [69]:
data_set = 's3://fcc-spark-example/dataset/broadcast_logs'

logs_df = spark.read.csv(os.path.join(data_set, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
                        sep="|",
                        header=True,
                        inferSchema=True,
                        timestampFormat="yyyy-MM-dd",
                        )

                                                                                

In [70]:
logs_df.show()

+--------------+------------+-------------------+----------+-------------------+----------------------+----------+---------------+-----------------+----------------+---------------+------------------+--------------+--------------------+------------+----------------+----------------+-------------------+------------+--------------------+----------------+--------+--------------------+------------------+----------------------+-------------+---------+---------+---------+---------+
|BroadcastLogID|LogServiceID|            LogDate|SequenceNO|AudienceTargetAgeID|AudienceTargetEthnicID|CategoryID|ClosedCaptionID|CountryOfOriginID|DubDramaCreditID|EthnicProgramID|ProductionSourceID|ProgramClassID|FilmClassificationID|ExhibitionID|        Duration|         EndTime|       LogEntryDate|ProductionNO|        ProgramTitle|       StartTime|Subtitle|NetworkAffiliationID|SpecialAttentionID|BroadcastOriginPointID|CompositionID|Producer1|Producer2|Language1|Language2|
+--------------+------------+---------

> **NOTE:** 
This optional parameter, `inferSchema` forces `PySpark` to go over the ingested data twice: 
 > 1. one time to set the type of each column, and 
 > 2. one time to ingest the data  

In [71]:
logs_df.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string 

### Data Normalization and Denormalization

<img src="../img/img10.png" alt="img10">

In Spark’s universe, we often prefer working with a `single table` instead of linking a
multitude of tables to get the data. We call these `denormalized` tables, or, colloquially,
fat tables. We start by assessing the data directly available in the logs table before plumping our table. 

## Basics of data manipulation

Basic Operations: 
    
    - select 
    - delete
    - rename 
    - reorder  
    - create columns

### `SELECT()`

In [72]:
logs_df.select("BroadcastLogID", "LogServiceID", "LogDate") \
       .show(5, False)

+--------------+------------+-------------------+
|BroadcastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157        |2018-08-01 00:00:00|
|1196192318    |3157        |2018-08-01 00:00:00|
|1196192319    |3157        |2018-08-01 00:00:00|
|1196192320    |3157        |2018-08-01 00:00:00|
+--------------+------------+-------------------+
only showing top 5 rows



In [73]:
import pyspark.sql.functions as F 

logs_df.select(*[F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")]) \
       .show(5, False)

+--------------+------------+-------------------+
|BroadCastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157        |2018-08-01 00:00:00|
|1196192318    |3157        |2018-08-01 00:00:00|
|1196192319    |3157        |2018-08-01 00:00:00|
|1196192320    |3157        |2018-08-01 00:00:00|
+--------------+------------+-------------------+
only showing top 5 rows



In [74]:
logs_df.select(F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")) \
       .show(5, False)

+--------------+------------+-------------------+
|BroadCastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157        |2018-08-01 00:00:00|
|1196192318    |3157        |2018-08-01 00:00:00|
|1196192319    |3157        |2018-08-01 00:00:00|
|1196192320    |3157        |2018-08-01 00:00:00|
+--------------+------------+-------------------+
only showing top 5 rows



In [75]:
logs_df.select("*").show(5, False)

+--------------+------------+-------------------+----------+-------------------+----------------------+----------+---------------+-----------------+----------------+---------------+------------------+--------------+--------------------+------------+----------------+----------------+-------------------+------------+-------------------------------------------+----------------+--------+--------------------+------------------+----------------------+-------------+---------+---------+---------+---------+
|BroadcastLogID|LogServiceID|LogDate            |SequenceNO|AudienceTargetAgeID|AudienceTargetEthnicID|CategoryID|ClosedCaptionID|CountryOfOriginID|DubDramaCreditID|EthnicProgramID|ProductionSourceID|ProgramClassID|FilmClassificationID|ExhibitionID|Duration        |EndTime         |LogEntryDate       |ProductionNO|ProgramTitle                               |StartTime       |Subtitle|NetworkAffiliationID|SpecialAttentionID|BroadcastOriginPointID|CompositionID|Producer1|Producer2|Language1|Lan

In [76]:
import numpy as np 

column_split = np.array_split(np.array(logs_df.columns), len(logs_df.columns) // 3)

In [77]:
arr = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8']
cols = np.array_split(np.array(arr), len(arr) // 3)

cols

[array(['col1', 'col2', 'col3', 'col4'], dtype='<U4'),
 array(['col5', 'col6', 'col7', 'col8'], dtype='<U4')]

In [78]:
column_split = np.array_split(np.array(logs_df.columns), len(logs_df.columns) // 3)

In [79]:
column_split

[array(['BroadcastLogID', 'LogServiceID', 'LogDate'], dtype='<U22'),
 array(['SequenceNO', 'AudienceTargetAgeID', 'AudienceTargetEthnicID'],
       dtype='<U22'),
 array(['CategoryID', 'ClosedCaptionID', 'CountryOfOriginID'], dtype='<U22'),
 array(['DubDramaCreditID', 'EthnicProgramID', 'ProductionSourceID'],
       dtype='<U22'),
 array(['ProgramClassID', 'FilmClassificationID', 'ExhibitionID'],
       dtype='<U22'),
 array(['Duration', 'EndTime', 'LogEntryDate'], dtype='<U22'),
 array(['ProductionNO', 'ProgramTitle', 'StartTime'], dtype='<U22'),
 array(['Subtitle', 'NetworkAffiliationID', 'SpecialAttentionID'],
       dtype='<U22'),
 array(['BroadcastOriginPointID', 'CompositionID', 'Producer1'],
       dtype='<U22'),
 array(['Producer2', 'Language1', 'Language2'], dtype='<U22')]

In [80]:
for x in column_split:
    logs_df.select(*x).show(5, False)

+--------------+------------+-------------------+
|BroadcastLogID|LogServiceID|LogDate            |
+--------------+------------+-------------------+
|1196192316    |3157        |2018-08-01 00:00:00|
|1196192317    |3157        |2018-08-01 00:00:00|
|1196192318    |3157        |2018-08-01 00:00:00|
|1196192319    |3157        |2018-08-01 00:00:00|
|1196192320    |3157        |2018-08-01 00:00:00|
+--------------+------------+-------------------+
only showing top 5 rows

+----------+-------------------+----------------------+
|SequenceNO|AudienceTargetAgeID|AudienceTargetEthnicID|
+----------+-------------------+----------------------+
|1         |4                  |null                  |
|2         |null               |null                  |
|3         |null               |null                  |
|4         |null               |null                  |
|5         |null               |null                  |
+----------+-------------------+----------------------+
only showing top 5 ro

### `DROP()`

In [81]:
logs_df = logs_df.drop("BroadcastLogID", "SequenceNO")

In [82]:
"BroadcastLogID" in logs_df.columns

False

In [83]:
logs_df.drop("Col_doesnt_exist")

DataFrame[LogServiceID: int, LogDate: timestamp, AudienceTargetAgeID: int, AudienceTargetEthnicID: int, CategoryID: int, ClosedCaptionID: int, CountryOfOriginID: int, DubDramaCreditID: int, EthnicProgramID: int, ProductionSourceID: int, ProgramClassID: int, FilmClassificationID: int, ExhibitionID: int, Duration: string, EndTime: string, LogEntryDate: timestamp, ProductionNO: string, ProgramTitle: string, StartTime: string, Subtitle: string, NetworkAffiliationID: int, SpecialAttentionID: int, BroadcastOriginPointID: int, CompositionID: int, Producer1: string, Producer2: string, Language1: int, Language2: int]

> **NOTE:** Unlike `select()`, where selecting a column that doesn’t exist will
return a runtime error, dropping a nonexistent column is a no-op. PySpark
will simply ignore the columns it doesn’t find. Be careful with the spelling of
your column names!

### `withColumn()`

In [84]:
logs_df.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttenti

In [85]:
logs_df.select(F.col('Duration')) \
       .show(5)

+----------------+
|        Duration|
+----------------+
|02:00:00.0000000|
|00:00:30.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
+----------------+
only showing top 5 rows



Let's way we want to break this `Duration` col into 3 columns: 

    - 'duration_hours'
    - 'duration_minutes'
    - 'duration_seconds'

In [86]:
logs_df.select(F.col('Duration'),
               F.col('Duration').substr(1, 2).cast('int').alias('duration_hours'),
               F.col('Duration').substr(4, 2).cast('int').alias('duration_minutes'),
               F.col('Duration').substr(7, 2).cast('int').alias('duration_seconds')
              ) \
       .distinct() \
       .show(5)



+----------------+--------------+----------------+----------------+
|        Duration|duration_hours|duration_minutes|duration_seconds|
+----------------+--------------+----------------+----------------+
|00:04:52.0000000|             0|               4|              52|
|00:09:52.0000000|             0|               9|              52|
|01:34:00.0000000|             1|              34|               0|
|01:59:57.0000000|             1|              59|              57|
|00:38:10.0000000|             0|              38|              10|
+----------------+--------------+----------------+----------------+
only showing top 5 rows



                                                                                

In [87]:
# Let's combine it all together

logs_df.select(F.col('Duration'),
               (F.col('Duration').substr(1, 2).cast('int') * 60 * 60 +
               F.col('Duration').substr(4, 2).cast('int') * 60 +
               F.col('Duration').substr(7, 2).cast('int')).alias('duration_seconds')
              ) \
        .distinct() \
        .show(5)

+----------------+----------------+
|        Duration|duration_seconds|
+----------------+----------------+
|00:28:08.0000000|            1688|
|00:32:00.0000000|            1920|
|00:30:00.0000000|            1800|
|00:01:39.0000000|              99|
|00:29:50.0000000|            1790|
+----------------+----------------+
only showing top 5 rows



In [88]:
# Lets add this in a seperate column 
logs_df = logs_df.withColumn(
                    'Duration_seconds', 
                    (F.col('Duration').substr(1, 2).cast('int') * 60 * 60 +
                       F.col('Duration').substr(4, 2).cast('int') * 60 +
                       F.col('Duration').substr(7, 2).cast('int'))
                    )

logs_df.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttenti

In [89]:
logs_df.select(
                F.col('Duration'),
                F.col('Duration_seconds')
                ) \
       .distinct() \
       .show(5)

+----------------+----------------+
|        Duration|Duration_seconds|
+----------------+----------------+
|00:28:08.0000000|            1688|
|00:32:00.0000000|            1920|
|00:30:00.0000000|            1800|
|00:01:39.0000000|              99|
|00:29:50.0000000|            1790|
+----------------+----------------+
only showing top 5 rows



### `withColumnRenamed()`

In [90]:
logs_df = logs_df.withColumnRenamed("Duration_seconds", "Total_duration_seconds")

In [91]:
logs_df.printSchema()

root
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- SpecialAttenti

### Batch `lowercasing` using the toDF() method

In [93]:
logs_df.toDF(*[col.lower() for col in logs_df.columns]) \
       .printSchema()

root
 |-- logserviceid: integer (nullable = true)
 |-- logdate: timestamp (nullable = true)
 |-- audiencetargetageid: integer (nullable = true)
 |-- audiencetargetethnicid: integer (nullable = true)
 |-- categoryid: integer (nullable = true)
 |-- closedcaptionid: integer (nullable = true)
 |-- countryoforiginid: integer (nullable = true)
 |-- dubdramacreditid: integer (nullable = true)
 |-- ethnicprogramid: integer (nullable = true)
 |-- productionsourceid: integer (nullable = true)
 |-- programclassid: integer (nullable = true)
 |-- filmclassificationid: integer (nullable = true)
 |-- exhibitionid: integer (nullable = true)
 |-- duration: string (nullable = true)
 |-- endtime: string (nullable = true)
 |-- logentrydate: timestamp (nullable = true)
 |-- productionno: string (nullable = true)
 |-- programtitle: string (nullable = true)
 |-- starttime: string (nullable = true)
 |-- subtitle: string (nullable = true)
 |-- networkaffiliationid: integer (nullable = true)
 |-- specialattenti

### Selecting our columns in `alphabetical` order using select()

In [95]:
logs_df.select(sorted(logs_df.columns)).printSchema()

root
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)
 |-- LogDate: timestamp (nullable = true)
 |-- LogEntryDate: timestamp (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string (nullable = true)
 |-- ProductionNO: 

### `describe()`

In [98]:
for i in logs_df.columns:
    logs_df.describe(i).show()

                                                                                

+-------+------------------+
|summary|      LogServiceID|
+-------+------------------+
|  count|            238945|
|   mean| 3450.890284375065|
| stddev|199.50673962555592|
|    min|              3157|
|    max|              3925|
+-------+------------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    max|
+-------+

+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|  count|              16112|
|   mean| 3.4929245283018866|
| stddev|  1.041596339474513|
|    min|                  1|
|    max|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|  count|                  1710|
|   mean|    120.56432748538012|
| stddev|     71.98694059436134|
|    min|                     4|
|    max|                   337|
+-------+----------------------+

+-------+------------------+
|summary|        CategoryID|
+-------+-----------

                                                                                

+-------+----------------+
|summary|        Duration|
+-------+----------------+
|  count|          236724|
|   mean|            null|
| stddev|            null|
|    min|00:00:01.0000000|
|    max|06:30:09.0000000|
+-------+----------------+



                                                                                

+-------+----------------+
|summary|         EndTime|
+-------+----------------+
|  count|          169979|
|   mean|            null|
| stddev|            null|
|    min|00:00:00.0000000|
|    max|23:59:59.0000000|
+-------+----------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    max|
+-------+

+-------+------------------+
|summary|      ProductionNO|
+-------+------------------+
|  count|              3519|
|   mean| 35710.61538461538|
| stddev|3749.1340008607654|
|    min|            030641|
|    max|            c34183|
+-------+------------------+



                                                                                

+-------+------------------+
|summary|      ProgramTitle|
+-------+------------------+
|  count|            238703|
|   mean|            1999.0|
| stddev|              null|
|    min|  !NO 5PM A ID(:5)|
|    max|�t� avec Jo�l 2/Un|
+-------+------------------+



                                                                                

+-------+----------------+
|summary|       StartTime|
+-------+----------------+
|  count|          238945|
|   mean|            null|
| stddev|            null|
|    min|00:00:00.0000000|
|    max|23:59:59.0000000|
+-------+----------------+

+-------+--------------------+
|summary|            Subtitle|
+-------+--------------------+
|  count|               15468|
|   mean|   3463.573913043478|
| stddev|   16251.27241914423|
|    min|                #001|
|    max|�tre dans le trou...|
+-------+--------------------+

+-------+--------------------+
|summary|NetworkAffiliationID|
+-------+--------------------+
|  count|              108807|
|   mean|   6.082200593711802|
| stddev|   2.990848675228516|
|    min|                   1|
|    max|                   9|
+-------+--------------------+

+-------+------------------+
|summary|SpecialAttentionID|
+-------+------------------+
|  count|              2395|
|   mean| 1.704384133611691|
| stddev|0.5394635034869688|
|    min|             

### `summary()`

In [100]:
for i in logs_df.columns:
    logs_df.select(i).summary().show()

                                                                                

+-------+------------------+
|summary|      LogServiceID|
+-------+------------------+
|  count|            238945|
|   mean| 3450.890284375065|
| stddev|199.50673962555592|
|    min|              3157|
|    25%|              3287|
|    50%|              3379|
|    75%|              3627|
|    max|              3925|
+-------+------------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+





+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|  count|              16112|
|   mean| 3.4929245283018866|
| stddev| 1.0415963394745122|
|    min|                  1|
|    25%|                  4|
|    50%|                  4|
|    75%|                  4|
|    max|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|  count|                  1710|
|   mean|    120.56432748538012|
| stddev|     71.98694059436134|
|    min|                     4|
|    25%|                    74|
|    50%|                    95|
|    75%|                   136|
|    max|                   337|
+-------+----------------------+

+-------+------------------+
|summary|        CategoryID|
+-------+------------------+
|  count|             25506|
|   mean|18.485297577040697|
| stddev| 9.655852252020837|
|    min|                 1|
|    25%|                11|
|    50%| 

                                                                                

+-------+----------------+
|summary|        Duration|
+-------+----------------+
|  count|          236724|
|   mean|            null|
| stddev|            null|
|    min|00:00:01.0000000|
|    25%|            null|
|    50%|            null|
|    75%|            null|
|    max|06:30:09.0000000|
+-------+----------------+



                                                                                

+-------+----------------+
|summary|         EndTime|
+-------+----------------+
|  count|          169979|
|   mean|            null|
| stddev|            null|
|    min|00:00:00.0000000|
|    25%|            null|
|    50%|            null|
|    75%|            null|
|    max|23:59:59.0000000|
+-------+----------------+

+-------+
|summary|
+-------+
|  count|
|   mean|
| stddev|
|    min|
|    25%|
|    50%|
|    75%|
|    max|
+-------+

+-------+------------------+
|summary|      ProductionNO|
+-------+------------------+
|  count|              3519|
|   mean| 35710.61538461538|
| stddev|3749.1340008607654|
|    min|            030641|
|    25%|           32974.0|
|    50%|           37775.0|
|    75%|           37775.0|
|    max|            c34183|
+-------+------------------+



                                                                                

+-------+------------------+
|summary|      ProgramTitle|
+-------+------------------+
|  count|            238703|
|   mean|            1999.0|
| stddev|              null|
|    min|  !NO 5PM A ID(:5)|
|    25%|            1999.0|
|    50%|            1999.0|
|    75%|            1999.0|
|    max|�t� avec Jo�l 2/Un|
+-------+------------------+



                                                                                

+-------+----------------+
|summary|       StartTime|
+-------+----------------+
|  count|          238945|
|   mean|            null|
| stddev|            null|
|    min|00:00:00.0000000|
|    25%|            null|
|    50%|            null|
|    75%|            null|
|    max|23:59:59.0000000|
+-------+----------------+

+-------+--------------------+
|summary|            Subtitle|
+-------+--------------------+
|  count|               15468|
|   mean|   3463.573913043478|
| stddev|  16251.272419144227|
|    min|                #001|
|    25%|                17.0|
|    50%|               106.0|
|    75%|              2014.0|
|    max|�tre dans le trou...|
+-------+--------------------+

+-------+--------------------+
|summary|NetworkAffiliationID|
+-------+--------------------+
|  count|              108807|
|   mean|   6.082200593711802|
| stddev|  2.9908486752285337|
|    min|                   1|
|    25%|                   5|
|    50%|                   6|
|    75%|              

In [102]:
for i in logs_df.columns:
    logs_df.select(i).summary('mean', '25%').show()

+-------+-----------------+
|summary|     LogServiceID|
+-------+-----------------+
|   mean|3450.890284375065|
|    25%|             3287|
+-------+-----------------+

+-------+
|summary|
+-------+
|   mean|
|    25%|
+-------+

+-------+-------------------+
|summary|AudienceTargetAgeID|
+-------+-------------------+
|   mean| 3.4929245283018866|
|    25%|                  4|
+-------+-------------------+

+-------+----------------------+
|summary|AudienceTargetEthnicID|
+-------+----------------------+
|   mean|    120.56432748538012|
|    25%|                    74|
+-------+----------------------+

+-------+------------------+
|summary|        CategoryID|
+-------+------------------+
|   mean|18.485297577040697|
|    25%|                11|
+-------+------------------+

+-------+------------------+
|summary|   ClosedCaptionID|
+-------+------------------+
|   mean|1.0316174141185184|
|    25%|                 1|
+-------+------------------+

+-------+-----------------+
|summary|Cou

                                                                                

+-------+----------------+
|summary|    ExhibitionID|
+-------+----------------+
|   mean|4.52067364784627|
|    25%|               4|
+-------+----------------+



                                                                                

+-------+--------+
|summary|Duration|
+-------+--------+
|   mean|    null|
|    25%|    null|
+-------+--------+



                                                                                

+-------+-------+
|summary|EndTime|
+-------+-------+
|   mean|   null|
|    25%|   null|
+-------+-------+

+-------+
|summary|
+-------+
|   mean|
|    25%|
+-------+

+-------+-----------------+
|summary|     ProductionNO|
+-------+-----------------+
|   mean|35710.61538461538|
|    25%|          32974.0|
+-------+-----------------+



                                                                                

+-------+------------+
|summary|ProgramTitle|
+-------+------------+
|   mean|      1999.0|
|    25%|      1999.0|
+-------+------------+



                                                                                

+-------+---------+
|summary|StartTime|
+-------+---------+
|   mean|     null|
|    25%|     null|
+-------+---------+

+-------+-----------------+
|summary|         Subtitle|
+-------+-----------------+
|   mean|3463.573913043478|
|    25%|             17.0|
+-------+-----------------+

+-------+--------------------+
|summary|NetworkAffiliationID|
+-------+--------------------+
|   mean|   6.082200593711802|
|    25%|                   5|
+-------+--------------------+

+-------+------------------+
|summary|SpecialAttentionID|
+-------+------------------+
|   mean| 1.704384133611691|
|    25%|                 1|
+-------+------------------+

+-------+----------------------+
|summary|BroadcastOriginPointID|
+-------+----------------------+
|   mean|    2.1390058127881337|
|    25%|                     1|
+-------+----------------------+

+-------+------------------+
|summary|     CompositionID|
+-------+------------------+
|   mean|3.4141110442974543|
|    25%|                 3|
+---

In [103]:
logs_df.summary().show()



+-------+------------------+-------------------+----------------------+------------------+-------------------+------------------+------------------+------------------+------------------+-----------------+--------------------+------------------+----------------+----------------+------------------+------------------+----------------+--------------------+--------------------+------------------+----------------------+------------------+---------+---------+------------------+------------------+----------------------+
|summary|      LogServiceID|AudienceTargetAgeID|AudienceTargetEthnicID|        CategoryID|    ClosedCaptionID| CountryOfOriginID|  DubDramaCreditID|   EthnicProgramID|ProductionSourceID|   ProgramClassID|FilmClassificationID|      ExhibitionID|        Duration|         EndTime|      ProductionNO|      ProgramTitle|       StartTime|            Subtitle|NetworkAffiliationID|SpecialAttentionID|BroadcastOriginPointID|     CompositionID|Producer1|Producer2|         Language1|       

                                                                                