## Overview Narrow and Wide Transformations
Narrow transformations known as row transformations do not result in shuffling
* `df.select`
* `df.filter`
* `df.withColumn`
* `df.withColumnRenamed`
* `df.drop`
* `map`, `coalesce`, `repartition`

Wide transformations result in shuffling and data movement (dealing with group of records based on the key)
* `df.distinct`
* `df.join`
* `df.union`
* `df.groupBy`
* `df.sort` and `df.orderBy`

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("instance").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/29 15:28:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
import datetime
from pyspark.sql import Row
users = [
    {
        'id': 1,
        'first_name': 'Corrie',
        'last_name': 'Van den Oord',
        'email': 'cvandenoord@etsy.com',
        'phone_numbers': Row(mobile='+1 234 567 8901', home='+1 234 567 8911'),
        'courses': [2, 4],
        'is_customer': True,
        'amount_paid': 1000.55,
        'customer_from': datetime.date(2021, 1, 15),
        'last_updated_ts': datetime.datetime(2021, 2, 18, 1, 15, 0)
    },
    {
        'id': 2,
        'first_name': 'Nicolaus',
        'last_name': 'Brewitt',
        'email': 'nbrewitt@dailymail.co.uk',
        'phone_numbers': Row(mobile='+1 234 567 8923', home='+1 234 567 8934'),
        'courses': [3],
        'is_customer': None,
        'amount_paid': 900.0,
        'customer_from': datetime.date(2021, 2, 14),
        'last_updated_ts': datetime.datetime(2021, 2, 18, 3, 33, 0)
    },
    {
        'id': 3,
        'first_name': 'Kurt',
        'last_name': 'Rome',
        'email': 'krome4@shutterfly.co.uk',
        'phone_numbers': Row(mobile=None, home=None),
        'courses': [],
        'is_customer': False,
        'amount_paid': None,
        'customer_from': datetime.date(2021, 2, 14),
        'last_updated_ts': datetime.datetime(2024, 2, 28, 5, 27, 0)
    }
]

In [3]:
import pandas as pd
spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)

In [5]:
users_df=spark.createDataFrame(pd.DataFrame(users))
users_df.dtypes, users_df.show()

+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord|cvandenoord@etsy.com|{+1 234 567 8901,...| [2, 4]|       true|    1000.55|   2021-01-15|2021-02-18 01:15:00|
|  2|  Nicolaus|     Brewitt|nbrewitt@dailymai...|{+1 234 567 8923,...|    [3]|       NULL|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|      Kurt|        Rome|krome4@shutterfly...|        {NULL, NULL}|     []|      false|        NaN|   2021-02-14|2024-02-28 05:27:00|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+



([('id', 'bigint'),
  ('first_name', 'string'),
  ('last_name', 'string'),
  ('email', 'string'),
  ('phone_numbers', 'struct<mobile:string,home:string>'),
  ('courses', 'array<bigint>'),
  ('is_customer', 'boolean'),
  ('amount_paid', 'double'),
  ('customer_from', 'date'),
  ('last_updated_ts', 'timestamp')],
 None)

24/05/29 15:28:37 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


## select

In [11]:
users_df.select('*').show()

+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord|cvandenoord@etsy.com|{+1 234 567 8901,...| [2, 4]|       true|    1000.55|   2021-01-15|2021-02-18 01:15:00|
|  2|  Nicolaus|     Brewitt|nbrewitt@dailymai...|{+1 234 567 8923,...|    [3]|       NULL|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|      Kurt|        Rome|krome4@shutterfly...|        {NULL, NULL}|     []|      false|        NaN|   2021-02-14|2024-02-28 05:27:00|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+



In [26]:
users_df.select('id', 'first_name', 'last_name').show()
users_df.select(['id', 'first_name', 'last_name']).show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+



In [17]:
users_df.alias('u').select('u.*').show()
users_df.alias('u').select(['u.id', 'u.first_name', 'u.last_name']).show()

+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord|cvandenoord@etsy.com|{+1 234 567 8901,...| [2, 4]|       true|    1000.55|   2021-01-15|2021-02-18 01:15:00|
|  2|  Nicolaus|     Brewitt|nbrewitt@dailymai...|{+1 234 567 8923,...|    [3]|       NULL|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|      Kurt|        Rome|krome4@shutterfly...|        {NULL, NULL}|     []|      false|        NaN|   2021-02-14|2024-02-28 05:27:00|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+

+---+----------+------------+
| i

In [32]:
from pyspark.sql.functions import col
users_df.select(col('id'), 'first_name', 'last_name').show()
users_df.select([col('id'), 'first_name', 'last_name']).show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+



In [8]:
from pyspark.sql.functions import col, concat, lit
users_df.select(
    col('id'),
    'first_name',
    'last_name',
    concat(col('first_name'), lit(', '), col('last_name')).alias('full_name')
).show()

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



## selectExpr

In [6]:
users_df.selectExpr('*').show()
users_df.alias('u').selectExpr('u.*').show()

+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord|cvandenoord@etsy.com|{+1 234 567 8901,...| [2, 4]|       true|    1000.55|   2021-01-15|2021-02-18 01:15:00|
|  2|  Nicolaus|     Brewitt|nbrewitt@dailymai...|{+1 234 567 8923,...|    [3]|       NULL|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|      Kurt|        Rome|krome4@shutterfly...|        {NULL, NULL}|     []|      false|        NaN|   2021-02-14|2024-02-28 05:27:00|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+

+---+----------+------------+----

24/05/28 23:36:56 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [7]:
users_df.selectExpr('id', 'first_name', 'last_name').show()
users_df.selectExpr(['id', 'first_name', 'last_name']).show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+



In [13]:
from pyspark.sql.functions import col, concat, lit
users_df.select('id', 'first_name', 'last_name', concat(col('first_name'), lit(', '), col('last_name')).alias('full_name')).show()
users_df.selectExpr('id', 'first_name', 'last_name', "concat(first_name, ', ', last_name) as full_name").show()

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



In [18]:
users_df.createOrReplaceTempView('users')
spark.sql("""select * from users""").show()
spark.sql("""select id, first_name, last_name, concat(first_name, ', ', last_name) as full_name from users""").show()

+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
| id|first_name|   last_name|               email|       phone_numbers|courses|is_customer|amount_paid|customer_from|    last_updated_ts|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+
|  1|    Corrie|Van den Oord|cvandenoord@etsy.com|{+1 234 567 8901,...| [2, 4]|       true|    1000.55|   2021-01-15|2021-02-18 01:15:00|
|  2|  Nicolaus|     Brewitt|nbrewitt@dailymai...|{+1 234 567 8923,...|    [3]|       NULL|      900.0|   2021-02-14|2021-02-18 03:33:00|
|  3|      Kurt|        Rome|krome4@shutterfly...|        {NULL, NULL}|     []|      false|        NaN|   2021-02-14|2024-02-28 05:27:00|
+---+----------+------------+--------------------+--------------------+-------+-----------+-----------+-------------+-------------------+

+---+----------+------------+----

## Referring columns using Spark Data Frame Names

In [12]:
from pyspark.sql.functions import col
users_df['id'], col('id'), type(users_df['id'])

(Column<'id'>, Column<'id'>, pyspark.sql.column.Column)

In [14]:
from pyspark.sql.functions import col
users_df.select(users_df['id'], col('first_name'), 'last_name').show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+



In [19]:
# This does not work as there is no object by name u in this session
# users_df.alias('u').select(u['id']).show() # NameError: name 'u' is not defined
users_df.alias('u').select('u.id').show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+



In [29]:
from pyspark.sql.functions import col, concat, lit
users_df.select(
    col('id'),
    'first_name',
    'last_name',
    concat(users_df['first_name'], lit(', '), col('last_name')).alias('full_name')
).show()
users_df.alias('u').select(col('u.id'),
    'u.first_name',
    'u.last_name',
    concat(col('u.first_name'), lit(', '), 'u.last_name').alias('full_name')
).show()


+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



In [24]:
users_df.createOrReplaceTempView('users')
spark.sql("""
    select u.id, u.first_name, u.last_name,
        concat(u.first_name, ', ', u.last_name) full_name
    from users u
""").show()

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nicolaus|     Brewitt|   Nicolaus, Brewitt|
|  3|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



## Understanging col function in Spark

In [36]:
from pyspark.sql.functions import col
users_df['id'], col('id')

(Column<'id'>, Column<'id'>)

In [6]:
users_df.select('id', 'first_name', 'last_name').show()

cols = ['id', 'first_name', 'last_name']
users_df.select(cols).show()
users_df.select(*cols).show() # star

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nicolaus|     Brewitt|
|  3|      Kurt|        Rome|
+---+----------+------------+



In [8]:
from pyspark.sql.functions import col

user_id = col('id')
type(user_id), users_df.select(user_id).show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+



(pyspark.sql.column.Column, None)

column type based functions
* `cast`, `select`, `filter`, `groupBy`, `orderBy`, etc.
* `asc` and `desc`(usually as part of `sort` and `orderBy`)
* `contains`(as part of `filter` or `where`  

In [14]:
users_df.select('id', 'customer_from').show() 
users_df.select('id', 'customer_from').printSchema()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|   2021-01-15|
|  2|   2021-02-14|
|  3|   2021-02-14|
+---+-------------+

root
 |-- id: long (nullable = true)
 |-- customer_from: date (nullable = true)



In [13]:
from pyspark.sql.functions import date_format
users_df.select(
    col('id')
    , date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_type') # date_format returns string that's why implicit case to int is added
).show()
users_df.select(
    col('id')
    , date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_type') # date_format returns string that's why implicit case to int is added
).printSchema()

+---+-------------+
| id|customer_type|
+---+-------------+
|  1|     20210115|
|  2|     20210214|
|  3|     20210214|
+---+-------------+

root
 |-- id: long (nullable = true)
 |-- customer_type: integer (nullable = true)



In [16]:
cols=[col('id'), date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_type')]
users_df.select(*cols).show()
users_df.select(cols).show()

+---+-------------+
| id|customer_type|
+---+-------------+
|  1|     20210115|
|  2|     20210214|
|  3|     20210214|
+---+-------------+

+---+-------------+
| id|customer_type|
+---+-------------+
|  1|     20210115|
|  2|     20210214|
|  3|     20210214|
+---+-------------+



## Invoking Functions using Spark Column Objects

In [81]:
import datetime
users = [
    {
        "id": 1,
        "first_name": "Corrie",
        "last_name": "Van den Oord",
        "email": "cvandenoord0@etsy.com",
        "phone_numbers": Row(mobile="+1 234 567 8901", home="+1 234 567 8911"),
        "courses": [1, 2],
        "is_customer": True,
        "amount_paid": 1000.55,
        "customer_from": datetime.date(2021, 1, 15),
        "last_updated_ts": datetime.datetime(2021, 2, 10, 1, 15, 0)
    },
    {
        "id": 2,
        "first_name": "Nikolaus",
        "last_name": "Brewitt",
        "email": "nbrewitt1@dailymail.co.uk",
        "phone_numbers":  Row(mobile="+1 234 567 8923", home="1 234 567 8934"),
        "courses": [3],
        "is_customer": True,
        "amount_paid": 900.0,
        "customer_from": datetime.date(2021, 2, 14),
        "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
    },
    {
        "id": 3,
        "first_name": "Orelie",
        "last_name": "Penney",
        "email": "openney2@vistaprint.com",
        "phone_numbers": Row(mobile="+1 714 512 9752", home="+1 714 512 6601"),
        "courses": [2, 4],
        "is_customer": True,
        "amount_paid": 850.55,
        "customer_from": datetime.date(2021, 1, 21),
        "last_updated_ts": datetime.datetime(2021, 3, 15, 15, 16, 55)
    },
    {
        "id": 4,
        "first_name": "Ashby",
        "last_name": "Maddocks",
        "email": "amaddocks3@home.pl",
        "phone_numbers": Row(mobile=None, home=None),
        "courses": [],
        "is_customer": False,
        "amount_paid": None,
        "customer_from": None,
        "last_updated_ts": datetime.datetime(2021, 4, 10, 17, 45, 30)
    },
    {
        "id": 5,
        "first_name": "Kurt",
        "last_name": "Rome",
        "email": "krome4@shutterfly.com",
        "phone_numbers": Row(mobile="+1 817 934 7142", home=None),
        "courses": [],
        "is_customer": False,
        "amount_paid": None,
        "customer_from": None,
        "last_updated_ts": datetime.datetime(2021, 4, 2, 0, 55, 18)
    }
]

In [82]:
users_df=spark.createDataFrame(users)
users_df.show()

+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|amount_paid|courses|customer_from|               email|first_name| id|is_customer|   last_name|    last_updated_ts|       phone_numbers|
+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|    1000.55| [1, 2]|   2021-01-15|cvandenoord0@etsy...|    Corrie|  1|       true|Van den Oord|2021-02-10 01:15:00|{+1 234 567 8901,...|
|      900.0|    [3]|   2021-02-14|nbrewitt1@dailyma...|  Nikolaus|  2|       true|     Brewitt|2021-02-18 03:33:00|{+1 234 567 8923,...|
|     850.55| [2, 4]|   2021-01-21|openney2@vistapri...|    Orelie|  3|       true|      Penney|2021-03-15 15:16:55|{+1 714 512 9752,...|
|       NULL|     []|         NULL|  amaddocks3@home.pl|     Ashby|  4|      false|    Maddocks|2021-04-10 17:45:30|        {NULL, NULL}|
|       NULL|     []|         NULL

* concatenate `first_name` and `last_name` to generate `full_name`

In [30]:
from pyspark.sql.functions import concat, lit, col
full_name_col=concat(col('first_name'), lit(', '), col('last_name')).alias('full_name')
type(full_name_col), users_df.select('id', full_name_col).show()

+---+--------------------+
| id|           full_name|
+---+--------------------+
|  1|Corrie, Van den Oord|
|  2|   Nikolaus, Brewitt|
|  3|      Orelie, Penney|
|  4|     Ashby, Maddocks|
|  5|          Kurt, Rome|
+---+--------------------+



(pyspark.sql.column.Column, None)

* convert data type of `customer_from` date to numeric 

In [31]:
users_df.select('id', 'customer_from').show()

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|   2021-01-15|
|  2|   2021-02-14|
|  3|   2021-01-21|
|  4|         NULL|
|  5|         NULL|
+---+-------------+



In [35]:
from pyspark.sql.functions import date_format

customer_from_numeric=date_format('customer_from', 'yyyyMMdd').cast('int').alias('customer_from')
users_df.select('id', customer_from_numeric).show(), users_df.select('id', customer_from_numeric).dtypes

+---+-------------+
| id|customer_from|
+---+-------------+
|  1|     20210115|
|  2|     20210214|
|  3|     20210121|
|  4|         NULL|
|  5|         NULL|
+---+-------------+



(None, [('id', 'bigint'), ('customer_from', 'int')])

## Understanding lit function in Spark

In [48]:
#users_df.select('id', 'amount_paid'+25).show() # fails
#users_df.select('id', 'amount_paid'+'25').show() # fails

In [46]:
users_df.createOrReplaceTempView('users')
spark.sql('select id, amount_paid+25 amount_paid from users').show()
users_df.selectExpr('id', 'amount_paid+25 amount_paid').show()

+---+-----------+
| id|amount_paid|
+---+-----------+
|  1|    1025.55|
|  2|      925.0|
|  3|     875.55|
|  4|       NULL|
|  5|       NULL|
+---+-----------+

+---+-----------+
| id|amount_paid|
+---+-----------+
|  1|    1025.55|
|  2|      925.0|
|  3|     875.55|
|  4|       NULL|
|  5|       NULL|
+---+-----------+



In [59]:
from pyspark.sql.functions import col, lit
lit(25), users_df.select('id', col('amount_paid')+lit(25)).show()

+---+------------------+
| id|(amount_paid + 25)|
+---+------------------+
|  1|           1025.55|
|  2|             925.0|
|  3|            875.55|
|  4|              NULL|
|  5|              NULL|
+---+------------------+



(Column<'25'>, None)

## There are multiple ways to rename Spark Data Frame Columns or Expressions.
* We can rename column or expression using `alias` as part of `select`
* We can add or rename column or expression using `withColumn` on top of Data Frame.
* We can rename one column at a time using `withColumnRenamed` on top of Data Frame.
* We typically use `withColumn` to perform row level transformations and then to provide a name to the result. If we provide the same name as existing column, then the column will be replaced with new one.
* If we want to just rename the column then it is better to use `withColumnRenamed`.
* If we want to apply any transformation, we need to either use `select` or `withColumn`
* We can rename bunch of columns using `toDF`.

## Naming derived columns using withColumn

In [60]:
users_df.select('id', 'first_name', 'last_name').show()

+---+----------+------------+
| id|first_name|   last_name|
+---+----------+------------+
|  1|    Corrie|Van den Oord|
|  2|  Nikolaus|     Brewitt|
|  3|    Orelie|      Penney|
|  4|     Ashby|    Maddocks|
|  5|      Kurt|        Rome|
+---+----------+------------+



* concat `first_name` and `last_name`. Provide the alias to derived result as `full_name`

In [61]:
from pyspark.sql.functions import concat, lit
users_df.select('id', concat('first_name', lit(', '), 'last_name').alias('full_name')).show()

+---+--------------------+
| id|           full_name|
+---+--------------------+
|  1|Corrie, Van den Oord|
|  2|   Nikolaus, Brewitt|
|  3|      Orelie, Penney|
|  4|     Ashby, Maddocks|
|  5|          Kurt, Rome|
+---+--------------------+



In [64]:
users_df.select('id', 'first_name', 'last_name').withColumn('full_name', concat('first_name', lit(', '), 'last_name')).show()

+---+----------+------------+--------------------+
| id|first_name|   last_name|           full_name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nikolaus|     Brewitt|   Nikolaus, Brewitt|
|  3|    Orelie|      Penney|      Orelie, Penney|
|  4|     Ashby|    Maddocks|     Ashby, Maddocks|
|  5|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



In [71]:
from pyspark.sql.functions import col
users_df.select('id', 'first_name').withColumn('fn', col('first_name')).show()
users_df.select('id', 'first_name').withColumn('fn', users_df['first_name']).show()

+---+----------+--------+
| id|first_name|      fn|
+---+----------+--------+
|  1|    Corrie|  Corrie|
|  2|  Nikolaus|Nikolaus|
|  3|    Orelie|  Orelie|
|  4|     Ashby|   Ashby|
|  5|      Kurt|    Kurt|
+---+----------+--------+

+---+----------+--------+
| id|first_name|      fn|
+---+----------+--------+
|  1|    Corrie|  Corrie|
|  2|  Nikolaus|Nikolaus|
|  3|    Orelie|  Orelie|
|  4|     Ashby|   Ashby|
|  5|      Kurt|    Kurt|
+---+----------+--------+



* add another `course_count` column containing number of courses the user is enrolled for

In [75]:
from pyspark.sql.functions import size
users_df.select('id', 'courses').withColumn('course_count', size('courses')).show()

+---+-------+------------+
| id|courses|course_count|
+---+-------+------------+
|  1| [1, 2]|           2|
|  2|    [3]|           1|
|  3| [2, 4]|           2|
|  4|     []|           0|
|  5|     []|           0|
+---+-------+------------+



## Renaming columns withColumnRenamed

* rename `id` to `user_id`
* rename `first_name` to `user_first_name`
* rename `last_name` to `user_last_name`

In [77]:
users_df. \
    select('id', 'first_name', 'last_name'). \
    withColumnRenamed('id', 'user_id'). \
    withColumnRenamed('first_name', 'user_first_name'). \
    withColumnRenamed('last_name', 'user_last_name'). \
    show()

+-------+---------------+--------------+
|user_id|user_first_name|user_last_name|
+-------+---------------+--------------+
|      1|         Corrie|  Van den Oord|
|      2|       Nikolaus|       Brewitt|
|      3|         Orelie|        Penney|
|      4|          Ashby|      Maddocks|
|      5|           Kurt|          Rome|
+-------+---------------+--------------+



## Renaming Spark Data Frame columns or expressions using alias
* rename `id` to `user_id`
* rename `first_name` to `user_first_name`
* rename `last_name` to `user_last_name`
* add new `user_full_name` column which is derived by concatenating `first_name` and `last_name` with `, ` in between

In [89]:
from pyspark.sql.functions import col, concat, lit
user_id=col('id')
users_df. \
    select( \
        col('id').alias('user_id'), \
        col('first_name').alias('user_first_name'), \
        col('last_name').alias('user_last_name'), \
        concat(col('first_name'), lit(', '), col('last_name')).alias('user_full_name')
    ). \
    show()

+-------+---------------+--------------+--------------------+
|user_id|user_first_name|user_last_name|      user_full_name|
+-------+---------------+--------------+--------------------+
|      1|         Corrie|  Van den Oord|Corrie, Van den Oord|
|      2|       Nikolaus|       Brewitt|   Nikolaus, Brewitt|
|      3|         Orelie|        Penney|      Orelie, Penney|
|      4|          Ashby|      Maddocks|     Ashby, Maddocks|
|      5|           Kurt|          Rome|          Kurt, Rome|
+-------+---------------+--------------+--------------------+



In [88]:
users_df. \
    select( \
        users_df['id'].alias('user_id'), \
        users_df['first_name'].alias('user_first_name'), \
        users_df['last_name'].alias('user_last_name'), \
        concat(users_df['first_name'], lit(', '), users_df['last_name']).alias('user_full_name')
    ). \
    show()

+-------+---------------+--------------+--------------------+
|user_id|user_first_name|user_last_name|      user_full_name|
+-------+---------------+--------------+--------------------+
|      1|         Corrie|  Van den Oord|Corrie, Van den Oord|
|      2|       Nikolaus|       Brewitt|   Nikolaus, Brewitt|
|      3|         Orelie|        Penney|      Orelie, Penney|
|      4|          Ashby|      Maddocks|     Ashby, Maddocks|
|      5|           Kurt|          Rome|          Kurt, Rome|
+-------+---------------+--------------+--------------------+



In [93]:
# using withColumn and alias (first select and then withColumn)
users_df. \
    select( \
        users_df['id'].alias('user_id'), \
        users_df['first_name'].alias('user_first_name'), \
        users_df['last_name'].alias('user_last_name'), \
    ). \
    withColumn('user_full_name', concat(col('user_first_name'), lit(', '), col('user_last_name'))). \
    show()

+-------+---------------+--------------+--------------------+
|user_id|user_first_name|user_last_name|      user_full_name|
+-------+---------------+--------------+--------------------+
|      1|         Corrie|  Van den Oord|Corrie, Van den Oord|
|      2|       Nikolaus|       Brewitt|   Nikolaus, Brewitt|
|      3|         Orelie|        Penney|      Orelie, Penney|
|      4|          Ashby|      Maddocks|     Ashby, Maddocks|
|      5|           Kurt|          Rome|          Kurt, Rome|
+-------+---------------+--------------+--------------------+



In [96]:
# using withColumn and alias (withColumn first and then select)
users_df. \
    withColumn('user_full_name', concat(col('first_name'), lit(', '), col('last_name'))). \
    select( \
        users_df['id'].alias('user_id'), \
        users_df['first_name'].alias('user_first_name'), \
        users_df['last_name'].alias('user_last_name'), \
        col('user_full_name')
    ). \
    show()

+-------+---------------+--------------+--------------------+
|user_id|user_first_name|user_last_name|      user_full_name|
+-------+---------------+--------------+--------------------+
|      1|         Corrie|  Van den Oord|Corrie, Van den Oord|
|      2|       Nikolaus|       Brewitt|   Nikolaus, Brewitt|
|      3|         Orelie|        Penney|      Orelie, Penney|
|      4|          Ashby|      Maddocks|     Ashby, Maddocks|
|      5|           Kurt|          Rome|          Kurt, Rome|
+-------+---------------+--------------+--------------------+



In [95]:
users_df. \
    withColumn('user_full_name', concat(users_df['first_name'], lit(', '), users_df['last_name'])). \
    select( \
        users_df['id'].alias('user_id'), \
        users_df['first_name'].alias('user_first_name'), \
        users_df['last_name'].alias('user_last_name'), \
        col('user_full_name')
    ). \
    show()

+-------+---------------+--------------+--------------------+
|user_id|user_first_name|user_last_name|      user_full_name|
+-------+---------------+--------------+--------------------+
|      1|         Corrie|  Van den Oord|Corrie, Van den Oord|
|      2|       Nikolaus|       Brewitt|   Nikolaus, Brewitt|
|      3|         Orelie|        Penney|      Orelie, Penney|
|      4|          Ashby|      Maddocks|     Ashby, Maddocks|
|      5|           Kurt|          Rome|          Kurt, Rome|
+-------+---------------+--------------+--------------------+



## Renaming and Reordering multiple Spark Data Frame Columns

In [100]:
#required columns from the original list
required_columns = ['id', 'first_name', 'last_name', 'email', 'phone_numbers', 'courses']
#new column name list 
target_column_names = ['user_id', 'user_first_name', 'user_last_name', 'user_email', 'user_phone_numbers', 'enrolled_courses']
users_df. \
    select(required_columns). \
    show()

users_df. \
    select(required_columns). \
    toDF(*target_column_names). \
    show()

+---+----------+------------+--------------------+--------------------+-------+
| id|first_name|   last_name|               email|       phone_numbers|courses|
+---+----------+------------+--------------------+--------------------+-------+
|  1|    Corrie|Van den Oord|cvandenoord0@etsy...|{+1 234 567 8901,...| [1, 2]|
|  2|  Nikolaus|     Brewitt|nbrewitt1@dailyma...|{+1 234 567 8923,...|    [3]|
|  3|    Orelie|      Penney|openney2@vistapri...|{+1 714 512 9752,...| [2, 4]|
|  4|     Ashby|    Maddocks|  amaddocks3@home.pl|        {NULL, NULL}|     []|
|  5|      Kurt|        Rome|krome4@shutterfly...|{+1 817 934 7142,...|     []|
+---+----------+------------+--------------------+--------------------+-------+

+-------+---------------+--------------+--------------------+--------------------+----------------+
|user_id|user_first_name|user_last_name|          user_email|  user_phone_numbers|enrolled_courses|
+-------+---------------+--------------+--------------------+------------------

In [105]:
def myDF(*cols):
    print(type(cols))
    print(cols)

myDF('f1', 'f2'), myDF(['f1', 'f2']), myDF(*['f1', 'f2'])

<class 'tuple'>
('f1', 'f2')
<class 'tuple'>
(['f1', 'f2'],)
<class 'tuple'>
('f1', 'f2')


(None, None, None)