### Selecting and renaming

In [0]:
from pyspark.sql import Row
import datetime
users = [
    {
        "id": 1,
        "first_name": "Corrie",
        "last_name": "Van den Oord",
        "email": "cvandenoord0@etsy.com",
        "phone_numbers": Row(mobile="+1 234 567 8923", home="1 234 567 8934"),
        "courses": [1, 2],
        "is_customer": True,
        "amount_paid": 1000.55,
        "customer_from": datetime.date(2021, 1, 15),
        "last_updated_ts": datetime.datetime(2021, 2, 10, 1, 15, 0)
    },
    {
        "id": 2,
        "first_name": "Nikolaus",
        "last_name": "Brewitt",
        "email": "nbrewitt1@dailymail.co.uk",
        "phone_numbers":  Row(mobile="+1 234 567 8923", home="1 234 567 8934"),
        "courses": [3],
        "is_customer": True,
        "amount_paid": 900.0,
        "customer_from": datetime.date(2021, 2, 14),
        "last_updated_ts": datetime.datetime(2021, 2, 18, 3, 33, 0)
    },
    {
        "id": 3,
        "first_name": "Orelie",
        "last_name": "Penney",
        "email": "openney2@vistaprint.com",
        "phone_numbers": Row(mobile="+1 714 512 9752", home="+1 714 512 6601"),
        "courses": [2, 4],
        "is_customer": True,
        "amount_paid": 850.55,
        "customer_from": datetime.date(2021, 1, 21),
        "last_updated_ts": datetime.datetime(2021, 3, 15, 15, 16, 55)
    },
    {
        "id": 4,
        "first_name": "Ashby",
        "last_name": "Maddocks",
        "email": "amaddocks3@home.pl",
        "phone_numbers": Row(mobile=None, home=None),
        "courses": [],
        "is_customer": False,
        "amount_paid": None,
        "customer_from": None,
        "last_updated_ts": datetime.datetime(2021, 4, 10, 17, 45, 30)
    },
    {
        "id": 5,
        "first_name": "Kurt",
        "last_name": "Rome",
        "email": "krome4@shutterfly.com",
        "phone_numbers": Row(mobile="+1 817 934 7142", home=None),
        "courses": [],
        "is_customer": False,
        "amount_paid": None,
        "customer_from": None,
        "last_updated_ts": datetime.datetime(2021, 4, 2, 0, 55, 18)
    }
]

### Overview of Narrow and Wide Transformations

* **Narrow transformations** convert each input partition to only one output partition. When each partition at the parent RDD is used by at most one partition of the child RDD or when each partition from child produced or dependent on single parent RDD.

* **Wide transformation** will have input partitions contributing to many output partitions. When each partition at the parent RDD is used by multiple partitions of the child RDD or when each partition from child produced or dependent on multiple parent RDD.
  * **Any function that result in shuffling is wide transformation. For all the wide transformations, we have to deal with group of records based on a key.**


* Here are the functions related to narrow transformations. Narrow transformations doesn't result in shuffling. These are also known as row level transformations.
  * >*df.select, df.filter, df.withColumn, df.withColumnRenamed, df.drop*


* Here are the functions related to wide transformations.
  * >*df.distinct, df.union or any set operation, df.join or any join operation, df.groupBy, df.sort or df.orderBy*

In [0]:
df_ = spark.createDataFrame(users)
df_.select('*').show()

# OR 

columns = df_.columns
df_.select(columns[2:4]).show()

+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|amount_paid|courses|customer_from|               email|first_name| id|is_customer|   last_name|    last_updated_ts|       phone_numbers|
+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|    1000.55| [1, 2]|   2021-01-15|cvandenoord0@etsy...|    Corrie|  1|       true|Van den Oord|2021-02-10 01:15:00|{+1 234 567 8923,...|
|      900.0|    [3]|   2021-02-14|nbrewitt1@dailyma...|  Nikolaus|  2|       true|     Brewitt|2021-02-18 03:33:00|{+1 234 567 8923,...|
|     850.55| [2, 4]|   2021-01-21|openney2@vistapri...|    Orelie|  3|       true|      Penney|2021-03-15 15:16:55|{+1 714 512 9752,...|
|       null|     []|         null|  amaddocks3@home.pl|     Ashby|  4|      false|    Maddocks|2021-04-10 17:45:30|        {null, null}|
|       null|     []|         null

In [0]:
df_.alias('master').select(['master.'+column for column in columns]).show()
## PASSING LIST WITH ALIAS ADDED WOULD WORK

+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|amount_paid|courses|customer_from|               email|first_name| id|is_customer|   last_name|    last_updated_ts|       phone_numbers|
+-----------+-------+-------------+--------------------+----------+---+-----------+------------+-------------------+--------------------+
|    1000.55| [1, 2]|   2021-01-15|cvandenoord0@etsy...|    Corrie|  1|       true|Van den Oord|2021-02-10 01:15:00|{+1 234 567 8923,...|
|      900.0|    [3]|   2021-02-14|nbrewitt1@dailyma...|  Nikolaus|  2|       true|     Brewitt|2021-02-18 03:33:00|{+1 234 567 8923,...|
|     850.55| [2, 4]|   2021-01-21|openney2@vistapri...|    Orelie|  3|       true|      Penney|2021-03-15 15:16:55|{+1 714 512 9752,...|
|       null|     []|         null|  amaddocks3@home.pl|     Ashby|  4|      false|    Maddocks|2021-04-10 17:45:30|        {null, null}|
|       null|     []|         null

### Overview of selectExpr

In [0]:
help(df_.selectExpr) #on-the-go we can modify select query output

Help on method selectExpr in module pyspark.sql.dataframe:

selectExpr(*expr: Union[str, List[str]]) -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Projects a set of SQL expressions and returns a new :class:`DataFrame`.
    
    This is a variant of :func:`select` that accepts SQL expressions.
    
    .. versionadded:: 1.3.0
    
    Examples
    --------
    >>> df.selectExpr("age * 2", "abs(age)").collect()
    [Row((age * 2)=4, abs(age)=2), Row((age * 2)=10, abs(age)=5)]



In [0]:
from pyspark.sql.functions import concat, lit, col
df_.select('id','first_name','last_name',concat(col('first_name'), lit(', '),col('last_name'))).show()

+---+----------+------------+---------------------------------+
| id|first_name|   last_name|concat(first_name, , , last_name)|
+---+----------+------------+---------------------------------+
|  1|    Corrie|Van den Oord|             Corrie, Van den Oord|
|  2|  Nikolaus|     Brewitt|                Nikolaus, Brewitt|
|  3|    Orelie|      Penney|                   Orelie, Penney|
|  4|     Ashby|    Maddocks|                  Ashby, Maddocks|
|  5|      Kurt|        Rome|                       Kurt, Rome|
+---+----------+------------+---------------------------------+



In [0]:
df_.selectExpr('id','first_name','last_name',"concat(first_name,', ',last_name) as name").show()

+---+----------+------------+--------------------+
| id|first_name|   last_name|                name|
+---+----------+------------+--------------------+
|  1|    Corrie|Van den Oord|Corrie, Van den Oord|
|  2|  Nikolaus|     Brewitt|   Nikolaus, Brewitt|
|  3|    Orelie|      Penney|      Orelie, Penney|
|  4|     Ashby|    Maddocks|     Ashby, Maddocks|
|  5|      Kurt|        Rome|          Kurt, Rome|
+---+----------+------------+--------------------+



There are multiple ways to rename Spark Data Frame Columns or Expressions.
* We can rename column or expression using `alias` as part of `select`
* We can add or rename column or expression using `withColumn` on top of Data Frame.
* We can rename one column at a time using `withColumnRenamed` on top of Data Frame.
* We typically use `withColumn` to perform row level transformations and then to provide a name to the result. If we provide the same name as existing column, then the column will be replaced with new one.
* If we want to just rename the column then it is better to use `withColumnRenamed`.
* If we want to apply any transformation, we need to either use `select` or `withColumn`
* We can rename bunch of columns using `toDF`.