<a href="https://colab.research.google.com/github/anil-chhetri/Miscellaneous/blob/main/ManipulatingDataframes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
        .master("local")
        .appName("Colab")
        .config('spark.ui.port', '4050')
        .getOrCreate())

**String Manipulation Functions**
- Case Conversion - ``lower``, ``upper``
- Getting Length - `length`
- Extracting substrings - ``substring``, ``split``
- Trimming - ``trim``, ``ltrim``, ``rtrim``
- Padding - ``lpad``, ``rpad``
- Concatenating string - ``concat``, ``concat_ws``

**Date Manipulation Functions**
- Getting current date and time - ``current_date``, ``current_timestamp``
- Date Arithmetic - ``date_add``, ``date_sub``, ``datediff``, ``months_between``, ``add_months``, ``next_day``
- Beginning and Ending Date or Time - ``last_day``, ``trunc``, ``date_trunc``
- Formatting Date - ``date_format``
- Extracting Information - ``dayofyear``, ``dayofmonth``, ``dayofweek``, ``year``, ``month``

**Aggregate Functions**
- ``count``, ``countDistinct``
- ``sum``, ``avg``
- ``min``, ``max``

**Other Functions** - We will explore depending on the use cases.
- ``CASE`` and ``WHEN``
- ``CAST`` for type casting
- Functions to manage special types such as ``ARRAY``, ``MAP``, ``STRUCT`` type columns
- Many others

In [5]:
employees = [
    (1, "Scott", "Tiger", 1000.0, 
      "united states", "+1 123 456 7890", "123 45 6789"
    ),
     (2, "Henry", "Ford", 1250.0, 
      "India", "+91 234 567 8901", "456 78 9123"
     ),
     (3, "Nick", "Junior", 750.0, 
      "united KINGDOM", "+44 111 111 1111", "222 33 4444"
     ),
     (4, "Bill", "Gomes", 1500.0, 
      "AUSTRALIA", "+61 987 654 3210", "789 12 6118"
     )
]

In [13]:
df = spark.createDataFrame(employees, schema="""employee_id INT,  first_name STRING, 
                                                last_name STRING, salary FLOAT, 
                                                nationality STRING, phone_number STRING, ssn STRING """)

In [14]:
from pyspark.sql import functions as f

In [15]:
help(f.lower)

Help on function lower in module pyspark.sql.functions:

lower(col)
    Converts a string expression to lower case.
    
    .. versionadded:: 1.5



In [18]:
df.withColumn('nationality', f.lower(f.col('nationality'))).show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         india|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united kingdom|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     australia|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+

