# 0. **Install PySpark**

In [3]:
!pip install pyspark



# 1. **Initialize Spark session**:


In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

   - This initializes a Spark session with the application name 'SparkByExamples.com'. If a Spark session already exists, it will return the existing one; otherwise, it creates a new one.


# 2. **Import necessary functions**:


In [5]:
from pyspark.sql.functions import col, expr

   - Importing `col` and `expr` from `pyspark.sql.functions` to help with DataFrame column operations and SQL expressions.


# 3. **Define the data**:


In [6]:
data = [("2019-01-23", 1), ("2019-06-24", 2), ("2019-09-20", 3)]

   - This is a list of tuples, where each tuple contains a date string and an increment value.


# 4. **Create a DataFrame from the data**:


In [7]:
df = spark.createDataFrame(data).toDF("date", "increment")

In [10]:
df.show()

+----------+---------+
|      date|increment|
+----------+---------+
|2019-01-23|        1|
|2019-06-24|        2|
|2019-09-20|        3|
+----------+---------+



   - This creates a Spark DataFrame from the list of tuples and assigns column names "date" and "increment".


# 5. **Perform the date manipulation and add a new column 'inc_date'**:


In [8]:
df_with_inc_date = df.select(
    col("date"),
    col("increment"),
    expr("add_months(to_date(date,'yyyy-MM-dd'), cast(increment as int))").alias("inc_date")
)

   - `col("date")` and `col("increment")`: Selects the "date" and "increment" columns.
   - `expr("add_months(to_date(date,'yyyy-MM-dd'), cast(increment as int))")`:
     - `to_date(date, 'yyyy-MM-dd')`: Converts the "date" string to a date object.
     - `cast(increment as int)`: Casts the "increment" value to an integer.
     - `add_months`: Adds the specified number of months to the date.
     - `.alias("inc_date")`: Assigns the result to a new column named "inc_date".


1. **`expr()`**:
   - **Purpose**: Allows you to use SQL expressions within the DataFrame API.
   - **Usage**: `expr("SQL expression")` enables you to write SQL syntax directly in your DataFrame transformations.

2. **`to_date(date, 'yyyy-MM-dd')`**:
   - **Purpose**: Converts a string column to a date column.
   - **Details**:
     - `date`: The column containing date strings.
     - `'yyyy-MM-dd'`: The format of the date strings in the `date` column.
   - **Result**: Converts the `date` string (e.g., '2019-01-23') to a date object of the format `yyyy-MM-dd`.

3. **`cast(increment as int)`**:
   - **Purpose**: Converts the `increment` column values to integers.
   - **Details**:
     - `increment`: The column containing numeric values that represent the number of months to add.
   - **Result**: Ensures that the `increment` values are treated as integers for the next operation.

4. **`add_months(to_date(date, 'yyyy-MM-dd'), cast(increment as int))`**:
   - **Purpose**: Adds a specified number of months to a date.
   - **Details**:
     - `to_date(date, 'yyyy-MM-dd')`: The date to which months will be added.
     - `cast(increment as int)`: The number of months to add, cast to an integer.
   - **Result**: Produces a new date by adding the `increment` number of months to the `date` column.

5. **`.alias("inc_date")`**:
   - **Purpose**: Assigns an alias (or name) to the resulting column from the expression.
   - **Details**:
     - `"inc_date"`: The name for the new column containing the resulting dates after the addition of months.
   - **Result**: The resulting column will be named `inc_date`.

# 6. **Show the resulting DataFrame**:


In [12]:
df_with_inc_date.show()

+----------+---------+----------+
|      date|increment|  inc_date|
+----------+---------+----------+
|2019-01-23|        1|2019-02-23|
|2019-06-24|        2|2019-08-24|
|2019-09-20|        3|2019-12-20|
+----------+---------+----------+



   - This displays the contents of the DataFrame, showing the original "date" and "increment" columns along with the new "inc_date" column.
