When running spark code in pypsark-shell or in Databricks or other cloud environment, they gives us spark session ready to use  
Means we can work on data processing just by start writing `spark.<methods>`.  

But when we build our application which we need to submit to spark cluster, then we need to import `SparkSession` `from pyspark.sql`

In [1]:
from pyspark.sql import SparkSession

#### Logging in PySpark with Log4j

For logging purposes in PySpark, we need to work with `log4j`, but we don't need to install it separately, as `PySpark` comes with `log4j` by default.

## What is Log4j?  
Log4j is similar to Python's logging module. It has three main components:

- **Logger**: Captures logging information.
- **Configuration**: Defines how logging information is captured and formatted in `log4j.properties` file.
- **Appender**: They are output destinations like where to write these logs (console, file, etc.). These appenders also configured in the `log4j.properties` file.

The most important thing in log4j is **Configuration**. And configuration is defined in `log4j.properties` file, so we need to understand this file.
- Log4j configuration file which is: `log4j.properties` file, so there we define configuration in hierarchy.
    - The Topmost hierarchy is the root category in the file, which is this line: `log4j.rootCategory=INFO, console`.  
        For any category, we define 2 things: 
        - First thing is log level: `INFO` `WARN` `DEBUG` `ERROR` So, these are  log levels log4j supports
        - Second thing is a list of appender: `console`: So, when we set log levels like `INFO` or `WARN` then I set the appender as `console`, so I want these log messages go to console.
- Next thing in the log config file, we need to define console appender:  
    - console appender settings looks like this:
        ```
        # Define the console appender
        log4j.appender.console=org.apache.log4j.ConsoleAppender
        log4j.appender.console.target=System.err
        log4j.appender.console.layout=org.apache.log4j.PatternLayout
        log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
    And above appender setting is standard settings, they remain almost the same in the most pyspark projects.

- Till above, these two sections: **category** and **appender** will set the root level log4j,  
    and they will stop all the log messages sent by the spark and other packages except warning and errors.  
    So, we will get clean and minimal log output.

- But, we want to change the log settings for our application(pyspark code),  
    Now, I will define second log level for my pyspark application(code)
    for my application, I will name level as: `log4j.additivity.ashish.spark.example`, So, this is the name that I am going to use when using the Logger in my application.
      
    `log4j.logger.ashish.spark.example= INFO, console, file` #here, I've set my application log level to INFO and set log should go to **console** and **log file**

## How to Use Log4j?

### 1. Create a Custom `log4j.properties` File

To customize logging behavior, create a `log4j.properties` file (or any preferred name) with the following configuration.

#### Example Configuration (`log4j.properties`):
```properties
# Set the root logger level to INFO and its appender to the console
log4j.rootCategory=INFO, console

# Define the console appender
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Optional: Set a specific logger level for Spark components
log4j.logger.org.apache.spark=INFO


Below is how we start SparkSession with `builder` attribute.
- Each spark session can only have 1 active spark session because spark session is a driver we cannot have more than 1 driver in our spark application.
- `.builder` is attribute of SparkSession class, builder helps in creating and configuring SparkSession.

In [2]:
# spark = SparkSession.builder \
#     .getOrCreate()

BTW, spark is a very highly configurable system, so SparkSession is also highly configurable, so we can have some config according to ourself in our sparksession, so instead of above, we can set some config when creating `SparkSession`

Let's add some config in SparkSession

In [3]:
spark = SparkSession.builder \
        .appName("1_basics_pyspark") \
        .master("local[3]") \
        .getOrCreate()

25/02/20 17:30:34 WARN Utils: Your hostname, lenovo resolves to a loopback address: 127.0.1.1; using 192.168.29.125 instead (on interface wlp3s0)
25/02/20 17:30:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/20 17:30:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/02/20 17:30:37 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


So, in above code is like this:
```python
spark = SparkSession.builder \
        .appName("1_basics_pyspark")  # This is a our Spark Application name, think this as a log file name in logging module, so if my Spark application I've build like to fetch GCB-Billing data, so I would set as: .appName("GCPBILLING")
        .master("local[3]")  # here we are setting Cluster Manager, and Cluster Manager config is defined as Spark Master.
        .getOrCreate()
```
As, my code is running on local environment, so I set master to `local[3]` **which means, I am using local multithreaded JVM with 3 threads**

And, Above is how we created spark session, that means we created our spark driver, now we can use this driver for data processing.
And **once we are done with data processing, we should stop the driver**

In [4]:
# Stopping the driver
spark.stop()