In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder \
            .appName("2_setting_log4j_in_spark_application") \
            .master("local[3]") \
            .config("spark.driver.extraJavaOptions", "-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=ashish-spark") \
            .getOrCreate()


25/02/20 18:19:47 WARN Utils: Your hostname, lenovo resolves to a loopback address: 127.0.1.1; using 192.168.29.125 instead (on interface wlp3s0)
25/02/20 18:19:47 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
25/02/20 18:19:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Above, I've added one more config setting when making `SparkSession`  
So, when we can this:  
```
.config("spark.driver.extraJavaOptions", "-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=ashish-spark")
```
- you're instructing the Spark driver JVM to use additional system properties at startup.  
**These properties include:**
    - `-Dlog4j.configuration=file:log4j.properties`
    This tells the JVM to load the Log4J configuration from the `log4j.properties` which I made it and set the logging configuration, and **we keep this log4j.properties at the root level of spark application**. This file controls how logging is handled (e.g., where log messages are written).

    - `-Dspark.yarn.app.container.log.dir=app-logs`
    This sets the directory where log files should be stored. so, it will be the folder name of the logs files In a cluster environment (like YARN), this variable helps in managing log files across containers, but it's also useful locally to define a consistent log directory.

    - `-Dlogfile.name=ashish-spark`
    This sets a custom property (used in your log4j.properties file) to define the log file name. For example, if your file appender is configured as:
    ```
    log4j.appender.file.File=${spark.yarn.app.container.log.dir}/${logfile.name}.log
    ```

**In summary, this configuration ensures that your Spark application uses your custom Log4J settings for logging, allowing you to control where and how logs are written, even when running in a VS Code notebook environment.**


In [3]:
# importing custom define log4J class, I wrote this class
from helpers.logger import Log4J

In [4]:
logger = Log4J(spark_session=spark)

Now, I can test some print logging statements of Pyspark

In [5]:
logger.info("First Log")
logger.info("Last log and stopping spark session")

25/02/20 18:19:51 INFO 2_setting_log4j_in_spark_application: First Log
25/02/20 18:19:51 INFO 2_setting_log4j_in_spark_application: Last log and stopping spark session


In [6]:
spark.stop()