## Zero Class Features DataFrame Creation

### Data Seperation

Given your PC's resources (128 GB of RAM and 32 cores), you can optimize the Spark session configuration for better performance when processing large data, particularly a Parquet file and repartitioning it into 184 partitions.

### Suggested Configuration:

1. **`spark.driver.memory`**:
   - You have 128 GB of RAM, so it's safe to allocate a substantial portion to the driver. However, 24 GB is already a good starting point, as the driver does not need as much memory as executors. You can keep this value at `24g` or adjust based on job requirements.

2. **`spark.executor.memory`**:
   - Each executor should get a reasonable amount of memory without causing memory overuse. With 128 GB of RAM, allocating 4 GB to each executor is reasonable, but since you have 32 cores available, you can increase it to 8-16 GB per executor to fully utilize your machine's capacity.

3. **`spark.executor.instances`**:
   - You can set this to 8, as you're working with 32 cores. With 8 executors, each executor will get access to 4 cores (`32 cores / 8 executors = 4 cores per executor`), which ensures that each executor is properly utilized.

4. **`spark.executor.cores`**:
   - With 32 cores available and a total of 8 executors, allocating 4 cores per executor is a good choice, as mentioned above. This ensures that you are using all cores efficiently.

5. **`spark.driver.maxResultSize`**:
   - With a 128 GB machine, 4 GB for max result size should suffice, as the driver is mainly responsible for coordinating the job, not processing large data.

6. **`master`**:
   - Using `local[32]` will allow Spark to utilize all the cores on your machine for parallel computation. The `[*]` setting would use all cores, but specifying `local[32]` can give you more control over how many cores to allocate. 

7. **Repartitioning**:
   - When repartitioning to 184 partitions, ensure that you have enough executors to handle the task. With 8 executors and 184 partitions, each executor will handle around 23 partitions (`184 partitions / 8 executors ≈ 23 partitions per executor`). This will help in balancing the load effectively.

### Updated Configuration:

```python
spark = (
    SparkSession
    .builder
    .appName("leash belka3")
    .config("spark.driver.memory", "24g")  # Driver memory (sufficient for your system)
    .config("spark.executor.memory", "16g")  # Increased executor memory (optimal for large file)
    .config("spark.executor.instances", "8")  # 8 executors to utilize 32 cores
    .config("spark.executor.cores", "4")  # 4 cores per executor (balanced with 8 executors)
    .config("spark.driver.maxResultSize", "4g")  # Max result size for the driver
    .master("local[32]")  # Utilize all 32 cores
    .getOrCreate()
)
```

In [1]:
from pyspark.sql import SparkSession

In [2]:
# for 256 Gb and 64 Cores

spark = (
    SparkSession
    .builder
    .appName("leash belka3")
    .config("spark.driver.memory", "24g")  # Driver memory
    .config("spark.executor.memory", "32g")  # Executor memory (increased for large dataset)
    .config("spark.executor.instances", "16")  # Number of executors (16 executors for 64 cores)
    .config("spark.executor.cores", "4")  # Executor cores (4 cores per executor)
    .config("spark.driver.maxResultSize", "4g")  # Max result size for driver
    # .config("spark.local.dir", "temp")  # Specify a directory with enough space
    .config("spark.shuffle.file.buffer", "128k")  # Shuffle buffer size
    .config("spark.memory.fraction", "0.8")  # Spark memory fraction (80% of executor memory)
    .config("spark.shuffle.memoryFraction", "0.4")  # Shuffle memory fraction (40% of executor memory)
    .master("local[64]")  # Use all 64 cores on the machine
    .getOrCreate()
)

spark

24/12/24 00:51:51 WARN Utils: Your hostname, kanjur resolves to a loopback address: 127.0.1.1; using 10.119.2.14 instead (on interface eno3)
24/12/24 00:51:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/24 00:51:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# spark.stop()

In [4]:
df = spark.read.format('parquet').load('chunks_output')

print(df.rdd.getNumPartitions())
print(df.count())
df.show()

                                                                                

74


                                                                                

295246830
+---------+--------------------+--------------------+--------------------+--------------------+-------+---+
|       id|                 bb1|                 bb2|                 bb3|            molecule|protein|  y|
+---------+--------------------+--------------------+--------------------+--------------------+-------+---+
|107000000|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 1, 1, 1...|[5, 0, 2, 2, 2, 3...|      3|  0|
|107000001|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 0, 0, 0...|[5, 0, 2, 1, 1, 2...|      1|  0|
|107000002|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 0, 0, 0...|[5, 0, 2, 1, 1, 2...|      2|  0|
|107000003|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 0, 0, 0...|[5, 0, 2, 1, 1, 2...|      3|  0|
|107000004|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 0, 0, 0...|[5, 0, 2, 1, 1, 2...|      1|  0|
|107000005|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[2, 0, 0, 0, 0, 0...|[5, 0, 2, 1, 1, 2...|      2|  0|
|107000006|[6, 0, 

In [5]:
df0 = df.where('y == 0')
df0 = df0.repartition(184)

# print(df0.rdd.getNumPartitions())
# print(df0.count())
# df0.select('y').distinct().show()

In [None]:
df0.show()

In [None]:
print(df0.rdd.getNumPartitions())

In [6]:
df0.write.format('parquet').mode('overwrite').option('header', True).save('zero.parquet')

                                                                                

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

### Feature Dataset Creation

In [1]:
train_len = 295246830
one_len = 1589906
zero_len = 293656924
protein_map = {'BRD4': 1, 'HSA': 2, 'sEH': 3}
vocab = {'C': 6825082866, '#': 81527490, '@': 511451694, 'H': 456489972, '=': 1406606874, 'O': 2554179786, 
         'N': 2469595230, 'c': 12257477022, '-': 438483636, '.': 216945504, 'l': 491088828, 'B': 123330132, 
         'r': 121915914, 'n': 1997759694, 'D': 295246830, 'y': 295246830, 'o': 67918650, 's': 156618468, 
         'S': 90662574, 'F': 492710238, '+': 65206260, 'i': 1414026, '/': 11547096, 'I': 23972994}

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

from pyspark.sql.functions import col
from pyspark.sql.types import LongType, IntegerType, StructType, StructField

In [3]:
# for 128 Gb and 32 Cores
# spark = (
#     SparkSession
#     .builder
#     .appName("leash belka3")
#     .config("spark.driver.memory", "16g")
#     .config("spark.executor.memory", "16g")
#     .config("spark.executor.instances", "4")
#     .config("spark.executor.cores", "4")
#     .config("spark.driver.maxResultSize", "4g")
#     .master("local[*]")
#     .getOrCreate()
# )

# spark

# for 256 Gb and 64 Cores
spark = (
    SparkSession
    .builder
    .appName("leash belka3")
    .config("spark.driver.memory", "48g")  # Increased driver memory
    .config("spark.executor.memory", "48g")  # Increased executor memory
    .config("spark.executor.instances", "16")  # 16 executors
    .config("spark.executor.cores", "4")  # 4 cores per executor
    .config("spark.driver.maxResultSize", "4g")  # Driver result size limit
    .config("spark.local.dir", "temp")  # Specify a directory with enough space
    .config("spark.shuffle.file.buffer", "128k")  # Shuffle buffer size
    .config("spark.memory.fraction", "0.8")  # Memory fraction for tasks
    .config("spark.shuffle.memoryFraction", "0.6")  # Shuffle memory fraction
    .config("spark.executor.javaOptions", "-Xmx48g")  # JVM heap size for executors
    .master("local[64]")  # Use all 64 cores on the machine
    .getOrCreate()
)

spark

24/12/24 01:26:39 WARN Utils: Your hostname, kanjur resolves to a loopback address: 127.0.1.1; using 10.119.2.14 instead (on interface eno3)
24/12/24 01:26:39 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/24 01:26:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/24 01:26:40 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).


In [4]:
df0 = spark.read.format('parquet').load('zero.parquet')

print(df0.rdd.getNumPartitions())
print(df0.count())
df0.show()

                                                                                

92


                                                                                

293656924
+---------+--------------------+--------------------+--------------------+--------------------+-------+---+
|       id|                 bb1|                 bb2|                 bb3|            molecule|protein|  y|
+---------+--------------------+--------------------+--------------------+--------------------+-------+---+
|101628733|[6, 0, 2, 1, 3, 6...|[1, 0, 0, 0, 2, 2...|[5, 0, 0, 0, 0, 0...|[7, 0, 2, 1, 4, 5...|      2|  0|
|101298451|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[3, 0, 0, 0, 0, 3...|[7, 0, 2, 1, 1, 5...|      2|  0|
|101166654|[6, 0, 2, 1, 2, 4...|[1, 0, 0, 0, 0, 1...|[4, 0, 0, 0, 1, 1...|[8, 0, 2, 1, 2, 3...|      1|  0|
|107432121|[6, 0, 2, 1, 2, 4...|[2, 0, 0, 1, 0, 0...|[6, 0, 0, 0, 0, 0...|[8, 0, 1, 2, 1, 1...|      1|  0|
|107230085|[6, 0, 2, 1, 2, 4...|[8, 0, 0, 0, 1, 1...|[1, 0, 0, 0, 0, 0...|[10, 0, 2, 1, 2, ...|      3|  0|
|107926935|[6, 0, 2, 1, 2, 4...|[8, 0, 0, 0, 0, 2...|[9, 0, 2, 1, 0, 2...|[20, 0, 4, 2, 1, ...|      1|  0|
|263512551|[9, 0, 

In [5]:
cols = []
for i in range(24):
    cols.append(col('bb1').getItem(i).alias(f'a{i+1}'))
    cols.append(col('bb2').getItem(i).alias(f'b{i+1}'))
    cols.append(col('bb3').getItem(i).alias(f'c{i+1}'))
    cols.append(col('molecule').getItem(i).alias(f'd{i+1}'))

schema = StructType([
    StructField('id', LongType(), True),
    StructField('protein', IntegerType(), True),
    *[StructField(f'a{i+1}', IntegerType(), True) for i in range(24)],
    *[StructField(f'b{i+1}', IntegerType(), True) for i in range(24)],
    *[StructField(f'c{i+1}', IntegerType(), True) for i in range(24)],
    *[StructField(f'd{i+1}', IntegerType(), True) for i in range(24)],
    StructField('y', IntegerType(), True)
])

df0 = df0.select('id', 'protein', *cols, 'y')
df0 = spark.createDataFrame(df0.rdd, schema)

In [6]:
df0.first()

24/12/24 01:02:05 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Row(id=101628733, protein=2, a1=6, a2=1, a3=5, a4=7, a5=0, a6=0, a7=0, a8=0, a9=2, a10=0, a11=0, a12=2, a13=1, a14=0, a15=0, a16=1, a17=3, a18=2, a19=0, a20=4, a21=6, a22=2, a23=0, a24=5, b1=2, b2=1, b3=1, b4=5, b5=18, b6=6, b7=5, b8=20, b9=2, b10=0, b11=0, b12=1, b13=0, b14=0, b15=2, b16=0, b17=0, b18=0, b19=2, b20=0, b21=0, b22=0, b23=0, b24=0, c1=0, c2=0, c3=0, c4=0, c5=0, c6=0, c7=1, c8=4, c9=0, c10=0, c11=0, c12=1, c13=0, c14=0, c15=0, c16=1, c17=0, c18=0, c19=0, c20=0, c21=0, c22=0, c23=0, c24=0, d1=0, d2=1, d3=0, d4=1, d5=0, d6=0, d7=0, d8=0, d9=1, d10=0, d11=0, d12=1, d13=0, d14=0, d15=0, d16=0, d17=0, d18=0, d19=0, d20=0, d21=0, d22=0, d23=0, d24=0, y=0)

In [7]:
print(df0.rdd.getNumPartitions())

92


In [7]:
df0 = df0.repartition(184)

In [9]:
print(df0.rdd.getNumPartitions())

[Stage 6:>                                                        (0 + 64) / 92]





184


In [8]:
df0.write.format('parquet').mode('overwrite').option('header', True).save('zero_features.parquet')

24/12/24 01:27:30 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [None]:
df0_features = spark.read.format('parquet').load('zero_features.parquet')

print(df0_features.rdd.getNumPartitions())
print(df0_features.count())
df0_features.printSchema()

In [None]:
df0_features.show()

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////