# From nominal values to numerical values -Spark Example (pyspark)


This is a practical example of how to transform nominal values to numerical values using Spark. 

Basically, the main idea is to use the hash function used for the **Spark default partitioning problem** : 
How do I choose which keys go to a reducer and which go to the other? And what is the fastest way?

Spark configuration and files loading:

In [7]:
import findspark
from pyspark.sql import SparkSession

findspark.init("/usr/local/spark")
spark = SparkSession.builder \
   .master("local[*]") \
   .appName("Test") \
   .getOrCreate()
sc = spark.sparkContext

import os

file_path = "/Users/Desktop/data"
files = os.listdir(file_path)[1:]

Applying some transformation, it's just an example...

In [None]:
def apply_preprocessing(rdd) :
    '''
    This function applies some transformations 
     - **parameters**, **types**, **return** and **return types**::
          :param rdd: RDD to transform
          :type rdd: pyspark.rdd.RDD
          :return: return the transformed RDD 
          :rtype: pyspark.rdd.RDD
    '''
    header = rdd.first()
    rdd = rdd.filter(lambda lines : lines!=header)
    rdd = rdd.map(lambda (a,b) : ((a,b),1))
    rdd = rdd.reduceByKey(lambda a,b : a+b)
    return rdd 

def main() :
    for file_path in files : 
        rdd_new = sc.textFile(file_path)
        rdd_new = apply_preprocessing(rdd_new)
    return rdd_info    
rdd = main()

I need to know the total number of instances for generating an unique ID, which is a number, for each instance.
* *b* is the value to transform:
* *(you can find other ways to count the number of total istances)*

In [None]:
tot_istances = rdd.map(lambda ((a,b),c) : (b,1)).keys().distinct().count()

In [None]:
def java_string_hashcode(s):
    h = 0
    for c in s:
        h = (31 * h + ord(c)) & 0xFFFFFFFF
    return ((h + 0x80000000) & 0xFFFFFFFF) - 0x80000000
import sys 

def get_hash(istance) :
    return (java_string_hashcode(istance) & sys.maxint) % tot_istances

Getting the unique ID:

In [None]:
rdd_final  = rdd.map(lambda ((a,b),c) : (a,get_hash(b),c ))

Checking the results:

In [None]:
rdd_final.take(5)

In [None]:
rdd.take(5)