## Entropy stub
@Citation: [(Wiki) Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory))  

### Why entropy?

Entropy: - sum(Prob(event_i)\*log(Prob(event_i)))

Measures the amount of variation/amount of surprise in the signal. Useful for categorical vars. Can be used to guess how good a predictor will the categorical column be. Intuition behind this is that when a low probability event occurs it carries more information.

If the column has little variation, or it is perfectly random, the entropy will be close to 0.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Binary_entropy_plot.svg/1920px-Binary_entropy_plot.svg.png" alt="Entropy curve" width="200" align="left">

In [None]:
import pandas as pd
import findspark
findspark.init()
import pyspark
import pyspark.sql.functions as psq

In [None]:
def entropyCat(sdf, catcolname):
    """
    Function that returns entropy score for given column
    
    Inputs:
    sdf - spark data frame
    catcolname - string column name of categorical column
    
    Outputs:
    entropy - float / numeric entropy value calculated
    
    Usage:
    entValue = entropyCat(data, 'someColumn')
    """
    total = sdf.count()
    tmp = (sdf
       .groupby(psq.col(catcolname).alias('catCol')).count()
       .withColumn('Probability', psq.col('count')/total)
       .withColumn('ProbintoLogProb', psq.col('Probability')*psq.log(psq.col('Probability')))
      )
    return tmp.select(psq.sum('ProbintoLogProb')).collect()[0][0]*-1