<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# Basic Statistics and Data Types

## Hypothesis Testing 

## Lesson Objectives 

After completing this lesson, you should be able to:

-	Perform hypothesis testing for goodness of fit and independence 
-	Perform hypothesis testing for equality and probability distributions
-	Perform kernel density estimation 

## Hypothesis Testing 

- Used to determine whether a result is statistically significant, that is, whether it occurred by chance or not 
-	Supported tests:
  -	Pearson's Chi-Squared test for goodness of fit 
  -	Pearson's Chi-Squared test for independence
-	Kolmogorov-Smirnov test for equality of distribution 
-	Inputs of type `RDD[LabeledPoint]` are also supported, enabling feature selection


### Pearson's Chi-Squared Test for Goodness of Fit 

-	Determines whether an observed frequency distribution differs from a given distribution or not 
-	Requires an input of type Vector containing the frequencies of the events 
-	It runs against a uniform distribution, if a second vector to test against is not supplied 
-	Available as `chiSqTest`() function in Statistics 



### Libraries required for examples

In [1]:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}

import org.apache.spark.mllib.stat.Statistics

val vec: Vector = Vectors.dense(0.3, 0.2, 0.15, 0.1, 0.1, 0.1, 0.05)

val goodnessOfFitTestResult = Statistics.chiSqTest(vec)

goodnessOfFitTestResult

vec = [0.3,0.2,0.15,0.1,0.1,0.1,0.05]
goodnessOfFitTestResult = 


Chi squared test summary:
method: pearson
degrees of freedom = 6
statistic = 0.295
pValue = 0.999520973435643
No presumption against null hypothesis: observed follows the same distribution as expected..
Chi squared test summary:
method: pearson
degrees of freedom = 6
statistic = 0.295
pValue = 0.999520973435643
No presumption against null hypothesis: observed follows the same distribution as expected..


### Pearson's Chi-Squared Test for Independence

-	Determines whether unpaired observations on two variables are independent of each other 
-	Requires an input of type Matrix, representing a contingency table, or an `RDD[LabeledPoint]`
-	Available as `chiSqTest()` function in Statistics 
-	May be used for feature selection

In [2]:
// Testing for Independence 

import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.mllib.stat.Statistics 
import org.apache.spark.rdd.RDD

val mat: Matrix = Matrices.dense(3, 2,
Array(13.0, 47.0, 40.0, 80.0, 11.0, 9.0))

val independenceTestResult = Statistics.chiSqTest(mat)
independenceTestResult

mat = 
independenceTestResult = 


13.0  80.0
47.0  11.0
40.0  9.0
Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 90.22588968846716
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 90.22588968846716
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent..


In [3]:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.test.ChiSqTestResult

val obs: RDD[LabeledPoint] = sc.parallelize(Array(
    LabeledPoint(0, Vectors.dense(1.0, 2.0)),
    LabeledPoint(0, Vectors.dense(0.5, 1.5)),
    LabeledPoint(1, Vectors.dense(1.0, 8.0))))

val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
featureTestResults

obs = ParallelCollectionRDD[0] at parallelize at <console>:36
featureTestResults = 


Array(Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.75
pValue = 0.3864762307712326
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0000000000000004
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..)
res14: Array[org.apache.sp...


### Kolmogorov-Smirnov Test

-	Determines whether nor not two probability distributions are equal 
-	One sample, two sided test 
-	Supported distributions to test against:
-	normal distribution (distName='norm')
- customized cumulative density function (CDF)
-	Available as `kolmogorovSmirnovTest()` function in Statistics

In [4]:
// Test for Equality of Distribution
import org.apache.spark.mllib.random.RandomRDDs.normalRDD

val data: RDD[Double] = normalRDD(sc, size=100, numPartitions=1, seed=13L)

val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
 
// Test for Equality of Distribution 

import org.apache.spark.mllib.random.RandomRDDs.uniformRDD

val data1: RDD[Double] = uniformRDD(sc, size = 100, numPartitions=1, seed=13L)

val testResult1 = Statistics.kolmogorovSmirnovTest(data1, "norm", 0, 1)

data = RandomRDD[5] at RDD at RandomRDD.scala:42
testResult = 
data1 = RandomRDD[10] at RDD at RandomRDD.scala:42
testResult1 = 


Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.12019890461912125
pValue = 0.10230385223938121
No presumption against null hypothesis: Sample follows theoretical distribution.
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.5022419691869352
pValue = -2.220446049250313E-16
Very strong presumption against null hypothe...


### Kernel Density Estimation 

-	Computes an estimate of the probability density function of a random variable, evaluated at a given set of points 
-	Does not require assumptions about the particular distribution that the observed samples are drawn from 
-	Requires an RDD of samples
-	Available as `estimate()` function in KernelDensity
-	In Spark, only Gaussian kernel is supported

In [5]:

// Kernel Density Estimation I

import org.apache.spark.mllib.stat.KernelDensity

val data: RDD[Double] = normalRDD(sc, size=1000, numPartitions=1, seed=17L)

val kd = new KernelDensity().setSample(data).setBandwidth(0.1)

val densities = kd.estimate(Array(-1.5, -1, -0.5, 1, 1.5))

densities 

data = RandomRDD[15] at RDD at RandomRDD.scala:42
kd = org.apache.spark.mllib.stat.KernelDensity@946de4b
densities = Array(0.13251324189510227, 0.2343205768786857, 0.37436865774453676, 0.2597908788293575, 0.11549809683090305)


Array(0.13251324189510227, 0.2343205768786857, 0.37436865774453676, 0.2597908788293575, 0.11549809683090305)

In [6]:
// Kernel Density Estimation II 

val data: RDD[Double] = uniformRDD(sc, size=1000, numPartitions=1, seed=17L)

val kd = new KernelDensity().setSample(data).setBandwidth(0.1)

val densities = kd.estimate(Array(-0.25, 0.25, 0.5, 0.75, 1.25))

densities 

data = RandomRDD[16] at RDD at RandomRDD.scala:42
kd = org.apache.spark.mllib.stat.KernelDensity@5e8846b9
densities = Array(0.005891454217755318, 1.0011358547494325, 1.0157407141249963, 0.9352095006986689, 0.006607054892779689)


Array(0.005891454217755318, 1.0011358547494325, 1.0157407141249963, 0.9352095006986689, 0.006607054892779689)

## Lesson Summary

-	Having completed this lesson, you should be able to:
- Perform hypothesis testing for goodness of fit and independence 
-	Perform hypothesis testing for equality of probability distributions 
-	Perform kernel density estimation

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.