## Clustered Bootstrap

In this notebook, we will look at a different bootstrap technique called 'Clustered Bootstrap'. Here sampling is done at a higher unit than merged onto the original data

Check if the Spark and SparkSQL context have started successfully

In [1]:
print sc
print sqlContext
print sqlCtx

<pyspark.context.SparkContext object at 0x7f815b520650>
<pyspark.sql.context.HiveContext object at 0x7f815b500e90>
<pyspark.sql.context.HiveContext object at 0x7f815b500e90>


Load the dataset

In [2]:
data_df = (sqlContext.read
           .format('com.databricks.spark.csv')
           .option("header", "true") # Use first line of all files as header
           .option("inferSchema", "true") # Automatically infer data types
           .load("skewdata-policy-new.csv")
           )

In [3]:
data_df.dtypes

[('policyid', 'int'), ('age', 'int'), ('values', 'double')]

In [4]:
data_df.show()

+--------+---+-----------+
|policyid|age|     values|
+--------+---+-----------+
|       0| 10|81.37291811|
|       0| 10|25.70097086|
|       0| 10|4.942646012|
|       1| 11|43.02085256|
|       1| 11|81.69058902|
|       1| 11|51.19523649|
|       2| 12|55.65990905|
|       2| 12|15.15315474|
|       2| 12|38.74578007|
|       3| 13|12.61038468|
|       3| 13|22.41509375|
|       3| 13| 18.3557207|
|       4| 14|38.08150137|
|       4| 14|48.17113476|
|       4| 14|18.46272527|
|       5| 15|44.64225129|
|       5| 15|25.39108197|
|       5| 15|20.41087394|
|       6| 16|15.77818657|
|       6| 16|19.35148454|
+--------+---+-----------+
only showing top 20 rows



From the above dataset, we need to sample it in such a way that all the records for a particular policy id need to be included in the sample.

First find the distinct policy ids from the dataset and have them stored in a different dataframe

In [5]:
policyids = data_df.select('policyid').distinct()

In [6]:
policyids.show()

+--------+
|policyid|
+--------+
|       0|
|       1|
|       2|
|       3|
|       4|
|       5|
|       6|
|       7|
|       8|
|       9|
+--------+



In [7]:
## Create a function the creates specified number of samples based on the sample size specified using fraction

def clusteredSamples(data,policies,policyid_sample_fraction,num_of_samples):
    
    #Initiate an emtpy sample list
    samples = []
    
    for n in range(0,num_of_samples):
        
        #Create a sample of the unique policy ids
        policyids_sample = policies.sample(withReplacement=False, fraction=policyid_sample_fraction)
    
        #Sample the data based on the sampled policyids
        sample = policyids_sample.join(data,on='policyid',how='inner')
        
        #Add the sample to the samples list
        samples.append(sample)
        
    #We will return a list of clustered samples
    return samples

In [8]:
sampleList = clusteredSamples(data_df,policyids,0.8,20)

In [9]:
sampleList

[DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: double],
 DataFrame[policyid: int, age: int, values: do

Run a linear regression on all the samples in samplesList sequentially and save the coffeccients

In [10]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

def runLinearRegression(samples):
    #initiate a result list
    samples_coefficients = []
    
    #Create a vector Assembler
    feature_columns = ['age']
    vectorAssembler = VectorAssembler(inputCols = feature_columns, outputCol = 'features_vector')
    
    #Create a linear regresson model
    lr = LinearRegression(featuresCol ='features_vector', 
                          labelCol = 'values',
                          predictionCol = 'predicted_values',
                          maxIter=5, 
                          elasticNetParam = 0.5,
                          solver="l-bfgs")
    for i in range(0,len(samples)):
        sample_df = samples[i]
        sample_df1 = vectorAssembler.transform(sample_df)
        
        #Fit the linear Regression model
        sample_lr = lr.fit(sample_df1)
        
        #Save the coefficients from the Regression model
        samples_coefficients.append(sample_lr.coefficients)
    
    #Return the list of coefficients from running glm on each sample set    
    return samples_coefficients

In [11]:
sampleCoefficients = runLinearRegression(sampleList)

In [12]:
sampleCoefficients

[DenseVector([-2.5583]),
 DenseVector([-2.3188]),
 DenseVector([-2.3443]),
 DenseVector([-2.1986]),
 DenseVector([-2.6441]),
 DenseVector([-2.3188]),
 DenseVector([-1.8888]),
 DenseVector([-2.3443]),
 DenseVector([-2.1632]),
 DenseVector([-3.4732]),
 DenseVector([-3.6289]),
 DenseVector([-3.4164]),
 DenseVector([-2.6553]),
 DenseVector([-2.7576]),
 DenseVector([-2.3557]),
 DenseVector([-2.8209]),
 DenseVector([-2.6926]),
 DenseVector([-2.3467]),
 DenseVector([-2.4428]),
 DenseVector([-2.8209])]