In [0]:
airlines_df = spark.read\
                .format("csv")\
                .option("header","true") \
                .option("inferSchema","true") \
                .option("Samplingratio","0.0001") \
                .load("/databricks-datasets/airlines/part-00000")
     

In [0]:
airlines_df.count()

Out[2]: 645918

How to randomly sample the dataframe?

*Approach-1* ***Sample method*** 
Also known as Simple random sampling

In [0]:
random_sample_df = airlines_df.sample(fraction=0.1,withReplacement=False,seed=0)
random_sample_df.count()

#fraction : percentage of records for sample ; withReplacement: Set true to get similar type of records, false to get distinct records 3)seed allows to get same output for multiple runs aka."Reproducibility factor"

Out[3]: 64459

*Approach 2:*  ***sampleBy***

Also Known as stratified Sampling 

**When to use?**
- Suitable for bucketing the data and then take sample from each bucket. 
- In below example, bucket your data by the UniqueCarrierand then take samples from each bucket.
- Each bucket is known as strata, and the approach is called Stratification

In [0]:
base_df = airlines_df.filter("UniqueCarrier in ('AA','DL','PS')")
base_df.count()

Out[5]: 147140

In [0]:
base_df.groupBy("UniqueCarrier")\
       .count()\
       .orderBy("UniqueCarrier")\
       .show()

+-------------+-----+
|UniqueCarrier|count|
+-------------+-----+
|           AA|56091|
|           DL|63104|
|           PS|27945|
+-------------+-----+



***Requirements***

- 20K samples for AA and DL
- Random Sampling
- Each code should have approx. 10K samples

***Plan***
I need data from two carrier codes. So I will create two strata(aka buckets). For selecting records from strata, you need a sampling fraction for the strata. 

And here is the calculation of sampling fraction.
- AA : 10,000/56091 = 0.18
- DL : 10,000/63104 = 0.158

In [0]:
strata_df = base_df.sampleBy("UniqueCarrier",
                             fractions={"AA":0.18,"DL":0.158},
                             seed=0)

In [0]:
strata_df.groupBy("UniqueCarrier")\
       .count()\
       .orderBy("UniqueCarrier")\
       .show()

+-------------+-----+
|UniqueCarrier|count|
+-------------+-----+
|           AA|10123|
|           DL|10014|
+-------------+-----+



- How to split dataframe? 
- How to split with different weights and random samples?

Approach : randomSplit

In [0]:
(df1,df2,df3) = airlines_df.randomSplit(weights=[0.25,0.5
                                                 ,0.25], seed=0)

In [0]:
print(df1.count(),df2.count(),df3.count())

161240 323047 161631


***Important***
- The weights should be always given floating point numbers.he spark will normalize the values into a split percentage.
- Else, it will throw error (java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

*How do you combine the different data frames?*

Use union and it can be chained

In [0]:
df4 = df1.union(df2).union(df3)

In [0]:
df4.count()

Out[26]: 645918