# Sampling and Subsetting

Once a Dataflow has been created, it is possible to act on only a subset of the records contained in it. This can help when working with very large datasets or when only a portion of the records is truly relevant.

## Head

The `head` method will take the number of records specified, run them through the transformations in the Dataflow, and then return the result as a Pandas dataframe.

In [1]:
import azureml.dataprep as dprep

df = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv')
df.head(10)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
1,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.
5,ALABAMA,1,101710,Hale County,10171000589,Moundville Elem Sch,304,95,.,.,...,.,.,.,.,.,.,.,.,.,.
6,ALABAMA,1,101710,Hale County,10171000592,Sunshine High Sch,137,80-84,.,.,...,.,.,.,.,.,.,.,.,.,.
7,ALABAMA,1,101920,Jefferson County,10192000681,Adamsville Elem Sch,170,80-84,1,PS,...,1,PS,.,.,.,.,.,.,.,.
8,ALABAMA,1,101920,Jefferson County,10192000684,Bagley Jr High,395,90,.,.,...,.,.,.,.,.,.,.,.,.,.
9,ALABAMA,1,101920,Jefferson County,10192000687,Bottenfield Middle Sch,794,69,.,.,...,.,.,.,.,.,.,.,.,.,.


## Take

The `take` method adds a step to the Dataflow that will keep the number of records specified (counting from the beginning) and drop the rest. Unlike `head`, which does not modify the Dataflow, all operations applied on a Dataflow on which `take` has been applied will affect only the records kept.

In [2]:
top_five_df = df.take(5)
top_five_df.to_pandas_dataframe()

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
1,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.


## Skip

It is also possible to skip a certain number of records in a Dataflow, such that transformations are only applied after a specific point. Depending on the underlying data source, a Dataflow with a `skip` step might still have to scan through the data in order to skip past the records.

In [3]:
skip_top_one_df = top_five_df.skip(1)
skip_top_one_df.to_pandas_dataframe()

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.


## Take Sample

In addition to taking records from the top of the dataset, it's also possible to take a random sample of the dataset. This is done through the `take_sample(probability, seed=None)` method. This method will scan through all of the records available in the Dataflow and include them based on the probability specified. The `seed` parameter is optional. If a seed is not provided, a stable one is generated, ensuring that the results for a specific Dataflow remain consistent. Different calls to `take_sample` will receive different seeds.

In [4]:
sampled_df = df.take_sample(0.1)
sampled_df.to_pandas_dataframe()

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1,101710,Hale County,10171000589,Moundville Elem Sch,304,95,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101920,Jefferson County,10192000691,Brighton Middle Sch,209,66,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101950,Lamar County,10195000756,Sulligent Sch,456,78,.,.,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,102010,Lauderdale County,10201000772,Rogers High Sch,719,90,.,.,...,.,.,.,.,.,.,.,.,.,.


`skip`, `take`, and `take_sample` can all be combined. With this, we can achieve behaviors like getting a random 10% sample fo the middle N records of a dataset.

In [5]:
seed = 1
nested_sample_df = df.skip(1).take(5).take_sample(0.5, seed)
nested_sample_df.to_pandas_dataframe()

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.


## Caching
It is usually a good idea to cache the sampled Dataflow for later uses.

See [caching.ipynb](caching.ipynb) for more details about caching.