# Replace DataSource Reference
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

A common practice when performing DataPrep is to build up a script or set of cleaning operations on a smaller example file locally. This is quicker and easier than dealing with large amounts of data initially.

After building a Dataflow that performs the desired steps, it's time to run it against the larger dataset, which may be stored in the cloud, or even locally just in a different file. This is where we can use `Dataflow.replace_datasource` to get a Dataflow identical to the one built on the small data, but referencing the newly specified DataSource.

In [1]:
import azureml.dataprep as dprep

dflow = dprep.read_csv('./data/crime0-10.csv')
df = dflow.to_pandas_dataframe()
df

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,...,41,10,06,1129230.0,1933315.0,2015,07/12/2015 12:42:46 PM,41.973309466,-87.800174996,"(41.973309466, -87.800174996)"
1,10139776,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,False,True,...,49,1,08B,1167370.0,1946271.0,2015,07/12/2015 12:42:46 PM,42.008124017,-87.65955018,"(42.008124017, -87.65955018)"
2,10140270,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9,53,08B,,,2015,07/12/2015 12:42:46 PM,,,
3,10139885,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37,25,05,1141721.0,1907465.0,2015,07/12/2015 12:42:46 PM,41.902152027,-87.754883404,"(41.902152027, -87.754883404)"
4,10140379,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27,28,07,1168413.0,1901632.0,2015,07/12/2015 12:42:46 PM,41.885610142,-87.657008701,"(41.885610142, -87.657008701)"
5,10140868,HY330421,07/05/2015 10:54:00 PM,118XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,False,False,...,34,53,14,1172409.0,1826485.0,2015,07/12/2015 12:42:46 PM,41.6793109,-87.644545209,"(41.6793109, -87.644545209)"
6,10139762,HY329232,07/05/2015 10:42:00 PM,026XX W 37TH PL,1020,ARSON,BY FIRE,VACANT LOT/LAND,False,False,...,12,58,09,1159436.0,1879658.0,2015,07/12/2015 12:42:46 PM,41.825500607,-87.690578042,"(41.825500607, -87.690578042)"
7,10139722,HY329228,07/05/2015 10:30:00 PM,016XX S CENTRAL PARK AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,...,24,29,18,1152687.0,1891389.0,2015,07/12/2015 12:42:46 PM,41.857827814,-87.715028789,"(41.857827814, -87.715028789)"
8,10139774,HY329209,07/05/2015 10:15:00 PM,048XX N ASHLAND AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,46,3,14,1164821.0,1932394.0,2015,07/12/2015 12:42:46 PM,41.970099796,-87.669324377,"(41.970099796, -87.669324377)"
9,10139697,HY329177,07/05/2015 10:10:00 PM,058XX S ARTESIAN AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,...,16,63,14,1160997.0,1865851.0,2015,07/12/2015 12:42:46 PM,41.787580282,-87.685233078,"(41.787580282, -87.685233078)"


Here we have the first 10 rows of a dataset called 'Crime0'. The original dataset is over 100MB (admittedly not that large of a dataset but this is just an example).

We'll perform a few cleaning operations.

In [2]:
dropped_dflow = dflow.drop_columns(['Location', 'Updated On', 'X Coordinate', 'Y Coordinate', 'Description'])
sctb = dropped_dflow.builders.set_column_types()
sctb.learn()
typed_dflow = sctb.to_dataflow()
typed_dflow.head(10)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,Year,Latitude,Longitude
0,10140490.0,HY329907,2015-07-05 23:50:00,050XX N NEWLAND AVE,820.0,THEFT,STREET,False,False,1613.0,16.0,41.0,10.0,06,2015.0,41.973309,-87.800175
1,10139776.0,HY329265,2015-07-05 23:30:00,011XX W MORSE AVE,460.0,BATTERY,STREET,False,True,2431.0,24.0,49.0,1.0,08B,2015.0,42.008124,-87.65955
2,10140270.0,HY329253,2015-07-05 23:20:00,121XX S FRONT AVE,486.0,BATTERY,STREET,False,True,532.0,,9.0,53.0,08B,2015.0,,
3,10139885.0,HY329308,2015-07-05 23:19:00,051XX W DIVISION ST,610.0,BURGLARY,SMALL RETAIL STORE,False,False,1531.0,15.0,37.0,25.0,05,2015.0,41.902152,-87.754883
4,10140379.0,HY329556,2015-07-05 23:00:00,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,STREET,False,False,1215.0,12.0,27.0,28.0,07,2015.0,41.88561,-87.657009
5,10140868.0,HY330421,2015-07-05 22:54:00,118XX S PEORIA ST,1320.0,CRIMINAL DAMAGE,VEHICLE NON-COMMERCIAL,False,False,524.0,5.0,34.0,53.0,14,2015.0,41.679311,-87.644545
6,10139762.0,HY329232,2015-07-05 22:42:00,026XX W 37TH PL,1020.0,ARSON,VACANT LOT/LAND,False,False,911.0,9.0,12.0,58.0,09,2015.0,41.825501,-87.690578
7,10139722.0,HY329228,2015-07-05 22:30:00,016XX S CENTRAL PARK AVE,1811.0,NARCOTICS,ALLEY,True,False,1021.0,10.0,24.0,29.0,18,2015.0,41.857828,-87.715029
8,10139774.0,HY329209,2015-07-05 22:15:00,048XX N ASHLAND AVE,1310.0,CRIMINAL DAMAGE,APARTMENT,False,False,2032.0,20.0,46.0,3.0,14,2015.0,41.9701,-87.669324
9,10139697.0,HY329177,2015-07-05 22:10:00,058XX S ARTESIAN AVE,1320.0,CRIMINAL DAMAGE,ALLEY,False,False,824.0,8.0,16.0,63.0,14,2015.0,41.78758,-87.685233


Now that we have a Dataflow with all our desired steps, we're ready to run against the 'full' dataset stored in Azure Blob.
All we need to do is pass the BlobDataSource into `replace_datasource` and we'll get back an identical Dataflow with the new DataSource substituted in.

In [3]:
replaced_dflow = typed_dflow.replace_datasource(dprep.BlobDataSource('https://dpreptestfiles.blob.core.windows.net/testfiles/crime0.csv'))

'replaced_dflow' will now pull data from the 168MB (729734 rows) version of Crime0.csv stored in Azure Blob!

NOTE: Dataflows can also be created by referencing a different Dataflow. Instead of using `replace_datasource`, there is a corresponding `replace_reference` method.

We should be careful now since pulling all that data down and putting it in a pandas dataframe isn't an ideal way to inspect the result of our Dataflow. So instead, to see that our steps are being applied to all the new data, we can add a `take_sample` step, which will select records at random (based on a given probability) to be returned.

The probability below takes the ~730000 rows down to a more inspectable ~73, though the number will vary each time `to_pandas_dataframe()` is run, since they are being randomly selected based on the probability.

In [4]:
random_sample_dflow = replaced_dflow.take_sample(probability=0.0001)
sample = random_sample_dflow.to_pandas_dataframe()
sample

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,Year,Latitude,Longitude
0,10029770.0,HY219716,2015-04-12 21:40:00,082XX S PULASKI RD,1320.0,CRIMINAL DAMAGE,VEHICLE NON-COMMERCIAL,False,False,834.0,8.0,13.0,70.0,14,2015.0,41.743922,-87.721883
1,9839428.0,HX488527,2014-10-30 17:30:00,005XX W 81ST ST,460.0,BATTERY,STREET,False,False,622.0,6.0,21.0,44.0,08B,2014.0,41.746959,-87.638242
2,9672814.0,HX322830,2014-06-28 17:56:00,052XX W ADAMS ST,460.0,BATTERY,SIDEWALK,False,False,1522.0,15.0,29.0,25.0,08B,2014.0,41.878349,-87.756137
3,9502101.0,HX157117,2014-02-19 15:45:00,063XX S COTTAGE GROVE AVE,5011.0,OTHER OFFENSE,SIDEWALK,True,False,312.0,3.0,20.0,42.0,26,2014.0,41.780230,-87.605780
4,9321475.0,HW465122,2013-09-24 12:18:00,061XX S MICHIGAN AVE,620.0,BURGLARY,VEHICLE NON-COMMERCIAL,True,False,311.0,3.0,20.0,40.0,05,2013.0,41.782093,-87.622346
5,9248963.0,HW394088,2013-08-04 15:30:00,009XX W ADDISON ST,870.0,THEFT,CTA BUS,False,False,1923.0,19.0,44.0,6.0,06,2013.0,41.947360,-87.653548
6,9220616.0,HW367197,2013-07-17 19:50:00,075XX N CLARK ST,560.0,ASSAULT,SIDEWALK,True,False,2422.0,24.0,49.0,1.0,08A,2013.0,42.018743,-87.675892
7,9168362.0,HW313516,2013-06-11 00:00:00,049XX W ALTGELD ST,1305.0,CRIMINAL DAMAGE,RESIDENCE-GARAGE,False,False,2521.0,25.0,31.0,19.0,14,2013.0,41.925925,-87.751045
8,9077387.0,HW222159,2013-04-06 16:45:00,026XX N MULLIGAN AVE,620.0,BURGLARY,RESIDENCE,False,False,2512.0,25.0,29.0,19.0,05,2013.0,41.928199,-87.784231
9,9053652.0,HW198727,2013-03-19 00:00:00,031XX W MADISON ST,2825.0,OTHER OFFENSE,RESIDENCE,False,True,1124.0,11.0,28.0,27.0,26,2013.0,41.880885,-87.704348
