# Data Ingestion
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
import azureml.dataprep as dprep

DataPrep has the ability to load different types of input data. While it is possible to use our smart reading functionality to detect the type of a file, it is also possible to specify a file type and its parameters.

## Table of Contents
[Read Lines](#Read-Lines)<br>
[Read CSV](#Read-CSV)<br>
[Read Excel](#Read-Excel)<br>
[Read Fixed Width Files](#Read-Fixed-Width-Files)<br>
[Read SQL](#Read-SQL)<br>
[Read From ADLS](#Read-From-ADLS)<br>

## Read Lines

One of the simplest ways to read a file into a dataframe is to just read it as text lines.

In [2]:
dataflow = dprep.read_lines(path='./data/text_lines.txt')
dataflow.head(20)

Unnamed: 0,Line
0,Date||Minimum temperature||Maximum temperature
1,2015-07-1||-4.1||10.0
2,2015-07-2||-0.8||10.8
3,2015-07-3||-7.0||10.5
4,2015-07-4||-5.5||9.3
5,2015-07-5||-4.7||7.3
6,2015-07-6||-2.4||11.2
7,2015-07-7||-4.7||11.5
8,2015-07-8||-3.0||12.6
9,2015-07-9||-1.3||13.8


With our ingestion done, we can go ahead and retrieve a Pandas DataFrame for the full dataset.

In [3]:
df = dataflow.to_pandas_dataframe()
df

Unnamed: 0,Line
0,Date||Minimum temperature||Maximum temperature
1,2015-07-1||-4.1||10.0
2,2015-07-2||-0.8||10.8
3,2015-07-3||-7.0||10.5
4,2015-07-4||-5.5||9.3
5,2015-07-5||-4.7||7.3
6,2015-07-6||-2.4||11.2
7,2015-07-7||-4.7||11.5
8,2015-07-8||-3.0||12.6
9,2015-07-9||-1.3||13.8


## Read CSV

When reading delimited files, we can let the underlying runtime infer the parsing parameters (e.g. separator, encoding, whether to use headers, etc.) simply by not providing them. In this case, we will attempt to read a file by specifying only its location. Once this is done, we can retrieve the first 10 rows to evaluate the result.

In [4]:
# SAS expires June 16th, 2019
dataflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv?st=2018-06-15T23%3A01%3A42Z&se=2019-06-16T23%3A01%3A00Z&sp=r&sv=2017-04-17&sr=b&sig=ugQQCmeC2eBamm6ynM7wnI%2BI3TTDTM6z9RPKj4a%2FU6g%3D')
dataflow.head(10)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
1,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.
5,ALABAMA,1,101710,Hale County,10171000589,Moundville Elem Sch,304,95,.,.,...,.,.,.,.,.,.,.,.,.,.
6,ALABAMA,1,101710,Hale County,10171000592,Sunshine High Sch,137,80-84,.,.,...,.,.,.,.,.,.,.,.,.,.
7,ALABAMA,1,101920,Jefferson County,10192000681,Adamsville Elem Sch,170,80-84,1,PS,...,1,PS,.,.,.,.,.,.,.,.
8,ALABAMA,1,101920,Jefferson County,10192000684,Bagley Jr High,395,90,.,.,...,.,.,.,.,.,.,.,.,.,.
9,ALABAMA,1,101920,Jefferson County,10192000687,Bottenfield Middle Sch,794,69,.,.,...,.,.,.,.,.,.,.,.,.,.


From the result, we can see that the delimiter and encoding were correctly detected. Column headers were also detected. However, the first line seems to be a duplicate of the column headers. One of the parameters we can specify is a number of lines to skip from the files we are reading. We will do so to filter out the duplicate line.

In [5]:
dataflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv',
                          skip_rows=1)
dataflow.head(10)

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1,101710,Hale County,10171002158,Greensboro Elem Sch,299,82,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1,101710,Hale County,10171002162,Greensboro High Sch,94,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1,101710,Hale County,10171002156,Greensboro Middle Sch,287,63,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1,101710,Hale County,10171000588,Hale Co High Sch,257,74,2,PS,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1,101710,Hale County,10171000589,Moundville Elem Sch,304,95,.,.,...,.,.,.,.,.,.,.,.,.,.
5,ALABAMA,1,101710,Hale County,10171000592,Sunshine High Sch,137,80-84,.,.,...,.,.,.,.,.,.,.,.,.,.
6,ALABAMA,1,101920,Jefferson County,10192000681,Adamsville Elem Sch,170,80-84,1,PS,...,1,PS,.,.,.,.,.,.,.,.
7,ALABAMA,1,101920,Jefferson County,10192000684,Bagley Jr High,395,90,.,.,...,.,.,.,.,.,.,.,.,.,.
8,ALABAMA,1,101920,Jefferson County,10192000687,Bottenfield Middle Sch,794,69,.,.,...,.,.,.,.,.,.,.,.,.,.
9,ALABAMA,1,101920,Jefferson County,10192000689,Bragg Middle Sch,875,84,.,.,...,.,.,.,.,.,.,.,.,.,.


Now we can see our data set contains the correct headers and the extraneous row has been skipped by read_csv. Next, we can take a look at the data types of the columns.

In [6]:
dataflow.head(1).dtypes

stnam                     object
fipst                     object
leaid                     object
leanm10                   object
ncessch                   object
schnam10                  object
ALL_MTH00numvalid_1011    object
ALL_MTH00pctprof_1011     object
MAM_MTH00numvalid_1011    object
MAM_MTH00pctprof_1011     object
MAS_MTH00numvalid_1011    object
MAS_MTH00pctprof_1011     object
MBL_MTH00numvalid_1011    object
MBL_MTH00pctprof_1011     object
MHI_MTH00numvalid_1011    object
MHI_MTH00pctprof_1011     object
MTR_MTH00numvalid_1011    object
MTR_MTH00pctprof_1011     object
MWH_MTH00numvalid_1011    object
MWH_MTH00pctprof_1011     object
F_MTH00numvalid_1011      object
F_MTH00pctprof_1011       object
M_MTH00numvalid_1011      object
M_MTH00pctprof_1011       object
CWD_MTH00numvalid_1011    object
CWD_MTH00pctprof_1011     object
ECD_MTH00numvalid_1011    object
ECD_MTH00pctprof_1011     object
LEP_MTH00numvalid_1011    object
LEP_MTH00pctprof_1011     object
          

Unfortunately, all of our columns came back as strings. This is because, by default, data prep will not change the type of your data. Since the data source we are reading from is a text file, we keep all values as strings. In this case, however, we do want to parse numeric columns as numbers. To do this, we can set the `inference_arguments` parameter to `current_culture`.

In [7]:
dataflow = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/read_csv_duplicate_headers.csv',
                          skip_rows=1,
                          inference_arguments=dprep.InferenceArguments.current_culture())
dataflow.head(1).dtypes

stnam                      object
fipst                     float64
leaid                     float64
leanm10                    object
ncessch                   float64
schnam10                   object
ALL_MTH00numvalid_1011    float64
ALL_MTH00pctprof_1011      object
MAM_MTH00numvalid_1011     object
MAM_MTH00pctprof_1011      object
MAS_MTH00numvalid_1011     object
MAS_MTH00pctprof_1011      object
MBL_MTH00numvalid_1011     object
MBL_MTH00pctprof_1011      object
MHI_MTH00numvalid_1011     object
MHI_MTH00pctprof_1011      object
MTR_MTH00numvalid_1011     object
MTR_MTH00pctprof_1011      object
MWH_MTH00numvalid_1011     object
MWH_MTH00pctprof_1011      object
F_MTH00numvalid_1011      float64
F_MTH00pctprof_1011        object
M_MTH00numvalid_1011      float64
M_MTH00pctprof_1011        object
CWD_MTH00numvalid_1011    float64
CWD_MTH00pctprof_1011      object
ECD_MTH00numvalid_1011    float64
ECD_MTH00pctprof_1011      object
LEP_MTH00numvalid_1011     object
LEP_MTH00pctpr

Now we can see several of the columns were correctly detected as numbers and their type is set to float64. With our ingestion done, we can go ahead and retrieve a Pandas DataFrame for the full dataset.

In [8]:
df = dataflow.to_pandas_dataframe()
df

Unnamed: 0,stnam,fipst,leaid,leanm10,ncessch,schnam10,ALL_MTH00numvalid_1011,ALL_MTH00pctprof_1011,MAM_MTH00numvalid_1011,MAM_MTH00pctprof_1011,...,MIG_MTH05numvalid_1011,MIG_MTH05pctprof_1011,MIG_MTH06numvalid_1011,MIG_MTH06pctprof_1011,MIG_MTH07numvalid_1011,MIG_MTH07pctprof_1011,MIG_MTH08numvalid_1011,MIG_MTH08pctprof_1011,MIG_MTHHSnumvalid_1011,MIG_MTHHSpctprof_1011
0,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Greensboro Elem Sch,299.0,82,.,.,...,.,.,.,.,.,.,.,.,.,.
1,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Greensboro High Sch,94.0,55-59,.,.,...,.,.,.,.,.,.,.,.,.,.
2,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Greensboro Middle Sch,287.0,63,.,.,...,.,.,.,.,.,.,.,.,.,.
3,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Hale Co High Sch,257.0,74,2,PS,...,.,.,.,.,.,.,.,.,.,.
4,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Moundville Elem Sch,304.0,95,.,.,...,.,.,.,.,.,.,.,.,.,.
5,ALABAMA,1.0,101710.0,Hale County,1.017100e+10,Sunshine High Sch,137.0,80-84,.,.,...,.,.,.,.,.,.,.,.,.,.
6,ALABAMA,1.0,101920.0,Jefferson County,1.019200e+10,Adamsville Elem Sch,170.0,80-84,1,PS,...,1,PS,.,.,.,.,.,.,.,.
7,ALABAMA,1.0,101920.0,Jefferson County,1.019200e+10,Bagley Jr High,395.0,90,.,.,...,.,.,.,.,.,.,.,.,.,.
8,ALABAMA,1.0,101920.0,Jefferson County,1.019200e+10,Bottenfield Middle Sch,794.0,69,.,.,...,.,.,.,.,.,.,.,.,.,.
9,ALABAMA,1.0,101920.0,Jefferson County,1.019200e+10,Bragg Middle Sch,875.0,84,.,.,...,.,.,.,.,.,.,.,.,.,.


## Read Excel

DataPrep also can load excel files using `read_excel` function.

In [9]:
dataflow = dprep.read_excel(path='./data/excel.xlsx')
dataflow.head(10)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,Hoba,"Iron, IVB",60000000.0,Found,1920.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,-19.58333,17.91667
1,Cape York,"Iron, IIIAB",58200000.0,Found,1818.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,76.13333,-64.93333
2,Campo del Cielo,"Iron, IAB-MG",50000000.0,Found,1576.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,-27.46667,-60.58333
3,Canyon Diablo,"Iron, IAB-MG",30000000.0,Found,1891.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,35.05,-111.03333
4,Armanty,"Iron, IIIE",28000000.0,Found,1898.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,47.0,88.0
5,Gibeon,"Iron, IVA",26000000.0,Found,1836.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,-25.5,18.0
6,Chupaderos,"Iron, IIIAB",24300000.0,Found,1852.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,27.0,-105.1
7,Mundrabilla,"Iron, IAB-ung",24000000.0,Found,1911.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,-30.78333,127.55
8,Sikhote-Alin,"Iron, IIAB",23000000.0,Fell,1947.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,46.16,134.65333
9,Bacubirito,"Iron, ungrouped",22000000.0,Found,1863.0,http://www.lpi.usra.edu/meteor/metbull.php?cod...,26.2,-107.83333


Here we have loaded the first sheet in the Excel document. We could have achieved the same result by specifying the name of the sheet we want to load explicitly. Alternatively, if we wanted to load the second sheet instead, we would provide its name as an argument.

In [10]:
dataflow = dprep.read_excel(path='./data/excel.xlsx', sheet_name='Sheet2')
dataflow.head(10)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,Rank,Title,Studio,Worldwide,Domestic / %,Column1,Overseas / %,Column2,Year^
4,1,Avatar,Fox,2788,760.5,0.273,2027.5,0.727,2009^
5,2,Titanic,Par.,2186.8,658.7,0.301,1528.1,0.699,1997^
6,3,Marvel's The Avengers,BV,1518.6,623.4,0.41,895.2,0.59,2012
7,4,Harry Potter and the Deathly Hallows Part 2,WB,1341.5,381,0.284,960.5,0.716,2011
8,5,Frozen,BV,1274.2,400.7,0.314,873.5,0.686,2013
9,6,Iron Man 3,BV,1215.4,409,0.337,806.4,0.663,2013


As you can see, the table in the second sheet had headers as well as 3 empty rows, so we should modify the function's arguments accordingly.

In [11]:
dataflow = dprep.read_excel(path='./data/excel.xlsx', sheet_name='Sheet2', use_header=True, skip_rows=3)
dataflow.head(10)

Unnamed: 0,Rank,Title,Studio,Worldwide,Domestic / %,Column1,Overseas / %,Column2,Year^
0,1.0,Avatar,Fox,2788.0,760.5,0.273,2027.5,0.727,2009^
1,2.0,Titanic,Par.,2186.8,658.7,0.301,1528.1,0.699,1997^
2,3.0,Marvel's The Avengers,BV,1518.6,623.4,0.41,895.2,0.59,2012
3,4.0,Harry Potter and the Deathly Hallows Part 2,WB,1341.5,381.0,0.284,960.5,0.716,2011
4,5.0,Frozen,BV,1274.2,400.7,0.314,873.5,0.686,2013
5,6.0,Iron Man 3,BV,1215.4,409.0,0.337,806.4,0.663,2013
6,7.0,Transformers: Dark of the Moon,P/DW,1123.8,352.4,0.314,771.4,0.686,2011
7,8.0,The Lord of the Rings: The Return of the King,NL,1119.9,377.8,0.337,742.1,0.663,2003^
8,9.0,Skyfall,Sony,1108.6,304.4,0.275,804.2,0.725,2012
9,10.0,The Dark Knight Rises,WB,1084.4,448.1,0.413,636.3,0.587,2012


In [12]:
df = dataflow.to_pandas_dataframe()
df

Unnamed: 0,Rank,Title,Studio,Worldwide,Domestic / %,Column1,Overseas / %,Column2,Year^
0,1,Avatar,Fox,2788,760.5,0.273,2027.5,0.727,2009^
1,2,Titanic,Par.,2186.8,658.7,0.301,1528.1,0.699,1997^
2,3,Marvel's The Avengers,BV,1518.6,623.4,0.41,895.2,0.59,2012
3,4,Harry Potter and the Deathly Hallows Part 2,WB,1341.5,381,0.284,960.5,0.716,2011
4,5,Frozen,BV,1274.2,400.7,0.314,873.5,0.686,2013
5,6,Iron Man 3,BV,1215.4,409,0.337,806.4,0.663,2013
6,7,Transformers: Dark of the Moon,P/DW,1123.8,352.4,0.314,771.4,0.686,2011
7,8,The Lord of the Rings: The Return of the King,NL,1119.9,377.8,0.337,742.1,0.663,2003^
8,9,Skyfall,Sony,1108.6,304.4,0.275,804.2,0.725,2012
9,10,The Dark Knight Rises,WB,1084.4,448.1,0.413,636.3,0.587,2012


## Read Fixed Width Files

For fixed-width files, we can specify a list of offsets. The first column is always assumed to start at offset 0.

In [13]:
dataflow = dprep.read_fwf('./data/fixed_width_file.txt', offsets=[7, 13, 43, 46, 52, 58, 65, 73])
dataflow.head(10)

Unnamed: 0,010000,99999,BOGUS NORWAY,NO,NO_1,ENRS,Column7,Column8,Column9
0,10003,99999,BOGUS NORWAY,NO,NO,ENSO,,,
1,10010,99999,JAN MAYEN,NO,JN,ENJA,70933.0,-8667.0,90.0
2,10013,99999,ROST,NO,NO,,,,
3,10014,99999,SOERSTOKKEN,NO,NO,ENSO,59783.0,5350.0,500.0
4,10015,99999,BRINGELAND,NO,NO,ENBL,61383.0,5867.0,3270.0
5,10016,99999,RORVIK/RYUM,NO,NO,,64850.0,11233.0,140.0
6,10017,99999,FRIGG,NO,NO,ENFR,59933.0,2417.0,480.0
7,10020,99999,VERLEGENHUKEN,NO,SV,,80050.0,16250.0,80.0
8,10030,99999,HORNSUND,NO,SV,,77000.0,15500.0,120.0
9,10040,99999,NY-ALESUND II,NO,SV,ENAS,78917.0,11933.0,80.0


Looking at the data, we can see that the first row was used as headers. In this particular case, however, there are no headers in the file. Therefore, we want to treat the first row as data.

By passing in `PromoteHeadersMode.NONE` to the `header` keyword argument, we can avoid header detection and get the correct data.

In [14]:
dataflow = dprep.read_fwf('./data/fixed_width_file.txt',
                          offsets=[7, 13, 43, 46, 52, 58, 65, 73],
                          header=dprep.PromoteHeadersMode.NONE)
dataflow.head(10)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9
0,10000,99999,BOGUS NORWAY,NO,NO,ENRS,,,
1,10003,99999,BOGUS NORWAY,NO,NO,ENSO,,,
2,10010,99999,JAN MAYEN,NO,JN,ENJA,70933.0,-8667.0,90.0
3,10013,99999,ROST,NO,NO,,,,
4,10014,99999,SOERSTOKKEN,NO,NO,ENSO,59783.0,5350.0,500.0
5,10015,99999,BRINGELAND,NO,NO,ENBL,61383.0,5867.0,3270.0
6,10016,99999,RORVIK/RYUM,NO,NO,,64850.0,11233.0,140.0
7,10017,99999,FRIGG,NO,NO,ENFR,59933.0,2417.0,480.0
8,10020,99999,VERLEGENHUKEN,NO,SV,,80050.0,16250.0,80.0
9,10030,99999,HORNSUND,NO,SV,,77000.0,15500.0,120.0


In [15]:
df = dataflow.to_pandas_dataframe()
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9
0,010000,99999,BOGUS NORWAY,NO,NO,ENRS,,,
1,010003,99999,BOGUS NORWAY,NO,NO,ENSO,,,
2,010010,99999,JAN MAYEN,NO,JN,ENJA,+70933,-008667,+00090
3,010013,99999,ROST,NO,NO,,,,
4,010014,99999,SOERSTOKKEN,NO,NO,ENSO,+59783,+005350,+00500
5,010015,99999,BRINGELAND,NO,NO,ENBL,+61383,+005867,+03270
6,010016,99999,RORVIK/RYUM,NO,NO,,+64850,+011233,+00140
7,010017,99999,FRIGG,NO,NO,ENFR,+59933,+002417,+00480
8,010020,99999,VERLEGENHUKEN,NO,SV,,+80050,+016250,+00080
9,010030,99999,HORNSUND,NO,SV,,+77000,+015500,+00120


## Read SQL

DataPrep can also get data from SQL servers. Currently, only Microsoft SQL Server is supported.

To read data from a SQL server, we have to create a data source object that contains the connection information.

In [16]:
secret = dprep.register_secret(value="dpr3pTestU$er", id="dprepTestUser")
ds = dprep.MSSQLDataSource(server_name="dprep-sql-test.database.windows.net",
                           database_name="dprep-sql-test",
                           user_name="dprepTestUser",
                           password=secret)

As you can see, the password parameter of `MSSQLDataSource` accepts a Secret object. You can get a Secret object in two ways:
1. Register the secret and its value with the execution engine 
2. Create the secret with just an id (useful if the secret value was already registered in the execution environment)

Now that we have created a data source object, we can proceed to read data.

In [17]:
dataflow = dprep.read_sql(ds, "SELECT top 100 * FROM [SalesLT].[Product]")
dataflow.head(20)

Unnamed: 0,ProductID,Name,ProductNumber,Color,StandardCost,ListPrice,Size,Weight,ProductCategoryID,ProductModelID,SellStartDate,SellEndDate,DiscontinuedDate,ThumbNailPhoto,ThumbnailPhotoFileName,rowguid,ModifiedDate
0,680,"HL Road Frame - Black, 58",FR-R92B-58,Black,1059.31,1431.5,58,1016.04,18,6,2002-06-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,43dd68d6-14a4-461f-9069-55309d90ea7e,2008-03-11 10:01:36.827
1,706,"HL Road Frame - Red, 58",FR-R92R-58,Red,1059.31,1431.5,58,1016.04,18,6,2002-06-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,9540ff17-2712-4c90-a3d1-8ce5568b2462,2008-03-11 10:01:36.827
2,707,"Sport-100 Helmet, Red",HL-U509-R,Red,13.0863,34.99,,,35,33,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,2e1ef41a-c08a-4ff6-8ada-bde58b64a712,2008-03-11 10:01:36.827
3,708,"Sport-100 Helmet, Black",HL-U509,Black,13.0863,34.99,,,35,33,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,a25a44fb-c2de-4268-958f-110b8d7621e2,2008-03-11 10:01:36.827
4,709,"Mountain Bike Socks, M",SO-B909-M,White,3.3963,9.5,M,,27,18,2005-07-01,2006-06-30T00:00:00.000000,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,18f95f47-1540-4e02-8f1f-cc1bcb6828d0,2008-03-11 10:01:36.827
5,710,"Mountain Bike Socks, L",SO-B909-L,White,3.3963,9.5,L,,27,18,2005-07-01,2006-06-30T00:00:00.000000,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,161c035e-21b3-4e14-8e44-af508f35d80a,2008-03-11 10:01:36.827
6,711,"Sport-100 Helmet, Blue",HL-U509-B,Blue,13.0863,34.99,,,35,33,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,fd7c0858-4179-48c2-865b-abd5dfc7bc1d,2008-03-11 10:01:36.827
7,712,AWC Logo Cap,CA-1098,Multi,6.9223,8.99,,,23,2,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,b9ede243-a6f4-4629-b1d4-ffe1aedc6de7,2008-03-11 10:01:36.827
8,713,"Long-Sleeve Logo Jersey, S",LJ-0192-S,Multi,38.4923,49.99,S,,25,11,2005-07-01,,,b'GIF89aP\x002\x00\xf7\x00\x00\x14%g\xfc\xfc\x...,awc_jersey_male_small.gif,fd449c82-a259-4fae-8584-6ca0255faf68,2008-03-11 10:01:36.827
9,714,"Long-Sleeve Logo Jersey, M",LJ-0192-M,Multi,38.4923,49.99,M,,25,11,2005-07-01,,,b'GIF89aP\x002\x00\xf7\x00\x00\x14%g\xfc\xfc\x...,awc_jersey_male_small.gif,6a290063-a0cf-432a-8110-2ea0fda14308,2008-03-11 10:01:36.827


In [18]:
df = dataflow.to_pandas_dataframe()
df.dtypes

ProductID                          int64
Name                              object
ProductNumber                     object
Color                             object
StandardCost                     float64
ListPrice                        float64
Size                              object
Weight                           float64
ProductCategoryID                  int64
ProductModelID                     int64
SellStartDate             datetime64[ns]
SellEndDate                       object
DiscontinuedDate                  object
ThumbNailPhoto                    object
ThumbnailPhotoFileName            object
rowguid                           object
ModifiedDate              datetime64[ns]
dtype: object

## Read from ADLS

There are 2 ways the DataPrep API can acquire the necessary OAuth token to access Azure DataLake Storage:
1. Retrieve the access token from a recent login session of the user's [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) login
2. Using a ServicePrincipal (SP) and a certificate as secret

### Using Access Token from a recent Azure CLI session

On your local machine, run the following command:
```
az login
```
If your user account is a member of more than one Azure tenant, it will be necessary to specify the tenant, either in the AAD url hostname form '<your_domain>.onmicrosoft.com' or the tenantId GUID. The latter can be retrieved as follows:
```
az account show --query tenantId
```

dataflow = read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/farmers-markets.csv', tenant='microsoft.onmicrosoft.com'))
head = dataflow.head(5)
head

### Create a ServicePrincipal via Azure CLI

A ServicePrincipal and the corresponding certificate can be created via [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).
This particular SP is configured as Reader, with its scope reduced to just the ADLS account 'dpreptestfiles'
```
az account set --subscription "Data Wrangling development"
az ad sp create-for-rbac -n "SP-ADLS-dpreptestfiles" --create-cert --role reader --scopes /subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourceGroups/dpreptestfiles/providers/Microsoft.DataLakeStore/accounts/dpreptestfiles
```
This command emits the appId and the path to the certificate file (usually in the home folder). The .crt file contains both the public cert and the private key in PEM format.

Extract the thumbprint with:
```
openssl x509 -in adls-dpreptestfiles.crt -noout -fingerprint
```

### Configure ADLS Account for ServicePrincipal

To configure the ACL for the ADLS filesystem, use the objectId of the user or, here, ServicePrincipal:
```
az ad sp show --id "8dd38f34-1fcb-4ff9-accd-7cd60b757174" --query objectId
```
Configure Read and Execute access for the ADLS file system. Since the underlying HDFS ACL model doesn't support inheritance, folders and files need to be ACL-ed individually.
```
az dls fs access set-entry --account dpreptestfiles --acl-spec "user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r-x" --path /
az dls fs access set-entry --account dpreptestfiles --acl-spec "user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r--" --path /farmers-markets.csv
```

References:
- [az ad sp](https://docs.microsoft.com/en-us/cli/azure/ad/sp?view=azure-cli-latest)
- [az dls fs access](https://docs.microsoft.com/en-us/cli/azure/dls/fs/access?view=azure-cli-latest)
- [ACL model for ADLS](https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-lake-store/data-lake-store-access-control.md)

In [19]:
certThumbprint = 'C2:08:9D:9E:D1:74:FC:EB:E9:7E:63:96:37:1C:13:88:5E:B9:2C:84'
certificate = ''
with open('./data/adls-dpreptestfiles.crt', 'rt', encoding='utf-8') as crtFile:
    certificate = crtFile.read()

servicePrincipalAppId = "8dd38f34-1fcb-4ff9-accd-7cd60b757174"

### Acquire an OAuth Access Token

Use the adal package (via: `pip install adal`) to create an authentication context on the MSFT tenant and acquire an OAuth access token. Note that for ADLS, the `resource` in the token request must be for 'https://datalake.azure.net', which is different from most other Azure resources.

In [20]:
import adal
from azureml.dataprep.api.datasources import DataLakeDataSource

ctx = adal.AuthenticationContext('https://login.microsoftonline.com/microsoft.onmicrosoft.com')
token = ctx.acquire_token_with_client_certificate('https://datalake.azure.net/', servicePrincipalAppId, certificate, certThumbprint)
dataflow = dprep.read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/farmers-markets.csv', accessToken=token['accessToken']))
dataflow.to_pandas_dataframe().head()

Unnamed: 0,FMID,MarketName,Website,Facebook,Twitter,Youtube,OtherMedia,street,city,County,...,Coffee,Beans,Fruits,Grains,Juices,Mushrooms,PetFood,Tofu,WildHarvested,updateTime
0,1012063,Caledonia Farmers Market Association - Danville,https://sites.google.com/site/caledoniafarmers...,https://www.facebook.com/Danville.VT.Farmers.M...,,,,,Danville,Caledonia,...,Y,Y,Y,N,Y,N,Y,N,N,6/28/2016 12:10
1,1011871,Stearns Homestead Farmers' Market,http://Stearnshomestead.com,,,,,6975 Ridge Road,Parma,Cuyahoga,...,N,N,Y,N,N,N,Y,N,N,4/9/2016 20:05
2,1011878,100 Mile Market,http://www.pfcmarkets.com,https://www.facebook.com/100MileMarket/?fref=ts,,,https://www.instagram.com/100milemarket/,507 Harrison St,Kalamazoo,Kalamazoo,...,N,N,Y,N,N,N,N,N,N,7/15/2016 19:20
3,1009364,106 S. Main Street Farmers Market,http://thetownofsixmile.wordpress.com/,,,,,106 S. Main Street,Six Mile,,...,,,,,,,,,,2013
4,1010691,10th Steet Community Farmers Market,,,,,http://agrimissouri.com/mo-grown/grodetail.php...,10th Street and Poplar,Lamar,Barton,...,N,N,Y,N,N,N,N,N,N,10/28/2014 9:49
