# Data ingestion via Azure Data Factory

In this notebook, you will create an Azure Data Factory (ADF) v2 pipeline to ingest data from a public dataset into your Azure Storage account.

In this lesson you will complete the following:
* Create a copy data pipeline.
* Monitor the pipeline run.
* Verify copied files exist in Blob storage.
* Examine the ingested data.

## Create a copy data pipeline

We will start the lab by using the Copy Data Wizard to create a new ADF pipeline using the Azure Data Factory UI. The wizard handles creating Linked Services, Data Sources, and the Copy Activity for you.

To learn more about the various components of ADF, you can visit the following resources:

* [Pipeline and actvities](https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities)
* [Dataset and linked services](https://docs.microsoft.com/en-us/azure/data-factory/concepts-datasets-linked-services)
* [Pipeline execution and triggers](https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipeline-execution-triggers)
* [Integration runtime](https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime)

In the [Azure portal](https://portal.azure.com/), navigate to the ADF instance you provisioned in the Getting Started notebook, and then launch the Azure Data Factory UI by select the **Author & Monitor** tile on the ADF Overview blade.

![Author & Monitor](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-author-and-deploy.png "Author & Monitor")

On the ADF UI landing page, select **Copy data**.

![Copy data](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-copy-data.png "Copy pipeline")

### Step 1: Properties

On the Properties page of the Copy Data wizard, do the following:

1. **Task name**: Enter a name, such as LabPipeline.
2. **Task cadence or Task schedule**: Choose Run once now.
3. Select **Next**.

![Copy Data wizard properties](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-copy-data-step-1.png "Copy Data wizard properties")

### Step 2: Source

The source will be configured to point to a publicly accessible Azure Storage account, which contains the files you will be copying into your own storage account.

On the Source data store page, select **+ Create new connection**.

![Create new connection](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-create-new-connection.png "Create new connection")

On the New Linked Service blade:

1. Enter "storage" into the search box.
2. Select **Azure Blob Storage**.
3. Select **Continue**.

![Add Azure Blob Storage Linked Service](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-linked-service-azure-blob-storage.png "Add Azure Blob Storage Linked Service")

Configure the New Linked Service (Azure Blob Storage) with the following values:

1. **Name**: Enter PublicDataset
2. **Authentication method**: Select Use SAS URI
3. **SAS URI**: Paste the following URI into the field: <https://databricksdemostore.blob.core.windows.net/?sv=2017-11-09&ss=b&srt=sco&sp=rl&se=2099-12-31T17:59:59Z&st=2018-09-22T15:21:51Z&spr=https&sig=LqDcqVNGNEuWILjNJoThzaXktV2N%2BFS354s716RJo80%3D>

4. Select **Test connection** and ensure a Connection successful message is displayed
5. Select **Finish**

![Configure Azure Blob Storage Linked Service for public dataset](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-linked-service-public-dataset.png "Configure Azure Blob Storage Linked Service for public dataset")

Back on the Source data source page, ensure **PublicDataset** is selected, and then select **Next**.

![Source data store](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-source-data-store.png "Source data store")

On the Choose the input file or folder page:

1. **File or folder**: Enter **training/crime-data-2016/**.
2. **Copy file recursively**: Check this box.
3. **Binary Copy**: Check this box.
4. Select **Next**.

![Choose the input file or folder](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-choose-input-file-or-folder.png "Choose the input file or folder")

### Step 3: Destination

The destination data store will be configured to point to the Azure Storage account you created in the [Getting Started notebook]($./01-Getting-Started.dbc) within this lab.

On the Destination data store page, select **+ Create new connection**.

![Create new connection](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-create-new-connection.png "Create new connection")

As you did previously, enter "storage" into the search box on the New Linked Service blade, choose **Azure Blob Storage** for the Linked Service and select **Continue**.

Configure the New Linked Service (Azure Blob Storage) with the following values:

1. **Name**: Enter DestinationContainer.
2. **Authentication method**: Select Use account key.
3. **Storage account name**: Select the name of the Storage account you created in the Getting Started notebook from the list.
4. Select **Test connection** and ensure a Connection successful message is displayed.
5. Select **Finish**.

![Configure Azure Blob Storage Linked Service for destination container](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-linked-service-destination-container.png "Configure Azure Blob Storage Linked Service for destination container")

Back on the Destination data store view, ensure **DestinationContainer** is selected, and then select **Next**.

On the Choose the output file or folder page:

1. **Folder path**: Enter **dwtemp/03.02/**.
2. **File name**: Leave empty.
3. **Copy behavior**: Select **Preserve hierarchy**.
4. Select **Next**.

![Choose the output file or folder](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-choose-output-file-or-folder.png "Choose the output file or folder")

### Step 4: Settings

On the Settings page, accept the default values, and select **Next**.

### Step 5: Summary

On the Summary page, you can review the copy pipeline settings, and then select **Next** to deploy the pipeline.

### Step 6: Deployment

The Deployment page displays the status of the pipeline deployment. This will create the Datasets and Pipeline, and then run the pipeline. Select **Monitor** to view the pipeline progress.

  ![Pipeline Deployment](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-deployment.png "Pipeline Deployment")

## Monitor the pipeline run

Selecting Monitor above will take you to the Pipeline Runs screen in the ADF UI, where you can monitor the status of your pipeline runs. Using the monitor dialog, you can track the completion status of pipeline runs, and access other details about the run.

  ![Monitor pipeline runs](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-monitor-pipeline-runs.png "Monitor pipeline runs")
  
> NOTE: You may need to select Refresh if you don't see the pipeline listed, or to update the status displayed.

You can select the View Activity Runs icon under Actions if you want to see the progress if the individual activities that make up the pipeline. On the Activity Runs dialog, you can select the various icons under Actions to display the inputs, outputs, and run details.

  ![Monitor activity runs](https://databricksdemostore.blob.core.windows.net/images/03/02/adf-monitor-activity-runs.png "Monitor activity runs")

When the copy activity is completed, its status will change to Succeeded (requires refreshing the activities list).

## Verify files in blob storage

Now, navigate to your storage account in the Azure portal, and locate the `dwtemp` container, and the `03.02` folder within it. Observe the files copied via ADF.

  ![Copied files in Blob storage](https://databricksdemostore.blob.core.windows.net/images/03/02/blob-storage-files.png "Blob storage files")

-sandbox
## Examine the ingested data

Quickly examine a few of the crime datasets ingested by the above operation, and observe some of the differences in the datasets from each city.

First, run the cell below to create a direct connection to your Blob Storage account, replacing the values of `storageAccountName` and `storageAccountKey` with the appropriate values from your storage account. You retrieved these values in the [Getting Started notebook]($./01-Getting-Started).

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/>Remember to attach your notebook to a cluster before running any cells in your notebook. In the notebook's toolbar, select the drop down arrow next to Detached, then select your cluster under Attach to.

![Attached to cluster](https://databricksdemostore.blob.core.windows.net/images/03/03/databricks-cluster-attach.png)

In [6]:
containerName = "dwtemp"
storageAccountName = "cs793d050761b12x4bc1x83a"
storageAccountKey = "T0Iq+hcVt2sWWt55Lannl5C1MFv6/Kf6rt1VWM3tLopcrmVYt5um3dgWEai2vH+7EcpqxR8etvYQllRuVl8eiQ=="

spark.conf.set(
  "fs.azure.account.key.%(storageAccountName)s.blob.core.windows.net" % locals(),
  storageAccountKey)

connectionString = "wasbs://%(containerName)s@%(storageAccountName)s.blob.core.windows.net/03.02" % locals()

Start by reviewing the `crime-data-2016` parquet files copied from the public storage account.

In the below cell, replace `[your-storage-account-name]` with the name of your storage account and then run the cell. This will list the parquet files, containing crime data for Boston, New York, and other cities.

In [8]:
%fs ls wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02

path,name,size
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-Boston-2016.parquet/,Crime-Data-Boston-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-Chicago-2016.parquet/,Crime-Data-Chicago-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-Dallas-2016.parquet/,Crime-Data-Dallas-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-Los-Angeles-2016.parquet/,Crime-Data-Los-Angeles-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-New-Orleans-2016.parquet/,Crime-Data-New-Orleans-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-New-York-2016.parquet/,Crime-Data-New-York-2016.parquet/,0
wasbs://dwtemp@cs793d050761b12x4bc1x83a.blob.core.windows.net/03.02/Crime-Data-Philadelphia-2016.parquet/,Crime-Data-Philadelphia-2016.parquet/,0


The next step is to examine the data for a few cities closer by creating a DataFrame for each file.

Start by creating a DataFrame for the New York and Boston data.

In [10]:
crimeDataNewYorkDf = spark.read.parquet("%(connectionString)s/Crime-Data-New-York-2016.parquet" % locals())

In [11]:
crimeDataBostonDf = spark.read.load("%(connectionString)s/Crime-Data-Boston-2016.parquet" % locals())

With the two DataFrames created, it is now possible to review the first couple records of each file.

In [13]:
display(crimeDataNewYorkDf)

complaintNumber,keyCode,offenseDescription,policeDeptCode,policeDeptDescription,lawCategoryCode,jurisdictionDesc,borough,precinct,locationOfOccurrenceDesc,premiseTypeDesc,latitude,longitude,fromDate,fromTime,toDate,toTime,reportDate
227505849,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,MISDEMEANOR,N.Y. POLICE DEPT,BRONX,52,INSIDE,CLOTHING/BOUTIQUE,40.86483612,-73.892447136,2016-12-31T00:00:00.000+0000,2016-12-31T16:40:00.000+0000,,,2016-12-31T00:00:00.000+0000
586068632,235,DANGEROUS DRUGS,511.0,"CONTROLLED SUBSTANCE, POSSESSI",MISDEMEANOR,PORT AUTHORITY,MANHATTAN,14,INSIDE,BUS TERMINAL,40.756266207,-73.990501248,2016-12-31T00:00:00.000+0000,2016-12-31T23:55:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:56:00.000+0000,2016-12-31T00:00:00.000+0000
155423129,105,ROBBERY,389.0,"ROBBERY,DWELLING",FELONY,N.Y. POLICE DEPT,BRONX,43,INSIDE,RESIDENCE - APT. HOUSE,40.828754623,-73.866593516,2016-12-31T00:00:00.000+0000,2016-12-31T23:40:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:50:00.000+0000,2016-12-31T00:00:00.000+0000
653964645,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,MISDEMEANOR,N.Y. POLICE DEPT,MANHATTAN,25,FRONT OF,STREET,40.809859893,-73.937644103,2016-12-31T00:00:00.000+0000,2016-12-31T23:30:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:31:00.000+0000,2016-12-31T00:00:00.000+0000
988275798,235,DANGEROUS DRUGS,567.0,"MARIJUANA, POSSESSION 4 & 5",MISDEMEANOR,N.Y. POLICE DEPT,MANHATTAN,7,OPPOSITE OF,STREET,40.719711494,-73.9894242,2016-12-31T00:00:00.000+0000,2016-12-31T23:25:00.000+0000,,,2016-12-31T00:00:00.000+0000
225104473,106,FELONY ASSAULT,109.0,"ASSAULT 2,1,UNCLASSIFIED",FELONY,N.Y. POLICE DEPT,QUEENS,102,,STREET,40.694514975,-73.849134227,2016-12-31T00:00:00.000+0000,2016-12-31T23:24:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:30:00.000+0000,2016-12-31T00:00:00.000+0000
428909890,106,FELONY ASSAULT,109.0,"ASSAULT 2,1,UNCLASSIFIED",FELONY,N.Y. POLICE DEPT,BROOKLYN,70,INSIDE,RESIDENCE - APT. HOUSE,40.649370541,-73.960872294,2016-12-31T00:00:00.000+0000,2016-12-31T23:20:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:25:00.000+0000,2016-12-31T00:00:00.000+0000
313457048,344,ASSAULT 3 & RELATED OFFENSES,101.0,ASSAULT 3,MISDEMEANOR,N.Y. POLICE DEPT,BROOKLYN,79,INSIDE,RESIDENCE - APT. HOUSE,40.682000963,-73.948223153,2016-12-31T00:00:00.000+0000,2016-12-31T23:20:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:25:00.000+0000,2016-12-31T00:00:00.000+0000
816766111,126,MISCELLANEOUS PENAL LAW,198.0,CRIMINAL CONTEMPT 1,FELONY,N.Y. POLICE DEPT,MANHATTAN,13,,STREET,40.73137039,-73.982563257,2016-12-31T00:00:00.000+0000,2016-12-31T23:19:00.000+0000,,,2016-12-31T00:00:00.000+0000
323812425,348,VEHICLE AND TRAFFIC LAWS,916.0,LEAVING SCENE-ACCIDENT-PERSONA,MISDEMEANOR,N.Y. POLICE DEPT,BROOKLYN,67,,STREET,40.641135477,-73.945624473,2016-12-31T00:00:00.000+0000,2016-12-31T23:15:00.000+0000,2016-12-31T00:00:00.000+0000,2016-12-31T23:35:00.000+0000,2016-12-31T00:00:00.000+0000


In [14]:
display(crimeDataBostonDf)

INCIDENT_NUMBER,OFFENSE_CODE,OFFENSE_CODE_GROUP,OFFENSE_DESCRIPTION,DISTRICT,REPORTING_AREA,SHOOTING,OCCURRED_ON_DATE,YEAR,MONTH,DAY_OF_WEEK,HOUR,UCR_PART,STREET,LATITUDE,LONGITUDE,LOCATION
I172071639,1402,Vandalism,VANDALISM,D4,905.0,,2017-08-29T19:38:00.000+0000,2017,8,Tuesday,19,Part Two,HARRISON ARCHWAYS,42.33936768,-71.07035467,"(42.33936768, -71.07035467)"
I172071637,3114,Investigate Property,INVESTIGATE PROPERTY,E13,574.0,,2017-08-29T16:00:00.000+0000,2017,8,Tuesday,16,Part Three,WASHINGTON ST,42.30971857,-71.10429432,"(42.30971857, -71.10429432)"
I172071635,3201,Property Lost,PROPERTY - LOST,A1,102.0,,2017-08-29T21:00:00.000+0000,2017,8,Tuesday,21,Part Three,TREMONT ST,42.35637531,-71.06213513,"(42.35637531, -71.06213513)"
I172071633,3115,Investigate Person,INVESTIGATE PERSON,C11,240.0,,2017-08-26T18:00:00.000+0000,2017,8,Saturday,18,Part Three,COLUMBIA RD,42.31959298,-71.062607,"(42.31959298, -71.06260700)"
I172071632,562,Other Burglary,BURGLARY - OTHER - NO FORCE,C6,232.0,,2017-08-29T21:04:00.000+0000,2017,8,Tuesday,21,Part One,FARRAGUT RD,42.33370773,-71.02500441,"(42.33370773, -71.02500441)"
I172071631,619,Larceny,LARCENY ALL OTHERS,E5,727.0,,2017-08-29T14:30:00.000+0000,2017,8,Tuesday,14,Part One,VFW PKWY,42.2802082,-71.17087959,"(42.28020820, -71.17087959)"
I172071630,613,Larceny,LARCENY SHOPLIFTING,E13,304.0,,2017-08-29T21:16:00.000+0000,2017,8,Tuesday,21,Part One,COLUMBUS AVE,42.31771534,-71.09823049,"(42.31771534, -71.09823049)"
I172071629,3115,Investigate Person,INVESTIGATE PERSON,B3,465.0,,2017-08-29T21:36:50.000+0000,2017,8,Tuesday,21,Part Three,BLUE HILL AVE,42.28482577,-71.09137369,"(42.28482577, -71.09137369)"
I172071624,1842,Drug Violation,"DRUGS - POSS CLASS A - HEROIN, ETC.",,,,2017-08-29T17:00:00.000+0000,2017,8,Tuesday,17,Part Two,,42.31523705,-71.09777089,"(42.31523705, -71.09777089)"
I172071624,1849,Drug Violation,"DRUGS - POSS CLASS B - COCAINE, ETC.",,,,2017-08-29T17:00:00.000+0000,2017,8,Tuesday,17,Part Two,,42.31523705,-71.09777089,"(42.31523705, -71.09777089)"


### Same Type of Data, Different Structure

Notice in the examples above:
* The `crimeDataNewYorkDF` and `crimeDataBostonDF` DataFrames use different names for the columns.
* The data itself is formatted differently and different names are used for similar concepts.

This is common when pulling data from disparate data sources.

In the next lesson, we will use an ADF Databricks Notebooks activity to perform data cleanup and extract homicide statistics.

## Next Steps

Start the next lesson, [Data Transformation]($./03-Data-Transformation)