# Join
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

In DataPrep you can easily join two dataflows.

In [1]:
import azureml.dataprep as dprep


First let's get the left side of the data into a shape that is ready for the join.

In [2]:
# get the first dataflow and derive desired key column
dataflow_l = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/testfiles/BostonWeather.csv')
dataflow_l = dataflow_l.derive_column_by_example(source_columns='DATE', new_column_name='date_timerange',
                                                 example_data=[('11/11/2015 0:54', 'Nov 11, 2015 | 12AM-2AM'),
                                                              ('2/1/2015 0:54', 'Feb 1, 2015 | 12AM-2AM'),
                                                              ('1/29/2015 20:54', 'Jan 29, 2015 | 8PM-10PM')])
dataflow_l = dataflow_l.drop_columns(['DATE'])

# convert types and summarize data
dataflow_l = dataflow_l.set_column_types(type_conversions={'HOURLYDRYBULBTEMPF': dprep.TypeConverter(dprep.FieldType.DECIMAL)})
dataflow_l = dataflow_l.filter(expression=dprep.f_not(dprep.col('HOURLYDRYBULBTEMPF').is_error()))
dataflow_l = dataflow_l.summarize(group_by_columns=['date_timerange'],summary_columns=[dprep.SummaryColumnsValue('HOURLYDRYBULBTEMPF', dprep.api.engineapi.typedefinitions.SummaryFunction.MEAN, 'HOURLYDRYBULBTEMPF_Mean')] )

# cache the result so the steps above are not executed every time we pull on the data
import os
from pathlib import Path
cache_dir = str(Path(os.getcwd(), 'dataflow-cache'))
dataflow_l.cache(directory_path=cache_dir)
dataflow_l.head(10)

Unnamed: 0,date_timerange,HOURLYDRYBULBTEMPF_Mean
0,"Jan 1, 2015 | 12AM-2AM",22.0
1,"Jan 1, 2015 | 2AM-4AM",23.0
2,"Jan 1, 2015 | 4AM-6AM",23.0
3,"Jan 1, 2015 | 6AM-8AM",22.666667
4,"Jan 1, 2015 | 8AM-10AM",26.5
5,"Jan 1, 2015 | 10AM-12PM",29.666667
6,"Jan 1, 2015 | 12PM-2PM",32.333333
7,"Jan 1, 2015 | 2PM-4PM",32.0
8,"Jan 1, 2015 | 4PM-6PM",30.666667
9,"Jan 1, 2015 | 6PM-8PM",30.666667


Now let's prepare the data for the right side of the join.

In [3]:
# get the second dataflow and desired key column
dataflow_r = dprep.read_csv(path='https://dpreptestfiles.blob.core.windows.net/bike-share/*-hubway-tripdata.csv')
dataflow_r = dataflow_r.keep_columns(['starttime', 'start station id'])
dataflow_r = dataflow_r.derive_column_by_example(source_columns='starttime', new_column_name='l_date_timerange',
                                                 example_data=[('2015-01-01 00:21:44', 'Jan 1, 2015 | 12AM-2AM')])
dataflow_r = dataflow_r.drop_columns('starttime')

# cache the results
dataflow_r.cache(directory_path=cache_dir)
dataflow_r.head(10)

Unnamed: 0,l_date_timerange,start station id
0,"Jan 1, 2015 | 12AM-2AM",115
1,"Jan 1, 2015 | 12AM-2AM",80
2,"Jan 1, 2015 | 12AM-2AM",91
3,"Jan 1, 2015 | 12AM-2AM",115
4,"Jan 1, 2015 | 12AM-2AM",105
5,"Jan 1, 2015 | 12AM-2AM",88
6,"Jan 1, 2015 | 12AM-2AM",91
7,"Jan 1, 2015 | 2AM-4AM",68
8,"Jan 1, 2015 | 4AM-6AM",117
9,"Jan 1, 2015 | 8AM-10AM",67


There are three ways one can join two dataflows in DataPrep:
1. Create `JoinBuilder` object for interactive join configuration.
2. Call ```join()``` on one of the dataflows and pass in the other along with all other arguments.
3. Call ```Dataflow.join()``` method and pass in two dataflows along with all other arguments.

We will explore the builder object as it simplifies the determination of correct arguments. 

In [4]:
# construct a builder for joining dataflow_l with dataflow_r
join_builder = dataflow_l.builders.join(right_dataflow=dataflow_r, left_column_prefix='l', right_column_prefix='r')

join_builder

JoinBuilder:
    join_key_pairs: None
    left_column_prefix: l
    right_column_prefix: r
    left_non_prefixed_columns: []
    right_non_prefixed_columns: []
    is_join_suggestion_applied: False
    is_suggested_join_key_generated: N/A

As you can see, so far the builder has no propeties set except default values.
From here you could set each of the options and preview its effect on the join result or use DataPrep to determine some of them.

Let's start with determining appropriate column prefixes for left and right side of the join and lists of columns that would not conflict and therefore don't need to be prefixed.  

In [5]:
join_builder.detect_column_info()
join_builder

JoinBuilder:
    join_key_pairs: None
    left_column_prefix: l2_
    right_column_prefix: r_
    left_non_prefixed_columns: ['date_timerange', 'HOURLYDRYBULBTEMPF_Mean', 'KEY_generated']
    right_non_prefixed_columns: ['l_date_timerange', 'start station id', 'KEY_generated']
    is_join_suggestion_applied: False
    is_suggested_join_key_generated: N/A

You can see that DataPrep has performed a pull on both dataflows to determine the column names in them. Given that `dataflow_r` already had a column starting with `l_` new prefix got generated which would not collide with any column names that are already present.
Additionally columns in each dataflow that won't conflict during join would remain unprefixed.
This apprach to column naming is crucial for join robustness to schema changes in the data. Let's say that at some time in future the data consumed by left dataflow will also have `l_date_timerange` column in it.
Configured as above the join will still run as expected and the new column will be prefixed with `l2_` ensuring that ig column `l_date_timerange` was consumed by some other future transformation it remains unaffected.

Note: `KEY_generated` is appended to both lists and is reserved for Dataprep use in case Autojoin is performed.

### Autojoin
Autojoin is a Dataprep feature that determines suitable join arguments given data on both sides. In some cases Autojoin can even derive a key column from a number of available columns in the data.
Here is how you can use Autojoin:

In [6]:
# generate join suggestions
join_builder.generate_suggested_join()

In [7]:
join_builder.list_join_suggestions()

'[{0: "\\nSuggestion:\\n    Left:\\n        Needs transform: False\\n        % of matched rows: 0.18604651162790697\\n    Right:\\n        Needs transform: False\\n        % of matched rows: 1.0\\n    Join keys: [(\'date_timerange\', \'l_date_timerange\')]\\n\\n"}]'

In [8]:
join_builder.apply_suggestion(0)

In [9]:
join_builder.preview(10)

Unnamed: 0,date_timerange,HOURLYDRYBULBTEMPF_Mean,l_date_timerange,start station id
0,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",75
1,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",115
2,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",88
3,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",90
4,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",116
5,"Jan 1, 2015 | 10AM-12PM",29.666667,"Jan 1, 2015 | 10AM-12PM",88
6,"Jan 1, 2015 | 10AM-12PM",29.666667,"Jan 1, 2015 | 10AM-12PM",95
7,"Jan 1, 2015 | 10AM-12PM",29.666667,"Jan 1, 2015 | 10AM-12PM",116
8,"Jan 1, 2015 | 10AM-12PM",29.666667,"Jan 1, 2015 | 10AM-12PM",116
9,"Jan 1, 2015 | 10AM-12PM",29.666667,"Jan 1, 2015 | 10AM-12PM",110


Everything looks just as we would expect, so it is time to get our new joined dataflow.

In [10]:
dataflow_autojoined = join_builder.to_dataflow().drop_columns(['l_date_timerange'])

### Joining two dataflows without pulling the data

In don't want to pull on data and know what join should look like you can always you join method on Dataflow 

In [11]:
dataflow_joined = dprep.Dataflow.join(left_dataflow=dataflow_l,
                                      right_dataflow=dataflow_r,
                                      join_key_pairs=[('date_timerange', 'l_date_timerange')],
                                      left_column_prefix='l2_',
                                      right_column_prefix='r_')


In [12]:
dataflow_joined.head(10)

Unnamed: 0,l2_date_timerange,l2_HOURLYDRYBULBTEMPF_Mean,r_l_date_timerange,r_start station id
0,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",115
1,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",80
2,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",91
3,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",115
4,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",105
5,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",88
6,"Jan 1, 2015 | 12AM-2AM",22.0,"Jan 1, 2015 | 12AM-2AM",91
7,"Jan 1, 2015 | 2AM-4AM",23.0,"Jan 1, 2015 | 2AM-4AM",68
8,"Jan 1, 2015 | 4AM-6AM",23.0,"Jan 1, 2015 | 4AM-6AM",117
9,"Jan 1, 2015 | 8AM-10AM",26.5,"Jan 1, 2015 | 8AM-10AM",67
