# Append Columns and Rows
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.<br>

Often the data we want does not come in a single dataset: they are coming from different locations, have features that are separated, or are simply not homogeneous. Unsurprisingly, we typically want to work with a single dataset at a time.

Azure ML Data Prep allows the concatenation of two or more dataflows by means of column and row appends.

We will demonstrate this by defining a single dataflow that will pull data from multiple datasets.

## Table of Contents
[append_columns(dataflows)](#append_columns(dataflows))<br>
[append_rows(dataflows)](#append_rows(dataflows))

## `append_columns(dataflows)`
We can append data width-wise, which will change some or all existing rows and potentially adding rows (based on an assumption that data in two datasets are aligned on row number).

However we cannot do this if the reference dataflows have clashing schema with the target dataflow. Observe:

In [1]:
from azureml.dataprep import auto_read_file

In [2]:
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


In [3]:
dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')
dflow_chicago.head(5)

Unnamed: 0,Ward,Name,Took Office,Party
0,1.0,Proco Joe Moreno,2010*,Dem
1,2.0,Brian Hopkins,2015,Dem
2,3.0,Pat Dowell,2007,Dem
3,4.0,Sophia King,2016*,Dem
4,5.0,Leslie Hairston,1999,Dem


In [4]:
from azureml.dataprep import ExecutionError
try:
    dflow_combined_by_column = dflow.append_columns([dflow_chicago])
    dflow_combined_by_column.head(5)
except ExecutionError:
    print('Cannot append_columns with schema clash!')

Cannot append_columns with schema clash!


As expected, we cannot call `append_columns` with target dataflows that have clashing schema.

We can make the call once we rename or drop the offending columns. In more complex scenarios, we could opt to skip or filter to make rows align before appending columns. Here we will choose to simply drop the clashing column.

In [5]:
dflow_combined_by_column = dflow.append_columns([dflow_chicago.drop_columns(['Ward'])])
dflow_combined_by_column.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,Name,Took Office,Party
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)",Proco Joe Moreno,2010*,Dem
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)",Brian Hopkins,2015,Dem
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,,,2015.0,07/12/2015 12:42:46 PM,,,,Pat Dowell,2007,Dem
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)",Sophia King,2016*,Dem
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)",Leslie Hairston,1999,Dem


Notice that the resultant schema has more columns in the first N records (N being the number of records in `dataflow` and the extra columns being the width of the schema of our reference dataflow, chicago, minus the `Ward` column). From the N+1th record onwards, we will only have a schema width matching that of the `Ward`-less chicago set.

Why is this? As much as possible, the data from the reference dataflow(s) will be attached to existing rows in the target dataflow. If there are not enough rows in the target dataflow to attach to, we simply append them as new rows.

Note that these are appends, not joins (for joins please reference [Join](join.ipynb)), so the append may not be logically correct, but will take effect as long as there are no schema clashes.

In [6]:
# Ward-less data after we skip the first N rows
dflow_len = dflow.row_count
dflow_combined_by_column.skip(dflow_len).head(5)

Unnamed: 0,Name,Took Office,Party
0,Patrick Daley Thompson,2015,Dem
1,George Cardenas,2003,Dem
2,Marty Quinn,2011,Dem
3,Edward M. Burke,1969,Dem
4,Raymond Lopez,2015,Dem


## `append_rows(dataflows)`
We can append data length-wise, which will only have the effect of adding new rows. No existing data will be changed.

In [7]:
from azureml.dataprep import auto_read_file

In [8]:
dflow = auto_read_file(path='../data/crime-dirty.csv')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


In [9]:
dflow_spring = auto_read_file(path='../data/crime-spring.csv')
dflow_spring.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554.0,HZ239907,2016-04-15 23:56:00,007XX E 111TH ST,1153.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9.0,50.0,11.0,1183356.0,1831503.0,2016.0,2016-05-11 15:48:00,41.692834,-87.604319,"(41.692833841, -87.60431945)"
1,10516598.0,HZ258664,2016-04-15 17:00:00,082XX S MARSHFIELD AVE,890.0,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21.0,71.0,6.0,1166776.0,1850053.0,2016.0,2016-05-12 15:48:00,41.744107,-87.664494,"(41.744106973, -87.664494285)"
2,10519196.0,HZ261252,2016-04-15 10:00:00,104XX S SACRAMENTO AVE,1154.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19.0,74.0,11.0,,,2016.0,2016-05-12 15:50:00,,,
3,10519591.0,HZ261534,2016-04-15 09:00:00,113XX S PRAIRIE AVE,1120.0,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9.0,49.0,10.0,,,2016.0,2016-05-13 15:51:00,,,
4,10534446.0,HZ277630,2016-04-15 10:00:00,055XX N KEDZIE AVE,890.0,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40.0,13.0,6.0,,,2016.0,2016-05-25 15:59:00,,,


In [10]:
dflow_chicago = auto_read_file(path='../data/chicago-aldermen-2015.csv')
dflow_chicago.head(5)

Unnamed: 0,Ward,Name,Took Office,Party
0,1.0,Proco Joe Moreno,2010*,Dem
1,2.0,Brian Hopkins,2015,Dem
2,3.0,Pat Dowell,2007,Dem
3,4.0,Sophia King,2016*,Dem
4,5.0,Leslie Hairston,1999,Dem


In [11]:
dflow_combined_by_row = dflow.append_rows([dflow_chicago, dflow_spring])
dflow_combined_by_row.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,07/05/2015 11:50:00 PM,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,07/12/2015 12:42:46 PM,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,07/05/2015 11:30:00 PM,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,07/12/2015 12:42:46 PM,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,07/05/2015 11:20:00 PM,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,07/12/2015 12:42:46 PM,,,
3,10139885.0,HY329308,07/05/2015 11:19:00 PM,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,07/12/2015 12:42:46 PM,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,07/05/2015 11:00:00 PM,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,07/12/2015 12:42:46 PM,41.88561,-87.657009,"(41.885610142, -87.657008701)"


Notice that neither schema nor data has changed for the target dataflow.

If we skip ahead, we will see our target dataflows' data.

In [12]:
# chicago data
dflow_len = dflow.row_count
dflow_combined_by_row.skip(dflow_len).head(5)

Unnamed: 0,Ward,Name,Took Office,Party
0,1.0,Proco Joe Moreno,2010*,Dem
1,2.0,Brian Hopkins,2015,Dem
2,3.0,Pat Dowell,2007,Dem
3,4.0,Sophia King,2016*,Dem
4,5.0,Leslie Hairston,1999,Dem


In [13]:
# crimes spring data
dflow_chicago_len = dflow_chicago.row_count
dflow_combined_by_row.skip(dflow_len + dflow_chicago_len).head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554.0,HZ239907,2016-04-15 23:56:00,007XX E 111TH ST,1153.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9.0,50.0,11.0,1183356.0,1831503.0,2016.0,2016-05-11 15:48:00,41.692834,-87.604319,"(41.692833841, -87.60431945)"
1,10516598.0,HZ258664,2016-04-15 17:00:00,082XX S MARSHFIELD AVE,890.0,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21.0,71.0,6.0,1166776.0,1850053.0,2016.0,2016-05-12 15:48:00,41.744107,-87.664494,"(41.744106973, -87.664494285)"
2,10519196.0,HZ261252,2016-04-15 10:00:00,104XX S SACRAMENTO AVE,1154.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19.0,74.0,11.0,,,2016.0,2016-05-12 15:50:00,,,
3,10519591.0,HZ261534,2016-04-15 09:00:00,113XX S PRAIRIE AVE,1120.0,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9.0,49.0,10.0,,,2016.0,2016-05-13 15:51:00,,,
4,10534446.0,HZ277630,2016-04-15 10:00:00,055XX N KEDZIE AVE,890.0,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40.0,13.0,6.0,,,2016.0,2016-05-25 15:59:00,,,
