# Amazon SageMaker Data Wrangler time series advanced transformations
This notebook must be run after you finished the first part of the Data Wrangler time series transformation lab in the [`TS-Workshop.ipynb`](./TS-Workshop.ipynb) notebook.

In [None]:
%store -r bucket_name
%store -r data_uploaded
%store -r region

try:
    bucket_name
    data_uploaded
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] YOU HAVE TO RUN TS-Workshop notebook  ")
    print("++++++++++++++++++++++++++++++++++++++++++++++")

## Advanced time series dataset preparation
As we mentioned previously current dataset is suitable for managed services like [Amazon SageMaker Canvas](https://aws.amazon.com/sagemaker/canvas/), [Amazon Forecast](https://aws.amazon.com/forecast/) and [DeepAR algotithm](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html) in SageMaker. 

All of them could automatically add multiple features, like weekend flag, day of week number, lag feature, etc. If you are planing to use a dataset with other tools and algorithms then it is better to add a few more transformations. 

### Adding missing timestamp - locationID combinations. 
Current dataset might missing some timestamp - locationID combinations. Lets add them with 0 as a value for features. There is no built-in transformation, so we will create a new custom one. 
To create a custom transformation you have to:
1. Click the plus sign next to a collection of transformation elements and choose Add transform.
![addMissingCombinations](./pictures/addMissingCombinations.png)
1. Click "+ Add step" orange button in the TRANSFORMS menu.
![AddStep](./pictures/AddStep.png)
1. Choose Custom Transform. \
![CustomTransform](./pictures/CustomTransform.png)
1. In drop down menu select Python (PySpark) and use code below. This code will create a new dataframe with all possible combinations of timestamps and locations id and then joint it with existing dataframe. All missing values will be replaced by 0. 


In [None]:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
data = []
for location in range(1,264):
    for time in range (0,10200):
        data.append((location, 1546300800 + time*3600))
schema = StructType([
  StructField("PULocationID",StringType(),True),
  StructField("pickup_time_temp",StringType(),True)
])
df_temp = spark.createDataFrame(data=data,schema=schema)
df_temp = df_temp.withColumn("timestamp",from_unixtime("pickup_time_temp"))
df_temp = df_temp.withColumn("pickup_time",col("timestamp").cast(TimestampType()))
df_temp = df_temp.drop("pickup_time_temp","timestamp")
df = df_temp.join(df,on=['pickup_time','PULocationID'],how='left')
df = df.na.fill(value=0)

![combinationsCode](./pictures/combinationsCode.png)
1. Choose Preview
1. Choose Add to save the step.

When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset with a new column `pickup_time` and without column `tpep_pickup_datetime`. Dont worry about many zeros as this is happend because sampled data is about 100MB of 6GB dataset. 
![CombinationsResult](./pictures/CombinationsResult.png)

If you take a look on a data flow graph you could notice that we have a new branch with this transformation. This is happend bacause we can have a several destination nodes to Amazon S3. 
![newDAG](./pictures/newDAG.png)

### Featurize datetime
"Featurize datetime" time series transformation will add the month, day of the month, day of the year, week of the year, hour and quarter features to our dataset. Because we’re providing the date/time components as separate features, we enable ML algorithms to detect signals and patterns for improving prediction accuracy.

To create this transformation you have to:
1. Click the plus sign next to a collection of transformation elements and choose Add transform.
![addDateFeature](./pictures/addDateFeature.png)
1. Click "+ Add step" orange button in the TRANSFORMS menu.
![AddStep](./pictures/AddStep.png)
1. Choose Time Series. \
![SelectTimeSeries](./pictures/SelectTimeSeries.png)
    1. For "Transform" choose "Featurize date/time"
    1. For "Input Column" choose pickup_time
    1. For "Output Column" enter "date" 
    1. For "Output mode" choose "Ordinal"
    1. For "Output format" choose "Columns"
    1. For date/time features to extract, select Year, Month, Day, Hour, Week of year, Day of year, and Quarter.
![dataFeatureConfig](./pictures/dataFeatureConfig.png)
1. Choose Preview
1. Choose Add to save the step.

When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. 
![dateFeatureResult](./pictures/dateFeatureResult.png)

### Lag feature
Let’s create lag features for the target column `count`. Lag features in time series analysis are values at prior timestamps that are considered helpful in inferring future values. They also help identify autocorrelation, also known as serial correlation, patterns in the residual series by quantifying the relationship of the observation with observations at previous time steps. Autocorrelation is similar to regular correlation but between the values in a series and its past values. It forms the basis for the autoregressive forecasting models in the ARIMA series.

With the Data Wrangler Lag feature transform, you can easily create lag features `n` periods apart. Additionally, we often want to create multiple lag features at different lags and let the model decide the most meaningful features. For such a scenario, the **Lag features** transform helps create multiple lag columns over a specified window size.

To create this transformation you have to:
1. Click the plus sign next to a collection of transformation elements and choose Add transform.
![addLag](./pictures/addLag.png)
1. Click "+ Add step" orange button in the TRANSFORMS menu.
![AddStep](./pictures/AddStep.png)
1. Choose Time Series. \
![SelectTimeSeries](./pictures/SelectTimeSeries.png)
    1. For "Transform" choose "Lag features"
    1. For "Generate lag features for this column" choose "count"
    1. For "ID column" enter "PULocationID" 
    1. For "Timestamp Column" choose "pickup_time"
    1. For Lag, enter 8. You could try to use different values, maybe 24 hoyrs in our case makes more sense. 
    1. Because we’re interested in observing up to the previous 8 lag values, let’s select Include the entire lag window.
    1. To create a new column for each lag value, select Flatten the output.
![lagConfig](./pictures/lagConfig.png)
1. Choose Preview
1. Choose Add to save the step.

When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. 
![lagResult](./pictures/lagResult.png)

### Rolling window features
We can also calculate meaningful statistical summaries across a range of values and include them as input features. Let’s extract common statistical time series features.

Data Wrangler implements automatic time series feature extraction capabilities using the open source `tsfresh` package. With the time series feature extraction transforms, you can automate the feature extraction process. This eliminates the time and effort otherwise spent manually implementing signal processing libraries. We will extract features using the **Rolling window** features transform. This method computes statistical properties across a set of observations defined by the window size.

To create this transformation you have to:
1. Click the plus sign next to a collection of transformation elements and choose Add transform.
![addRolling](./pictures/addRolling.png)
1. Click "+ Add step" orange button in the TRANSFORMS menu.
![AddStep](./pictures/AddStep.png)
1. Choose Time Series. \
![SelectTimeSeries](./pictures/SelectTimeSeries.png)
    1. For "Transform" choose "Rolling window features"
    1. For "Generate rolling window features for this column" choose "count"
    1. For "Timestamp Column" choose "pickup_time"
    1. For "ID column" enter "PULocationID" 
    1. For "Window size", enter 8. You could try to use different values, maybe 24 hoyrs in our case makes more sense. 
    1. Select Flatten to create a new column for each computed feature.
    1. Choose "Strategy" as "Minimal subset". This strategy extracts eight features that are useful in downstream analyses. Other strategies include Efficient Subset, Custom subset, and All features. \
![rollingConfig](./pictures/rollingConfig.png)
1. Choose Preview
1. Choose Add to save the step.

When transfromation is applied on a sampled data you should see all curent steps and a preview of a resulted dataset. 
![rollingResult](./pictures/rollingResult.png)

### Dataset export
We have transformed the time series dataset and are ready to use the transformed dataset as input for a custom forecasting algorithm. The last step is to export the transformed dataset to Amazon S3. We are going to repeat steps from previous "Dataset export" section in the [`TS-Workshop.ipynb`](./TS-Workshop.ipynb) notebook.

To do that you have to:
1. Click the plus sign next to a collection of transformation elements and choose "Add destination"->"Amazon S3".
![addNewExport](./pictures/addNewExport.png)
1. Provide parameters for S3 destination:
    1. "Dataset name" - name for new dataset, I used "NYC_export_advanced"
    1. "File type" - CSV
    1. Delimeter - Comma
    1. Compression - none
    1. "Amazon S3 location" - I use a bucket name which we created at the begining with additional path in it, like "s3://979894173312-us-east-1-datawranglertimeseries-6596/NYC_export_advanced/"
1. Click "Add destination" orange button \
![newDestinationConfig](./pictures/newDestinationConfig.png)
1. Now your dataflow have a new final step and you could click "Create job" orange button. 
![flowCompletedNew](./pictures/flowCompletedNew.png)
1. Provide a "Job name" (you could keep autogenerated) and select "destination". We have two (previous and new one), lets use only new one  "S3: NYC_export_advanced", but you are allowed to select both. Leave a "KMS key ARN" field empty and click "Next" orange button. 
![newJob1](./pictures/newJob1.png)
1. Now your have to provide configuration for a compute capacity for a job. You could keep all defaults values:
    1. For "Instance type" use "ml.m5.4xlarge"
    1. For "Instance count" use "2"
    1. You could explore "Additional configuration", but keep them without change. 
    1. Click "Run" orange button \
![Job2](./pictures/Job2.png)
1. Now you job is started and it will take about 3 hours to process 6 GB of data according to our workflow. Cost for this job will be around 6 USD as "ml.m5.4xlarge" cost 0.922 USD per hour and we are using two of them. 

Run the following cell and click on the link to go to the SageMaker console to see the processing jobs.

In [None]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Open <a target="blank" href="https://{}.console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/">Processing Jobs</a></b>'.format(
            region, region
        )
    )
)

<div class="alert alert-info"> 💡
<b>Congratulations!</b></br>
You reached the end of Data Wrangler time series advanced lab! Now you know how to use Amazon SageMaker Data Wrangler for advanced time series tranformations!
</div>

## Clean up
Please move to the cleanup notebook [`TS-Workshop-Cleanup.ipynb`](./TS-Workshop-Cleanup.ipynb) to remove resources incurring charges.

# Release resources
The following code will stop the kernel in this notebook.

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>