Using Machine Learning and Internet of Things to predict the pollution level at a time, given pollution level of last N timesteps. We will simulate an IoT device using AWS Greengrass and collect streaming device data into IoT core. With pre configured rules, those reading will be pushed to IoT Analytics. When we have sufficient data, then that data is pulled into AWS Sagemaker. Sagemaker will train a model and save artefacts to AWS S3. Using lambda at edge (Greengrass) we will predict the pollution level for next hour given we have data for last N hours.
Dataset is taken from the Beijing PM2.5 This dataset contains the hourly data for PM2.5 levels at US Embassy in Beijing between Jan 1st, 2010 to Dec 31st, 2014.
- Handling Missing Values - Drop as count of those is less than 5%. Also majority of missing is in Target features.
- Encoding Categorical value - Label Encoder
- Data Normalization - Min-Max Scaler
- Converting sequence data to timeseries data
- Reshaping dataset - (Sample, Time-steps, Features)
As we will be forecasting the pollution level based on last N timsteps, LSTM Network seem fit for the problem. Working of LSTM Model is described over here LSTM in Simple Words. Long Short Term Model is capable of retaning information like seasonal and periodical variation in data along with the regression attributes. We have explored variations of the model before finalizing the current model.
Results of different variations if the model we experimented with are below:
Model | Layers | Neurons | Activation | RMSE |
---|---|---|---|---|
LSTM | 1 | 50 | tanh | 32 |
LSTM | 1 | 50 | relu | 29 |
LSTM | 1 | 50 | selu | 27 |
LSTM | 3 | 50 | selu | 25 |
BiLSTM | 1 | 64 | selu | 27 |
BiLSTM | 3 | 64 | relu | 23 |
BiLSTM | 3 | 64 | selu | 20 |
Other parameters like epoch (20-100), optimizer(adam, sgd), learning rate(0.001 and 0.01) were also explored (see notebook for details).
The solutoion is deployed on AWS and we are leveraging AWS CFN to maintain infrastructure as code.
- Base Service - VPC, Subnet, Route Table, Security Group
- Simulation Services - Autoscaling + LaunchConfiguration + Greengrass Core
- IoT Services - Rule, Subscription, Topic
- Machine Learning Services - Sagemaker Notebook and Training Job
- Auxillary Services - S3, IAM, Certificates etc.
- Configure AWS credentials
- Create an S3 bucket where artefacts(file, model, cfn) will be uploaded for deployment
- Clone the repo
- From root of the app directory deploy.sh while passing the arguemnts. -h or --help can be used to see all available arguemnts
- Switch over to AWS Console and check cloudformation. Wait for cloudformation stack to come to status of CREATE_COMPLETE.
- Head over to AWS Console for Greengrass, Deploy the GreengrassGroup with Automatic Detection.
- Switch to IoT Core console. Subscribe to the pollution/data topic.
- Update the publish topic to pollution/data/ingest/trigger and click Publish to topic.
- Switch over to IoTAnalytic page. Run the dataset iotanalyticssqldataset.
- Head over to SageMaker notebook and open notebook named SagemakerNotebookInstance. Follow instruction in notebook and run it.
- Switch over Greengrass console and click on Group.
- Select the group named GreengrassGroup. Add a machine learning resource.
- While prompted give name and select Use a model trained in AWS SageMaker. Select the training job you just trained prefixed with pollution-forecasting-lstm.
- Give local path as /dest/.
- Select Lambda affiliation, and pick the Lambda prefixed with -InferenceLambda. Leave the Read-only access and click Save.
- Expand Actions and click Deploy.
- Subscribe to three topics pollution/data, pollution/data/infer, and pollution/data/model/accuracy.
- Publish the default message to the topic pollution/data/infer/trigger.
- Reset deployments on Greengrass group.
- Delete the cloudformation stack.
- Optinally clear your S3 bucket artefacts.