## Overview
A notification is a message that can originate from multiple sources as shown:  

<img src="./images/notifications.png" width=400 height=auto />

We need to create a system that sends notifications as push notification, SMS and emails. The system should be as real-time as possible. Which means users receive notification as soon as it is generated. The user should be able to opt-out of notifications.

The system should be able to handle the following volume:
- 10 million mobile push notifications
- 1 million SMS
- 5 million emails

## Notification Modes
### iOS Push Notification
Involves the following entities:
- **Provider:** is a server that you deploy, manage and configure to work with APNs
- **APNs:** Apple Push Notification service, a service provided by Apple that sends notifications to devices
- **Client device:** the targeted device, such as an iPhone
- **Client app:** the developerâ€™s mobile app

Once user installs the client app and provides notification permission, the app communicates with APN to generate device token. The provider stores device tokens, prepares the notification and sends payload to APN. APN authenticates the request, verifies the payload and finally routes the notification to the correct device. If the device is turned off, APN retries later.

<img src="images/ios_push_notification.png" />

### Android Push Notification
The architecture is the same except that instead of APN, we use FCM (Firebase Cloud Messaging).

### SMS
For sending SMS, we utilise services like Twilio or Nexmo.

### Email
For sending emails, we can establish our own email server, or use services like MailChimp or SendGrid. Commercial services provide better delivery rate and data analysis.

## User Onboarding
Once a user signs up, we collect information like email and mobile number. If the user installs the app, we would collect device information like push notification device token. The architecture for this is very simple:

<img src="images/user_onboarding.png" width=600 height=auto/>

## Architecture
<img src="images/notification_architecture_1.png" />

This is a simple architecture where the volume of notifications being sent is low. It involves:
- **Services 1 to N:** different services that need to send out notifications. As an example, sales service would need to send out notifications related to orders, shipment service would send notifications regarding order shipment and delivery, etc.
- **Notification Service:** handles notification requests from other services and forwards them to third party notification service providers. Notification service would provide REST APIs through which other services can send notifications, check status, etc.
- **User Info Service:** provides information like email address, device token, etc to notification service. Notification service would use this information to identify recepient and decide notification delivery mode.
- **Third Party Notification Service:** like APN, FCM, MailChimp, Twilio, etc. It is important that our system is flexible enough to switch service providers easily. Geographical availability is another factor to consider. This means for different locations, different service provider needs to be used.

The above architecture can be improved to resolve scaling issues and remove the single point of failure. Other improvements like rate limits, retry, etc are also considered:

<img src="images/notification_architecture_2.png" />

- **Notification Service:** no longer a single point of failure. Its resposibility is shrunk to just providing an API to trigger notifications and send it to the appropriate notification queue. It performs basic validations on the request and communicates with user info service to get necessary details.
- **Rate Limiter:** ensures that no user is flooded by notifications. Rate limiting can be granular - at each user/notification mode level.
- **User Info Service:** as the volume of notifications increase, requests to user info services also increase. It now makes sense to add a cache since user info details don't change frequently. If user details like mobile number is changed, this service invalidates the concerned cache entry.
- **Message Queues:** segregated queues for each notification mode, prevents failures in one notification service from affecting others. Message queues serve as buffers when high volumes of notifications are to be sent out.
- **Workers:** pull notification messages from the queues and prepare the notifications to be sent out to third party services. To prepare contents of the notifications, it connects with CMS to get templates and combines it with user data present in notification message. 4
- **CMS:** stores notification template. CMS can be accessed by product managers without input from the development team.
- **Analytics:** captures stats around user-notification engagement.

### Handling Failure
There can be multiple sources of failure in this entire flow:

**Scenario 1:** worker fails to send notification because Twilio was temporarily unavailable or there was a network glitch. We would ideally like to retry this type of failure. We can think of some approaches:
1. Attach a failure count to the message and push it back to the queue for it to be re-processed. Once the failure count reaches a certain number, we can discard the message. Problems with this approach:
    - if the problem is transient, we will be re-processing the message soon enough for it to fail again.
    - failing message delays other good messages from being processed since both are added to the same queue
2. [Retry ladder with dead-letter queue](https://littlehorse.io/blog/retries-and-dlq) : basic idea is to have multiple additional queues to put the failed messages back in. Let there be 3 queues (representing three retry attempts) - Q1, Q2 and Q3. Message moves from Q1 to Q2 to Q3 if it keeps failing. Workers pull messages from these queues with a certain delay $baseDelay * factor^{attempt}$. For Q1, $attempt = 1$, ..., Qn $attempt = n$. If all retries are exhausted, messages are moved to a separate queue and require manual intervention.

<img src="images/retry_ladder.png" />

**Scenario 2:** what if request to third party provider like Twilio disconnected, but Twilio actually sent out the SMS? From workers's perspective, it was a failure. So it retrys sending the notification again. Now the user will receive duplicate notifications. How to solve for this? This requires support from the third party service as well.
1. Before sending the request to Twilio, the notification must be assigned a unique id called *idempotency key* at the very beginning of the process. Idempotency key along with notification status is saved in database prior to sending request to Twilio.
2. Failure happens, message is queued in retry queue.
3. Worker retreives the message and obtains its idempotency key. Make request to Twilio using the same idempotency key. Twilio will recognize the key and not send the SMS again (if it had).