## Overview
We need to create an application that takes a URL as input and return a shorter URL in length. For example, if we have an input URL like `https://photos.google.com/?pli=1`, the shortener would return something like `https://sh.rt/30b895e`.

The system should have the following characteristics:
- 100 million URLs generated per day
- Length of shortened URL should be as short as possible
- The shortened URL can contain both alphabets and numbers

The system should support the following two operations:
- Given a URL, return shortened URL
- Given a shortened URL, redirect it to the original URL

## Calculations
100 million URLs generated per day. This equals:
- $\frac{100000000}{24\times3600} = 1160/s$
- $36500000000/year$

Assuming $10:1$ read-write ratio, the system would serve reads $11600/s$.

In terms of storage, lets assume an average URL to be 100 characters long and each character taking 1 byte (assuming ASCII character-set). Therefore, we would need $36500000000\times1$ bytes, which equals $365$TB.

## API Design
Both the operations would be served by two APIs. To shorten a URL, our application would expose the following API:
- `POST https://sh.rt/api/v1/short-url` accepting body `{ "url": "https://photos.google.com/?pli=1" }`
- Response would be status 201 with response body `{ "url": "https://photos.google.com/?pli=1", "shortURL": "https://sh.rt/30b895e" }`

The read operation would be realised by `GET https://sh.rt/30b895e`. Our system can respond with:
- Status 301 (permanently moved), with `location` header containing `https://photos.google.com/?pli=1`. When browser receives status 301, it also caches the location header. Subsequent requests would get redirected by the browser itself. Request would not reach our server. This is preferred if we want to keep server load less.
- Status 302 (temporarily moved) doesn't lead to any caching by the browser. Subsequent request gets served by the server. This is preferred if analytics is important.

## Operation In Details
### Shorten URL
A simplistic implementation would involve a map which has the shortened URL key as the map key and the original URL as the value. A hash function would convert the original URL to the shortened URL key.

<img src="images/hash_url.png"/>

An in-memory map is not feasible since we have requirement of 3.65TB per year. We can possibly explore distributed key value stores like Cassandra. However, lets stick to RDBMS for this example. We propose the following DB schema:

<img src="images/hash_db_schema.png"/>

Lets focus on hash function. We have total of 62 choices *a-z*, *A-Z* and *0-9*. Therefore we need to calculate:
$$62^n = 365000000000$$
$$n = \log_{62}(365000000000) \approx 7$$

Therefore, we need total of 7 characters. There are two ways to generate the shortened URL:  
**Hash + Collision Resolution:** Lets evaluate a few hash functions:  
<img src="images/hash_fn.png"/>

CRC32 looks to be a good choice, though it is 8 characters long (1 more than required). We can pick the first 7 characters, though it will increase the probability of hash collision. So how do we resolve collisions? One way is to recursively add extra characters. For example, if the original URL was `http://www.google.com` and it leads to collision, we change the URL to `(http://www.google.com)`.

<img src="images/flowchart_shorten.png"/>

**Base-62 Conversion:** there are 62 possible characters that we can use for the shortened URL, therefore base-62 is another way to generate the shortened URL. In this system, we map a character to a number.

<img src="images/base62.png" />
We first generate a unique number corresponding to the number and then base62 convert the number. As the generated number is unique, we don't have to worry about collision. The downside is that we can generate the next possible URL easily.

### Redirect
<img src="images/redirect.png" />