<center><img src="images/careers.jpg" width="700"/></center>

"How would you design Twitter?"
-------

<center><img src="https://tse3.mm.bing.net/th?id=OIP.ciFOF2jCpcN28FVQKpOokgHaEC&pid=Api" width="700"/></center>

Why System Design Questions?
-----

- Fuzzy definitions
- Open ended
- Testing your analytical ability 

[Source](https://www.quora.com/What-are-some-of-the-best-answers-to-the-question-How-would-you-design-Twitter-in-a-system-design-interview)

Steps
------

1. Requirements clarifications
1. Back-of-the-envelope estimation
2. System interface definition
4. Defining data model
5. High-level design
6. Detailed design
7. Bottlenecks 

Anti-Steps
-----

- Jump into implementation
- Pick specific technologies

Requirements clarifications
-------

__Always ask__

Then compress Twitter to its MVP (minimum viable product)...

Tweeter MVP
-----

<center><img src="images/publish_subscribe.png" width="700"/></center>

Pub (Publishers): People tweeting 

Sub (Subscribers): People following

What are other examples of [Pub Sub Models](https://en.wikipedia.org/wiki/Publish%E2%80%93subscribe_pattern)?
------

- Newspapers 📰
- Video Streaming 📼
- Data Systems (Kafka)

Back-of-the-envelope estimation
-------

- Estimate scale and speed
    - Number of concurrent users
    - Volume of data at rest 

Estimation Example
-----

Number of tweets per day : 500 million    
Maximum size of a tweet : 140 chars + 1 byte for timestamp + 1 byte for userId = 142 bytes   
Provisioning for : 5 years = 365 * 5 days   
Space required : 142bytes * 500M * 365 * 5 = 129  

[Source](http://massivetechinterview.blogspot.com/2016/07/design-twitter.html)

System interface definition
------

Define APIs 

Nouns & Verbs

`postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …)`

Defining data model
--------

What the 4 things in data modeling 101?
-------

1. Entity - discrete thing in the real world
2. Attribute - information about an entity
3. Fact - level of attribute
4. Events - thing that happens in the real world

What are the Entities (nouns) for Twitter?
-----

1. Users 

2. Tweets

Possible Data Model
-------

`User: UserID, UserName, FirstName, LastName, Password (salted), Email, DoB, TimeStamp`

`Tweet: TweetID, Content, UserID, TimeStamp`

Relations
------

`Follers: FollowerID, FolloweeID, TimeStamp`
`Favorites: UserID, TweetID, TimeStamp`

High-level design
-------

Draw a block diagram with 5-6 boxes representing core components of your system

<center><img src="images/db.png" width="700"/></center>

Detailed design
------

Focus on developing the most important 2-3 components

Verbs
-----

1. Create account
1. Post a tweet
1. Follow a user
1. Get feed
1. Search!

Implement Twitter
--------

Data Structures first!

[Source](https://github.com/rhettinger/modernpython/tree/master/pubsub)

In [195]:
from collections import namedtuple

In [196]:
Post = namedtuple('Post', ['timestamp', 'user', 'content'])

In [197]:
# A collection of all posts 
posts = None

In [198]:
# Efficient append left
from collections import deque

In [199]:
posts = deque() # From newest to oldest

In [200]:
# A collection of posts by user 
user_posts = None # user: posts

In [201]:
# A dictionary
# A special dictionary that accumulates post
from collections import defaultdict

In [202]:
# When a key is encountered for the first time, automatically create an empty deque
user_posts = defaultdict(deque) 

After data structures (nouns), come functions (verbs)
------

In [203]:
from time import time

def post_message(user, text, timestamp=None):
    timestamp = timestamp or time()
    post = Post(timestamp, user, text) # Most of the data
    posts.appendleft(post)             # Pass by reference
    user_posts[user].appendleft(post)  # Pass by reference

In [204]:
# Run sample data
now = time()
post_message('brianspiering', '#python tip: use the correctdata structures', now-3600*48)
post_message('jay', 'graded papers', now-3600)
post_message('selik', 'gradient descent save me money on travel', now-2500)
post_message('brianspiering', '#python tip: develop interactively', now-500)
post_message('jay', 'plan to heckle brianspiering', now-80)
post_message('jeffjohnson', 'teaching #python today', now-50)
post_message('selik', 'have you ever wanted to unpack mappings?', now-46)
post_message('brianspiering', '#python tip: have fun programming', now-40)
post_message('jeffjohnson', '#camping tip:  always take water', now-30)
post_message('barry', 'enums rock', now-20)
post_message('brianspiering', '#life tip: take frequent naps', now-10)
post_message('jeffjohnson', 'coriander and cilantro come from the same plant', now)

In [205]:
posts

deque([Post(timestamp=1523406271.574185, user='jeffjohnson', content='coriander and cilantro come from the same plant'),
       Post(timestamp=1523406261.574185, user='brianspiering', content='#life tip: take frequent naps'),
       Post(timestamp=1523406251.574185, user='barry', content='enums rock'),
       Post(timestamp=1523406241.574185, user='jeffjohnson', content='#camping tip:  always take water'),
       Post(timestamp=1523406231.574185, user='brianspiering', content='#python tip: have fun programming'),
       Post(timestamp=1523406225.574185, user='selik', content='have you ever wanted to unpack mappings?'),
       Post(timestamp=1523406221.574185, user='jeffjohnson', content='teaching #python today'),
       Post(timestamp=1523406191.574185, user='jay', content='plan to heckle brianspiering'),
       Post(timestamp=1523405771.574185, user='brianspiering', content='#python tip: develop interactively'),
       Post(timestamp=1523403771.574185, user='selik', content='gradient 

In [206]:
user_posts['brianspiering']

deque([Post(timestamp=1523406261.574185, user='brianspiering', content='#life tip: take frequent naps'),
       Post(timestamp=1523406231.574185, user='brianspiering', content='#python tip: have fun programming'),
       Post(timestamp=1523405771.574185, user='brianspiering', content='#python tip: develop interactively'),
       Post(timestamp=1523233471.574185, user='brianspiering', content='#python tip: use the correctdata structures')])

In [207]:
UserInfo = namedtuple('UserInfo', ['displayname', 
                                   'email',
                                   'hashed_password', 
                                   'bio'])

Let's add following feature
------

In [208]:
following = defaultdict(set) # User: Other Users        
followers = defaultdict(set) # User: Other Users   

Follow function
-------

In [209]:
def follow(user, followed_user):
    "Update symmetric dicts"
    following[user].add(followed_user)
    followers[followed_user].add(user)

In [210]:
follow('jay', followed_user='brianspiering')
follow('jay', followed_user='barry')
follow('selik', followed_user='davin')
follow('brianspiering', followed_user='jay')
follow('brianspiering', followed_user='barry')

In [211]:
followers

defaultdict(set,
            {'barry': {'brianspiering', 'jay'},
             'brianspiering': {'jay'},
             'davin': {'selik'},
             'jay': {'brianspiering'}})

Let's get a couple post from a user
------

In [212]:
def posts_by_user(user):
    return user_posts[user]

posts_by_user('brianspiering')

deque([Post(timestamp=1523406261.574185, user='brianspiering', content='#life tip: take frequent naps'),
       Post(timestamp=1523406231.574185, user='brianspiering', content='#python tip: have fun programming'),
       Post(timestamp=1523405771.574185, user='brianspiering', content='#python tip: develop interactively'),
       Post(timestamp=1523233471.574185, user='brianspiering', content='#python tip: use the correctdata structures')])

What if the person has thousands of posts?

In [213]:
# Efficentlly select elements from an iterator
from itertools import islice

In [214]:
def posts_by_user(user, limit=2):
    return list(islice(user_posts[user], limit))

posts_by_user('brianspiering')

[Post(timestamp=1523406261.574185, user='brianspiering', content='#life tip: take frequent naps'),
 Post(timestamp=1523406231.574185, user='brianspiering', content='#python tip: have fun programming')]

How about search?
-----

In [219]:
phrase = 'python'

relevant_posts = []
for post in posts:
    if phrase in post.content:
        relevant_posts.append(post) 
print(*relevant_posts, sep="\n")

Post(timestamp=1523406231.574185, user='brianspiering', content='#python tip: have fun programming')
Post(timestamp=1523406221.574185, user='jeffjohnson', content='teaching #python today')
Post(timestamp=1523405771.574185, user='brianspiering', content='#python tip: develop interactively')
Post(timestamp=1523233471.574185, user='brianspiering', content='#python tip: use the correctdata structures')


In [220]:
# itertools.islice efficiently selects elements from an iterator
# Indexing only works with sequences.

def search(phrase, limit=2):
    return list(islice((post for post in posts if phrase in post.content), limit))

search(phrase='python')

Bottlenecks
------

Search is often a bottleneck.

For example, my code could benefit from caching and from building a index ahead of time.

On Your Own - "How would you design a search engine?"
------

<center><img src="https://upload.wikimedia.org/wikipedia/en/thumb/f/fa/MLR-search-engine-example.png/260px-MLR-search-engine-example.png" width="700"/></center>

Bottlenecks
------

Refactor for easy wins (sets, modularize, parallelize)

__Where would the pub-sub system fail first?__

Twitter is Power Law Distributed
-------

<center><img src="http://trak.in/wp-content/uploads/2009/07/Twitteruserfollowers.jpg" width="700"/></center>

For Next Time
------

- Search Algorithms
    - Binary Search
    - [Bisect](https://en.wikipedia.org/wiki/Bisect)
    - 2d matrices search
    - XOR might be helpful
- Linked List or Student's Choice

-------

Only 2 more sessions

<br>
<br> 
<br>

----