### What is a database view? What are some advantages views have over tables?

A view is essentially a virtual table that is defined by a SQL query. It doesn’t store data itself but presents results from underlying tables as if it were a table. Views are great because they simplify complex queries for users, provide a layer of security by exposing only certain columns or rows, and help maintain consistent business logic across the application. They can also act as a reusable abstraction over the schema, making it easier to work with.

### Say you have a database system where most of the queries made were UPDATEs/ INSERTs/DELETEs. How would this affect your decision to create indices? What if the queries made were mostly SELECTs and JOINs instead?

If most of my queries are writes—like INSERT, UPDATE, or DELETE—I would be careful about creating many indexes because indexes slow down write operations. I would only create essential indexes, like on primary or foreign keys.

On the other hand, if the workload is mostly SELECTs or JOINs, I would create more indexes because they significantly speed up read queries. The overhead of maintaining indexes is worth it in read-heavy scenarios.

### What is a primary key? What characteristics does a good primary key have?

A primary key is a column, or a set of columns, that uniquely identifies each row in a table. Every table can have only one primary key. A good primary key is unique, non-null, stable, minimal, and ideally small. Stability is important because changing primary key values can break relationships or indexes. Keeping it simple and small also helps with performance, especially when used in joins.

### Advantages and disadvantages of relational databases vs. NoSQL

At a high level, to implement a shuffle operator with MapReduce:

Map phase: Assign each record a random key.

Shuffle phase: The framework automatically groups records by key, effectively distributing them across reducers.

Reduce phase: Ignore the key and output records as they arrive. Because the keys were random, the output dataset is randomly ordered.

### Similarity and difference between WHERE and HAVING

Similarity: Both filter rows based on a condition.

Difference: WHERE filters rows before aggregation, while HAVING filters groups after aggregation. So if you want to filter on a SUM, AVG, COUNT, you use HAVING; for raw rows, use WHERE.

### Foreign key and its relation to primary key

A foreign key is a column (or set of columns) in one table that references the primary key in another table. It enforces referential integrity, ensuring that values in the foreign key column match existing primary key values in the referenced table. Essentially, it links tables together.

### Clustered index vs. non-clustered index

A clustered index determines the physical order of data in a table. There can be only one clustered index per table, usually on the primary key.

A non-clustered index is a separate structure that contains pointers to the table rows; it doesn’t affect physical order. You can have multiple non-clustered indexes per table.

In short: clustered = data sorted on disk, non-clustered = separate lookup structure for faster queries.

### Say you had the entire Facebook social graph (users and their friendships). How would you use MapReduce to find the number of mutual friends for every pair of Facebook users?

To find mutual friends between every pair of Facebook users, I’d use MapReduce like this: In the Map step, for each user, I look at all pairs of their friends and emit the pair as a key with a count of 1. So if Alice is friends with Bob and Carol, I emit (Bob, Carol) → 1. Then, in the Shuffle/Sort phase, all pairs are grouped together. In the Reduce step, I sum the counts for each pair to get the total number of mutual friends. This approach scales horizontally for very large social graphs. Key trade-offs are handling users with huge friend lists, which can produce a lot of pairs, and the fact that this is batch, not real-time, computation.

### Assume you are tasked with designing a large-scale system that tracks a variety of search query strings and their frequencies. How would you design this, and what trade-offs would you need to consider?

For tracking query strings and their frequencies at scale, I’d design a system with distributed ingestion of user queries into sharded counters. Each shard could be a key-value store like Redis, DynamoDB, or Cassandra. For very high traffic, I’d use approximate counting with a Count-Min Sketch to save memory and avoid hot keys, while maintaining near-accurate counts. We can do real-time updates with streaming pipelines like Kafka and Spark Streaming or batch aggregation for analytics. The main trade-offs are accuracy versus memory, real-time versus batch processing, and handling very popular queries to avoid hotspots