# Data Science Interview Preparation: Concepts & Code Review

This notebook provides a comprehensive review of key data modeling, database, and data science concepts commonly encountered in technical interviews. Each section includes terminology definitions, conceptual explanations, and Python code examples.

---

## 1. Database Fundamentals

### Key Terminology

- **Database**: An organized collection of structured data stored electronically
- **DBMS (Database Management System)**: Software that manages database access and operations
- **RDBMS (Relational Database Management System)**: Database system based on the relational model (e.g., PostgreSQL, MySQL)
- **Schema**: The structure/organization of a database including tables, columns, and relationships
- **Table**: A collection of related data organized in rows and columns
- **Primary Key**: Unique identifier for each row in a table
- **Foreign Key**: A column that references the primary key of another table
- **Index**: Data structure that improves query performance
- **Normalization**: Process of organizing data to reduce redundancy
- **ACID Properties**: Atomicity, Consistency, Isolation, Durability - guarantees for database transactions

### Connecting to Databases in Python

In [None]:
# Install required packages (run once)
# !pip install pandas sqlalchemy psycopg2-binary sqlite3

import pandas as pd
import sqlite3
from sqlalchemy import create_engine

# Example 1: SQLite connection (lightweight, file-based)
conn = sqlite3.connect('example.db')

# Example 2: PostgreSQL connection (production databases)
# engine = create_engine('postgresql://user:password@localhost:5432/database_name')

print("Database connections established")

---
## 2. SQL - Structured Query Language

### SQL Categories

1. **DDL (Data Definition Language)**: Defines database structure
   - `CREATE`, `ALTER`, `DROP`, `TRUNCATE`

2. **DML (Data Manipulation Language)**: Manipulates data
   - `INSERT`, `UPDATE`, `DELETE`

3. **DQL (Data Query Language)**: Queries data
   - `SELECT`

4. **DCL (Data Control Language)**: Controls access
   - `GRANT`, `REVOKE`

5. **TCL (Transaction Control Language)**: Manages transactions
   - `COMMIT`, `ROLLBACK`, `SAVEPOINT`

In [None]:
# DDL Example: Creating tables
cursor = conn.cursor()

# Create a customers table
cursor.execute('''
CREATE TABLE IF NOT EXISTS customers (
    customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    email TEXT UNIQUE,
    created_date DATE DEFAULT CURRENT_DATE
)
''')

# Create an orders table with foreign key
cursor.execute('''
CREATE TABLE IF NOT EXISTS orders (
    order_id INTEGER PRIMARY KEY AUTOINCREMENT,
    customer_id INTEGER,
    order_date DATE,
    total_amount DECIMAL(10, 2),
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
)
''')

conn.commit()
print("Tables created successfully")

In [None]:
# DML Example: Inserting and updating data

# INSERT
cursor.execute('''
INSERT INTO customers (first_name, last_name, email)
VALUES (?, ?, ?)
''', ('John', 'Doe', 'john.doe@email.com'))

cursor.execute('''
INSERT INTO customers (first_name, last_name, email)
VALUES (?, ?, ?)
''', ('Jane', 'Smith', 'jane.smith@email.com'))

# INSERT multiple orders
orders_data = [
    (1, '2024-01-15', 150.00),
    (1, '2024-02-20', 275.50),
    (2, '2024-01-18', 99.99)
]

cursor.executemany('''
INSERT INTO orders (customer_id, order_date, total_amount)
VALUES (?, ?, ?)
''', orders_data)

# UPDATE
cursor.execute('''
UPDATE customers
SET email = ?
WHERE customer_id = ?
''', ('john.updated@email.com', 1))

conn.commit()
print("Data inserted and updated successfully")

In [None]:
# DQL Example: Querying data with SELECT

# Basic SELECT
df = pd.read_sql_query('SELECT * FROM customers', conn)
print("All Customers:")
print(df)
print()

# SELECT with JOIN
query = '''
SELECT
    c.first_name,
    c.last_name,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
ORDER BY o.order_date
'''

df_orders = pd.read_sql_query(query, conn)
print("Customer Orders:")
print(df_orders)
print()

# Aggregation query
agg_query = '''
SELECT
    c.customer_id,
    c.first_name || ' ' || c.last_name as customer_name,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as avg_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, customer_name
HAVING COUNT(o.order_id) > 0
'''

df_agg = pd.read_sql_query(agg_query, conn)
print("Customer Summary:")
print(df_agg)

### Common SQL Interview Queries

**Window Functions** - Used for ranking, running totals, moving averages

```sql
SELECT
    customer_id,
    order_date,
    total_amount,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) as order_number,
    SUM(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) as running_total
FROM orders;
```

**Subqueries** - Query within a query

```sql
SELECT *
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE total_amount > 200
);
```

**CTEs (Common Table Expressions)** - Temporary named result sets

```sql
WITH high_value_customers AS (
    SELECT customer_id, SUM(total_amount) as total
    FROM orders
    GROUP BY customer_id
    HAVING SUM(total_amount) > 500
)
SELECT c.*, hvc.total
FROM customers c
JOIN high_value_customers hvc ON c.customer_id = hvc.customer_id;
```

---
## 3. Data Modeling Concepts

### Entity-Relationship Diagrams (ERD)

**Key Components:**
- **Entity**: Object or concept (represented as rectangles)
- **Attribute**: Property of an entity (represented as ovals)
- **Relationship**: Association between entities (represented as diamonds)

**Cardinality Types:**
- **One-to-One (1:1)**: Each entity in A relates to one entity in B
- **One-to-Many (1:N)**: Each entity in A can relate to many in B
- **Many-to-Many (M:N)**: Entities in A can relate to many in B and vice versa

### Normalization

**Purpose**: Organize data to reduce redundancy and improve data integrity

**Normal Forms:**

1. **First Normal Form (1NF)**:
   - Each column contains atomic (indivisible) values
   - Each column contains values of a single type
   - Each column has a unique name
   - Order doesn't matter

2. **Second Normal Form (2NF)**:
   - Must be in 1NF
   - All non-key attributes are fully dependent on the primary key
   - No partial dependencies

3. **Third Normal Form (3NF)**:
   - Must be in 2NF
   - No transitive dependencies (non-key attributes don't depend on other non-key attributes)

4. **Boyce-Codd Normal Form (BCNF)**:
   - Must be in 3NF
   - Every determinant is a candidate key

In [None]:
# Example: Unnormalized vs Normalized data

# Unnormalized (violates 1NF - multiple values in phone_numbers)
unnormalized_data = pd.DataFrame({
    'employee_id': [1, 2],
    'name': ['Alice Johnson', 'Bob Smith'],
    'department': ['Engineering', 'Marketing'],
    'phone_numbers': ['555-1234, 555-5678', '555-9999']
})

print("Unnormalized Table (Violates 1NF):")
print(unnormalized_data)
print()

# Normalized (1NF compliant)
employees = pd.DataFrame({
    'employee_id': [1, 2],
    'name': ['Alice Johnson', 'Bob Smith'],
    'department_id': [1, 2]
})

phone_numbers = pd.DataFrame({
    'phone_id': [1, 2, 3],
    'employee_id': [1, 1, 2],
    'phone_number': ['555-1234', '555-5678', '555-9999']
})

departments = pd.DataFrame({
    'department_id': [1, 2],
    'department_name': ['Engineering', 'Marketing']
})

print("Normalized Tables (1NF, 2NF, 3NF):")
print("\nEmployees:")
print(employees)
print("\nPhone Numbers:")
print(phone_numbers)
print("\nDepartments:")
print(departments)

---
## 4. Python Data Manipulation

### Pandas - Essential Operations

**Key Terminology:**
- **DataFrame**: 2D labeled data structure (like a table)
- **Series**: 1D labeled array
- **Index**: Row labels
- **Columns**: Column labels
- **Vectorization**: Operations on entire arrays without explicit loops

In [None]:
import pandas as pd
import numpy as np

# Creating sample data
data = {
    'product_id': [101, 102, 103, 104, 105],
    'product_name': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'],
    'category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Accessories'],
    'price': [999.99, 25.50, 75.00, 299.99, 89.99],
    'quantity_sold': [50, 200, 150, 75, 120]
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print(f"\nShape: {df.shape}")
print(f"\nData Types:\n{df.dtypes}")

In [None]:
# Common DataFrame operations

# 1. Filtering
print("Products over $50:")
print(df[df['price'] > 50])
print()

# 2. Sorting
print("Sorted by price (descending):")
print(df.sort_values('price', ascending=False))
print()

# 3. Grouping and Aggregation
print("Summary by category:")
summary = df.groupby('category').agg({
    'price': ['mean', 'min', 'max'],
    'quantity_sold': 'sum'
})
print(summary)
print()

# 4. Creating new columns
df['revenue'] = df['price'] * df['quantity_sold']
print("DataFrame with revenue column:")
print(df[['product_name', 'price', 'quantity_sold', 'revenue']])

In [None]:
# Handling missing data
df_with_nulls = df.copy()
df_with_nulls.loc[1, 'price'] = np.nan
df_with_nulls.loc[3, 'quantity_sold'] = np.nan

print("DataFrame with missing values:")
print(df_with_nulls)
print(f"\nMissing values per column:\n{df_with_nulls.isnull().sum()}")
print()

# Methods to handle missing data
print("Fill with mean:")
df_filled = df_with_nulls.fillna(df_with_nulls.mean(numeric_only=True))
print(df_filled)
print()

print("Drop rows with any null:")
df_dropped = df_with_nulls.dropna()
print(df_dropped)

In [None]:
# Merging DataFrames (SQL-like joins)

# Create additional data
suppliers = pd.DataFrame({
    'product_id': [101, 102, 103, 106],
    'supplier_name': ['TechCorp', 'AccessoriesPlus', 'KeyboardKing', 'DisplayPro']
})

print("Original products DataFrame:")
print(df[['product_id', 'product_name']])
print("\nSuppliers DataFrame:")
print(suppliers)
print()

# Inner join
print("Inner Join:")
inner = pd.merge(df, suppliers, on='product_id', how='inner')
print(inner[['product_id', 'product_name', 'supplier_name']])
print()

# Left join
print("Left Join (keep all products):")
left = pd.merge(df, suppliers, on='product_id', how='left')
print(left[['product_id', 'product_name', 'supplier_name']])

---
## 5. Database Design Life Cycle

### Phases

1. **Requirements Analysis**: Gather and document business requirements
2. **Conceptual Design**: Create ERD showing entities and relationships
3. **Logical Design**: Convert ERD to normalized relational schema
4. **Physical Design**: Define storage structures, indexes, partitions
5. **Implementation**: Create database, tables, and load data
6. **Testing & Deployment**: Validate and deploy to production
7. **Maintenance**: Monitor, optimize, and update as needed

### Database Types Comparison

| Feature | RDBMS | NoSQL | Graph DB | Vector Store |
|---------|-------|-------|----------|-------------|
| **Structure** | Tables/Rows | Documents/Collections | Nodes/Edges | Embeddings/Vectors |
| **Schema** | Fixed | Flexible | Flexible | Flexible |
| **Relationships** | Foreign Keys | Embedded/References | Native | Similarity |
| **Scaling** | Vertical | Horizontal | Horizontal | Horizontal |
| **Use Case** | Transactions | High-volume | Connections | Semantic Search |
| **Examples** | PostgreSQL, MySQL | MongoDB, Cassandra | Neo4j, Amazon Neptune | Pinecone, Weaviate |

---
## 6. Data Warehousing Concepts

### Key Terminology

- **Data Warehouse**: Centralized repository for integrated data from multiple sources
- **ETL (Extract, Transform, Load)**: Process of moving data from sources to warehouse
- **OLTP (Online Transaction Processing)**: Day-to-day transactional systems
- **OLAP (Online Analytical Processing)**: Systems optimized for complex queries and analysis
- **Fact Table**: Contains measurable, quantitative data (sales, revenue)
- **Dimension Table**: Contains descriptive attributes (customer info, product details)
- **Star Schema**: Fact table surrounded by dimension tables
- **Snowflake Schema**: Normalized version of star schema
- **Data Mart**: Subset of data warehouse focused on specific business area
- **Data Lake**: Storage repository for raw, unstructured data

### Star Schema Example

In [None]:
# Creating a simple star schema example

# Fact table - Sales
fact_sales = pd.DataFrame({
    'sale_id': [1, 2, 3, 4],
    'date_key': [20240101, 20240102, 20240102, 20240103],
    'product_key': [1, 2, 1, 3],
    'customer_key': [101, 102, 101, 103],
    'quantity': [2, 1, 3, 1],
    'unit_price': [100.00, 50.00, 100.00, 75.00],
    'total_amount': [200.00, 50.00, 300.00, 75.00]
})

# Dimension table - Products
dim_products = pd.DataFrame({
    'product_key': [1, 2, 3],
    'product_name': ['Widget A', 'Widget B', 'Widget C'],
    'category': ['Electronics', 'Home', 'Electronics']
})

# Dimension table - Customers
dim_customers = pd.DataFrame({
    'customer_key': [101, 102, 103],
    'customer_name': ['Acme Corp', 'Tech Solutions', 'Global Industries'],
    'region': ['West', 'East', 'Central']
})

# Dimension table - Date
dim_date = pd.DataFrame({
    'date_key': [20240101, 20240102, 20240103],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03'],
    'month': [1, 1, 1],
    'quarter': [1, 1, 1],
    'year': [2024, 2024, 2024]
})

print("Star Schema Structure:")
print("\nFact Table (Sales):")
print(fact_sales)
print("\nDimension - Products:")
print(dim_products)
print("\nDimension - Customers:")
print(dim_customers)

In [None]:
# Analytical query on star schema
analysis = fact_sales.merge(dim_products, on='product_key') \
                     .merge(dim_customers, on='customer_key') \
                     .merge(dim_date, on='date_key')

# Summary by region
regional_summary = analysis.groupby('region').agg({
    'total_amount': 'sum',
    'quantity': 'sum',
    'sale_id': 'count'
}).rename(columns={'sale_id': 'num_transactions'})

print("Sales Summary by Region:")
print(regional_summary)

---
## 7. API Concepts

### REST API Principles

**REST (Representational State Transfer)** - Architectural style for web services

**HTTP Methods:**
- **GET**: Retrieve data (read-only)
- **POST**: Create new resource
- **PUT**: Update entire resource
- **PATCH**: Partial update of resource
- **DELETE**: Remove resource

**Status Codes:**
- **2xx**: Success (200 OK, 201 Created)
- **3xx**: Redirection
- **4xx**: Client errors (400 Bad Request, 404 Not Found)
- **5xx**: Server errors (500 Internal Server Error)

### FastAPI Example

In [None]:
# Basic FastAPI structure (for reference - won't run in notebook)
"""
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel

app = FastAPI()

class Product(BaseModel):
    id: int
    name: str
    price: float

# In-memory database
products_db = {}

@app.get("/")
def read_root():
    return {"message": "Welcome to Products API"}

@app.get("/products/{product_id}")
def read_product(product_id: int):
    if product_id not in products_db:
        raise HTTPException(status_code=404, detail="Product not found")
    return products_db[product_id]

@app.post("/products/")
def create_product(product: Product):
    if product.id in products_db:
        raise HTTPException(status_code=400, detail="Product already exists")
    products_db[product.id] = product
    return product

@app.put("/products/{product_id}")
def update_product(product_id: int, product: Product):
    if product_id not in products_db:
        raise HTTPException(status_code=404, detail="Product not found")
    products_db[product_id] = product
    return product

@app.delete("/products/{product_id}")
def delete_product(product_id: int):
    if product_id not in products_db:
        raise HTTPException(status_code=404, detail="Product not found")
    del products_db[product_id]
    return {"message": "Product deleted"}
"""

print("FastAPI code structure shown above (reference only)")

In [None]:
# Working with APIs using requests library
import requests

# Example: Public API call
# (This is a real endpoint you can test)
try:
    response = requests.get('https://api.github.com/users/github')

    print(f"Status Code: {response.status_code}")
    print(f"Response Headers: {dict(list(response.headers.items())[:3])}")

    if response.status_code == 200:
        data = response.json()
        print(f"\nUser: {data.get('login')}")
        print(f"Name: {data.get('name')}")
        print(f"Public Repos: {data.get('public_repos')}")
except Exception as e:
    print(f"API call failed: {e}")

---
## 8. Interview Preparation Tips

### Common Technical Interview Topics

1. **SQL Queries**:
   - Know JOINs (INNER, LEFT, RIGHT, FULL)
   - Window functions (ROW_NUMBER, RANK, LAG/LEAD)
   - Aggregations with GROUP BY and HAVING
   - Subqueries and CTEs

2. **Data Modeling**:
   - ERD creation and interpretation
   - Normalization and when to denormalize
   - Primary vs. foreign keys
   - Cardinality relationships

3. **Python/Pandas**:
   - DataFrame manipulation
   - Merging and joining data
   - Handling missing values
   - GroupBy operations

4. **Database Concepts**:
   - ACID properties
   - Indexes and performance
   - Transactions
   - Database types (RDBMS vs NoSQL)

5. **System Design**:
   - Star vs. snowflake schema
   - ETL processes
   - Data warehouse architecture
   - Scalability considerations

### Practice Questions to Consider

1. "Write a SQL query to find the second highest salary"
2. "Design a database schema for an e-commerce platform"
3. "How would you handle missing data in a dataset?"
4. "Explain the difference between DELETE, TRUNCATE, and DROP"
5. "When would you use a NoSQL database over a relational database?"
6. "How would you optimize a slow-running query?"
7. "Design an ETL pipeline for a data warehouse"
8. "What are database indexes and when should you use them?"

---
## Summary Checklist

Before your interview, ensure you can:

✓ Write basic to intermediate SQL queries (SELECT, JOIN, GROUP BY, window functions)  
✓ Explain database normalization and the different normal forms  
✓ Create and interpret ERD diagrams  
✓ Manipulate data using pandas (filtering, grouping, merging)  
✓ Explain ACID properties and transactions  
✓ Describe the difference between OLTP and OLAP systems  
✓ Understand star schema vs. snowflake schema  
✓ Explain REST API principles and HTTP methods  
✓ Connect to databases using Python  
✓ Discuss trade-offs between different database types  

---

**Good luck with your interview preparation!**