# Analytical Quiz: NumPy & Pandas

### Questions

1. You are given a dataset with millions of rows. Why would using NumPy arrays or Pandas DataFrames be preferable to Python lists and dictionaries?

2. You need to compute the mean income by city across a large dataset. Describe how you would approach this task using Pandas.

3. Suppose you notice a large number of `NaN` values in your dataset. What steps would you take to decide whether to fill, drop, or investigate them?

4. You suspect some columns in your dataset are highly correlated. How would you use NumPy or Pandas to confirm this?

5. Describe a scenario where vectorized operations in NumPy would significantly improve performance compared to traditional Python loops.


# Analytical Quiz: Data Gathering

### Questions

1. You need to collect product data from an online store that doesn't offer an API. How would you approach this?

2. What ethical and legal considerations should you account for when scraping websites?

3. An API limits you to 1000 records per request. How would you plan to gather and store data from it efficiently?

4. You are scraping data from a public news site. Sometimes pages fail to load. How would you make your scraping process more reliable?

5. You’ve gathered data from multiple sources with overlapping content. What strategies would you use to reconcile and deduplicate it?


# Analytical Quiz: Data Cleaning

### Questions

1. You find that 30% of values in a key column are missing. What steps would you take to decide how to handle them?

2. Your data contains inconsistent category labels like 'NY', 'New York', 'new york'. How would you standardize them?

3. A dataset has outliers that skew the mean. What strategies would you consider to treat or preserve those outliers?

4. You need to ensure all numeric columns are correctly typed and free of non-numeric entries. How would you validate and enforce this?

5. Describe how you would design a repeatable and auditable data cleaning pipeline for a continuously updated dataset.


# Analytical Quiz: Data Augmentation

### Questions

1. Your dataset is highly imbalanced (90% no, 10% yes). What data augmentation strategies could help and how would you validate them?

2. You are working on a text classification task with limited data. What augmentation approaches might you try and why?

3. Describe how feature engineering differs from data augmentation. Where might they overlap?

4. How could synthetic data generation introduce bias into your model? What checks would you implement?

5. Explain how data augmentation could be applied during model training in real time, rather than as a preprocessing step.


# Final Assignment: End-to-End Data Processing

### Objective

Build a small pipeline that simulates a real-world data task using NumPy, Pandas, and data preparation techniques.

### Tasks

1. **Generate Data:**  
   - Use NumPy to simulate a dataset of 1000 people with fields: `age`, `income`, `city`, and `purchase_status`.
   - Add some missing values randomly.

2. **Clean Data:**  
   - Fill missing ages with the mean.
   - Drop rows where `purchase_status` is missing.
   - Convert `purchase_status` to binary: yes → 1, no → 0

3. **Augment Data:**  
   - Add a new feature: `age_group` based on age.
   - Perform simple upsampling to balance `purchase_status`.

4. **Output:**  
   - Save the cleaned and augmented DataFrame as CSV.
   - Print basic statistics (mean, value counts, etc.)
