Analytical Quiz: NumPy & Pandas

1. You're given a dataset with millions of rows. Why are NumPy arrays or Pandas DataFrames preferred over Python lists and dictionaries?

A) They are easier to read and debug

B) They consume significantly less memory and provide faster computations

C) They automatically remove missing data

D) They support more data types

2. How would you efficiently compute the mean income by city across a large dataset using Pandas?

A) Iterate manually over each row and calculate means

B) Group the data by city and use .mean() method

C) Convert Pandas DataFrame to a Python list, then calculate the mean

D) Use NumPy arrays exclusively for mean calculation

3. If you find many NaN values in your dataset, which first step should you take to handle them?

A) Immediately drop all NaN values

B) Fill NaN values with zero

C) Investigate the cause and distribution of NaNs

D) Randomly replace NaNs with existing data

4. To confirm suspicion of highly correlated columns using Pandas, you would:

A) Calculate standard deviation

B) Perform a .corr() analysis

C) Inspect a histogram

D) Count unique values

5. Which scenario illustrates NumPy's vectorized operations significantly improving performance?

A) Reading data from CSV files

B) Summing large arrays element-wise

C) Printing data to a console

D) Loading data into memory

Analytical Quiz: Data Gathering

1. How would you collect product data from an online store without an API?

A) Request database access from the store

B) Scrape the website using tools like BeautifulSoup or Selenium

C) Ask the store to manually send product data

D) Wait for the store to develop an API

2. When scraping websites, important ethical and legal considerations include:

A) Ignoring robots.txt for more data

B) Ensuring you follow terms of service and respecting privacy

C) Scraping only at peak traffic times

D) Collecting personal data without consent

3. To efficiently gather data from an API limited to 1000 records per request:

A) Request all data at once

B) Batch requests and store data incrementally

C) Request fewer than 1000 records each time

D) Continuously repeat the same request

4. How could you handle failed page loads when scraping a public news site?

A) Stop scraping immediately

B) Use retries with delays and error handling

C) Switch immediately to a different site

D) Ignore failed pages entirely

5. To reconcile and deduplicate data from multiple overlapping sources, you'd:

A) Randomly remove duplicates

B) Always keep the latest entry

C) Use consistent identifiers and merge techniques

D) Avoid using multiple sources

Analytical Quiz: Data Cleaning

1. When 30% of values in a key column are missing, your first step is to:

A) Immediately fill them with zeros

B) Delete the entire column

C) Analyze why data is missing and evaluate its impact

D) Replace them randomly with non-missing values

2. To standardize inconsistent category labels like 'NY', 'New York', 'new york', you would:

A) Remove all non-standard labels

B) Replace them manually each time they appear

C) Apply string normalization techniques

D) Convert them all to numeric codes

3. When outliers skew the mean, an appropriate strategy is to:

A) Always remove the outliers

B) Investigate outliers and choose between treatment or preservation based on context

C) Replace outliers with average values

D) Ignore the impact on mean

4. To ensure numeric columns are correctly typed, you would:

A) Visually inspect data

B) Convert columns forcibly and handle errors explicitly

C) Trust original data format

D) Replace all numeric data with strings

5. A repeatable and auditable data cleaning pipeline would ideally:

A) Be executed manually each time

B) Have clearly documented steps and be automated

C) Change methods frequently

D) Not log transformations for efficiency

Analytical Quiz: Data Augmentation

1. For a highly imbalanced dataset (90% no, 10% yes), beneficial augmentation strategies include:

A) Randomly deleting the minority class

B) Duplicating majority class

C) Using synthetic data generation (SMOTE)

D) Ignoring imbalance completely

2. In text classification with limited data, useful augmentation approaches might be:

A) Decreasing vocabulary size

B) Introducing random spelling mistakes

C) Paraphrasing sentences or using back-translation

D) Removing punctuation

3. The main difference between feature engineering and data augmentation is:

A) Augmentation creates new data points; feature engineering modifies existing features

B) Feature engineering always reduces dataset size

C) Augmentation directly improves model accuracy

D) There is no difference

4. Synthetic data generation might introduce bias because:

A) It is always perfectly balanced

B) It duplicates existing bias in training data

C) It only generates neutral examples

D) It always reduces variance

5. Real-time data augmentation during model training involves:

A) Applying augmentations before training starts

B) Augmenting data on-the-fly during each training batch

C) Using only synthetic data

D) Storing augmented data permanently

# Final Assignment: End-to-End Data Processing

### Objective

Build a small pipeline that simulates a real-world data task using NumPy, Pandas, and data preparation techniques.

### Tasks

1. **Generate Data:**  
   - Use NumPy to simulate a dataset of 1000 people with fields: `age`, `income`, `city`, and `purchase_status`.
   - Add some missing values randomly.

2. **Clean Data:**  
   - Fill missing ages with the mean.
   - Drop rows where `purchase_status` is missing.
   - Convert `purchase_status` to binary: yes → 1, no → 0

3. **Augment Data:**  
   - Add a new feature: `age_group` based on age.
   - Perform simple upsampling to balance `purchase_status`.

4. **Output:**  
   - Save the cleaned and augmented DataFrame as CSV.
   - Print basic statistics (mean, value counts, etc.)
