# Intro to Data Pipelines — Hands-On Lab for ML@P Accelerator Lab 2

## Overview
This lab is designed to give you practical experience with NumPy, Pandas, and Exploratory Data Analysis (EDA). 

The focus of this lab is not on writing perfect code, but on developing the ability to reason about data: its structure, quality, distributions, and relationships.

---

## Learning Objectives
- Create and manipulate NumPy arrays
- Use vectorized operations instead of loops
- Apply boolean masking and aggregation functions
- Load and inspect tabular data using Pandas
- Identify numerical and categorical columns
- Handle missing values appropriately
- Perform univariate and multivariate exploratory analysis
- Interpret plots and summarize insights in words

---


## Lab Structure
This lab is divided into two main parts:

1. NumPy Exercises  


2. Exploratory Data Analysis (EDA) with Pandas  


---


In [2]:
#Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Part 1: NumPy Exercises
**Tasks:**
1. Create a 1D NumPy array containing the integers from 1 to 10 (inclusive)
2. Create a 2D NumPy array of size (3, 3)
3. For both arrays, print: ndim, shape, size and dtype

4. Given
arr = np.array([
    [5, 10, 15],
    [20, 25, 30],
    [35, 40, 45]
])

    Extract the first row.
    Extract the last column.
    Extract the element with value 25.

5. Using the array in part 1, filter for elements greater than 5
6. Using the array in part 1, set all elements below 4 to 0
7. Given data = np.array([10, 20, np.nan, 40, np.nan, 60])

    Replace all NaNs with mean of non-nan values

In [32]:
# Do your work here!

# 1
arr1 = np.array([1,2,3,4,5,6,7,8,9,10])
# print("1:", arr1)

# 2
arr2 = np.array([[1,2,3], [4,5,6], [7,8,9]])
# print("2:", arr2)

# 3
print("\nQ3:")

print(arr1.ndim)
print(arr1.shape)
print(arr1.size)
print(arr1.dtype)

print()

print(arr2.ndim)
print(arr2.shape)
print(arr2.size)
print(arr2.dtype)

# 4
print("\nQ4:")

arr3 = np.array([
    [5, 10, 15],
    [20, 25, 30],
    [35, 40, 45]
])

print(arr3[0])
print(arr3[:,0])
print(arr3[1,1])

# 5
print("\nQ5:")

print(arr1[arr1 > 5])

# 6
print("\nQ6:")

arr4 = arr1.copy()
arr4[arr4 < 4] = 0
print(arr4)

# 7
print("\nQ7:")

data = np.array([10, 20, np.nan, 40, np.nan, 60])
data = np.where(np.isnan(data), np.nanmean(data), data)
print(data)




Q3:
1
(10,)
10
int64

2
(3, 3)
9
int64

Q4:
[ 5 10 15]
[ 5 20 35]
25

Q5:
[ 6  7  8  9 10]

Q6:
[ 0  0  0  4  5  6  7  8  9 10]

Q7:
[10.  20.  32.5 40.  32.5 60. ]


## Part 2: EDA

For this part, we want you to explore a dataset of your choice by yourself! Head on over to Kaggle (https://www.kaggle.com/) to find a plethora of datasets in a plethora of fields. Try to find a dataset that will test skills taught in the workshop.

Can't find something? Use the train Titanic dataset: https://www.kaggle.com/competitions/titanic/data?select=train.csv

### Step 1 - Load the dataset using Pandas.

**Tasks:**
1. Read the CSV file into a DataFrame.
2. Display the first 5 rows of the dataset.
3. Print the shape of the DataFrame.
4. Use info() to inspect column data types and missing values.
5. Use describe() to view summary statistics for numerical columns.

**Questions to answer:**
- How many rows and columns does the dataset have?
- Which columns are numerical?
- Which columns are categorical?
- Are there any obvious issues with data types or missing values?

---

### Step 2 — Identify and Handle Missing Values

**Tasks:**
1. Identify how many missing values exist in each column.
2. Decide which numerical columns should be imputed using:
   - Mean, or
   - Median, or
   - Mode
3. Perform the imputation.
4. Verify that missing values have been handled correctly.

**Questions to answer:**
- Which columns contained missing values?
- Why did you choose mean or median for each column?
- Were any columns dropped? If yes, explain why.

---

### Step 3 — Univariate Analysis (Numerical Columns)

Analyze numerical columns one at a time.

**Tasks:**
For at least two numerical columns:
1. Plot a histogram.
2. Plot a boxplot.
3. Examine the summary statistics.

**Questions to answer:**
- Is the distribution skewed?
- Are there extreme values or outliers?
- Would scaling be required for this column before modeling?

---

### Step 4 — Univariate Analysis (Categorical Columns)

Analyze categorical columns individually.

**Tasks:**
1. Use value_counts() for each categorical column.
2. Create a bar chart showing the distribution of categories.

**Questions to answer:**
- Which category appears most frequently?
- Is there any noticeable class imbalance?

---

### Step 5 — Feature Engineering 

Create at least one new feature that could be useful for analysis.

**Questions to answer:**
- Why might this feature be useful?
- Is this new feature numerical or categorical?

---

### Step 6 — Multivariate Analysis

Explore relationships between variables.

#### Numerical vs Numerical
**Tasks:**
1. Create a scatter plot between 2 pairs of numerical variables
2. Compute the correlation between these variables.

**Questions to answer:**
- Is there a visible relationship?
- Is the correlation positive, negative or close to 0


#### Categorical vs Numerical
**Tasks:**
1. Find statistics of numerical features after being grouped by the categorical features

**Questions to answer:**
- Are there any significant differences in the statistics across groups in the categorical variable?

---

### Step 7 — Outlier Analysis

Focus on one numerical column with potential outliers.

**Tasks:**
1. Identify potential outliers using visual inspection (boxplot).
2. Try to decide whether these outliers are:
   - Data errors, or
   - Legitimate extreme values

**Questions to answer:**
- Should these outliers be removed?
- What is the IQR range, and upper, lower limits for removal?

---

### Step 8 — Final Insights

Summarize your findings.

**Tasks:**
1. Write at least three meaningful insights derived from your analysis.
2. Reference plots or statistics where appropriate.

Examples of insights:
- Patterns between variables
- Differences across categories

---

In [136]:
# Do your work here!

### ! Step 1 - Load the dataset using Pandas.

# **Tasks:**
# TODO 1. Read the CSV file into a DataFrame.
# TODO 2. Display the first 5 rows of the dataset.
# TODO 3. Print the shape of the DataFrame.
# TODO 4. Use info() to inspect column data types and missing values.
# TODO 5. Use describe() to view summary statistics for numerical columns.

# **Questions to answer:**
# ?- How many rows and columns does the dataset have?
    # ^ rows = 10,000
    # ^ cols = 8

# ?- Which columns are numerical?
# ?- Which columns are categorical?
    # ^ Types of data for the columns, Transaction ID: Numerical, Item: Categorical, Quantity: Numerical, Price Per Unit: Numerical
    # ^ Total Spent: Numerical, Payment Method: Categorical, Location: Categorical, Transaction Date: DateTime

# ?- Are there any obvious issues with data types or missing values?
    # ^ There are missing values, values that have different ways of representing that they don't exist


# TODO 1. Read the CSV file into a DataFrame.
data = pd.read_csv('dirty_cafe_sales.csv')

# TODO 2. Display the first 5 rows of the dataset.
print(data.head(5))

# TODO 3. Print the shape of the DataFrame.
print(data.shape)

# TODO 4. Use info() to inspect column data types and missing values.
data.info()

# TODO 5. Use describe() to view summary statistics for numerical columns.
data.describe()





  Transaction ID    Item Quantity Price Per Unit Total Spent  Payment Method  \
0    TXN_1961373  Coffee        2            2.0         4.0     Credit Card   
1    TXN_4977031    Cake        4            3.0        12.0            Cash   
2    TXN_4271903  Cookie        4            1.0       ERROR     Credit Card   
3    TXN_7034554   Salad        2            5.0        10.0         UNKNOWN   
4    TXN_3160411  Coffee        2            2.0         4.0  Digital Wallet   

   Location Transaction Date  
0  Takeaway       2023-09-08  
1  In-store       2023-05-16  
2  In-store       2023-07-19  
3   UNKNOWN       2023-04-27  
4  In-store       2023-06-11  
(10000, 8)
<class 'pandas.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   Transaction ID    10000 non-null  str  
 1   Item              9667 non-null   str  
 2   Quantity          9862 non-null   str  
 

Unnamed: 0,Transaction ID,Item,Quantity,Price Per Unit,Total Spent,Payment Method,Location,Transaction Date
count,10000,9667,9862,9821.0,9827.0,7421,6735,9841
unique,10000,10,7,8.0,19.0,5,4,367
top,TXN_1961373,Juice,5,3.0,6.0,Digital Wallet,Takeaway,UNKNOWN
freq,1,1171,2013,2429.0,979.0,2291,3022,159


In [None]:

### ! Step 2 — Identify and Handle Missing Values

# **Tasks:**
# TODO 1. Identify how many missing values exist in each column.
# TODO 2. Decide which numerical columns should be imputed using:
#    - Mean, or
#    - Median, or
#    - Mode
# TODO 3. Perform the imputation.
# TODO 4. Verify that missing values have been handled correctly.

# **Questions to answer:**
# ?- Which columns contained missing values?
# ^ all but the id column

# ?- Why did you choose mean or median for each column?
# ^ I chose the mean for all the numerical columns based on item because it will just give the correct value since it wouldn't change, or a good in approximation

# ?- Were any columns dropped? If yes, explain why.
# ^ I dropped the ID column, and I also dropped any values that were missing Payment or Location since you can't really impute those values and they are important for analysis

# TODO 1. Identify how many missing values exist in each column.
data.isna().sum()

# TODO 2. Decide which numerical columns should be imputed using:
# ^ Quantity: Numerical - mean by item
# ^ Price Per Unit: Numerical - mean by item
# ^ Total Spent: Numerical - quantity * price per unit

# TODO 3. Perform the imputation.
# ~ Quantity
# data['Quantity'].unique()

# errors = 'coerce' will convert non-numeric values to NaN, the line above is not really necessary but makes it cleaner
data['Quantity'] = data['Quantity'].replace(['ERROR', 'UNKNOWN'], np.nan)
data['Quantity'] = pd.to_numeric(data['Quantity'], errors='coerce')
data['Quantity'] = data['Quantity'].fillna(data.groupby('Item')['Quantity'].transform('mean'))
data['Quantity'] = data['Quantity'].round()

# data['Quantity'].unique()

# ~ Price Per Unit
# data['Price Per Unit'].unique()

data['Price Per Unit'] = data['Price Per Unit'].replace(['ERROR', 'UNKNOWN'], np.nan)
data['Price Per Unit'] = pd.to_numeric(data['Price Per Unit'], errors='coerce')
data['Price Per Unit'] = data['Price Per Unit'].fillna(data.groupby('Item')['Price Per Unit'].transform('mean'))
data['Price Per Unit'] = data['Price Per Unit'].round()

# data['Price Per Unit'].unique()
# data[data['Price Per Unit'] < 1]
# data[(data['Quantity' * data['Price Per Unit'] != data['Total Spent']]]

# ~ Total Spent
# Only update 'Total Spent' where it is null
mask = data['Total Spent'].isna()
data.loc[mask, 'Total Spent'] = data.loc[mask, 'Quantity'] * data.loc[mask, 'Price Per Unit']

# ~ Remove rows with missing Payment Method or Location or Transaction Date and remove ID column, but I will keep an old version of the data for all the numerical values
data[['Payment Method', 'Location', 'Transaction Date']] = data[['Payment Method', 'Location', 'Transaction Date']].replace(['ERROR', 'UNKNOWN'], np.nan)

data_values = data.dropna(subset=['Quantity', 'Price Per Unit', 'Total Spent'])

data_cleaned = data.dropna(subset=['Payment Method', 'Location', 'Transaction Date']).drop(columns=['Transaction ID'])

# len(data_cleaned)

# TODO 4. Verify that missing values have been handled correctly.
data_cleaned.isna().sum()

data['Quantity'].unique()
data[data['Quantity'].isna()]

data['Price Per Unit'].unique()




array([ 2.,  3.,  1.,  5.,  4., nan])