# Note: Advanced functions like melt, pivot table won't be there for the mid-term exam. It is good to know it and it may come in the end semester and will be useful for your project.

# Pandas Questions

This notebook contains **6 diverse, scenario-based Pandas problems**.  
Each problem includes:
- Clear instructions on how to construct the DataFrame
- A hint pointing to useful Pandas functions
- A short explanation of what the key function(s) do and example usage

**Setup:** Run the import cell before attempting the questions.
---


## Setup
Run the import cell below before attempting the questions.

In [2]:
import pandas as pd
import numpy as np
# Only pandas is imported in this notebook


## Q1 — Grouping, aggregation and filtering

**Instructions (DataFrame construction):**
Create `df_sales` with columns: `OrderID`, `CustomerID`, `Product`, `Units`, `UnitPrice`, `OrderDate`.
- Create ~12 rows across 4 customers (CustomerID 1–4) and 5 products (P1–P5).
- `OrderDate` as strings like `'2025-01-03'`.

**Task:**
1. Add `Amount = Units * UnitPrice`.
2. Compute total `Amount` per `CustomerID`.
3. Show only customers with total > 300, sorted descending.

**Hint:** Use `.groupby()` and `.sum()` after creating the `Amount` column. You can filter using boolean indexing.

**Function explanation — `groupby`:**
`groupby()` splits the DataFrame into groups based on column values. You then apply aggregation functions (e.g., `sum`, `mean`, `count`) to each group. Example:
```python
df.groupby('CustomerID')['Amount'].sum()
```
This returns total `Amount` per `CustomerID`.


In [36]:
# YOUR CODE HERE: build df_sales and solve Q1
orderID_sample = [1,2,3,4,5,6,7,8,9,10]
customerID_sample = ['customer 1', 'customer 2', 'customer 3', 'customer 4']
product_sample = ['P1', 'P2','P3','P4','P5']
n = 12;

orderID_series = np.arange(1, n+1)
customerID_series = np.random.choice(customerID_sample, size=n)
product_series = np.random.choice(product_sample, size=n)
units_series = np.random.randint(3,12,size=n)
unit_price = np.random.randint(20,80, size=n)
order_date_series = ["2025-01-03", "2025-02-03", "2025-03-03", "2025-04-03", "2025-05-03", "2025-06-03", 
                                   "2025-07-03", "2025-08-03", "2025-09-03", "2025-10-03", "2025-11-03", "2025-12-03"]
# print(orderID_series["OrderID"])
# print(customerID_series["CustomerID"])
# print(product_series["Product"])
# print(units_series["Unit"])
df_sales = pd.DataFrame({"OrderID":orderID_series, "CustomerID":customerID_series, "Product":product_series,
                        "Unit":units_series, "UnitPrice":unit_price, "OrderDate":order_date_series}, np.arange(1, n+1))

print("Original")
print(df_sales)

Original
    OrderID  CustomerID Product  Unit  UnitPrice   OrderDate
1         1  customer 2      P5    11         35  2025-01-03
2         2  customer 3      P2    11         33  2025-02-03
3         3  customer 3      P2     9         61  2025-03-03
4         4  customer 2      P2     9         37  2025-04-03
5         5  customer 2      P4    11         36  2025-05-03
6         6  customer 4      P3     8         70  2025-06-03
7         7  customer 3      P2     5         48  2025-07-03
8         8  customer 4      P1     4         31  2025-08-03
9         9  customer 1      P4     6         54  2025-09-03
10       10  customer 2      P5     9         70  2025-10-03
11       11  customer 4      P2    11         70  2025-11-03
12       12  customer 1      P3     6         30  2025-12-03


In [37]:
df_sales["Amount"] = df_sales["Unit"] * df_sales["UnitPrice"]

print("\nAdd Amount = Units * UnitPrice.:\n\n", df_sales)
print("\nCompute total Amount per CustomerID.\n\n",df_sales.groupby("CustomerID")["Amount"].sum())


Add Amount = Units * UnitPrice.:

     OrderID  CustomerID Product  Unit  UnitPrice   OrderDate  Amount
1         1  customer 2      P5    11         35  2025-01-03     385
2         2  customer 3      P2    11         33  2025-02-03     363
3         3  customer 3      P2     9         61  2025-03-03     549
4         4  customer 2      P2     9         37  2025-04-03     333
5         5  customer 2      P4    11         36  2025-05-03     396
6         6  customer 4      P3     8         70  2025-06-03     560
7         7  customer 3      P2     5         48  2025-07-03     240
8         8  customer 4      P1     4         31  2025-08-03     124
9         9  customer 1      P4     6         54  2025-09-03     324
10       10  customer 2      P5     9         70  2025-10-03     630
11       11  customer 4      P2    11         70  2025-11-03     770
12       12  customer 1      P3     6         30  2025-12-03     180

Compute total Amount per CustomerID.

 CustomerID
customer 1     5

In [39]:
l1 = df_sales.groupby("CustomerID")["Amount"].sum()
print("\nShow only customers with total > 900, sorted descending.\n\n")
print(l1[l1>900].sort_values(ascending=False))


Show only customers with total > 900, sorted descending.


CustomerID
customer 2    1744
customer 4    1454
customer 3    1152
Name: Amount, dtype: int32


In [41]:
print("\n\n Extra :\n\n ")
print(df_sales.groupby("CustomerID")["Amount"].mean())
print(df_sales.groupby("CustomerID")["Amount"].max())
print(df_sales.groupby("CustomerID")["Amount"].min())
print(df_sales.dtypes)



 Extra :

 
CustomerID
customer 1    252.000000
customer 2    436.000000
customer 3    384.000000
customer 4    484.666667
Name: Amount, dtype: float64
CustomerID
customer 1    324
customer 2    630
customer 3    549
customer 4    770
Name: Amount, dtype: int32
CustomerID
customer 1    180
customer 2    333
customer 3    240
customer 4    124
Name: Amount, dtype: int32
OrderID        int64
CustomerID    object
Product       object
Unit           int32
UnitPrice      int32
OrderDate     object
Amount         int32
dtype: object


## Q2 — Join (merge) and derived boolean column

**Instructions (DataFrame construction):**
Create `df_employees` with `EmpID`, `Name`, `DeptID` (6 rows). Create `df_depts` with `DeptID`, `DeptName`, `Manager` (3 rows).

**Task:**
1. Merge into `df_full` that includes DeptName and Manager.
2. Create `IsManagedByAlice` == True when Manager == 'Alice'.
3. Show counts of employees per DeptName and number managed by Alice.

**Hint:** Use `merge()` to join the tables and boolean comparison `==` to produce the new column.

**Function explanation — `merge`:**
`merge()` combines two DataFrames on a key column (like SQL JOIN). Example:
```python
df_employees.merge(df_depts, on='DeptID', how='left')
```
This adds department details to each employee row. For booleans, compare a column to a value:
```python
df['IsManagedByAlice'] = df['Manager'] == 'Alice'
```


In [48]:
# YOUR CODE HERE: build df_employees, df_depts and solve Q2
df_employees = pd.DataFrame([
    {'EmpID':1,'Name':'Ravi','DeptID':10},
    {'EmpID':2,'Name':'Meera','DeptID':20},
    {'EmpID':3,'Name':'Ajay','DeptID':10},
    {'EmpID':4,'Name':'Priya','DeptID':30},
    {'EmpID':5,'Name':'Sameer','DeptID':20},
    {'EmpID':6,'Name':'Anita','DeptID':30},
    ])
df_depts = pd.DataFrame([
    {'DeptID':10,'DeptName':'Data','Manager':'Alice'},
    {'DeptID':20,'DeptName':'Infra','Manager':'Bob'},
    {'DeptID':30,'DeptName':'HR','Manager':'Alice'},
])

df_full = df_employees.merge(df_depts, on='DeptID', how='left')
# 2. Derived boolean
df_full['IsManagedByAlice'] = df_full['Manager'] == 'Alice'

print(df_full.to_string(index=False))

 EmpID   Name  DeptID DeptName Manager  IsManagedByAlice
     1   Ravi      10     Data   Alice              True
     2  Meera      20    Infra     Bob             False
     3   Ajay      10     Data   Alice              True
     4  Priya      30       HR   Alice              True
     5 Sameer      20    Infra     Bob             False
     6  Anita      30       HR   Alice              True


In [46]:
# 2. Derived boolean
df_full['IsManagedByAlice'] = df_full['Manager'] == 'Alice'
df_full

Unnamed: 0,EmpID,Name,DeptID,DeptName,Manager,IsManagedByAlice
0,1,Ravi,10,Data,Alice,True
1,2,Meera,20,Infra,Bob,False
2,3,Ajay,10,Data,Alice,True
3,4,Priya,30,HR,Alice,True
4,5,Sameer,20,Infra,Bob,False
5,6,Anita,30,HR,Alice,True


## Q3 — Reshape: melt and pivot

**Instructions (DataFrame construction):**
Create `df_quarterly` (wide) with columns `Company`, `Q1`, `Q2`, `Q3`, `Q4` for 5 companies (numeric revenue).

**Task:**
1. Use `melt` to convert to long format (`Company`, `Quarter`, `Revenue`).
2. Pivot back to wide using `pivot_table`.
3. Compute each company's annual revenue and append as column to the pivot result.

**Hint:** `pd.melt()` and `pivot_table()` are complementary — melt turns wide -> long; pivot_table (or pivot) can aggregate and reshape long -> wide.

**Function explanation — `melt` and `pivot_table`:**
- `melt(df, id_vars=[...], value_vars=[...])` turns columns into rows, useful for tidy/long-format data.
- `pivot_table(index=..., columns=..., values=..., aggfunc=...)` reshapes long data to wide and allows aggregation. Example:
```python
df_long = pd.melt(df_quarterly, id_vars=['Company'], value_vars=['Q1','Q2','Q3','Q4'], var_name='Quarter', value_name='Revenue')
df_wide = df_long.pivot_table(index='Company', columns='Quarter', values='Revenue')
```


In [None]:
# YOUR CODE HERE: build df_quarterly and solve Q3


## Q4 — Time series: rolling mean and resample

**Instructions (DataFrame construction):**
Create `df_temp` with daily `Date` (as strings) from '2025-03-01' to '2025-03-10' and a `Temperature` float column.

**Task:**
1. Convert `Date` to datetime and set as index.
2. Add `Temp_3d_avg` as 3-day rolling mean of Temperature.
3. Resample to weekly frequency and report mean temperature per week.

**Hint:** Convert dates with `pd.to_datetime()`, set as index, then use `.rolling()` and `.resample()`.

**Function explanation — `rolling` and `resample`:**
- `rolling(window=n).mean()` computes a moving average over `n` rows (good for smoothing). Example: `df['Temperature'].rolling(window=3).mean()`.
- `resample('W')` groups data into fixed time windows (here weekly); you can then call aggregation like `.mean()` to get weekly averages.


In [None]:
# YOUR CODE HERE: build df_temp and solve Q4


## Q5 — Missing data handling and imputation

**Instructions (DataFrame construction):**
Create `df_products` with columns `SKU`, `Category`, `Price`, `Stock` (8 rows). Include some `NaN` values in `Price` and `Stock`. At least 3 categories.

**Task:**
1. Show number of missing values per column.
2. Fill missing `Price` with median `Price` of the same `Category` and missing `Stock` with 0.
3. After imputation, show SKUs where `Stock == 0` or `Price > 100`.

**Hint:** Use `isna()`, `fillna()` and `groupby().transform()` to fill per-group values.

**Function explanation — `isna`, `fillna`, and `transform`:**
- `isna()` identifies missing values. `df.isna().sum()` gives counts.
- `fillna(value)` replaces missing values with `value` (or use `inplace=True` to modify the DataFrame).
- `groupby(...).transform()` applies a function per group and returns a Series aligned with the original DataFrame, useful for filling missing values using group statistics, e.g.:
```python
df['Price'] = df.groupby('Category')['Price'].transform(lambda x: x.fillna(x.median()))
```


In [None]:
# YOUR CODE HERE: build df_products and solve Q5


## Q6 — Apply/custom logic and string operations

**Instructions (DataFrame construction):**
Create `df_courses` with columns `CourseID`, `Title`, `Enrolled`, `StartDate` for 6 courses. Some titles should contain 'Intro' or 'Advanced'. `StartDate` as strings.

**Task:**
1. Create `Level`: 'Beginner' if Title contains 'Intro', 'Advanced' if contains 'Advanced', else 'Intermediate'. Use `apply` or vectorized string ops.
2. Create `StartMonth` derived from `StartDate`.
3. Show average `Enrolled` per `Level`.

**Hint:** Use `.str.contains()` for text checks and either `.apply()` or `np.where` for conditional column creation. Convert `StartDate` to datetime to extract month.

**Function explanation — `str.contains`, `apply`, and datetime accessor `.dt`:**
- `df['Title'].str.contains('Intro')` returns a boolean Series marking rows where 'Intro' appears.
- `apply()` runs a function row- or column-wise; useful for custom logic but can be slower than vectorized ops.
- After converting `StartDate` to datetime (`pd.to_datetime()`), use `.dt.month` to get the month number. Example:
```python
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['StartMonth'] = df['StartDate'].dt.month
```


In [None]:
# YOUR CODE HERE: build df_courses and solve Q6
