Creating a Synthatic dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.random.seed(42)
n = 50
data = {
    "ID": np.arange(1, n+1),  # unique IDs
    "Age": np.random.randint(20, 60, size=n),  # ages between 20–60
    "Income": np.random.randint(30000, 100000, size=n).astype(float),  # income values
    "Education": np.random.choice(["High School", "Graduate", "Postgraduate"], size=n)  # categorical
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ID,Age,Income,Education
0,1,58,97121.0,Postgraduate
1,2,48,99479.0,Postgraduate
2,3,34,49457.0,High School
3,4,27,96557.0,Postgraduate
4,5,40,82995.0,High School


In [20]:
df.loc[np.random.choice(df.index, 10, replace=False), "Income"] = np.nan
df.loc[np.random.choice(df.index, 7, replace=False), "Age"] = np.nan

print("Synthetic Dataset with NaN values:\n")
print(df.head(20))

Synthetic Dataset with NaN values:

    ID   Age   Income     Education  Income_z Age_Bin
0    1  58.0  97121.0  Postgraduate  1.469151   45-60
1    2  48.0  99479.0  Postgraduate  1.579138   45-60
2    3  34.0      NaN   High School -0.754087   25-35
3    4  27.0      NaN  Postgraduate       NaN   25-35
4    5  40.0  82995.0   High School  0.810259   35-45
5    6  58.0  70757.0  Postgraduate  0.239430   45-60
6    7   NaN      NaN      Graduate       NaN     NaN
7    8  42.0  75758.0  Postgraduate  0.472696   35-45
8    9  30.0  95697.0   High School  1.402730   25-35
9   10   NaN      NaN   High School       NaN   25-35
10  11   NaN  62606.0      Graduate -0.140765     NaN
11  12   NaN  41534.0  Postgraduate -1.123647     NaN
12  13   NaN  70397.0  Postgraduate  0.222638     NaN
13  14   NaN  31016.0      Graduate -1.614248   35-45
14  15  22.0  85591.0  Postgraduate  0.931346   18-25
15  16  41.0      NaN  Postgraduate       NaN   35-45
16  17  21.0  54300.0   High School -0.528190 

Problem 1: Compute (a) mean, (b) median, and (c) age-weighted mean of income. Ignore NaNs where appropriate. Explain when a weighted mean is preferable.

In [5]:
# Mean of income
mean_income = df["Income"].mean(skipna=True)
# Median of income
median_income = df["Income"].median(skipna=True)
# Age-weighted mean of income
# Formula: (Σ (Income * Age)) / (Σ Age)
age_weighted_mean = (df["Income"] * df["Age"]).sum(skipna=True) / df["Age"].sum(skipna=True)


In [6]:
print("\nResults:")
print(f"Mean Income = {mean_income:.2f}")
print(f"Median Income = {median_income:.2f}")
print(f"Age-Weighted Mean Income = {age_weighted_mean:.2f}")


Results:
Mean Income = 65623.87
Median Income = 68571.50
Age-Weighted Mean Income = 48075.27


Q Explain when a weighted mean is preferable.

Ans.Use weighted mean when some data points should matter more than others.

Problem 2: Standardize income (z-score). Report how many incomes are outliers using rule |z|> 3. Handle NaNs correctly (do not drop entire rows unnecessarily).

In [8]:
# Z-score standardization
mean_income = df["Income"].mean(skipna=True)
std_income = df["Income"].std(skipna=True)
# Compute z-scores for Income
df["Income_z"] = (df["Income"] - mean_income) / std_income
# Identify outliers using |z| > 3
outliers = df[(df["Income_z"].abs() > 3)]

In [10]:
print("\nStandardized Income (z-scores):\n")
print(df[["ID", "Income", "Income_z"]].head(15))

print(f"\nNumber of outliers = {outliers.shape[0]}")
print("\nOutlier Rows:\n")
print(outliers)


Standardized Income (z-scores):

    ID   Income  Income_z
0    1  97121.0  1.469151
1    2  99479.0  1.579138
2    3  49457.0 -0.754087
3    4      NaN       NaN
4    5  82995.0  0.810259
5    6  70757.0  0.239430
6    7      NaN       NaN
7    8  75758.0  0.472696
8    9  95697.0  1.402730
9   10      NaN       NaN
10  11  62606.0 -0.140765
11  12  41534.0 -1.123647
12  13  70397.0  0.222638
13  14  31016.0 -1.614248
14  15  85591.0  0.931346

Number of outliers = 0

Outlier Rows:

Empty DataFrame
Columns: [ID, Age, Income, Education, Income_z]
Index: []


Problem 3: Create age bins: [18-25), [25-35), [35-45), [45-60) and compute for each bin: ● count of observations, ● mean income, ● median score. Show result as a tidy DataFrame sorted by age bin.

In [11]:
bins = [18, 25, 35, 45, 60]
labels = ["18-25", "25-35", "35-45", "45-60"]

df["Age_Bin"] = pd.cut(df["Age"], bins=bins, labels=labels, right=False)
result = df.groupby("Age_Bin").agg(
    Count=("Income", "count"),
    Mean_Income=("Income", "mean"),
    Median_Income=("Income", "median")
).reset_index()
result = result.sort_values("Age_Bin").reset_index(drop=True)

  result = df.groupby("Age_Bin").agg(


In [12]:
print("\nResult by Age Bin:\n")
print(result)


Result by Age Bin:

  Age_Bin  Count  Mean_Income  Median_Income
0   18-25      5    66781.400        69099.0
1   25-35      8    65553.750        61847.0
2   35-45      9    62503.000        69504.0
3   45-60      8    68580.625        69400.5


Problem 4: Create an array it cannot be of 1 Dimension. And then showcase the operation for the following: ● Shape and Resize → shape, size, Transpose, Flatten ● Showcasing negative indexing and display error while doing slicing ● Arithmetic Operations → Broadcasting, Dot Product ● Linear Algebra → Determinant, Inverse

In [13]:
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]], dtype=float)
print("Original Array:\n", arr)

Original Array:
 [[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]


In [16]:
# Shape and Resize
print("\nShape of array:", arr.shape)
print("Size of array:", arr.size)
print("Transpose of array:\n", arr.T)
print("Flattened array:\n", arr.flatten())


Shape of array: (3, 3)
Size of array: 9
Transpose of array:
 [[1. 4. 7.]
 [2. 5. 8.]
 [3. 6. 9.]]
Flattened array:
 [1. 2. 3. 4. 5. 6. 7. 8. 9.]


In [17]:
# Negative Indexing
print("\nLast row using negative indexing:", arr[-1])
print("Last element using negative indexing:", arr[-1, -1])


Last row using negative indexing: [7. 8. 9.]
Last element using negative indexing: 9.0


In [18]:
# Error in slicing
try:
    print(arr[-5])
except IndexError as e:
    print("\nIndexError:", e)


IndexError: index -5 is out of bounds for axis 0 with size 3


In [19]:
# Arithmetic Operations
# Broadcasting (adding scalar)
print("\nBroadcasting (arr + 5):\n", arr + 5)

# Dot product (matrix multiplication)
dot_product = np.dot(arr, arr)
print("\nDot Product (arr x arr):\n", dot_product)

# Linear Algebra Operations
det = np.linalg.det(arr)
print("\nDeterminant:", det)

# Inverse (only if determinant != 0)
if det != 0:
    inv = np.linalg.inv(arr)
    print("Inverse:\n", inv)
else:
    print("Matrix is singular, inverse does not exist")


Broadcasting (arr + 5):
 [[ 6.  7.  8.]
 [ 9. 10. 11.]
 [12. 13. 14.]]

Dot Product (arr x arr):
 [[ 30.  36.  42.]
 [ 66.  81.  96.]
 [102. 126. 150.]]

Determinant: 0.0
Matrix is singular, inverse does not exist
