# Cost of Living Pressures for Low-Income Households
- You may use this template to structure your Capstone Project in Jupyter Notebook.
- Feel free to add or remove sections as needed based on your project scope.
- You are encouraged to include code cells, Markdown explanations, charts, and summaries to clearly demonstrate your analytical thinking and process.

## 1️⃣ Project Title and Introduction:

Give your project a meaningful title. Then briefly describe the context or background of your analysis.

## 2️⃣ Scoping Your Data Analysis Project

- What are the big questions that you are exploring?
- What are the datasets and data columns that you will be exploring?
- What relationships between the data columns will you be exploring?

## 3️⃣Data Preparation

In [3]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import skew

# load datasets
lowincome_cpi = pd.read_csv("data/raw/cpi_2d_lowincome.csv")
statelevel_cpi = pd.read_csv("data/raw/cpi_2d_state_inflation.csv")
population = pd.read_csv("data/raw/population_dun.csv")
MCOICOP = pd.read_csv("data/raw/mcoicop.csv")

# quick inspection function
def inspect_dataset(name, df):
    print(f"\n===== {name} =====")
    print("\nInfo:")
    df.info()
    print("="*50)

# inspect all datasets
inspect_dataset("Low-Income CPI", lowincome_cpi)
inspect_dataset("State-Level CPI", statelevel_cpi)
inspect_dataset("Population by DUN", population)
inspect_dataset("MCOICOP", MCOICOP)




===== Low-Income CPI =====

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2632 entries, 0 to 2631
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   date      2632 non-null   object 
 1   division  2632 non-null   object 
 2   index     2519 non-null   float64
dtypes: float64(1), object(2)
memory usage: 61.8+ KB

===== State-Level CPI =====

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41888 entries, 0 to 41887
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   state          41888 non-null  object 
 1   date           41888 non-null  object 
 2   division       41888 non-null  object 
 3   inflation_yoy  39424 non-null  float64
 4   inflation_mom  40078 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.6+ MB

===== Population by DUN =====

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9000 entries, 0 to 

### Data Cleaning & Data Transformation:

- Handle missing values in both datasets (e.g., missing prices, order dates, or last visit dates).




In [19]:
# Function for check for duplicates row
def identify_duplicates(df):
    duplicates = df.duplicated().sum()
    print("\nDuplicates:", duplicates)

    if duplicates > 0:
         print("Duplicate rows found:")
         display(df[df.duplicated(keep=False)])
         # remove duplicates row
         remove_duplicates(df)

# Function for remove duplicates row
def remove_duplicates(df):
     df.drop_duplicates()

     # verify
     print(f"Number of duplicate rows after removal: {df.duplicated().sum()}")


# check for skewed -0.5 and 0.5 → approximately normal, use mean else median
def check_skewness(col):
    col = pd.to_numeric(col, errors="coerce").dropna()   # drop missing values
    n = len(col)
    if n < 3:  # too few data
        return None

    mean = col.mean()
    std = col.std()
    n = len(col)

    skewness = ((col - mean)**3).sum() / n / (std**3)
    return skewness

# fill the missing data
def impute_data(col,method='mean'):
    if method == 'mean':
        return col.fillna(col.mean())
    elif method == 'median':
        return col.fillna(col.median())

#### low income dataset
# 1. copy
lowincome_cpi_cleaned = lowincome_cpi.copy()

# 1. check for duplicates
identify_duplicates(lowincome_cpi)
# 2. check missing values
print("Missing Values Count:\n", lowincome_cpi.isnull().sum())
# 3. check skeweness
print("Skewness (index):", check_skewness(lowincome_cpi['index']))
# 4. fill
lowincome_cpi_cleaned['index'] = impute_data(lowincome_cpi['index'], method='median')
# 5. verify
print("Missing values after cleaning:", lowincome_cpi_cleaned.isnull().sum())



Duplicates: 0
Missing Values Count:
 date          0
division      0
index       113
dtype: int64
Skewness (index): 1.6944989496677305
Missing values after cleaning: date        0
division    0
index       0
dtype: int64


In [20]:
#### statelevel_cpi dataset
# 1. copy
statelevel_cpi_cleaned = statelevel_cpi.copy()

# 2. check for duplicates
identify_duplicates(statelevel_cpi)

# 3. remove_duplicates(df)
print("Missing Values Count:\n", statelevel_cpi.isnull().sum())

# 4. check skew
print("Skewness (index):", check_skewness(statelevel_cpi['inflation_yoy']))
print("Skewness (index):", check_skewness(statelevel_cpi['inflation_mom']))

# 5. fill missing data
statelevel_cpi_cleaned['inflation_yoy'] = impute_data(statelevel_cpi['inflation_yoy'], method='median')
statelevel_cpi_cleaned['inflation_mom'] = impute_data(statelevel_cpi['inflation_mom'], method='median')

# 6. verify
print("Missing values after cleaning:\n", statelevel_cpi_cleaned.isnull().sum())


Duplicates: 0
Missing Values Count:
 state               0
date                0
division            0
inflation_yoy    2464
inflation_mom    1810
dtype: int64
Skewness (index): 1.8073331594698339
Skewness (index): 3.5597662693415035
Missing values after cleaning:
 state            0
date             0
division         0
inflation_yoy    0
inflation_mom    0
dtype: int64


In [16]:
# population dataset
# 1. copy
population_cleaned = population.copy()

# 2. check for duplicates
identify_duplicates(population)

# 3. remove_duplicates(df)
print("Missing Values Count:\n", population.isnull().sum())

# 4. check skew
print(population['age'].nunique)
print("Skewness (index):", check_skewness(population['population']))

# 5. fill missing data
population_cleaned['age'] = population['age'].fillna("overall")
population_cleaned['population'] = impute_data(population['population'], method='median')

# 6. verify
print("Missing values after cleaning:\n", population_cleaned.isnull().sum())


Duplicates: 0
Missing Values Count:
 date            0
state           0
parlimen        0
dun             0
sex             0
age           409
ethnicity       0
population    388
dtype: int64
<bound method IndexOpsMixin.nunique of 0       overall
1       overall
2       overall
3       overall
4       overall
         ...   
8995    overall
8996    overall
8997    overall
8998    overall
8999    overall
Name: age, Length: 9000, dtype: object>
Skewness (index): 3.209489941188062
Missing values after cleaning:
 date          0
state         0
parlimen      0
dun           0
sex           0
age           0
ethnicity     0
population    0
dtype: int64


In [29]:
# 1. copy
MCOICOP_cleaned = MCOICOP.copy()

# 2. check for duplicates
identify_duplicates(MCOICOP)

# 3. check missing data
print("Missing Values Count:\n", MCOICOP.isna().sum())
MCOICOP.head()

# No need to fill missing data
# Keep the NaN values because they just show empty levels in the hierarchy, not missing data.
#  just fill na so there is no missing vlaues
MCOICOP_cleaned = MCOICOP_cleaned.fillna('N/A')

# 6. verify
print("Missing values after cleaning:\n", MCOICOP_cleaned.isnull().sum())



Duplicates: 0
Missing Values Count:
 digits        0
division      0
group        14
class        61
subclass    162
desc_en       0
desc_bm       0
dtype: int64
Missing values after cleaning:
 digits      0
division    0
group       0
class       0
subclass    0
desc_en     0
desc_bm     0
dtype: int64


### Data Manipulation and Data Transformation:
- Ensure data types and formatting are consistent.
- Create new columns that are helpful for data analysis



In [32]:
# created calculated column

# 1. Create a mapping for division -> category
division_mapping = {
    "01": "Makanan & Minuman Bukan Alkohol",
    "02": "Minuman Beralkohol & Tembakau",
    "03": "Pakaian & Alas Kaki",
    "04": "Perumahan, Air, Elektrik, Gas & Bahan Api Lain",
    "05": "Perabot, Peralatan Rumah & Penyelenggaraan Rutin Isi Rumah",
    "06": "Kesihatan",
    "07": "Pengangkutan",
    "08": "Komunikasi",
    "09": "Rekreasi & Kebudayaan",
    "10": "Pendidikan",
    "11": "Restoran & Hotel",
    "12": "Perkhidmatan Kewangan & Insurans",
    "13": "Penjagaan Diri, Perlindungan Sosial & Pelbagai Barangan & Perkhidmatan"
}

# Make sure division column is string
MCOICOP_cleaned['division'] = MCOICOP_cleaned['division'].astype(str)

# Create category column
MCOICOP_cleaned['category'] = MCOICOP_cleaned['division'].map(division_mapping)
# For overall items (digits == 1), assign "Semua item"
MCOICOP_cleaned.loc[MCOICOP_cleaned['digits'] == 1, 'category'] = 'Semua item'

# Check the result
print(MCOICOP_cleaned[['digits', 'division', 'category']].head(20))

# standardize formating
# ensured consistent data types
# performed data quality checks

    digits division                         category
0        1  overall                       Semua item
1        2       01  Makanan & Minuman Bukan Alkohol
2        3       01  Makanan & Minuman Bukan Alkohol
3        4       01  Makanan & Minuman Bukan Alkohol
4        5       01  Makanan & Minuman Bukan Alkohol
5        3       01  Makanan & Minuman Bukan Alkohol
6        4       01  Makanan & Minuman Bukan Alkohol
7        5       01  Makanan & Minuman Bukan Alkohol
8        5       01  Makanan & Minuman Bukan Alkohol
9        5       01  Makanan & Minuman Bukan Alkohol
10       5       01  Makanan & Minuman Bukan Alkohol
11       5       01  Makanan & Minuman Bukan Alkohol
12       4       01  Makanan & Minuman Bukan Alkohol
13       5       01  Makanan & Minuman Bukan Alkohol
14       5       01  Makanan & Minuman Bukan Alkohol
15       5       01  Makanan & Minuman Bukan Alkohol
16       4       01  Makanan & Minuman Bukan Alkohol
17       5       01  Makanan & Minuman Bukan A

### Data Joining:

- Join the datasets using a unique identifier.
- Perform groupby to uncover relationships between variables.


## 4️⃣ Exploratory Data Analysis

1.   List item
2.   List item



### Descriptive Analysis:

- Explore overall descriptive analysis.
- Filter subsets to answer big questions.

### Data Visualisation:
- Visualise data in graphs to better understand the data.



```
# This is formatted as code
```

## 5️⃣ Data Insights

- Summarize your main takeaways. What patterns or trends did you find?