## Section 1. Introduction ##

In this notebook, the dataset to be processed is the Labor Force Survey conducted April 2016 and retrieved through Philippine Statistics Authority database. 



In [35]:
import random
import numpy as np
import pickle
import os
import h5py
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import ScalarFormatter, FuncFormatter 
%matplotlib inline
 
plt.rcParams['figure.figsize'] = (6.0, 6.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

plt.style.use('ggplot')

# autoreload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

<h1>Importing LFS PUF April 2016.CSV</h1>

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

try:
    lfs_data = pd.read_csv("LFS PUF April 2016.CSV")
except FileNotFoundError:
    print("Error: CSV file not found. Please make sure the file exists in the correct directory or provide the correct path.")
    exit()


<h1>Data Information, Pre-Processing, and Cleaning</h1>

Let's get an overview of our dataset.

In [13]:
lfs_data.info()

---
Of interest to us, there are:
<ul><li>1 contains float values, </li>
<li>14 contain integer values, and </li>
<li><b>35 are object values</b>.</li></ul>

Let's check for duplicates:

In [14]:
lfs_data.duplicated().sum()

No duplicates here, and therefore no cleaning need follow in this regard.

The dataset seems to contain null values in the form of whitespaces. Let's count those:

In [15]:
has_null = lfs_data.apply(lambda col: col.str.isspace().sum() if col.dtype == 'object' else 0)

print("Number Empty Cells:")
print(has_null[has_null > 0])

---
And standardize, replacing these whitespace values with NaN:

In [16]:
lfs_data.replace(r"^\s+$", np.nan, regex=True, inplace=True)
nan_counts_per_column = lfs_data.isna().sum()
print(nan_counts_per_column[nan_counts_per_column > 0])

---
Let's also apply the unique() function to our dataset.

In [17]:
lfs_data.apply(lambda x: x.nunique())

---
Considering our dataset has 18,000 entries, features with particularly low numbers stand out as questions that have clear, defined choices. Reviewing the [questionnaire](https://psada.psa.gov.ph/catalog/67/download/537), we find that certain questions ask the participant to specify beyond prespecified choices.

This column possibly contains "010," which is obviously not an integer. We ensure this column is a string, and check for values not specified in the questionnaire.

In [18]:
lfs_data['PUFC07_GRADE'] = lfs_data['PUFC07_GRADE'].astype(str)
valid_codes = [
    "000", "010",                                      # No Grade, Preschool
    "210", "220", "230", "240", "250", "260", "280",  # Elementary
    "310", "320", "330", "340", "350",                # High School
    "410", "420",                                     # Post Secondary; If Graduate Specify
    "810", "820", "830", "840",                       # College; If Graduate Specify
    "900",                                            # Post Baccalaureate
    "nan"
]
invalid_rows = lfs_data[~(lfs_data['PUFC07_GRADE'].isin(valid_codes))]

unique_invalid_values = invalid_rows['PUFC07_GRADE'].unique()
print(unique_invalid_values)

---
Values 5XX 6XX are not detailed in the questionnaire. As it instructs the participant to specify whether they graduated from post secondary or college, we'll create a new data point to encapsulate these.

In [19]:
lfs_data.loc[~lfs_data['PUFC07_GRADE'].isin(valid_codes), 'PUFC07_GRADE'] = '700'
print(lfs_data['PUFC07_GRADE'].unique())

In [21]:
corr_matrix = lfs_data.corr()

plt.figure(figsize=(8, 6))  
sns.heatmap(
    corr_matrix,
    annot=False,  
    fmt=".2f",   
    annot_kws={"size": 8}, 
    cmap='coolwarm',
    cbar_kws={'shrink': 0.8} 
)
plt.title("Correlation Matrix", fontsize=14)
plt.show()

In [23]:

categorical_cols = ['PUFREG', 'PUFPRRCD', 'PUFC03_REL', 'PUFC23_PCLASS']
for col in categorical_cols:
    print(f"Frequency distribution for {col}:")
    print(lfs_data[col].value_counts())


In [52]:

lfs_data.replace([np.inf, -np.inf], np.nan, inplace=True)

fig, axes = plt.subplots(len(numerical_cols), 1, figsize=(10, 20))  

for i, col in enumerate(numerical_cols):
    
    data = lfs_data[col].dropna()

    
    if col == 'PUFC25_PBASIC' and data.min() > 0:
        
        sns.histplot(data, bins=30, kde=True, ax=axes[i], log_scale=True)

        
        axes[i].xaxis.set_major_formatter(ScalarFormatter())

        
        axes[i].xaxis.tick_top()
        axes[i].xaxis.set_label_position('top')

        
        secax = axes[i].secondary_xaxis('bottom')

        
        tick_locations = [50, 125, 300, 700, 2000, 5000]
        secax.set_xticks(tick_locations)
        secax.set_xticklabels([f"PhP{x:,}" for x in tick_locations])

        
        secax.set_xlabel("Income (Philippine Pesos)", labelpad=15)
        axes[i].set_xlabel("Income (log scale)", labelpad=15)

    elif col == 'PUFC19_PHOURS':
        
        max_val = float(data.max())
        sns.histplot(data, bins=np.arange(0, max_val+5, 5), kde=True, ax=axes[i])

    else:
        
        sns.histplot(data, bins=20, kde=True, ax=axes[i])

    axes[i].set_title(f"Distribution of {col}")
    if col != 'PUFC25_PBASIC':
        axes[i].set_xlabel(col)
    axes[i].set_ylabel("Frequency")

    
    textstr = f'Mean: {data.mean():.2f}\nMedian: {data.median():.2f}\nStd: {data.std():.2f}'
    props = dict(boxstyle='round', facecolor='wheat', alpha=0.5)
    axes[i].text(0.05, 0.95, textstr, transform=axes[i].transAxes,
                 verticalalignment='top', bbox=props)


plt.tight_layout(h_pad=4.0)  
plt.subplots_adjust(bottom=0.15)  
plt.show()
