## Section 1. Introduction ##

In this notebook, the dataset to be processed is the Labor Force Survey conducted April 2016 and retrieved through Philippine Statistics Authority database. 



In [None]:
import random
import numpy as np
import pickle
import os
import h5py
import matplotlib.pyplot as plt
%matplotlib inline

plt.rcParams['figure.figsize'] = (6.0, 6.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

plt.style.use('ggplot')

# autoreload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

<h1>Importing LFS PUF April 2016.CSV</h1>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

try:
    lfs_data = pd.read_csv("LFS PUF April 2016.CSV")
except FileNotFoundError:
    print("Error: CSV file not found. Please make sure the file exists in the correct directory or provide the correct path.")
    exit()


<h1>Data Information</h1>

Let's get an overview of our dataset.

In [None]:
lfs_data.info()

---
Of interest to us, there are:
<ul><li>1 contains float values, </li>
<li>14 contain integer values, and </li>
<li><b>35 are object values</b>.</li></ul>
<br>
We can then infer that we must first <b>format the 35 columns</b> that have object values before even doing any processing.

Let's also apply the unique() function to our dataset.

In [None]:
lfs_data.apply(lambda x: x.nunique())

---
Considering our dataset has 18,000 entries, features with particularly low numbers stand out as questions that have clear, defined choices.

Let's check for duplicates:

In [None]:
lfs_data.duplicated().sum()

No duplicates here, and therefore no cleaning need follow in this regard.

The dataset seems to contain null values in the form of whitespaces. Let's count those:

In [None]:
has_null = lfs_data.apply(lambda col: col.str.isspace().sum() if col.dtype == 'object' else 0)

print("Number Empty Cells:")
print(has_null[has_null > 0])

---
And standardize, replacing these whitespace values with NaN:

In [None]:
lfs_data.replace(r"^\s+$", np.nan, regex=True, inplace=True)
nan_counts_per_column = lfs_data.isna().sum()
print(nan_counts_per_column[nan_counts_per_column > 0])

<h1>Data Preprocessing and Data Cleaning</h1>

<h2>PUFC06_MSTAT</h2>
Predictors:
<ul><li>PUFC05_AGE</li>
<li>PUFC04_SEX </li>
<li>PUFC03_REL </li></ul>

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

columns = ["PUFC06_MSTAT", "PUFC05_AGE", "PUFC04_SEX", "PUFC03_REL"]
lfs_data_PUFC06_MSTAT = lfs_data[columns]
pd.get_dummies(lfs_data_PUFC06_MSTAT, columns=["PUFC04_SEX", "PUFC03_REL"])

imputer = IterativeImputer(random_state=42)
# lfs_data_PUFC06_MSTAT = pd.DataFrame(imputer.fit_transform(lfs_data_PUFC06_MSTAT), columns=lfs_data_PUFC06_MSTAT.columns)
# lfs_data_PUFC06_MSTAT["PUFC06_MSTAT"].round().astype(int)



In [None]:
lfs_data_PUFC06_MSTAT = pd.DataFrame(imputer.fit_transform(lfs_data_PUFC06_MSTAT), columns=lfs_data_PUFC06_MSTAT.columns)

In [None]:
lfs_data_PUFC06_MSTAT["PUFC06_MSTAT"].round().astype(int)

<h2>PUFC08_CURSCH</h2>
Is the person currently attending school?

TODO: 
since current variables are just 1 and 2<br>
where 1 or 2 represent "elementary education" and "secondary and tertiary education" accomplishment<br>
<br>
and so we will replace all null values with 0<br>
to represent having NOT finished elementary, secondary, or tertiary.<br>

PUFC31_FLWRK

PUFC32_JOBSM

PUFC33_WEEKS

PUFC35_LTLOOKW

PUFC36_AVAIL

PUFC37_WILLING

<h2>One Hot Encoding</h2>

In [None]:
# idk pa what columns to hot encoding 
# df = pd.get_dummies(df, drop_first=True)  # One-hot encoding

<h2>Feature Selection</h2>

In [None]:
# idk pa rin what to do here

Binary 

TODO:<br>
convert the columns that can be classified in a binary manner to 1s and 0s<br>
ie. employment status: instead of "employed" and "unemployed"<br>
convert to 1 and 0<br>

<h1>kNN</h1>

Employability or Job Prediction (kNN) (??)

Rural vs Urban Workforce Disparities (kNN)
<ul><li>group up regions that have similar labor market characteristics, challenges, and/or opportunities</li></ul>

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score

# 1. Import the CSV file using pandas
try:
    lfs_data = pd.read_csv("LFS_PUF_April_2016.CSV")  # Replace with the actual file name/path
except FileNotFoundError:
    print("Error: CSV file not found. Please make sure the file exists in the correct directory or provide the correct path.")
    # Handle the error gracefully, e.g., exit the program or prompt the user for the correct path.
    exit()  # Or another appropriate error handling

# 2. Data Cleaning and Preprocessing (Crucial for kNN)

# --- A. Handle Missing Values ---
# kNN is sensitive to missing values.  You'll need to decide how to handle them.  Common options:

# Option 1: Drop rows with any missing values (if the number of missing values is small)
lfs_data.dropna(inplace=True)  # This modifies the DataFrame in place

# Option 2: Impute missing values (more common and often better)
#   - Numerical features: Use mean, median, or more advanced imputation (e.g., KNN imputation)
#     Example using median:
#     for col in lfs_data.select_dtypes(include=['number']).columns:
#         lfs_data[col].fillna(lfs_data[col].median(), inplace=True)

#   - Categorical features: Use mode (most frequent value)
#     for col in lfs_data.select_dtypes(exclude=['number']).columns:
#         lfs_data[col].fillna(lfs_data[col].mode()[0], inplace=True) # mode() returns a series, take the first element


# --- B. Feature Selection/Engineering ---
# Identify your target variable (the one you want to predict) and features (the ones you'll use for prediction).
# Example (replace 'TARGET_COLUMN' and 'FEATURE_COLUMNS' with your actual column names):
TARGET_COLUMN = 'EmploymentStatus' # Example - replace with your actual target variable
FEATURE_COLUMNS = ['Age', 'EducationLevel', 'Occupation'] # Example - replace with your feature columns

y = lfs_data[TARGET_COLUMN]  # Target variable
X = lfs_data[FEATURE_COLUMNS] # Features

# --- C. Encode Categorical Features ---
# kNN works with numerical data. Convert categorical features to numerical using one-hot encoding or label encoding.

X = pd.get_dummies(X) # One-hot encoding (often preferred for kNN)
# OR
# from sklearn.preprocessing import LabelEncoder
# le = LabelEncoder()
# for col in X.select_dtypes(exclude=['number']).columns:
#     X[col] = le.fit_transform(X[col])

# --- D. Feature Scaling (Very Important for kNN) ---
# kNN is distance-based, so features with larger values can dominate. Scale your features.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Fit and transform the features

# 3. Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)  # 80% train, 20% test

# 4. Train the kNN Classifier
k = 5  # Choose an appropriate value for k (number of neighbors) - often needs tuning
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# 5. Make Predictions
y_pred = knn.predict(X_test)

# 6. Evaluate the Model
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))


# --- Important Notes ---

# * **File Path:** Double-check the path to your CSV file.  If it's not in the same directory as your notebook, provide the full path.
# * **Data Exploration:** Before preprocessing, explore your data: `lfs_data.head()`, `lfs_data.info()`, `lfs_data.describe()`. This will help you understand the data types, missing values, and potential issues.
# * **Feature Engineering:**  The choice of features and how you engineer them is *crucial* for model performance.
# * **k Value Tuning:** Experiment with different values of `k` to find the optimal one. You can use techniques like cross-validation.
# * **Handling Imbalanced Datasets:** If your target variable has imbalanced classes (e.g., many more examples of one class than another), consider techniques like oversampling or undersampling.
# * **Computational Resources:** The LFS PUF is a large dataset. Be mindful of your computer's memory.  You might need to process the data in chunks if you run into memory issues.

Logistic Regression

Employability (Binary Logistic Regression)

In [None]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier(
    loss='log_loss',
    eta0=0.001,
    max_iter=200,
    learning_rate='constant',
    random_state=1,
    verbose=1
)

<h1>try</h1>

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load the data
try:
    lfs_data = pd.read_csv("LFS PUF April 2016.CSV")
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: CSV file not found. Please ensure the file is in the correct directory or provide the correct path.")
    exit()

In [None]:
# 2. Data Cleaning and Preprocessing

# a. Handle Missing Values (Important!)
# Explore missing data:
print(lfs_data.isnull().sum()) # Check for missing values in each column
# Strategies for handling missing data (choose one or combine as appropriate):
# 1. Drop rows with many missing values (if applicable).
# 2. Impute missing numerical values (e.g., mean, median).
# 3. Impute missing categorical values (e.g., mode, or create a "missing" category).
# Example imputation (replace with more suitable strategy as needed):
# lfs_data['PUFC25_PBASIC'].fillna(lfs_data['PUFC25_PBASIC'].median(), inplace=True) # Impute with median for basic pay
# lfs_data['PUFC07_GRADE'].fillna("Unknown", inplace=True) # Impute with "Unknown" for grade

In [None]:
# i think we can drop PUFSVYMO Survey month, PUFSVYYR Survey year

renamed_fucking_columns = {
    'PUFREG': 'Region',
    'PUFPRV' : 'Province code',
    'PUFPRRCD' : 'Province recode',
    'PUFHHNUM' : 'Household unique sequential number',
    'PUFURB2K10' : 'Urban / Rural',
    'PUFPWGTFIN' : 'Final weight',
    'PUFSVYMO' : 'Survey month',
    'PUFSVYYR' : 'Survey year',
    'PUFPSU' : 'PSU number',
    'PUFRPL' : 'Replicate',
    'PUFHHSIZE' : 'Number of household members',
    'PUFC01_LNO' : 'Line number used to identify each member of the household in the survey',
    'PUFC03_REL' : 'Relationship of the person to the household head',
    'PUFC04_SEX' : 'Sex of the person',
    'PUFC05_AGE' : 'Age of the person since last birthday',
    'PUFC06_MSTAT' : 'Marital status of the person since last birthday',
    'PUFC07_GRADE' : 'Highest grade completed of the person',
    'PUFC08_CURSCH' : 'Is the person currently attending school?',
    'PUFC09_GRADTECH' : 'Is the person a graduate of a technical / vocational course?',
    'PUFC10_CONWR' : 'Category of OFW',
    'PUFC11_WORK' : 'Did the person do any work for at least one house during the past week?',
    'PUFC12_JOB' : 'Although the person did not work last week, did the person have a job or business during the past week?',
    'PUFC14_PROCC' : 'What is the primary occupation of the person during the past week?',
    'PUFC16_PKB' : 'Kind of business or industry of the person',
    'PUFC17_NATEM' : 'Nature of employment of the person.',
        # This refers to the permanence or regularity or seasonality with which a particular work or job/business is being pursued.
    'PUFC18_PNWHRS' : 'Normal working hours per day',
        # Normal working hours worked per day is the usual or prescribed working hours of a person in his primary job/business, which is, considered a full day's work.
    'PUFC19_PHOURS' : 'Total number of hours worked during the past week',
        # The actual number of hours worked by a person in his primary job that he held during the past week or in his other job(s)/business if there are or if there is any.
        # It includes the duration or the period the person was occupied in his work, including overtime, but excluding hours paid but not worked. 
        # For wage and salary earners, it includes time worked without compensation in connection with their occupations, 
        # such as the time a teacher spends at home preparing for the forthcoming lectures. 
        # For own account workers, it includes the time spent in the shop, business or office, even if no sale or transaction has taken place.
    'PUFC20_PWMORE' : 'Do you want more hours of work during the past week?',
    'PUFC21_PLADDW' : 'Did the person look for additional work during the past week?',
    'PUFC22_PFWRK' : "Was this the person's first time to do any work?",
        # This question determines whether a person is a “new entrant” to the labor force. 
        # A person is a new entrant if it is his first time to do any work.
        # A person is considered to have worked only for the first time if he started working only during the current survey period.
        # Current survey period refers to April 1 - 30 for this survey round
    'PUFC23_PCLASS' : 'Class of worker for primary occupation',
        # Class of worker is the relationship of the worker to the establishment where he works.
    'PUFC24_PBASIS' : 'Basis of payment for primary occupation',
    'PUFC25_PBASIC' : 'Basic pay per day for primary occupation',
        # Basic pay is the pay for normal time, prior to deductions of social security contributions, withholding taxes, etc. 
        # It excludes allowances, bonuses, commissions, overtime pay, benefits in kind, etc. 
        # This is also called basic wage.
    'PUFC26_OJOB' : 'Did the person have other job or business during the past week?',
    'PUFC27_NJOBS' : 'Number of jobs the person had during the past week',
    'PUFC28_THOURS' : 'Total number of hours worked by the person for all his jobs during the past week',
    'PUFC29_WWM48H' : 'Main reason for not working more than 48 hours in the past week',
    'PUFC30_LOOKW' : 'Did the person look for work or try to establish a business in the past week?',
    'PUFC31_FLWRK' : "Was it the person's first time looking for work or trying to establish a business?",
    'PUFC32_JOBSM' : 'Job search method',
        # What has the person been doing to find work?
    'PUFC33_WEEKS' : 'Number of weeks spent in looking for work',
        # How many weeks has the person been looking for work?',
    'PUFC34_WYNOT' : 'Reason for not looking for work Why did the person not look for work?',
    'PUFC35_LTLOOKW' : 'When was the last time the person looked for work?',
    'PUFC36_AVAIL' : 'Had opportunity for work existed last week or within two weeks, would the person have been available?',
    'PUFC37_WILLING' : 'Is the person willing to take up work in the past week or within 2 weeks?',
    'PUFC38_PREVJOB' : 'Has the person worked at any time before?',
    'PUFC40_POCC' : 'What was the person’s last occupation?',
    'PUFC41_WQTR' : 'Did the person work at all or had a job or business during the past quarter?',
    'PUFC43_QKB' : 'Kind of business for the past quarter',
    'PUFNEWEMPSTAT' : 'New Employment Criteria'
}
pd.set_option('display.max_columns', None)
lfs_data.rename(columns=renamed_fucking_columns, inplace=True)

lfs_data.head(50)

In [None]:
# b. Feature Selection (Crucial for a good model)
# Identify features relevant to employability.  Consider these factors:
# * Demographic features (age, sex, marital status, education)
# * Work experience (previous jobs, hours worked)
# * Job search activity (looking for work, methods used)
# * Availability and willingness to work
# * Location (region, urban/rural)

# Example: Select some potentially relevant features (you'll likely want to refine this):
selected_features = ['PUFC05_AGE', 'PUFC04_SEX', 'PUFC07_GRADE', 'PUFC11_WORK', 'PUFC14_PROCC', 'PUFC17_NATEM', 'PUFC23_PCLASS', 'PUFC30_LOOKW', 'PUFC36_AVAIL', 'PUFC37_WILLING', 'PUFNEWEMPSTAT']  # Add more!
lfs_data = lfs_data[selected_features]

# c. Encode Categorical Variables
# Logistic regression works with numerical data. Convert categorical features:
lfs_data = pd.get_dummies(lfs_data, columns=['PUFC04_SEX', 'PUFC07_GRADE', 'PUFC11_WORK', 'PUFC14_PROCC', 'PUFC17_NATEM', 'PUFC23_PCLASS', 'PUFC30_LOOKW', 'PUFC36_AVAIL', 'PUFC37_WILLING']) # One-hot encoding

# d. Define Target Variable (Employability)
# You'll need to define what "employable" means in your context.
# Example: If PUFNEWEMPSTAT indicates employment status, you might use it directly.
# Or, you might create a new target variable based on a combination of factors.
# Example (using PUFNEWEMPSTAT directly as a binary indicator - Adapt as needed):
lfs_data['employable'] = lfs_data['PUFNEWEMPSTAT'].apply(lambda x: 1 if x in [1, 2, 3] else 0) # Example: 1 if employed, 0 if not.  Adjust based on your data.
lfs_data.drop('PUFNEWEMPSTAT', axis=1, inplace=True) # Remove the original employment status column if you created a new 'employable' column


In [None]:

# 3. Model Training
X = lfs_data.drop('employable', axis=1)  # Features
y = lfs_data['employable']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  # Split data

model = LogisticRegression(max_iter=1000) # Increase max_iter if needed.
model.fit(X_train, y_train)



In [None]:
# 4. Model Evaluation
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d")
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


In [None]:
# 5. Feature Importance (Optional but helpful)
# Logistic regression can provide some insight into feature importance (coefficients):
coefficients = model.coef_[0]
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': coefficients})
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)


# Suggestions for an Outstanding Model:

# * Thorough Data Cleaning: Handle missing values strategically.  Don't just drop them blindly. Imputation is often better.
# * Feature Engineering: Create new features from existing ones.  For example, combine education and work experience.
# * Feature Selection: Carefully choose the most relevant features. Use domain knowledge, statistical tests, or feature selection techniques (e.g., recursive feature elimination).
# * Model Selection: Don't be limited to logistic regression. Explore other models like Random Forest, Gradient Boosting, or Support Vector Machines.
# * Hyperparameter Tuning: Optimize the model's parameters using techniques like GridSearchCV or RandomizedSearchCV.
# * Cross-Validation: Use techniques like k-fold cross-validation to get a more robust estimate of model performance.
# * Address Class Imbalance (if present): If your dataset has a significantly unequal number of "employable" and "not employable" individuals, consider techniques like oversampling or undersampling.
# * Domain Expertise:  The most important thing! Work with people who understand the Philippine labor market. Their insights will be invaluable for feature selection, defining "employability," and interpreting the model's results.
# * Explainability: Consider using techniques to make your model more interpretable. This is important for understanding why the model is making certain predictions.