# Labor Force Survery 2024
#### Goal: Find what factors affect a person's employability

___

### CSMODEL Major Course Output *Phase 1*

Members:
* AVELINO, Sophia Kylie
* BALINGIT, Andrei Luis
* WONG, Ching Man
* YOUNG, Cedric Francis

___

### Phase 1: data description, target research question, preprocessing, and exploratory data analysis

> Deliverables:
>
> A Jupyter Notebook containing all the data processing you did in the project. The Notebook should include Markdown
cells explaining each process, and highlighting the insights and conclusions. The Notebook should be structured in a
way that (1) is easy to understand, and (2) can be run sequentially to reproduce all outputs in your work.

## Section 1 - Dataset Description

### Section 1.1 - Introduction ###

This project entails exploring the Labor Force Survey (LFS) dataset, which is a nationwide survey of households every quarter that captures demographic and socio-economic information in regards to the current Labor Force of the Philippines. The primary purpose of the LFS is to obtain an estimate of employment and unemployment rates in the labor market and offer a quantitative framework for the formulation of labor market policies. The information encompasses a comprehensive range of individual and household characteristics for all the socio-economic factors.

The database, obtained from the Philippine Statistics Authority (PSA), contains records of a national sample of about 42,768 households (Batanes included) or 42,576 households (Batanes excluded) per round of surveys. It contains detailed data for each person in the households surveyed, namely demographic traits (age, sex, marital status), educational level, occupation, and work status. The reporting unit is the household, which in turn means that the statistics present the traits of people living in private households only and not those in institutions.

## Problem Statement 

The specific task we aim to address is a classification problem. In this notebook, we seek to predict whether an individual has worked in the past week(PUFC11_WORK) which is answered by a binary yes or no input, using the features provided in the LFS dataset. The target variable, PUFC11_WORK, is a binary variable indicating whether an individual has worked in the past week (Yes/No). This is a binary classification problem, where we aim to classify individuals into two groups: those who have worked and those who have not, based on their demographic and socio-economic characteristics. Thus, the goal of our models is to predict which of the two groups is a person a part of based on demographic variables, educational attainment, occupation, and household characteristics features provided by the LFS dataset. Namely, these feature columns are described as the following:

Included in the dataset is a PDF file titled 'Dataset 2- Labor Force Survey 2016' in which we explored the descriptions of each feature provided.  

Considering this, we have categorized the features as below:

#### 1. Demographic Variables

    PUFREG → Region

    PUFPRV → Province Code

    PUFPRRCD → Province Recode

    PUFSVYMO → Survey Month

    PUFSVYYR → Survey Year

    PUFC04_SEX → Sex

    PUFC03_REL → Relationship to Household Head

    PUFC05_AGE → Age as of Last Birthday

    PUFC06_MSTAT → Marital Status



#### 2. Household Characteristics

    PUFHHNUM → Household Unique Sequential Number

    PUFURB2K10 → 2010 Urban-Rural FIES
    
    PUFPWGTFIN → Final Weight Based on Projection (Provincial Projections)
    
    PUFPSU → PSU Number
    
    PUFRPL → Replicate
    
    PUFHHSIZE → Household Size
    
    PUFC01_LNO → Line Number (Household Member Identifier)



#### 3. Educational Attainment
    
    PUFC07_GRADE → Highest Grade Completed
    
    PUFC08_CURSCH → Currently Attending School
    
    PUFC09_GRADTECH → Graduate of Technical/Vocational Course


#### 4. Occupation and Work Characteristics
    
    PUFC14_PROCC → Primary Occupation
    
    PUFC16_PKB → Kind of Business (Primary Occupation)
    
    PUFC17_NATEM → Nature of Employment (Primary Occupation)
    
    PUFC18_PNWHRS → Normal Working Hours per Day
    
    PUFC19_PHOURS → Total Number of Hours Worked During the Past Week
    
    PUFC20_PWMORE → Want More Hours of Work
    
    PUFC21_PLADDW → Look for Additional Work
    
    PUFC23_PCLASS → Class of Worker (Primary Occupation)
    
    PUFC24_PBASIS → Basis of Payment (Primary Occupation)
    
    PUFC25_PBASIC → Basic Pay per Day (Primary Occupation)
    
    PUFC26_OJOB → Other Job Indicator
    
    PUFC27_NJOBS → Number of Jobs During the Past Week
    
    PUFC28_THOURS → Total Hours Worked for All Jobs
    
    PUFC29_WWM48H → Reasons for Working More than 48 Hours During the Past Week
    
    PUFC38_PREVJOB → Previous Job Indicator
    
    PUFC40_POCC → Previous Occupation
    
    PUFC41_WQTR → Did Work or Had a Job During the Past Quarter
    
    PUFC43_QKB → Kind of Business (Past Quarter)



#### 5. Unemployment and Underemployment
    
    PUFC30_LOOKW → Looked for Work or Tried to Establish Business During the Past Week
    
    PUFC31_FLWRK → First Time to Look for Work
    
    PUFC32_JOBSM → Job Search Method
    
    PUFC33_WEEKS → Number of Weeks Spent in Looking for Work
    
    PUFC34_WYNOT → Reason for Not Looking for Work
    
    PUFC35_LTLOOKW → When Last Looked for Work
    
    PUFC36_AVAIL → Available for Work
    
    PUFC37_WILLING → Willingness to Take Up Work During the Past Week or Within Two Weeks

The variables in this Notebook are as follows:

- `genhlth` - A categorical vector indicating general health, with categories `excellent`, `very good`, `good`, `fair`, and `poor`.
- `smoke100` - A categorical vector, 1 if the respondent has smoked at least 100 cigarettes in their entire life and 0 otherwise.
- `exerany` - A categorical vector, 1 if the respondent exercised in the past month and 0 otherwise.

Section 1.3 - Libraries needed

In [1]:
import numpy as np

## Section 2 - Data Cleaning

### Section 2.1 - Research Question & Exploratory Data Analysis

___

### Phase 2: statistical inference, data mining, key insights and conclusions

> Delivarables:
> 
> 1. A Jupyter Notebook containing all the data processing you did in the project. The Notebook should include Markdown
cells explaining each process, and highlighting the insights and conclusions. The Notebook should be structured in a
way that (1) is easy to understand, and (2) can be run sequentially to reproduce all outputs in your work.
> 
> 2. A poster that communicates all key findings and insights of your work. The poster should be intuitive to understand,
and intended for a general audience.

## Section 3 - Data Mining

## Section 4 - Statistical Inference

## Section 5 - Insights and Conclusions

Your must ensure that you will go through all of these minimum requirement tasks in your project:
1. identify a general research question that you aim to answer in your data narrative
2. perform exploratory data analysis, covering at least 3 EDA questions, to get a good understanding of the data
3. conduct at least 3 statistical tests to establish three sound conclusions from the data
4. apply at least one of the following techniques: (1) rule mining, (2) clustering, or (3) collaborative filtering to discover
meaningful insights from the data (you may also choose to apply any of the variants of the above approaches)

In [1]:
import pandas as pd
df = pd.read_csv('src/data/LFS PUF March 2024.CSV')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44063 entries, 0 to 44062
Data columns (total 41 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PUFREG           44063 non-null  int64  
 1   PUFHHNUM         44063 non-null  int64  
 2   PUFPWGTPRV       44063 non-null  float64
 3   PUFSVYMO         44063 non-null  int64  
 4   PUFSVYYR         44063 non-null  int64  
 5   PUFPSU           44063 non-null  int64  
 6   PUFRPL           44063 non-null  int64  
 7   PUFHHSIZE        44063 non-null  int64  
 8   PUFC01_LNO       44063 non-null  int64  
 9   PUFC03_REL       44063 non-null  int64  
 10  PUFC04_SEX       44063 non-null  int64  
 11  PUFC05_AGE       44063 non-null  int64  
 12  PUFC06_MSTAT     44063 non-null  object 
 13  PUFC07_GRADE     44063 non-null  object 
 14  PUFC08_CONWR     44063 non-null  object 
 15  PUFC09_WORK      44063 non-null  object 
 16  PUFC09A_WORK     44063 non-null  object 
 17  PUFC10_JOB  

In [2]:
df.columns.tolist()

['PUFREG',
 'PUFHHNUM',
 'PUFPWGTPRV',
 'PUFSVYMO',
 'PUFSVYYR',
 'PUFPSU',
 'PUFRPL',
 'PUFHHSIZE',
 'PUFC01_LNO',
 'PUFC03_REL',
 'PUFC04_SEX',
 'PUFC05_AGE',
 'PUFC06_MSTAT',
 'PUFC07_GRADE',
 'PUFC08_CONWR',
 'PUFC09_WORK',
 'PUFC09A_WORK',
 'PUFC10_JOB',
 'PUFC11A_PROVMUN',
 'PUFC13_PROCC',
 'PUFC15_PKB',
 'PUFC16_NATEM',
 'PUFC17_PNWHRS',
 'PUFC18_PHOURS',
 'PUFC19_PWMORE',
 'PUFC20_PLADDW',
 'PUFC20B_FTWORK',
 'PUFC21_PCLASS',
 'PUFC22_OJOB',
 'PUFC23_THOURS',
 'PUFC24_WWM48H',
 'PUFC25_LOOKW',
 'PUFC25B_FTWORK',
 'PUFC26_WYNOT',
 'PUFC27_AVAIL',
 'PUFC28_PREVJOB',
 'PUFC29_YEAR',
 'PUFC29_MONTH',
 'PUFC31_POCC',
 'PUFC33_QKB',
 'PUFNEWEMPSTAT']