# Filipino Family Income and Expenditure

### Members:
- Fernando Magallenes Jr.
- Heinze Kristian Moneda
- Azriel Matthew Ortega
- Caleb James Sonoy
- Darren Tan

## The Dataset

## Importing Libraries

In order to analyze the dataset the following modules are required to be imported in python.

- numpy is a library made up of multidimensional array objects and a collection of routines for processing those arrays.
- pandas is a software library for Python that is designed for data manipulation and data analysis.
- matplotlib is a software libary for data visualization, which allows us to easily render various types of graphs.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Importing the Dataset

First step is to load the dataset using the pandas library. To load the dataset into a pandas Dataframe object, we call the `read_csv` function. The arguments needed for the function is the path to the csv file.

In [2]:
income_df = pd.read_csv("Family Income and Expenditure.csv")

In [3]:
income_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41544 entries, 0 to 41543
Data columns (total 60 columns):
 #   Column                                         Non-Null Count  Dtype 
---  ------                                         --------------  ----- 
 0   Total Household Income                         41544 non-null  int64 
 1   Region                                         41544 non-null  object
 2   Total Food Expenditure                         41544 non-null  int64 
 3   Main Source of Income                          41544 non-null  object
 4   Agricultural Household indicator               41544 non-null  int64 
 5   Bread and Cereals Expenditure                  41544 non-null  int64 
 6   Total Rice Expenditure                         41544 non-null  int64 
 7   Meat Expenditure                               41544 non-null  int64 
 8   Total Fish and  marine products Expenditure    41544 non-null  int64 
 9   Fruit Expenditure                              41544 non-null

## Exploratory Data Analysis

## Data Cleaning
In this step, we will check if the dataset contains some errors or issues like null values, wrong encodings/spelling, duplicates, and inconsistencies. It is important to clean our data in order to avoid problems when performing analyses.

### Checking for Nulls

In [None]:
income_df.isna().sum()

After performing null checking, we can see that there are null values in 2 variables like `Household Head Occupation` and `Household Head Class of Worker`. 

### Replacing of Nulls
It is shown that there is a significant number of observations with null values for both variables. Instead of removing them, we will replace them with a sentinent value of `Other` to identify these observations as they can still be used for analysis.

### `Household Head Occupation` variable

In [None]:
income_df['Household Head Occupation'].value_counts()

In [None]:
income_df['Household Head Occupation'].unique()

In [None]:
income_df['Household Head Occupation'] = income_df['Household Head Occupation'].replace(np.nan, 'Other')

### `Household Head Class of Worker` variable

In [None]:
income_df['Household Head Class of Worker'].value_counts()

In [None]:
income_df['Household Head Class of Worker'] = income_df['Household Head Class of Worker'].replace(np.nan, 'Other')

In [None]:
income_df.isna().sum()

After performing another null checking, we can see there are no more null values found in all variables.

### Checking for Wrong Encodings, Duplicates, and Inconsistencies
Since all the null values are taken care of, we will now check the representations of the values in each categorical variable.

### `Region` variable

In [None]:
income_df['Region'].value_counts()

After checking the `Region` variable, we can see that there are misrepresentations of whitespacing, wrong spelling, and inconsistency in the values such as ` ARMM`, `IX - Zasmboanga Peninsula`, and, `Caraga`, respectively, which are found to be dirty. We will replace these values with the appropriate representations.  

In [None]:
income_df['Region'] = income_df['Region'].replace(' ARMM', 'ARMM')
income_df['Region'] = income_df['Region'].replace('IX - Zasmboanga Peninsula', 'IX - Zamboanga Peninsula')
income_df['Region'] = income_df['Region'].replace('Caraga', 'XIII - Caraga')

In [None]:
income_df['Region'].value_counts()

### `Main Source of Income` variable

In [None]:
income_df['Main Source of Income'].value_counts()

There seems to be no misrepresentations in `Main Source of Income variable` upon checking.

### `Agricultural Household indicator` variable

In [None]:
income_df['Agricultural Household indicator'].value_counts()

There seems to be no misrepresentations in `Agricultural Household indicator` variable upon checking.

### `Household Head Sex` variable

In [None]:
income_df['Household Head Sex'].value_counts()

There seems to be no misrepresentations in `Household Head Sex variable` variable upon checking.

### `Household Head Marital Status` variable

In [None]:
income_df['Household Head Marital Status'].value_counts()

We can see that there is a value `Unknown` as the very least observation compared to other values upon checking so we decided to drop its entire row from the dataset by replacing it with a null value then calling the `dropna` function. 

In [None]:
income_df['Household Head Marital Status'] = income_df['Household Head Marital Status'].replace('Unknown', np.nan) 

In [None]:
income_df = income_df.dropna()

### `Household Head Highest Grade Completed` variable

In [None]:
income_df['Household Head Highest Grade Completed'].value_counts()

We can see that there is a dirty duplicate value with letter casing issue surrounding the word `trades` between `Engineering and Engineering Trades Programs` and `Engineering and Engineering trades Programs` upon checking, so we decided to filter it by replacing the one with the lower case letter 't' with upper case 'T'.

In [None]:
income_df['Household Head Highest Grade Completed'] = income_df['Household Head Highest Grade Completed'].replace('Engineering and Engineering trades Programs', 'Engineering and Engineering Trades Programs')

In [None]:
income_df['Household Head Highest Grade Completed'].value_counts()

### `Household Head Job or Business Indicator`

In [None]:
income_df['Household Head Job or Business Indicator'].value_counts()

There seems to be no misrepresentations in `Household Head Job or Business Indicator` variable upon checking.

### `Household Head Occupation` variable

In [None]:
income_df['Household Head Occupation'].unique()

There seems to be no misrepresentations in `Household Head Occupation` variable upon checking. This is the variable earlier that had significant number of observations with null values

### `Household Head Class of Worker` variable

In [None]:
income_df['Household Head Class of Worker'].value_counts()

There seems to be no misrepresentations in `Household Head Class of Worker` variable upon checking. This is also another variable earlier that had significant number of observations with null values

### `Type of Household` variable

In [None]:
income_df['Type of Household'].value_counts()

There seems to be no misrepresentations in `Type of Household` variable upon checking.

### `Type of Building/House` variable

In [None]:
income_df['Type of Building/House'].value_counts()

There seems to be no misrepresentations in `Type of Building/House` variable upon checking.

### `Type of Roof` variable

In [None]:
income_df['Type of Roof'].value_counts()

There seems to be no misrepresentations in `Type of Roof` variable upon checking.

### `Type of Walls` variable

In [None]:
income_df['Type of Walls'].value_counts()

We can see that there is an inconsistency in the value `NOt applicable` so we decided to fix it following how it is represented in other variables.

### `Tenure Status` variable

In [None]:
income_df['Tenure Status'].value_counts()

There seems to be no misrepresentations in `Tenure Status` variable upon checking.

### `Toilet Facilities` variable

In [None]:
income_df['Toilet Facilities'].value_counts()

There seems to be no misrepresentations in `Toilet Facilities` variable upon checking.

### `Main Source of Water Supply` variable

In [None]:
income_df['Main Source of Water Supply'].value_counts()

There seems to be no misrepresentations in `Main Source of Water Supply` variable upon checking.

## Feature Extraction

## Data Visualization and Analysis

## Summary of Findings

## Recommendations

## References