# Phase 1

The first phase of the case study involves four sections – (1) dataset description, (2) data cleaning, (3) Exploratory Data Analysis, and (4) research question.

## Dataset Description

Each group should select one real-world dataset from the list of datasets provided for the project. Each dataset has a description file, which also contains a detailed description of each variable.

In this section of the notebook, you must fulfill the following:

- State a brief description of the dataset.
- Provide a description of the collection process executed to build the dataset.
- Discuss the implications of the data collection method on the generated conclusions and insights.
- Note that you may need to look at relevant sources related to the dataset to acquire necessary information for this part of the project.
- Describe the structure of the dataset file.
  - What does each row and column represent?
  - How many observations are there in the dataset?
  - How many variables are there in the dataset?
  - If the dataset is composed of different files that you will combine in the succeeding steps, describe the structure and the contents of each file.
- Discuss the variables in each dataset file. What does each variable represent? All variables, even those which are not used for the study, should be described to the reader. The purpose of each variable in the dataset should be clear to the reader of the notebook without having to go through an external link.

## Data Cleaning

For each used variable, check all the following and, if needed, perform data cleaning:

- There are multiple representations of the same categorical value.
- The datatype of the variable is incorrect.
- Some values are set to default values of the variable.
- There are missing data.
- There are duplicate data.
- The formatting of the values is inconsistent.

**Note**: No need to clean all variables. Clean only the variables utilized in the study.

## Exploratory Data Analysis

Perform exploratory data analysis comprehensively to gain a good understanding of your dataset. This step should help in formulating the research question of the project.

In this section of the notebook, you must fulfill the following:

- Identify **at least 4 exploratory data analysis questions**. Properly state the questions in the notebook. Having more than 4 questions is acceptable, especially if this will help in understanding the data better.
- Answer the EDA questions using both:
  - **Numerical Summaries** – measures of central tendency, measures of dispersion, and correlation.
  - **Visualization** – Appropriate visualization should be used. Each visualization should be accompanied by a brief explanation.

**To emphasize, both numerical summary and visualization should be presented for each question.**  
The whole process should be supported with verbose textual descriptions of your procedures and findings.

## Research Question

Come up with one (1) research question to answer using the dataset. Here are some requirements:

- **Important**: The research question should arise from exploratory data analysis. There should be an explanation regarding the connection of the research question to the answers obtained from performing exploratory data analysis.
- The research question should be within the scope of the dataset.
- The research question should be answerable by performing data mining techniques (i.e., rule mining, clustering, collaborative filtering). Students cannot use other techniques that are not covered in class.
- Make sure to indicate the importance and significance of the research question.


In [1]:
import pandas as pd

# Load the CSV file
file_path = '../data/SOF PUF 2015.csv'
data = pd.read_csv(file_path)

# Display the info of the DataFrame
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5440 entries, 0 to 5439
Data columns (total 48 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   RREG            5440 non-null   int64  
 1   HHNUM           5440 non-null   int64  
 2   RRPL            5440 non-null   int64  
 3   RSTR            5440 non-null   int64  
 4   RPSU            5440 non-null   int64  
 5   RROTATION       5440 non-null   int64  
 6   RQ1_LNO         5440 non-null   int64  
 7   RQ2_REL         5440 non-null   int64  
 8   RQ3_SEX         5440 non-null   int64  
 9   RQ4_AGE         5440 non-null   int64  
 10  RQ5_TMSLEFT     5440 non-null   int64  
 11  RQ6M_DTLEFT     5440 non-null   int64  
 12  RQ6Y_DTLEFT     5440 non-null   int64  
 13  RQ7_MSTAT       5438 non-null   float64
 14  RQ8_HGRADE      5440 non-null   int64  
 15  RQ9_USOCC       5440 non-null   int64  
 16  RQ10_REASON     5440 non-null   int64  
 17  RQ11_BASE       5012 non-null   f