# Data Preparation for Lung Capacity Dataset

## By: Fatima Usman Muhammad

## I. Introduction
**Lungs Capacity Dataset:<br>**
Lung capacity refers to the total amount of air that our lungs can hold **(Delgado BJ, Bajaj T, 2023)**, it is measured in terms of volume and is typically assessed through various pulmonary function tests. Lung capacity is a critical indicator of respiratory health, and understanding the factors that influence it, is essential for promoting overall well-being **(Ph.H Quanjer, et al., 1993)**. This project aims to investigate the relationships between lung capacity and several key variables, including smoking habits, age, gender, mode of birth, and height. The project will explore whether these factors have a significant impact on lung capacity, contributing valuable insights to public health and clinical practices.


## II. Data Collection

The dataset was collected from Kaggle https://www.kaggle.com/datasets/radhakrishna4/lung-capacity (Radhakrishna, 2019).<br>

**About The Dataset**<br>
The dataset is showing lungs capacity of smokers and non-smokers by age, gender and height.

**Here's the meaning of variables:**
- **LungCap:** It’s the lung capacity(closing capacity) of the person
- **Age:** It’s how old is the person
- **Height:** It’s how tall is the person
- **Smoke:** If the person smokes or doesn’t smoke
- **Gender:** If are male or female
- **Cesarean:** If they’re born by Cesarean

# Implementing Python Code for Data Preparation of Lungs Capacity Dataset
**Data Preparation:<br>**
Data preparation is a critical phase in analyzing lung capacity datasets, vital for ensuring the accuracy and reliability of subsequent analyses. This process involves key tasks such as cleaning, transformation, and feature engineering. Cleaning addresses missing values, outliers, and inaccuracies to prevent bias, while transformation standardizes units and normalizes variables. Feature engineering enhances predictive power by creating or modifying variables. In lung capacity datasets, attention is given to maintaining consistency in respiratory measurements and handling demographic information with privacy considerations. The well-prepared dataset becomes the foundation for meaningful analyses, unveiling patterns and correlations related to respiratory health. This introduction underscores the significance of meticulous data preparation in establishing the reliability and validity of research findings in the context of lung capacity analysis.

In [2]:
# Importing NumPy library and aliasing it as np
import numpy as np
# Importing pandas library and aliasing it as pd
import pandas as pd 
# Importing specific components (DataFrame, Series) from pandas
from pandas import DataFrame,Series 
# Importing the pyplot module from the matplotlib library and aliasing it as plt
import matplotlib.pyplot as plt 
# Importing the seaborn library and aliasing it as sns
import seaborn as sns 
# Importing the stats module from the scipy library
import warnings 
# Ignoring warning messages for better code readability (optional)
warnings.filterwarnings('ignore')

**The above code sets up a Python environment for data analysis and visualization, making use of libraries like NumPy, pandas, Matplotlib, Seaborn, and SciPy. The optional line at the end is used to ignore warning messages.**

In [6]:
# Reading a CSV file named 'LungCap.csv' and storing the data in a variable called 'data'
data= pd.read_csv('LungCap.csv')
# Displaying the first few rows of the dataset using the 'head()' method
data.head()


Unnamed: 0,LungCap(cc),Age( years),Height(inches),Smoke,Gender,Caesarean
0,6.475,6,62.1,no,male,no
1,10.125,18,74.7,yes,female,no
2,9.55,16,69.7,no,female,yes
3,11.125,14,71.0,no,male,no
4,4.8,5,56.9,no,male,no


**The above code reads data from a CSV file named 'LungCap.csv' using the pandas library, stores it in a DataFrame called 'data', and then displays the first few rows of the dataset to give you an initial look at the structure and content of the data. This is a common and essential step in the exploratory data analysis process.**

In [7]:
# Displaying the last few rows of the DataFrame 'data' using the 'tail()' method
data.tail()

Unnamed: 0,LungCap(cc),Age( years),Height(inches),Smoke,Gender,Caesarean
720,5.725,9,56.0,no,female,no
721,9.05,18,72.0,yes,male,yes
722,3.85,11,60.5,yes,female,no
723,9.825,15,64.9,no,female,no
724,7.1,10,67.7,no,male,no


**The above code displayed the last few rows of a DataFrame, these rows are useful for quickly checking the end of the dataset, especially if you want to see the most recent entries or verify the structure of the data towards the end of the file. It complements the head() method, which shows the first few rows.**

In [8]:
# Displaying concise information about the DataFrame 'data' using the 'info()' method
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   LungCap(cc)     725 non-null    float64
 1   Age( years)     725 non-null    int64  
 2   Height(inches)  725 non-null    float64
 3   Smoke           725 non-null    object 
 4   Gender          725 non-null    object 
 5   Caesarean       725 non-null    object 
dtypes: float64(2), int64(1), object(3)
memory usage: 34.1+ KB


**The above code is used to obtain a concise summary of a DataFrame's structure and content. It is commonly used in the exploratory data analysis (EDA) process to understand the characteristics of the dataset and to identify any potential data cleaning or preprocessing steps that may be needed.**<br>
**The dataset information clearly shows that there no any missing value in the dataset, however, the dataset contains 1 integer datatype, 2 float and 3 object**