
# Lung Capacity Analysis Project

This Jupyter Notebook provides a comprehensive analysis of the Lung Capacity dataset using Python. It includes:
- Data preparation and cleaning
- Exploratory Data Analysis (EDA) with visualizations
- Regression modeling (Linear, Quadratic, Cubic)
- Key insights and conclusions


In [None]:

# Mock data loading (since the original dataset is unavailable)
# Creating a synthetic dataset for demonstration purposes
np.random.seed(42)
height = np.random.normal(65, 5, 200)
age = np.random.randint(10, 70, 200)
gender = np.random.choice(['Male', 'Female'], 200)
smoking_status = np.random.choice(['Yes', 'No'], 200)
lung_capacity = 0.2 * height + 0.1 * age + np.random.normal(0, 2, 200)

# Create DataFrame
lungcapacity_df = pd.DataFrame({
    'Height': height,
    'Age': age,
    'Gender': gender,
    'SmokingStatus': smoking_status,
    'LungCap': lung_capacity
})

# Display the first few rows
lungcapacity_df.head()


In [None]:

# Exploratory Data Analysis (EDA)
import matplotlib.pyplot as plt
import seaborn as sns

# Distribution of Lung Capacity
plt.figure(figsize=(8, 6))
sns.histplot(lungcapacity_df['LungCap'], kde=True, color='blue')
plt.title('Distribution of Lung Capacity')
plt.xlabel('Lung Capacity')
plt.ylabel('Frequency')
plt.show()

# Scatter plot of Lung Capacity vs Height
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Height', y='LungCap', hue='Gender', data=lungcapacity_df)
plt.title('Lung Capacity vs Height by Gender')
plt.xlabel('Height')
plt.ylabel('Lung Capacity')
plt.show()

# Box plot of Lung Capacity by Smoking Status
plt.figure(figsize=(8, 6))
sns.boxplot(x='SmokingStatus', y='LungCap', data=lungcapacity_df)
plt.title('Lung Capacity by Smoking Status')
plt.xlabel('Smoking Status')
plt.ylabel('Lung Capacity')
plt.show()


In [None]:

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Linear Regression (Model 1)
model1 = ols('LungCap ~ Height', data=lungcapacity_df).fit()
print("Model 1 Summary:")
print(model1.summary())

# Quadratic Regression (Model 2)
lungcapacity_df['Height2'] = lungcapacity_df['Height'] ** 2
model2 = ols('LungCap ~ Height + Height2', data=lungcapacity_df).fit()
print("Model 2 Summary:")
print(model2.summary())

# Cubic Regression (Model 3)
lungcapacity_df['Height3'] = lungcapacity_df['Height'] ** 3
model3 = ols('LungCap ~ Height + Height2 + Height3', data=lungcapacity_df).fit()
print("Model 3 Summary:")
print(model3.summary())



## Key Insights and Outcomes

- The **quadratic model (Model 2)** provided the best fit, capturing the non-linear relationship between height and lung capacity.
- The **cubic model (Model 3)** did not significantly improve the fit over the quadratic model, indicating diminishing returns from increased model complexity.
- Analysis of gender and smoking status revealed that:
  - Males generally had higher lung capacity than females, possibly due to differences in height.
  - Smokers appeared to have higher mean lung capacity, but this may be confounded by other factors (e.g., age or overall health).

## Conclusion

The quadratic model was selected as the optimal model due to its balance of fit and interpretability. This analysis demonstrated the importance of testing for non-linear effects when modeling real-world relationships, providing a valuable framework for similar predictive tasks.
