# 1. Introduction

This notebook explores the Air Quality dataset from UCI Machine Learning Repository to understand the patterns, trends, and insights into air pollution data.

# 2. Problem Statement

To analyze air quality data by exploring pollutant levels, identifying patterns, handling missing values, and visualizing relationships among various environmental variables.

# 3. Installing & Importing Libraries

In [None]:
!pip install pandas numpy matplotlib seaborn plotly -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')

# 4. Data Acquisition & Description

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00360/AirQualityUCI.csv', sep=';', decimal=',')
df = df.iloc[:, :-2]
df.columns = df.columns.str.strip()
df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'], dayfirst=True)
df.drop(columns=['Date', 'Time'], inplace=True)
df.head()

# 5. Data Pre-Profiling

In [None]:
df.info()
df.describe()
df.isnull().sum()

# 6. Data Pre-Processing

In [None]:
df.replace(-200, np.nan, inplace=True)
df.drop(columns=['NMHC(GT)'], inplace=True)
df.fillna(df.median(numeric_only=True), inplace=True)

# 7. Data Post-Profiling

In [None]:
df.describe()
df.isnull().sum()

# 8. Exploratory Data Analysis

Visualizations answering key questions about the data.

### Q: What is the average CO concentration over time?

In [None]:
px.line(df, x='Datetime', y='CO(GT)', title='Average CO Concentration Over Time').show()

### Q: How does temperature vary over the recorded period?

In [None]:
px.line(df, x='Datetime', y='T', title='Temperature Variation Over Time').show()

### Q: Which pollutants are most correlated?

In [None]:
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm'); plt.title('Feature Correlation Heatmap'); plt.show()

### Q: What is the distribution of NO2(GT)?

In [None]:
sns.histplot(df['NO2(GT)'], kde=True, color='skyblue'); plt.title('Distribution of NO2(GT)'); plt.show()

### Q: How does relative humidity relate to temperature?

In [None]:
sns.scatterplot(data=df, x='T', y='RH'); plt.title('Relative Humidity vs Temperature'); plt.show()

### Q: Are benzene (C6H6) levels related to other pollutants?

In [None]:
sns.scatterplot(data=df, x='C6H6(GT)', y='CO(GT)'); plt.title('Benzene vs CO Levels'); plt.show()

### Q: Which time of day has peak pollution?

In [None]:
df['Hour'] = df['Datetime'].dt.hour
sns.boxplot(x='Hour', y='CO(GT)', data=df); plt.title('Hourly CO Levels'); plt.show()

### Q: How do gas sensor responses relate to pollutant levels?

In [None]:
sns.pairplot(df[['PT08.S1(CO)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)', 'PT08.S4(NO2)', 'PT08.S5(O3)']]); plt.suptitle('Sensor Response Patterns', y=1.02); plt.show()

### Q: What is the seasonal trend of CO?

In [None]:
df['Month'] = df['Datetime'].dt.month
sns.boxplot(x='Month', y='CO(GT)', data=df); plt.title('Monthly CO Levels'); plt.show()

### Q: What is the pattern of Absolute Humidity over time?

In [None]:
px.line(df, x='Datetime', y='AH', title='Absolute Humidity Over Time').show()

# 9. Summarization

## 9.1 Conclusion

- The dataset reveals temporal patterns in pollution.
- Some features had too many missing values and were removed.
- Sensor responses are strongly correlated with specific pollutants.

## 9.2 Actionable Insights

- Peak pollution is observed during specific hours—likely due to traffic.
- CO and NO2 show clear seasonality.
- Deploying sensors with focus on key hours can aid real-time alerts.