## Project : Exploratory Data Analysis (EDA) in Python
**Objective** : The objective of this project is to select a dataset, perform data cleaning and pre-processing, conduct exploratory data analysis (EDA), and present your findings. This project will help you understand the dataset, uncover underlying patterns, and generate insights that could guide further analysis or decision-making.

## 1. Select a Dataset

### 1.1 Choose a dataset that interests you. The dataset can be from a public source such as Kaggle, UCI Machine Learning Repository, or any other reliable source.

**Selected Dataset** : https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india

### 1.2  Ensure the dataset is sufficiently large and has a variety of features (columns) to analyze. Aim for at least 500 rows and 5 columns.

**Answer** : TBD

## 2. Project Setup

### 2.1 **Description of project and structure**

## 3. Data Import and Cleaning

### 3.1 Import the necessary libraries: pandas, numpy, matplotlib, seaborn, etc.

In [9]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Enable inline plotting for Jupyter Notebook
%matplotlib inline

### 3.2 Load the dataset into a pandas DataFrame

In [10]:
# Load the dataset from a CSV file into a pandas DataFrame
df = pd.read_csv('Dataset/city_day.csv')

# Display the first few rows
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,,,0.92,18.22,17.15,,0.92,27.64,133.36,0.0,0.02,0.0,,
1,Ahmedabad,2015-01-02,,,0.97,15.69,16.46,,0.97,24.55,34.06,3.68,5.5,3.77,,
2,Ahmedabad,2015-01-03,,,17.4,19.3,29.7,,17.4,29.07,30.7,6.8,16.4,2.25,,
3,Ahmedabad,2015-01-04,,,1.7,18.48,17.97,,1.7,18.59,36.08,4.43,10.14,1.0,,
4,Ahmedabad,2015-01-05,,,22.1,21.42,37.76,,22.1,39.33,39.31,7.01,18.89,2.78,,


### 3.3 Perform initial data inspection: check the shape of the data, data types, and summary statistics.

In [11]:
# Print the shape of the DataFrame to see the number of rows and columns
print("Data Shape:", df.shape)

# Display detailed information about the DataFrame (column types and non-null counts)
df.info()

# Generate and print summary statistics for numerical columns
print(df.describe())

Data Shape: (29531, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB
              PM2.5          PM10            NO           NO2           NOx 

### 3.4 Identify and handle missing values. Decide whether to drop, fill, or interpolate missing data based on the context.

In [12]:
# Check and print the number of missing values in each column
print("Missing values in each column:")
print(df.isnull().sum())

# Identify numerical columns in the DataFrame
num_cols = df.select_dtypes(include=[np.number]).columns

# Fill missing values in each numerical column with the respective column mean
# This avoids chained assignment by directly reassigning the column
for col in num_cols:
    df[col] = df[col].fillna(df[col].mean())

# Alternatively, if preferred, you could drop rows with missing values:
# df.dropna(inplace=True)

Missing values in each column:
City              0
Date              0
PM2.5          4598
PM10          11140
NO             3582
NO2            3585
NOx            4185
NH3           10328
CO             2059
SO2            3854
O3             4022
Benzene        5623
Toluene        8041
Xylene        18109
AQI            4681
AQI_Bucket     4681
dtype: int64


### 3.5 Detect and remove duplicate rows if any.

In [13]:
# Remove duplicate rows from the DataFrame
df = df.drop_duplicates()

# Confirm duplicates have been removed by checking the DataFrame shape or using df.info()
print("After removing duplicates, data shape:", df.shape)

After removing duplicates, data shape: (29531, 16)


### 3.6 Convert data types if necessary (e.g., dates should be in datetime format).

In [14]:
# Check if the 'Date' column exists and convert it to datetime format if it does
if 'Date' in df.columns:
    df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Verify the conversion by displaying the first few rows of the 'Date' column
if 'Date' in df.columns:
    print(df[['Date']].head())

        Date
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-04
4 2015-01-05


### Final Verification

In [15]:
# Display the updated DataFrame information to confirm all changes
df.info()

# Optionally, display the first few rows to inspect the cleaned data
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   City        29531 non-null  object        
 1   Date        29531 non-null  datetime64[ns]
 2   PM2.5       29531 non-null  float64       
 3   PM10        29531 non-null  float64       
 4   NO          29531 non-null  float64       
 5   NO2         29531 non-null  float64       
 6   NOx         29531 non-null  float64       
 7   NH3         29531 non-null  float64       
 8   CO          29531 non-null  float64       
 9   SO2         29531 non-null  float64       
 10  O3          29531 non-null  float64       
 11  Benzene     29531 non-null  float64       
 12  Toluene     29531 non-null  float64       
 13  Xylene      29531 non-null  float64       
 14  AQI         29531 non-null  float64       
 15  AQI_Bucket  24850 non-null  object        
dtypes: datetime64[ns](1), 

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Ahmedabad,2015-01-01,67.450578,118.127103,0.92,18.22,17.15,23.483476,0.92,27.64,133.36,0.0,0.02,0.0,166.463581,
1,Ahmedabad,2015-01-02,67.450578,118.127103,0.97,15.69,16.46,23.483476,0.97,24.55,34.06,3.68,5.5,3.77,166.463581,
2,Ahmedabad,2015-01-03,67.450578,118.127103,17.4,19.3,29.7,23.483476,17.4,29.07,30.7,6.8,16.4,2.25,166.463581,
3,Ahmedabad,2015-01-04,67.450578,118.127103,1.7,18.48,17.97,23.483476,1.7,18.59,36.08,4.43,10.14,1.0,166.463581,
4,Ahmedabad,2015-01-05,67.450578,118.127103,22.1,21.42,37.76,23.483476,22.1,39.33,39.31,7.01,18.89,2.78,166.463581,


## 4. Exploratory Data Analysis (EDA)