# Air Quality EDA, India 2015-22

## What is EDA

Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations. 

Hyperlink to EDA https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/

## Project Outline

## Envorment Setup

In [1]:
!pip install pandas numpy opendatasets matplotlib plotly seaborn --upgrade --quiet

In [2]:
import pandas as pd
import numpy as np
import opendatasets as od
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

## Download the Data 

In [3]:
aq_data_url = 'https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india?select=station_hour.csv'

In [4]:
od.download(aq_data_url,'air_quality_data')

Downloading air-quality-data-in-india.zip to air_quality_data/air-quality-data-in-india


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 72.9M/72.9M [00:10<00:00, 7.45MB/s]





## Data Prepration and Cleaning

- Load the file to a dataframe using pandas
- Look at information about the file
- Fix missing and incorrect values

In [24]:
station_hourly_data = 'air_quality_data/air-quality-data-in-india/station_hour.csv'

In [25]:
selected_dtypes = {
    'PM2.5': 'float32',
    'PM10': 'float32',
    'NO': 'float32',
    'NO2': 'float32',
    'NOx': 'float32',
    'NH3': 'float32',
    'CO': 'float32',
    'SO2': 'float32',
    'O3': 'float32',
    'Benzene': 'float32',
    'Toluene': 'float32',
    'Xylene': 'float32',
    'AQI': 'float32',
    'AQI_Bucket': 'object'
}

In [26]:
station_df = pd.read_csv(station_hourly_data, dtype=selected_dtypes, parse_dates=['Datetime'])

In [18]:
station_df.head()

Unnamed: 0,StationId,Datetime,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,AP001,2017-11-24 17:00:00,60.5,98.0,2.35,30.799999,18.25,8.5,0.1,11.85,126.400002,0.1,6.1,0.1,,
1,AP001,2017-11-24 18:00:00,65.5,111.25,2.7,24.200001,15.07,9.77,0.1,13.17,117.120003,0.1,6.25,0.15,,
2,AP001,2017-11-24 19:00:00,80.0,132.0,2.1,25.18,15.15,12.02,0.1,12.08,98.980003,0.2,5.98,0.18,,
3,AP001,2017-11-24 20:00:00,81.5,133.25,1.95,16.25,10.23,11.58,0.1,10.47,112.199997,0.2,6.72,0.1,,
4,AP001,2017-11-24 21:00:00,75.25,116.0,1.43,17.48,10.43,12.03,0.1,9.12,106.349998,0.2,5.75,0.08,,


In [19]:
station_df.describe()

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI
count,1941394.0,1469831.0,2035372.0,2060110.0,2098275.0,1352465.0,2089781.0,1846346.0,1863110.0,1727504.0,1546717.0,513979.0,2018893.0
mean,80.86481,158.4839,22.78826,35.2369,40.55115,28.70857,1.502366,12.11603,38.0641,3.305494,14.90267,2.448881,180.1732
std,89.47618,139.7883,48.46146,34.97508,55.90894,27.53244,6.292445,14.67385,47.10653,12.14053,33.29729,8.97347,140.4095
min,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,5.0
25%,28.16,64.0,3.05,13.1,11.35,11.23,0.41,4.25,11.02,0.08,0.34,0.0,84.0
50%,52.59,116.25,7.15,24.79,22.86,22.35,0.8,8.25,24.75,0.96,3.4,0.2,131.0
75%,97.74,204.0,18.58,45.48,45.7,37.78,1.38,14.53,49.53,3.23,15.1,1.83,259.0
max,1000.0,1000.0,500.0,499.99,500.0,499.97,498.57,199.96,997.0,498.07,499.99,499.98999,3133.0


In [20]:
station_df.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2589083 entries, 0 to 2589082
Data columns (total 16 columns):
 #   Column      Non-Null Count    Dtype         
---  ------      --------------    -----         
 0   StationId   2589083 non-null  object        
 1   Datetime    2589083 non-null  datetime64[ns]
 2   PM2.5       1941394 non-null  float32       
 3   PM10        1469831 non-null  float32       
 4   NO          2035372 non-null  float32       
 5   NO2         2060110 non-null  float32       
 6   NOx         2098275 non-null  float32       
 7   NH3         1352465 non-null  float32       
 8   CO          2089781 non-null  float32       
 9   SO2         1846346 non-null  float32       
 10  O3          1863110 non-null  float32       
 11  Benzene     1727504 non-null  float32       
 12  Toluene     1546717 non-null  float32       
 13  Xylene      513979 non-null   float32       
 14  AQI         2018893 non-null  float32       
 15  AQI_Bucket  2018893 non-null  ob

Unnamed: 0,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI
count,1941394.0,1469831.0,2035372.0,2060110.0,2098275.0,1352465.0,2089781.0,1846346.0,1863110.0,1727504.0,1546717.0,513979.0,2018893.0
mean,80.86481,158.4839,22.78825,35.23689,40.55115,28.70856,1.502366,12.11602,38.06408,3.305493,14.90266,2.448881,180.173
std,89.47618,139.7883,48.46146,34.97508,55.90894,27.53244,6.292445,14.67385,47.10653,12.14053,33.29729,8.97347,140.4095
min,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0,0.0,5.0
25%,28.16,64.0,3.05,13.1,11.35,11.23,0.41,4.25,11.02,0.08,0.34,0.0,84.0
50%,52.59,116.25,7.15,24.79,22.86,22.35,0.8,8.25,24.75,0.96,3.4,0.2,131.0
75%,97.74,204.0,18.58,45.48,45.7,37.78,1.38,14.53,49.53,3.23,15.1,1.83,259.0
max,1000.0,1000.0,500.0,499.99,500.0,499.97,498.57,199.96,997.0,498.07,499.99,499.99,3133.0


## Exploratory Analysis and Visualization

## Ask & Answer Questions

- Top polluted cities
- Least polluted cities
- Most coomon form of contaimantion
- Season Cycles
- Effect of 2020 Lockdown

## Summary 

## Reference 