## **🧭 Task 1 – Fundamental Data Understanding (Indian Air Pollution Data)**

### **🎯 Objective :**

The purpose of this task is to help you explore and understand the structure and quality of the Indian Air Pollution Dataset provided for your assessment.
You will use Pandas to:

* Combine multiple CSV files into a single dataset, and

* Perform fundamental data understanding (data inspection, summary statistics, and missing value analysis).

## **Mounting the drive**

In [41]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [42]:
%cd '/content/drive/MyDrive/Workshop'
# please change the path according to the location of your data

/content/drive/My Drive/Workshop


In [40]:
%ls # it shows all the content of your folder if done properly

Ahmedabad_data.csv       Chennai_data.csv     Kochi_data.csv
Aizawl_data.csv          Coimbatore_data.csv  Kolkata_data.csv
all_cities_combined.csv  Delhi_data.csv       Lucknow_data.csv
Amaravati_data.csv       Ernakulam_data.csv   Mumbai_data.csv
Amritsar_data.csv        Gurugram_data.csv    Patna_data.csv
Bengaluru_data.csv       Guwahati_data.csv    Shillong_data.csv
Bhopal_data.csv          Hyderabad_data.csv   Talcher_data.csv
Brajrajnagar_data.csv    Jaipur_data.csv      Thiruvananthapuram_data.csv
Chandigarh_data.csv      Jorapokhar_data.csv  Visakhapatnam_data.csv


# **🧩 What You Need to Do:**

**Step 1: Import Libraries :**

Start by importing the necessary Python libraries:

In [39]:
import pandas as pd
import os

### **🧩 Merging of the csv files:**



### **Combining all the csv files in the drive path: You can use either of the merging options:**  
---

**🧩 OPTION 1 :**

In [38]:
# Import the necessary libraries
import pandas as pd  # pandas is used for working with data tables
import glob         # glob is used to find files by name patterns

# STEP 1: The pattern "*_data.csv" means "find all files that end with '_data.csv'"
# This will find files like: Ahmedabad_data.csv, Delhi_data.csv, Mumbai_data.csv, etc.
city_files = glob.glob("*_data.csv")

# STEP 2: Create an empty list to store all our city data
# We'll put each city's data in this list before combining them
all_cities_data = []

# STEP 3: Read each city file one by one
for file_name in city_files:
    # Read the current city's CSV file into a DataFrame
    # A DataFrame is like a spreadsheet table in Python
    city_df = pd.read_csv(file_name)

    # Add this city's data to our list
    all_cities_data.append(city_df)

    # Optional: Print which file we just read
    print(f"Loaded: {file_name}")

# STEP 4: Combine all city data into one big table
# pd.concat() joins all the DataFrames in our list together
# ignore_index=True makes sure the row numbers are continuous (0, 1, 2, 3...)
combined_data = pd.concat(all_cities_data, ignore_index=True)

# STEP 5: Save the combined data to a new CSV file
# index=False means don't save the row numbers as a separate column
combined_data.to_csv("all_cities_combined.csv", index=False)

# STEP 6: Show us what we accomplished
# len(city_files) = count of how many city files we combined
# len(combined_data) = total number of rows in the final combined file
print(f"SUCCESS: Combined {len(city_files)} city files into one file with {len(combined_data)} total rows")
print("The combined file is saved as: all_cities_combined.csv")

Loaded: Patna_data.csv
Loaded: Guwahati_data.csv
Loaded: Chandigarh_data.csv
Loaded: Amaravati_data.csv
Loaded: Jorapokhar_data.csv
Loaded: Hyderabad_data.csv
Loaded: Ahmedabad_data.csv
Loaded: Visakhapatnam_data.csv
Loaded: Shillong_data.csv
Loaded: Brajrajnagar_data.csv
Loaded: Bhopal_data.csv
Loaded: Kochi_data.csv
Loaded: Kolkata_data.csv
Loaded: Thiruvananthapuram_data.csv
Loaded: Lucknow_data.csv
Loaded: Aizawl_data.csv
Loaded: Mumbai_data.csv
Loaded: Coimbatore_data.csv
Loaded: Gurugram_data.csv
Loaded: Bengaluru_data.csv
Loaded: Jaipur_data.csv
Loaded: Talcher_data.csv
Loaded: Ernakulam_data.csv
Loaded: Chennai_data.csv
Loaded: Delhi_data.csv
Loaded: Amritsar_data.csv
SUCCESS: Combined 26 city files into one file with 29531 total rows
The combined file is saved as: all_cities_combined.csv


In [37]:
df= pd.read_csv('all_cities_combined.csv')
df

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Patna,01/06/2015,,,14.41,25.06,39.32,,1.56,1.80,8.89,0.00,0.29,0.00,,
1,Patna,02/06/2015,,,25.00,22.48,47.50,,2.35,9.69,9.90,0.08,0.83,0.09,,
2,Patna,03/06/2015,,,14.29,17.16,29.81,,1.69,20.61,12.63,0.00,0.33,0.00,,
3,Patna,04/06/2015,,,13.03,15.62,28.63,,1.20,4.35,9.77,0.01,0.28,0.00,,
4,Patna,05/06/2015,,,10.40,10.36,20.14,,1.29,7.22,11.90,0.00,0.15,0.00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29526,Amritsar,27/06/2020,51.10,,25.25,20.80,35.75,14.07,0.66,4.55,16.48,1.30,1.10,8.82,74.0,Satisfactory
29527,Amritsar,28/06/2020,45.24,40.00,23.11,17.90,27.47,13.25,0.63,5.22,16.48,1.16,0.98,7.85,85.0,Satisfactory
29528,Amritsar,29/06/2020,26.77,60.69,26.99,19.83,34.03,14.94,0.67,4.89,16.48,1.30,1.10,8.81,49.0,Good
29529,Amritsar,30/06/2020,41.64,76.49,22.03,15.97,30.60,13.29,0.69,4.67,16.48,1.30,1.10,8.72,66.0,Satisfactory


**🧩 OPTION 2 :**


The process involves iterating through all files in a specified directory (drive_path) to identify those with a .csv extension. Each identified CSV file is then read into a Pandas DataFrame, which is subsequently appended to a list (dataframes). Once all the CSV files are processed, the individual DataFrames in the list are combined into a single consolidated DataFrame (all_data) using the pd.concat() function. This consolidation ensures that the combined data is reindexed, creating a unified dataset for further analysis.

In [None]:
drive_path = '/content/drive/MyDrive/Semester1_25/Beijing air quality'

In [None]:
dataframes = []
for filename in os.listdir(drive_path):
    if filename.endswith('.csv'):  # Check if the file is a CSV file
        file_path = os.path.join(drive_path, filename)
        df = pd.read_csv(file_path)  # Read the CSV file into a DataFrame
        dataframes.append(df)  # Add the DataFrame to the list

In [None]:
df1 = pd.concat(dataframes, ignore_index=True)

## **🧩 Perform Fundamental Data Understanding :**

**Once you have the merged dataset, explore and understand its structure.**

In [43]:
df.head()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
0,Patna,01/06/2015,,,14.41,25.06,39.32,,1.56,1.8,8.89,0.0,0.29,0.0,,
1,Patna,02/06/2015,,,25.0,22.48,47.5,,2.35,9.69,9.9,0.08,0.83,0.09,,
2,Patna,03/06/2015,,,14.29,17.16,29.81,,1.69,20.61,12.63,0.0,0.33,0.0,,
3,Patna,04/06/2015,,,13.03,15.62,28.63,,1.2,4.35,9.77,0.01,0.28,0.0,,
4,Patna,05/06/2015,,,10.4,10.36,20.14,,1.29,7.22,11.9,0.0,0.15,0.0,,


In [44]:
df.tail()

Unnamed: 0,City,Date,PM2.5,PM10,NO,NO2,NOx,NH3,CO,SO2,O3,Benzene,Toluene,Xylene,AQI,AQI_Bucket
29526,Amritsar,27/06/2020,51.1,,25.25,20.8,35.75,14.07,0.66,4.55,16.48,1.3,1.1,8.82,74.0,Satisfactory
29527,Amritsar,28/06/2020,45.24,40.0,23.11,17.9,27.47,13.25,0.63,5.22,16.48,1.16,0.98,7.85,85.0,Satisfactory
29528,Amritsar,29/06/2020,26.77,60.69,26.99,19.83,34.03,14.94,0.67,4.89,16.48,1.3,1.1,8.81,49.0,Good
29529,Amritsar,30/06/2020,41.64,76.49,22.03,15.97,30.6,13.29,0.69,4.67,16.48,1.3,1.1,8.72,66.0,Satisfactory
29530,Amritsar,01/07/2020,57.67,100.99,32.81,15.11,30.2,17.73,0.59,3.48,16.48,1.3,1.1,8.82,78.0,Satisfactory


In [47]:
df.shape
print(f'No of Rows: {df.shape[0]}, No of Columns: {df.shape[1]}')

No of Rows: 29531, No of Columns: 16


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29531 entries, 0 to 29530
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   City        29531 non-null  object 
 1   Date        29531 non-null  object 
 2   PM2.5       24933 non-null  float64
 3   PM10        18391 non-null  float64
 4   NO          25949 non-null  float64
 5   NO2         25946 non-null  float64
 6   NOx         25346 non-null  float64
 7   NH3         19203 non-null  float64
 8   CO          27472 non-null  float64
 9   SO2         25677 non-null  float64
 10  O3          25509 non-null  float64
 11  Benzene     23908 non-null  float64
 12  Toluene     21490 non-null  float64
 13  Xylene      11422 non-null  float64
 14  AQI         24850 non-null  float64
 15  AQI_Bucket  24850 non-null  object 
dtypes: float64(13), object(3)
memory usage: 3.6+ MB


In [48]:
df.dtypes

Unnamed: 0,0
City,object
Date,object
PM2.5,float64
PM10,float64
NO,float64
NO2,float64
NOx,float64
NH3,float64
CO,float64
SO2,float64


In [49]:
df.columns

Index(['City', 'Date', 'PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2',
       'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI', 'AQI_Bucket'],
      dtype='object')

### **Total number of stations in the dataset:**

In [51]:
cities = df['City'].value_counts().sort_index()
print(f'Total number of cities in the dataset : {len(cities)}')
cities

Total number of cities in the dataset : 26


Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
Ahmedabad,2009
Aizawl,113
Amaravati,951
Amritsar,1221
Bengaluru,2009
Bhopal,289
Brajrajnagar,938
Chandigarh,304
Chennai,2009
Coimbatore,386


### **Displaying the percentage of missing value**

In [52]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * mis_val / len(df)

    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    mis_val_table = mis_val_table.rename(columns={0: 'Missing Values', 1: '% of Total Values'})

    # Sort the table by percentage of missing descending
    mis_val_table = mis_val_table.sort_values('% of Total Values', ascending=False)

    return mis_val_table

missing_values = missing_values_table(df)
display(missing_values.style.background_gradient(cmap='Blues'))

Unnamed: 0,Missing Values,% of Total Values
Xylene,18109,61.322001
PM10,11140,37.723071
NH3,10328,34.973418
Toluene,8041,27.229014
Benzene,5623,19.041008
AQI,4681,15.851139
AQI_Bucket,4681,15.851139
PM2.5,4598,15.570079
NOx,4185,14.171549
O3,4022,13.619586


# **📊 Analysis Questions to Answer :**

**Include short written answers (3–5 sentences) to these:**

* How many rows and columns are in your merged dataset?

* Which pollutants are included, and which have the most missing data?

* How many unique cities or stations are there?

* What are the average levels of key pollutants?

* Are there any immediate data quality issues (e.g., missing or inconsistent values)?

In [53]:
# How many rows and columns are in your merged dataset
print(f'No of Rows: {df.shape[0]}, No of Columns: {df.shape[1]}')

No of Rows: 29531, No of Columns: 16


In [56]:
# Which pollutants are included, and which have the most missing data
df.isna().sum().sort_values(ascending=False)

Unnamed: 0,0
Xylene,18109
PM10,11140
NH3,10328
Toluene,8041
Benzene,5623
AQI,4681
AQI_Bucket,4681
PM2.5,4598
NOx,4185
O3,4022


In [57]:
# How many unique cities
df['City'].nunique()

26

In [59]:
# What are the average levels of key pollutants
df.describe()

print(f"Average PM2.5: {df['PM2.5'].mean()}")
print(f"Average PM10: {df['PM10'].mean()}")
print(f"Average NO2: {df['NO2'].mean()}")

Average PM2.5: 67.45057794890307
Average PM10: 118.12710293078135
Average NO2: 28.560659061126955
