## **ðŸ§­ Task 1 â€“ Fundamental Data Understanding (Indian Air Pollution Data)**

### **ðŸŽ¯ Objective :**

The purpose of this task is to help you explore and understand the structure and quality of the Indian Air Pollution Dataset provided for your assessment.
You will use Pandas to:

* Combine multiple CSV files into a single dataset, and

* Perform fundamental data understanding (data inspection, summary statistics, and missing value analysis).

## **Mounting the drive**

In [1]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd '/content/drive/MyDrive/Workshop'
# please change the path according to the location of your data

/content/drive/MyDrive/Workshop


In [3]:
%ls # it shows all the content of your folder if done properly

Ahmedabad_data.csv       Chennai_data.csv     Kochi_data.csv
Aizawl_data.csv          Coimbatore_data.csv  Kolkata_data.csv
all_cities_combined.csv  Delhi_data.csv       Lucknow_data.csv
Amaravati_data.csv       Ernakulam_data.csv   Mumbai_data.csv
Amritsar_data.csv        Gurugram_data.csv    Patna_data.csv
Bengaluru_data.csv       Guwahati_data.csv    Shillong_data.csv
Bhopal_data.csv          Hyderabad_data.csv   Talcher_data.csv
Brajrajnagar_data.csv    Jaipur_data.csv      Thiruvananthapuram_data.csv
Chandigarh_data.csv      Jorapokhar_data.csv  Visakhapatnam_data.csv


# **ðŸ“Š Analysis Questions to Answer :**

**Include short written answers (3â€“5 sentences) to these:**

* How many rows and columns are in your merged dataset?

* Which pollutants are included, and which have the most missing data?

* How many unique cities or stations are there?

* What are the average levels of key pollutants?

* Are there any immediate data quality issues (e.g., missing or inconsistent values)?

In [6]:
import pandas as pd
import glob

# The pattern "*_data.csv" means "find all files that end with '_data.csv'"
city_files = glob.glob("*_data.csv")

# Create an empty list to store all our city data
all_cities_data = []

# Read each city file one by one
for file_name in city_files:
    # Read the current city's CSV file into a DataFrame
    # A DataFrame is like a spreadsheet table in Python
    city_df = pd.read_csv(file_name)

    # Add this city's data to our list
    all_cities_data.append(city_df)

    # Optional: Print which file we just read
    print(f"Loaded: {file_name}")

# Combine all city data into one big table
# pd.concat() joins all the DataFrames in our list together
# ignore_index=True makes sure the row numbers are continuous (0, 1, 2, 3...)
combined_data = pd.concat(all_cities_data, ignore_index=True)

# Save the combined data to a new CSV file
# index=False means don't save the row numbers as a separate column
combined_data.to_csv("all_cities_combined.csv", index=False)

# Show us what we accomplished
# len(city_files) = count of how many city files we combined
# len(combined_data) = total number of rows in the final combined file
print(f"SUCCESS: Combined {len(city_files)} city files into one file with {len(combined_data)} total rows")
print("The combined file is saved as: all_cities_combined.csv")

Loaded: Lucknow_data.csv
Loaded: Talcher_data.csv
Loaded: Kochi_data.csv
Loaded: Kolkata_data.csv
Loaded: Jaipur_data.csv
Loaded: Aizawl_data.csv
Loaded: Patna_data.csv
Loaded: Thiruvananthapuram_data.csv
Loaded: Amaravati_data.csv
Loaded: Gurugram_data.csv
Loaded: Visakhapatnam_data.csv
Loaded: Ahmedabad_data.csv
Loaded: Jorapokhar_data.csv
Loaded: Bhopal_data.csv
Loaded: Coimbatore_data.csv
Loaded: Mumbai_data.csv
Loaded: Hyderabad_data.csv
Loaded: Ernakulam_data.csv
Loaded: Chandigarh_data.csv
Loaded: Chennai_data.csv
Loaded: Brajrajnagar_data.csv
Loaded: Delhi_data.csv
Loaded: Guwahati_data.csv
Loaded: Bengaluru_data.csv
Loaded: Shillong_data.csv
Loaded: Amritsar_data.csv
SUCCESS: Combined 26 city files into one file with 29531 total rows
The combined file is saved as: all_cities_combined.csv


In [7]:
# How many rows and columns are in your merged dataset?

df = pd.read_csv('all_cities_combined.csv')
df.shape
print(f"The merged dataset has {df.shape[0]} rows and {df.shape[1]} columns.")

The merged dataset has 29531 rows and 16 columns.


In [13]:
def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * mis_val / len(df)

    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    mis_val_table = mis_val_table.rename(columns={0: 'Missing Values', 1: '% of Total Values'})

    # Sort the table by percentage of missing descending
    mis_val_table = mis_val_table.sort_values('% of Total Values', ascending=False)

    return mis_val_table

missing_values = missing_values_table(df)
display(missing_values.style.background_gradient(cmap='Greens'))

# Which pollutants are included, and which have the most missing data?

# Display pollutants
pollutant_columns = [col for col in missing_values.index if col not in ['Date', 'City']]

print("\nPollutants included in the dataset:\n")
print(", ".join(pollutant_columns))
print("\n")

# Identify top 3 pollutants with the most missing data
top_missing = missing_values.head(3)
print("Pollutants with the most missing data:\n")
for index, row in top_missing.iterrows():
    print(f"{index}: {row['% of Total Values']:.2f}% missing")


Unnamed: 0,Missing Values,% of Total Values
Xylene,18109,61.322001
PM10,11140,37.723071
NH3,10328,34.973418
Toluene,8041,27.229014
Benzene,5623,19.041008
AQI,4681,15.851139
AQI_Bucket,4681,15.851139
PM2.5,4598,15.570079
NOx,4185,14.171549
O3,4022,13.619586



Pollutants included in the dataset:

Xylene, PM10, NH3, Toluene, Benzene, AQI, AQI_Bucket, PM2.5, NOx, O3, SO2, NO2, NO, CO


Pollutants with the most missing data:

Xylene: 61.32% missing
PM10: 37.72% missing
NH3: 34.97% missing


In [15]:
# How many unique cities or stations are there?

num_cities = df['City'].nunique()
unique_cities = df['City'].unique()
print(f"Number of unique cities: {num_cities}")
print("Cities included in the dataset:\n", unique_cities)

Number of unique cities: 26
Cities included in the dataset:
 ['Lucknow' 'Talcher' 'Kochi' 'Kolkata' 'Jaipur' 'Aizawl' 'Patna'
 'Thiruvananthapuram' 'Amaravati' 'Gurugram' 'Visakhapatnam' 'Ahmedabad'
 'Jorapokhar' 'Bhopal' 'Coimbatore' 'Mumbai' 'Hyderabad' 'Ernakulam'
 'Chandigarh' 'Chennai' 'Brajrajnagar' 'Delhi' 'Guwahati' 'Bengaluru'
 'Shillong' 'Amritsar']


In [16]:
# What are the average levels of key pollutants?

df.describe().loc['mean']
print("Average levels of key pollutants:")
print(df.describe().loc['mean'])

Average levels of key pollutants:
PM2.5       67.450578
PM10       118.127103
NO          17.574730
NO2         28.560659
NOx         32.309123
NH3         23.483476
CO           2.248598
SO2         14.531977
O3          34.491430
Benzene      3.280840
Toluene      8.700972
Xylene       3.070128
AQI        166.463581
Name: mean, dtype: float64


In [17]:
# Are there any immediate data quality issues (e.g., missing or inconsistent values)?

# Check missing values
missing_values = df.isnull().sum().sort_values(ascending=False)
missing_percent = (missing_values / len(df)) * 100

print("Missing Values Summary:\n")
print(pd.DataFrame({'Missing Values': missing_values, '% of Total Values': missing_percent}).head(10))


# Check data types
print("\nColumn Data Types:\n")
print(df.dtypes)


# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")


# Check for obvious outliers
print("\nSummary statistics for numeric columns:")
print(df.describe().T[['min', 'max', 'mean']])


Missing Values Summary:

            Missing Values  % of Total Values
Xylene               18109          61.322001
PM10                 11140          37.723071
NH3                  10328          34.973418
Toluene               8041          27.229014
Benzene               5623          19.041008
AQI                   4681          15.851139
AQI_Bucket            4681          15.851139
PM2.5                 4598          15.570079
NOx                   4185          14.171549
O3                    4022          13.619586

Column Data Types:

City           object
Date           object
PM2.5         float64
PM10          float64
NO            float64
NO2           float64
NOx           float64
NH3           float64
CO            float64
SO2           float64
O3            float64
Benzene       float64
Toluene       float64
Xylene        float64
AQI           float64
AQI_Bucket     object
dtype: object

Number of duplicate rows: 0

Summary statistics for numeric columns:
           m