## **🧭 Task 1 – Fundamental Data Understanding (Indian Air Pollution Data)**

### **🎯 Objective :**

The purpose of this task is to help you explore and understand the structure and quality of the Indian Air Pollution Dataset provided for your assessment.
You will use Pandas to:

* Combine multiple CSV files into a single dataset, and

* Perform fundamental data understanding (data inspection, summary statistics, and missing value analysis).

## **Mounting the drive**

In [6]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
%cd '/content/drive/MyDrive/Semester1_25/Beijing air quality'
# please change the path according to the location of your data

/content/drive/MyDrive/Semester1_25/Beijing air quality


In [8]:
%ls # it shows all the content of your folder if done properly

EDA__Beijing_data.ipynb
PRSA_Data_Aotizhongxin_20130301-20170228.csv
PRSA_Data_Changping_20130301-20170228.csv
PRSA_Data_Dingling_20130301-20170228.csv
PRSA_Data_Dongsi_20130301-20170228.csv
PRSA_Data_Guanyuan_20130301-20170228.csv
PRSA_Data_Gucheng_20130301-20170228.csv
PRSA_Data_Huairou_20130301-20170228.csv
PRSA_Data_Nongzhanguan_20130301-20170228.csv
PRSA_Data_Shunyi_20130301-20170228.csv
PRSA_Data_Tiantan_20130301-20170228.csv
PRSA_Data_Wanliu_20130301-20170228.csv
PRSA_Data_Wanshouxigong_20130301-20170228.csv


# **🧩 What You Need to Do:**

**Step 1: Import Libraries :**

Start by importing the necessary Python libraries:

In [2]:
import pandas as pd
import os

### **🧩 Merging of the csv files:**



### **Combining all the csv files in the drive path: You can use either of the merging options:**  
---

**🧩 OPTION 1 :**

In [None]:
# Import the necessary libraries
import pandas as pd  # pandas is used for working with data tables
import glob         # glob is used to find files by name patterns

# STEP 1: The pattern "*_data.csv" means "find all files that end with '_data.csv'"
# This will find files like: Ahmedabad_data.csv, Delhi_data.csv, Mumbai_data.csv, etc.
city_files = glob.glob("*_data.csv")

# STEP 2: Create an empty list to store all our city data
# We'll put each city's data in this list before combining them
all_cities_data = []

# STEP 3: Read each city file one by one
for file_name in city_files:
    # Read the current city's CSV file into a DataFrame
    # A DataFrame is like a spreadsheet table in Python
    city_df = pd.read_csv(file_name)

    # Add this city's data to our list
    all_cities_data.append(city_df)

    # Optional: Print which file we just read
    print(f"Loaded: {file_name}")

# STEP 4: Combine all city data into one big table
# pd.concat() joins all the DataFrames in our list together
# ignore_index=True makes sure the row numbers are continuous (0, 1, 2, 3...)
combined_data = pd.concat(all_cities_data, ignore_index=True)

# STEP 5: Save the combined data to a new CSV file
# index=False means don't save the row numbers as a separate column
combined_data.to_csv("all_cities_combined.csv", index=False)

# STEP 6: Show us what we accomplished
# len(city_files) = count of how many city files we combined
# len(combined_data) = total number of rows in the final combined file
print(f"SUCCESS: Combined {len(city_files)} city files into one file with {len(combined_data)} total rows")
print("The combined file is saved as: all_cities_combined.csv")

In [None]:
df= pd.read_csv('all_cities_combined.csv')
df

**🧩 OPTION 2 :**


The process involves iterating through all files in a specified directory (drive_path) to identify those with a .csv extension. Each identified CSV file is then read into a Pandas DataFrame, which is subsequently appended to a list (dataframes). Once all the CSV files are processed, the individual DataFrames in the list are combined into a single consolidated DataFrame (all_data) using the pd.concat() function. This consolidation ensures that the combined data is reindexed, creating a unified dataset for further analysis.

In [None]:
drive_path = '/content/drive/MyDrive/Semester1_25/Beijing air quality'

In [None]:
dataframes = []
for filename in os.listdir(drive_path):
    if filename.endswith('.csv'):  # Check if the file is a CSV file
        file_path = os.path.join(drive_path, filename)
        df = pd.read_csv(file_path)  # Read the CSV file into a DataFrame
        dataframes.append(df)  # Add the DataFrame to the list

In [None]:
df1 = pd.concat(dataframes, ignore_index=True)

## **🧩 Perform Fundamental Data Understanding :**

**Once you have the merged dataset, explore and understand its structure.**

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape
print(f'No of Rows: {df1.shape[0]}, No of Columns: {df1.shape[1]}')

No of Rows: 420768, No of Columns: 18


In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.columns

### **Total number of stations in the dataset:**

In [None]:
stations = df1['station'].value_counts()
print(f'Total number of stations in the dataset : {len(stations)}')
stations

### **Displaying the percentage of missing value**

In [None]:
def missing_values_table(df1):
    # Total missing values
    mis_val = df1.isnull().sum()

    # Percentage of missing values
    mis_val_percent = 100 * mis_val / len(df1)

    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    mis_val_table = mis_val_table.rename(columns={0: 'Missing Values', 1: '% of Total Values'})

    # Sort the table by percentage of missing descending
    mis_val_table = mis_val_table.sort_values('% of Total Values', ascending=False)

    return mis_val_table

missing_values = missing_values_table(df1)
display(missing_values.style.background_gradient(cmap='Blues'))

# **📊 Analysis Questions to Answer :**

**Include short written answers (3–5 sentences) to these:**

* How many rows and columns are in your merged dataset?

* Which pollutants are included, and which have the most missing data?

* How many unique cities or stations are there?

* What are the average levels of key pollutants?

* Are there any immediate data quality issues (e.g., missing or inconsistent values)?