
**Project Title -- Air Pollution Index (API) in Malaysia**

Data from 2005 to 2017 are retrieved from data.gov.my under MOSTI, DOE and KeTSA. The data is licensed under Creative Commons Attribution and Open Definition 2.1.

The data from 2018 to current date are retrieved directly from APIMS website. Data might still be under copyright (unclear) but it's safe to assume the data will eventually be released under the same license as above.


Project Team:

| Name          | Matric Number  | Task            |
| ------------- | -------------- | --------------- |
| MOHAMMED RAZA ASFAK CHIDIMAR     | MCS231004       | Case Study 1: Step by step EDA          |
| AYAZ RAHMAN BHUIYAN    | MCS232001        | Case Study 2a : Pandas-Profiling          |
| MUSAB IBNE AHMAD  | MCS231017        | Case Study 2b, 2c : DataPrep, SweetViz          |



In this case study, We will study Pandas Profiling Automated Exploratory Data Analysis (EDA) Tool. We will demonstrate how Pandas Profiling can be implemented step by step using a Malaysian dataset. Additionally, we will summarize the pros and cons of the Automated EDA Tools used.


**Implementation of Automated EDA Tools**

**STEP 1: Pandas Profiling Installation ~Deprecated 'pandas-profiling' package, use 'ydata-profiling' instead**

In [None]:
pip install notebook
pip install -U ydata-profiling

**STEP 2: Import Python libraries that required for Automated EDA using Pandas-Profiling**

In [1]:
from pathlib import Path
import pandas as pd
from ydata_profiling import ProfileReport

**STEP 3: Retrieving the Cleaned Dataset**

In [2]:
df = pd.read_csv("apims-20kdata.csv")

**STEP 4: Generate the profile**

In [None]:
profile = ProfileReport(
    df, title="Air Pollution Index (API) in Malaysia - EDA", html={"style": {"full_width": True}}, sort=None
)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

### **Conclusion and Inferences**

For a comprehensive analysis, researchers and environmental agencies often study trends over time, identifying sources of pollution and implementing measures to improve air quality. Public awareness campaigns and policy interventions may also be designed based on this analysis to address specific pollution challenges in different Malaysian states.

Pros and Cons of conducting Automated EDA using Pandas-Profiling

### Pros

(1) Pandas Profiling is helpful for exploratory data analysis since it offers a quick and simple method of conducting an initial data analysis.

(2) As opposed to manually creating code to investigate the data, it creates a comprehensive report with summary statistics, data type information, data distributions, correlations, and more, which can save a significant amount of time

(3) You can use the visualizations in the library to better comprehend your data graphically. Examples of these visualizations are correlation matrices, scatter plots, and histograms.

(4) Issues with data quality, such as missing values, outliers, and duplicate rows, can be swiftly identified with the aid of this report.

(5) To improve teamwork, the created report can be shared with stakeholders or other team members in an easy-to-use HTML format.

(6) By choosing which analyses to include and adjusting settings, you may somewhat tailor the report.

### Cons

(1) Large dataset report generation can be resource-intensive, resulting in lengthy processing times or even memory leaks. In these situations, you may need to sample or subset your data.

(2) Although Pandas Profiling offers a wealth of helpful data, it is not capable of carrying out more complex statistical analysis or machine learning activities. It works well for initial data investigation.

(3) Pandas Profiling might not be the ideal option if you need to generate reports programmatically as part of an automated data processing pipeline because it involves manual intervention to generate and see the results.

(4) Although you can alter certain parts of the report, if you require extremely personalized data analysis and visualization, you could find it to be restrictive.

(5) Pandas Profiling is a third-party library, maintenance problems could arise if it isn't always compatible with the most recent Pandas or Python versions.

### **References**

GitHub - Ydataai/Ydata-profiling: 1 Line of Code Data Quality Profiling & Exploratory Data Analysis for Pandas and Spark DataFrames. (n.d.). https://github.com/ydataai/ydata-profiling