<a href="https://colab.research.google.com/github/drshahizan/Python_EDA/blob/main/assignment/bdm/Truth_Archive/Case_study2a/Pandas_profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Population Table: Administrative Districts (Pandas_profiling)
This project is regarding the exploratory data analysis (EDA) using the Pandas_profiling tool on the administrative population of Malaysia.
The dataset *population_district.csv* contains the information of the population obtained from [OpenDosm](https://open.dosm.gov.my/data-catalogue/population_population_district_0). It contains data regarding the date, state, district, gender, ethnicity, age, and population of Malaysia from 2020 to 2023. This project used pandas, matplotlib, and seaborn to process, clean, analyze, and visualize the dataset.
Pandas-profiling is an open-source Python library that provides a comprehensive and automated way to explore and analyze Pandas DataFrames. It generates detailed reports that include descriptive statistics, data quality checks, and interactive visualizations, making it a valuable tool for data scientists and analysts.

## Getting Pnadas Profiling ready
first we install Pandas_profiling

In [None]:
!pip install -U pandas-profiling

Collecting pandas-profiling
  Downloading pandas_profiling-3.6.6-py2.py3-none-any.whl (324 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m324.4/324.4 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ydata-profiling (from pandas-profiling)
  Downloading ydata_profiling-4.6.1-py2.py3-none-any.whl (357 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.5/357.5 kB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
Collecting pydantic>=2 (from ydata-profiling->pandas-profiling)
  Downloading pydantic-2.4.2-py3-none-any.whl (395 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.8/395.8 kB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
Collecting visions[type_image_path]==0.7.5 (from ydata-profiling->pandas-profiling)
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting htmlmin==0.1.12 (from y

## Downloading the Dataset
then we import the libraries that we will need

In [None]:
from pathlib import Path
import pandas as pd
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
url = 'https://raw.githubusercontent.com/drshahizan/Python_EDA/main/assignment/bdm/Truth_Archive/population_district.csv'
df = pd.read_csv(url)

## Data Preparation and Cleaning

In [None]:
df['date'] = pd.to_datetime(df['date'])
df['age_group'] = df['age'].apply(lambda x: 'Child' if x in ['0-4', '5-9', '10-14']
                                   else 'Young Adult' if x in ['15-19', '20-24', '25-29', '30-34']
                                   else 'Adult' if x in ['35-39', '40-44', '45-49', '50-54', '55-59', '60-64']
                                   else 'Senior' if x in ['65-69', '70-74', '75-79', '80-84', '85+']
                                   else 'overall_age' if x in ['overall_age']
                                   else 'Unknown')
df = df[~df.apply(lambda row: row.astype(str).str.contains('overall', case=False)).any(axis=1)]

##Generate the profile

In [None]:
# Generate the Profiling Report
profile = ProfileReport(
    df, title="Population District Dataset", html={"style": {"full_width": True}}, sort=None
)

In [None]:
# The HTML report in an iframe
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



## Pros and Cons:

**Pros:**

1.  Pandas Profiling provides an easy and quick way to perform an initial data analysis, making it useful for exploratory data analysis (EDA).

2. It generates a comprehensive report that includes summary statistics, data type information, data distributions, correlations, and more, which can save a lot of time compared to manually writing code to explore the data.

3. The library includes visualizations like histograms, scatter plots, and correlation matrices, which can help you understand your data visually.

4. The report helps in quickly identifying data quality issues such as missing values, outliers, and duplicated rows.

5. The generated report is presented in a user-friendly HTML format and can be easily shared with team members or stakeholders for better collaboration.

6. You can customize the report to some extent by specifying which analysis you want to include and configuring settings.

**Cons:**

1. Generating reports for large datasets can be resource-intensive, leading to long processing times or even running out of memory. You might need to sample or subset your data in such cases.

2. While Pandas Profiling provides a lot of useful information, it doesn't perform more advanced statistical analysis or machine learning tasks. It's best suited for preliminary data exploration.

3. If you need to generate reports programmatically as part of an automated data processing pipeline, Pandas Profiling may not be the best choice, as it requires manual interaction to generate and view the reports.

4. While you can customize some aspects of the report, you may find it limiting if you need highly customized data analysis and visualization.

5. Pandas Profiling is a third-party library and may not always be compatible with the latest Pandas or Python versions, which could result in maintenance issues.

##Conclusion


To sum up, pandas profiling stands out as a robust Automated Exploratory Data Analysis (Auto EDA) tool that offers comprehensive reports and visual representations of datasets. Although there are various other Auto EDA tools accessible in Python, each having its distinct advantages and drawbacks, pandas profiling sets itself apart with its blend of versatility, adaptability, and in-depth reporting capabilities. Utilizing pandas profiling allows analysts to swiftly and effortlessly acquire a deeper understanding of their data's makeup and contents, thereby streamlining and simplifying the data exploration process and ultimately saving valuable time and effort.

## References and Future Work




> pandas-profiling. (n.d.). GitHub. Retrieved November 8, 2023, from https://github.com/pandas-profiling

