## EDA with Python workflow

| **Step**                      | **Objective**                     | **Key Actions / Code Examples**                                                                                                        | **Outputs**                        |
| ----------------------------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
| **1. Environment Setup**      | Prepare Python environment        | Install/import libraries: `pandas`, `numpy`, `matplotlib`, `seaborn` (optional) <br> Set display options with `pd.set_option`          | Ready-to-use workspace             |
| **2. Load Data**              | Read dataset with correct parsing | `pd.read_csv("data.csv", sep=",", parse_dates=["DateCol"], index_col=None)`                                                            | DataFrame in memory                |
| **3. Basic Overview**         | Sanity-check dataset              | `df.shape`, `df.head()`, `df.tail()`, `df.sample(5)` <br> `df.dtypes.value_counts()`                                                   | Understand size & sample           |
| **4. Memory Usage**           | Check data size in MB             | `df.memory_usage(deep=True).sum() / 1024**2`                                                                                           | Size in MB                         |
| **5. Missing Values**         | Identify nulls                    | `df.isna().sum()` <br> Calculate %: `(df.isna().sum()/len(df))*100`                                                                    | Missing values table               |
| **6. Duplicates**             | Check data repetition             | `df.duplicated().sum()`                                                                                                                | Count of duplicate rows            |
| **7. Data Types Check**       | Ensure correct types              | Convert: <br>`df["Col"] = df["Col"].astype("category")`<br>`pd.to_datetime(df["Date"])`<br>`pd.to_numeric(df["Num"], errors="coerce")` | Correct dtypes                     |
| **8. Numeric Summary**        | Describe numeric columns          | `df.describe().T` <br> Add skew & kurtosis: `df.skew()`, `df.kurtosis()`                                                               | Stats table                        |
| **9. Categorical Summary**    | Analyze category frequency        | `df["Cat"].value_counts()` <br> Plot top N categories                                                                                  | Frequency tables & bar plots       |
| **10. Univariate Plots**      | Visualize each feature            | Numeric: Histograms & boxplots <br> Categorical: Bar charts                                                                            | Distribution plots                 |
| **11. Outlier Detection**     | Spot extreme values               | IQR Method: <br>`q1, q3 = s.quantile([0.25, 0.75])` <br>`iqr = q3-q1` <br> Lower/Upper bounds = `q1-1.5*iqr` & `q3+1.5*iqr`            | Outlier list/table                 |
| **12. Correlation Analysis**  | Check numeric relationships       | `df.corr()` <br> Plot heatmap                                                                                                          | Correlation matrix & plot          |
| **13. Group Analysis**        | Compare across categories         | `df.groupby("Cat")["Target"].mean()`                                                                                                   | Grouped stats                      |
| **14. Crosstabs**             | Explore category vs category      | `pd.crosstab(df["Cat1"], df["Cat2"], normalize="index")`                                                                               | Crosstab table                     |
| **15. Datetime Analysis**     | Analyze trends                    | Extract: `df["Year"] = df["Date"].dt.year` <br> Group by time: `df.groupby("Month")["Sales"].sum()`                                    | Trend/time series plots            |
| **16. Target Relationships**  | Link features to target           | Numeric target: correlations <br> Categorical target: group means or counts                                                            | Feature-Target relationship tables |
| **17. Missing Data Strategy** | Plan handling method              | Numerical: mean/median <br> Categorical: “Unknown” category or mode                                                                    | Cleaned dataset                    |
| **18. Data Quality Flags**    | Track issues                      | Create columns for missing/outlier flags                                                                                               | Flags in DataFrame                 |
| **19. Save Outputs**          | Preserve findings                 | Save CSV summaries: `df.to_csv()` <br> Save plots: `plt.savefig()`                                                                     | `eda_report/` folder               |
| **20. Documentation**         | Record EDA decisions              | Notes on missing value handling, outlier decisions, transformations                                                                    | Reproducible EDA log               |


In [None]:
# import laibrary
import pandas as pd

In [10]:
df = pd.read_csv("population_of_pakistan.csv")


FileNotFoundError: [Errno 2] No such file or directory: 'population_of_pakistan.csv'