## NOTE: Import module and use module prefix
### To import module src directly mention file and use as prefix
import data_utils
data_utils.function_name();

## Phase 2: Exploratory Data Analysis (EDA)

Here’s a detailed breakdown of **Exploratory Data Analysis (EDA)** steps categorized into **Beginner**, **Moderate**, and **Advanced** levels. This categorization helps you progress systematically as you gain more experience.

---

## 🔰 **Beginner Level: Foundation Building**

Basic steps to understand the data and ensure it's usable.

### 1. **Understand the Dataset**

* Load data (`.csv`, `.xlsx`, etc.)
* Check shape, columns, data types (`df.shape`, `df.info()`)
* Preview data (`df.head()`, `df.tail()`)

### 2. **Handle Missing Values**

* Detect nulls (`df.isnull().sum()`)
* Basic handling: drop rows/columns, fill with mean/median/mode

### 3. **Descriptive Statistics**

* Use `.describe()` for numerical summaries
* Count unique values for categorical columns
* Frequency distribution using `value_counts()`

### 4. **Basic Data Cleaning**

* Strip spaces, convert to proper formats (`datetime`, `numeric`)
* Handle duplicates (`df.duplicated()`)
* Rename columns for clarity

### 5. **Basic Visualizations**

* Histograms (distribution)
* Bar plots (categorical)
* Box plots (spread and outliers)

---

## 🚀 **Moderate Level: Pattern Discovery**

Begin asking analytical questions and looking for insights.

### 1. **Data Type and Format Refinement**

* Convert data types explicitly (`pd.to_datetime`, `astype()`)
* Create calculated fields (e.g., profit = sales - cost)

### 2. **Outlier Detection**

* Visual: boxplots, scatter plots
* Statistical: IQR method, Z-score

### 3. **Univariate & Bivariate Analysis**

* Distribution of single variables
* Relationships between two variables:

  * Numerical vs Numerical: scatter plot, correlation heatmap
  * Categorical vs Numerical: groupby + aggregation
  * Categorical vs Categorical: crosstab, stacked bar plot

### 4. **Feature Engineering (Basic)**

* Derive new columns (e.g., `total_spent = quantity * price`)
* Extract features from datetime (month, day, hour)

### 5. **Handling Imbalanced Data (if applicable)**

* Detect class imbalance
* Consider techniques like stratified sampling (for modeling)

---

## 🧠 **Advanced Level: Deep Insights & Preparation for Modeling**

### 1. **Multivariate Analysis**

* Pair plots
* Correlation matrix
* PCA for dimensionality reduction (for visualization)

### 2. **Advanced Outlier Treatment**

* Winsorization
* Robust scaling
* Isolation Forest (optional)

### 3. **Feature Engineering (Advanced)**

* Encoding: one-hot, label, target
* Transformation: log, sqrt, power transformations
* Interaction terms between variables
* Normalization/Standardization

### 4. **Time Series Decomposition (for time data)**

* Trend, seasonality, residual
* Rolling statistics
* Lag features

### 5. **EDA for Text Data**

* Tokenization, stopword removal
* Word frequency analysis
* WordCloud, n-gram analysis

### 6. **EDA for Date/Time Patterns**

* Time-based aggregation (monthly, weekly trends)
* Festival/seasonal patterns
* Time to event analysis

---

## ✅ Summary Table

| Level    | Key Focus Areas                                                |
| -------- | -------------------------------------------------------------- |
| Beginner | Data loading, cleaning, missing values, basic visualizations   |
| Moderate | Outliers, feature engineering, bivariate plots, group analysis |
| Advanced | Multivariate EDA, text/time series, scaling, encodings         |

---

Would you like this in a printable PDF format with code examples or added to your current EDA project structure?


In [3]:
#%load_ext autoreload
#%autoreload 2

In [19]:
import sys
print(sys.executable)
# /home/Isha/anaconda3/envs/usd_env/bin/python

/home/Isha/anaconda3/envs/usd_env/bin/python


In [22]:
import sys
import os

sys.path.append(os.path.abspath('../src'))  # adjust relative path if needed
import data_utils


In [23]:
import src.data_utils
import pandas as pd
df = pd.DataFrame({
    'ProductNo': ['ABC123', 'XYZ@456', '789_ABC', 'DEF456!', 'GHI789']
})



result = get_values_with_special_chars(df, 'ProductNo')

print("Values containing special characters:")
print(result)


Values containing special characters:
['XYZ@456', '789_ABC', 'DEF456!']
