# Skewness in Data
## **How to Address Skewness in Data**

Once you've identified skewness, there are several techniques to transform the data and reduce skewness, improving the suitability for regression or other modeling tasks. Here are some common methods:

### **1. Use Log Transformation (for Positive Skew)**
A **log transformation** compresses large values and spreads smaller values, making the distribution more symmetric.

#### Code Example:
```python
import numpy as np

# Apply log transformation to positively skewed features
log_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.log1p(x))  # log(1 + x)
```

- **When to use**: Only for **positive** values.
- **Effect**: Reduces positive skew (right tail).


### **2. Use Square Root Transformation**
The **square root transformation** can help reduce moderate skewness.

#### Code Example:
```python
sqrt_transformed = industry_df[VALUE_METRICES].apply(lambda x: np.sqrt(x))
```

- **When to use**: Works for **non-negative** data.
- **Effect**: Reduces right skew but less aggressive than log transformation.


### **3. Use Box-Cox Transformation (for Positive Values)**
The **Box-Cox transformation** applies a parameterized power transformation to make data more normal.

#### Code Example:
```python
from scipy.stats import boxcox

# Apply Box-Cox transformation to each column
boxcox_transformed = industry_df[VALUE_METRICES].apply(lambda x: boxcox(x + 1)[0] if (x > 0).all() else x)
```

- **When to use**: Only for **strictly positive** data.
- **Effect**: Handles various degrees of skewness.


### **4. Use PowerTransformer (Yeo-Johnson) for Both Positive and Negative Values**
The **Yeo-Johnson transformation** is similar to Box-Cox but works for both positive and negative values.

#### Code Example:
```python
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
X_transformed = pd.DataFrame(pt.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
```

- **When to use**: Works with **positive and negative** data.
- **Effect**: Reduces skewness and stabilizes variance.


### **5. Use RobustScaler if Outliers are the Cause**
Instead of transforming the data, **RobustScaler** reduces the influence of outliers by scaling based on the median and IQR.

#### Code Example:
```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(industry_df[VALUE_METRICES]), columns=VALUE_METRICES)
```

- **When to use**: When skewness is due to **outliers**.
- **Effect**: Maintains data distribution but reduces extreme effects of outliers.


### **6. Handle Skewness by Clipping Outliers**
Clip extreme values to a specific percentile to reduce the impact of outliers.

#### Code Example:
```python
# Clip values to the 1st and 99th percentiles
clipped_df = industry_df[VALUE_METRICES].apply(lambda x: x.clip(lower=x.quantile(0.01), upper=x.quantile(0.99)))
```

- **When to use**: When skewness is caused by **extreme outliers**.
- **Effect**: Reduces skew without changing most data.


### **7. Evaluate Results**
After applying any transformation, recalculate skewness to check improvement.

#### Code Example:
```python
transformed_skewness = X_transformed.skew()
print("Skewness after transformation:")
print(transformed_skewness)
```

### **Summary Table of Methods**
| Transformation         | Use Case                         | Handles Negative? | Handles Zero? |
|------------------------|------------------------------------|-------------------|---------------|
| Log Transformation     | Positive skew, large values       | No                | No            |
| Square Root            | Moderate positive skew            | No                | Yes           |
| Box-Cox                | Positive skew                     | No                | No            |
| Yeo-Johnson            | Positive/negative skew            | Yes               | Yes           |
| RobustScaler           | Outliers causing skew             | Yes               | Yes           |
| Clipping               | Outliers causing skew             | Yes               | Yes           |

---

Would you like a specific transformation applied to your data, or further customization based on your dataset?