# Handle outliers in Financial Ratios

Handling outliers in financial ratios (or any financial data) is crucial because they can distort statistical models, impact visualizations, and lead to misleading conclusions. The best approach depends on the nature of the data and the specific financial ratios you're dealing with. Below are some recommended techniques, considering the type of financial data typically found in `industry_df`.

## **Common Financial Ratios:**
- **P/E Ratio (Price-to-Earnings)**: Can have extreme values, especially for companies with low or negative earnings.
- **P/B Ratio (Price-to-Book)**: Extreme values can be due to undervaluation or overvaluation.
- **ROE (Return on Equity)**: Outliers can occur due to non-recurring items or extreme financial performance.
- **Debt-to-Equity Ratio**: Can be highly skewed, especially for companies in capital-intensive industries.

## **Approaches to Handle Outliers in Financial Data:**

### **1. Identify and Visualize Outliers**

Before deciding on a method, it's important to **visualize** and **quantify** the outliers.

**Visualization:**
- **Box plots**: Box plots give a quick view of the distribution and potential outliers.
- **Histograms**: Histograms can reveal the shape of the distribution and highlight the presence of extreme values.

```python
import matplotlib.pyplot as plt

industry_df.boxplot(figsize=(12, 8))
plt.xticks(rotation=90)
plt.title("Boxplot for Financial Ratios")
plt.tight_layout()
plt.show()
```

### **2. Statistical Methods to Detect Outliers**

You can use statistical methods to **quantify** and **filter outliers**.

**a. Z-Score Method (Standardized Method)**:
- This method is suitable when data follows a **normal distribution**.
- **Z-score** measures how far away a value is from the mean in terms of standard deviations.
  
   **Code Example**:
   ```python
   from scipy.stats import zscore

   # Calculate Z-scores for the selected financial ratios
   z_scores = industry_df[VALUE_METRICS].apply(zscore)
   
   # Filter out data points with absolute Z-score greater than a threshold (e.g., 3)
   industry_df_no_outliers = industry_df[(z_scores < 3).all(axis=1)]
   ```

   - **Threshold**: A Z-score greater than **3** or less than **-3** typically indicates an outlier.

**b. IQR (Interquartile Range) Method**:
- The IQR is more robust and works better when data is **skewed** or **not normally distributed**.
- Values outside the **1.5 times IQR** (below Q1 - 1.5*IQR or above Q3 + 1.5*IQR) are typically considered outliers.

   **Code Example**:
   ```python
   Q1 = industry_df[VALUE_METRICS].quantile(0.25)
   Q3 = industry_df[VALUE_METRICS].quantile(0.75)
   IQR = Q3 - Q1

   # Filter out outliers
   industry_df_no_outliers = industry_df[~((industry_df[VALUE_METRICS] < (Q1 - 1.5 * IQR)) | 
                                            (industry_df[VALUE_METRICS] > (Q3 + 1.5 * IQR))).any(axis=1)]
   ```

   - **Effect**: This approach is more robust for non-normal data and works well for financial ratios, which are often skewed.

### **3. Treating Outliers (After Detection)**

Once outliers are identified, there are several ways to handle them:

**a. Remove Outliers:**
- If the number of outliers is small and they are likely errors or irrelevant, **removing** them is a reasonable approach.

   ```python
   # Remove rows containing outliers (using IQR method as example)
   industry_df_cleaned = industry_df_no_outliers
   ```

**b. Cap or Clip Outliers:**
- If the outliers are extreme but not errors (e.g., high P/E ratios), **capping** the values can be effective. Capping replaces values that exceed a certain threshold with the threshold value.

   ```python
   # Cap values to the 95th percentile
   upper_cap = industry_df[VALUE_METRICS].quantile(0.95)
   lower_cap = industry_df[VALUE_METRICS].quantile(0.05)

   industry_df_clipped = industry_df[VALUE_METRICS].apply(lambda x: x.clip(lower=lower_cap, upper=upper_cap))
   ```

   - **Effect**: Caps the extreme values at the given percentiles to prevent them from distorting analysis.

**c. Impute Outliers:**
- If the outliers are deemed valid but extreme, **imputing** them with more representative values like the **median** or **mean** might be appropriate.

   ```python
   # Impute outliers with the median for each column
   industry_df_imputed = industry_df[VALUE_METRICS].apply(lambda x: x.fillna(x.median()))
   ```

   - **Effect**: Replaces extreme outliers with a more central value, making them less influential in model training.

### **4. Log Transformation (for Skewed Data)**
If your financial data is **positively skewed**, applying a **log transformation** can help reduce the impact of extreme values.

```python
import numpy as np

# Apply log transformation to reduce skew
industry_df_transformed = industry_df[VALUE_METRICS].apply(lambda x: np.log1p(x))  # log(1 + x)
```

- **Effect**: Log transformation reduces the impact of large values, which can be particularly useful for financial ratios like P/E or P/B.

### **5. Robust Scaling (For Outliers and Skewness)**
If you have **outliers** and **skewness**, using **RobustScaler** is a good approach. It scales features using the **median and interquartile range** (IQR), making it less sensitive to outliers.

```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
industry_df_scaled = pd.DataFrame(scaler.fit_transform(industry_df[VALUE_METRICS]), columns=VALUE_METRICS)
```

- **Effect**: Scales data while minimizing the influence of outliers.

---

## **Best Approach for Financial Ratios**:
- **For data with skewness** (like P/E, P/B ratios), **log transformation** or **Box-Cox transformation** can help reduce extreme skewness.
- **For non-normal data** with extreme outliers, using **IQR filtering** or **RobustScaler** is often the best approach.
- **Imputation** of outliers with the median is another viable approach if outliers are valid but need to be controlled.


## Handling outliers for target variable

Handling outliers in the **target variable** (dependent variable) requires careful consideration because removing or transforming outliers can impact the predictive model's performance and accuracy. The strategy depends on whether the outliers are genuine data points or errors and the nature of the analysis.

Here are some recommended approaches for handling outliers in the target variable:

---

### **1. Identify Outliers in the Target Variable**
Before handling outliers, it's essential to identify them. Visualization and statistical methods can help.

#### **Visualization Techniques**:
- **Boxplot**: Displays potential outliers visually.
- **Histogram**: Shows the distribution and any extreme values.
- **Scatter Plot**: Useful if you want to examine the target variable against predictors.

```python
import matplotlib.pyplot as plt

# Boxplot to visualize outliers
plt.figure(figsize=(8, 6))
industry_df['TARGET'].plot(kind='box')
plt.title("Boxplot of Target Variable")
plt.show()

# Histogram to visualize the distribution
plt.figure(figsize=(8, 6))
industry_df['TARGET'].hist(bins=30)
plt.title("Histogram of Target Variable")
plt.show()
```

---

### **2. Statistical Methods to Detect Outliers**

#### **a. Z-Score Method:**
- If the target variable is **normally distributed**, use the **Z-score** method.
- Values with a Z-score greater than a threshold (e.g., 3) are considered outliers.

```python
from scipy.stats import zscore

# Calculate Z-scores for the target variable
z_scores_target = zscore(industry_df['TARGET'])

# Filter out rows where the absolute Z-score is greater than 3
industry_df_no_outliers = industry_df[abs(z_scores_target) < 3]
```

#### **b. IQR (Interquartile Range) Method:**
- If the target variable is **not normally distributed**, the **IQR** method is more robust.
- Outliers are defined as values outside the range [Q1 - 1.5*IQR, Q3 + 1.5*IQR].

```python
Q1 = industry_df['TARGET'].quantile(0.25)
Q3 = industry_df['TARGET'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out rows where the target variable is outside the bounds
industry_df_no_outliers = industry_df[(industry_df['TARGET'] >= lower_bound) & 
                                      (industry_df['TARGET'] <= upper_bound)]
```

---

### **3. Handling Outliers (Once Identified)**

#### **a. Remove Outliers:**
- If the outliers are likely **errors** or **irrelevant**, removing them is reasonable.

```python
industry_df_cleaned = industry_df[(industry_df['TARGET'] >= lower_bound) & 
                                  (industry_df['TARGET'] <= upper_bound)]
```

#### **b. Cap or Winsorize Outliers:**
- If the outliers are **valid** but extreme, **capping** them at the upper and lower bounds can reduce their impact without losing data.

```python
industry_df['TARGET'] = industry_df['TARGET'].clip(lower=lower_bound, upper=upper_bound)
```

- **Winsorization** replaces extreme values with the nearest non-outlier value.

#### **c. Transform the Target Variable:**
- Apply **log transformation** or **Box-Cox transformation** to reduce the impact of outliers.

```python
import numpy as np

# Apply log transformation (if the target variable has only positive values)
industry_df['TARGET'] = np.log1p(industry_df['TARGET'])  # log(1 + x)
```

- For both positive and negative values, **Yeo-Johnson transformation** works well:

```python
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='yeo-johnson')
industry_df['TARGET'] = pt.fit_transform(industry_df[['TARGET']])
```

#### **d. Impute Outliers:**
- Replace outliers with a **median** or **mean** to reduce their influence.

```python
industry_df.loc[(industry_df['TARGET'] < lower_bound) | (industry_df['TARGET'] > upper_bound), 'TARGET'] = industry_df['TARGET'].median()
```

---

### **4. Model-Specific Handling of Outliers**

Some machine learning models are more robust to outliers than others:

- **Robust Models**: Algorithms like **Random Forest**, **Gradient Boosting**, or **XGBoost** are less sensitive to outliers.
- **Linear Models**: Outliers can heavily influence linear regression, so handling outliers is critical.

For robust linear regression:
```python
from sklearn.linear_model import HuberRegressor

X = industry_df.drop('TARGET', axis=1)
y = industry_df['TARGET']

# Robust regression that is less sensitive to outliers
model = HuberRegressor()
model.fit(X, y)
```

---

### **Best Practices:**
1. **Understand the Cause of Outliers**:
   - Are they due to data entry errors, unique events, or valid but extreme observations?
2. **Avoid Blind Removal**:
   - Removing outliers without understanding their significance can lead to loss of important information.
3. **Document Your Process**:
   - Keep track of how you handled outliers and why, especially in financial data where outliers may have significant implications.