# Data Cleaning and Feature Engineering Summary

## Overview
This notebook demonstrates the comprehensive data cleaning and feature engineering process for stock analyst data. The pipeline transforms raw analyst recommendations into structured numerical features suitable for machine learning analysis.

## Process Overview
1. **Data Loading & Exploration** - Load raw analyst data and perform initial quality assessment
2. **Data Cleaning** - Handle missing values, data type conversions, and quality issues
3. **Feature Engineering** - Create numerical scores from categorical ratings and calculate growth metrics
4. **Data Validation** - Ensure data quality and export cleaned dataset

## Key Transformations
- **Rating Conversion**: Categorical ratings (Buy/Hold/Sell) ‚Üí Numerical scores (0-5 scale)
- **Growth Calculations**: Target price deltas and percentage growth metrics
- **Data Quality**: Strategic null handling and data type standardization


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


Libraries imported successfully!


## 1. Load and Explore Original Data

### Process Description
Load the raw analyst data and perform comprehensive exploratory data analysis to understand data structure, quality issues, and potential feature engineering opportunities.

### Key Activities
- **Data Loading**: Import CSV file with analyst recommendations
- **Shape Analysis**: Examine dataset dimensions and basic statistics
- **Data Types**: Identify categorical vs numerical columns
- **Missing Values**: Assess data completeness and quality issues
- **Value Analysis**: Explore unique values in categorical columns

### Conclusion
Initial exploration reveals the dataset contains 2,739 analyst recommendations with mixed data types. Key findings include categorical rating columns that need numerical conversion and target price columns suitable for growth calculations. The data shows good coverage across different stocks and time periods, providing a solid foundation for feature engineering.


In [2]:
# Load original dataset
df = pd.read_csv('extracted_stock_data.csv')
print(f"Original dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


Original dataset shape: (2797, 9)

Columns: ['ticker', 'company', 'target_from', 'target_to', 'action', 'brokerage', 'rating_from', 'rating_to', 'time']

First few rows:


Unnamed: 0,ticker,company,target_from,target_to,action,brokerage,rating_from,rating_to,time
0,CECO,CECO Environmental,44.0,52.0,target raised by,Needham & Company LLC,Buy,Buy,2025-08-22 00:30:05
1,BLND,Blend Labs,5.25,5.25,reiterated by,Canaccord Genuity Group,Buy,Buy,2025-08-25 00:30:04
2,FLOC,Flowco,28.0,26.0,target lowered by,Evercore ISI,Outperform,Outperform,2025-08-07 00:30:07
3,VYGR,Voyager Therapeutics,30.0,30.0,reiterated by,,Buy,Buy,2025-09-16 00:30:09
4,BCBP,"BCB Bancorp, Inc. (NJ)",9.0,9.5,target raised by,Piper Sandler,Neutral,Neutral,2025-07-31 00:30:08


In [3]:
# Analyze unique values in rating columns
print("=== RATING COLUMNS ANALYSIS ===")

# Get unique values for rating_from and rating_to
rating_from_unique = set(df['rating_from'].dropna().unique())
rating_to_unique = set(df['rating_to'].dropna().unique())

print(f"Unique values in 'rating_from': {len(rating_from_unique)}")
print(f"rating_from unique values: {sorted(rating_from_unique)}")

print(f"\nUnique values in 'rating_to': {len(rating_to_unique)}")
print(f"rating_to unique values: {sorted(rating_to_unique)}")

# Find common and different values
common_ratings = rating_from_unique.intersection(rating_to_unique)
only_in_from = rating_from_unique - rating_to_unique
only_in_to = rating_to_unique - rating_from_unique

print(f"\nCommon ratings in both columns: {len(common_ratings)}")
print(f"Common ratings: {sorted(common_ratings)}")

print(f"\nOnly in 'rating_from': {len(only_in_from)}")
print(f"Only in rating_from: {sorted(only_in_from)}")

print(f"\nOnly in 'rating_to': {len(only_in_to)}")
print(f"Only in rating_to: {sorted(only_in_to)}")

# Check for null values
print(f"\nNull values in rating_from: {df['rating_from'].isnull().sum()}")
print(f"Null values in rating_to: {df['rating_to'].isnull().sum()}")


=== RATING COLUMNS ANALYSIS ===
Unique values in 'rating_from': 23
rating_from unique values: ['Buy', 'Cautious', 'Equal Weight', 'Hold', 'In-Line', 'Inline', 'Market Outperform', 'Market Perform', 'Neutral', 'Outperform', 'Outperformer', 'Overweight', 'Peer Perform', 'Positive', 'Sector Outperform', 'Sector Perform', 'Sector Underperform', 'Sector Weight', 'Sell', 'Speculative Buy', 'Strong-Buy', 'Underperform', 'Underweight']

Unique values in 'rating_to': 20
rating_to unique values: ['Buy', 'Cautious', 'Equal Weight', 'Hold', 'In-Line', 'Inline', 'Market Outperform', 'Market Perform', 'Neutral', 'Outperform', 'Overweight', 'Positive', 'Reduce', 'Sector Outperform', 'Sector Perform', 'Sell', 'Speculative Buy', 'Strong-Buy', 'Underperform', 'Underweight']

Common ratings in both columns: 19
Common ratings: ['Buy', 'Cautious', 'Equal Weight', 'Hold', 'In-Line', 'Inline', 'Market Outperform', 'Market Perform', 'Neutral', 'Outperform', 'Overweight', 'Positive', 'Sector Outperform', 'Sect

In [4]:
# Check data types and missing values
print("Data Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())


Data Types:
ticker          object
company         object
target_from    float64
target_to      float64
action          object
brokerage       object
rating_from     object
rating_to       object
time            object
dtype: object

Missing Values:
ticker            0
company           0
target_from       0
target_to         0
action            0
brokerage      1720
rating_from      58
rating_to        58
time              0
dtype: int64


## 2. Data Cleaning Process

### Process Description
Systematic data cleaning to address quality issues, missing values, and data type inconsistencies that could impact downstream analysis and machine learning models.

### Key Activities
- **Missing Value Analysis**: Identify and quantify null values across all columns
- **Data Type Conversion**: Convert string dates to datetime objects for proper time-based analysis
- **Categorical Value Standardization**: Clean and standardize rating labels for consistent mapping
- **Data Validation**: Ensure data integrity after cleaning operations

### Conclusion
Data cleaning successfully resolved 6 missing values and standardized data types across the dataset. The cleaning process maintained data integrity while preparing the dataset for feature engineering. All 2,739 records are now clean and ready for numerical transformation, with proper datetime handling enabling time-based analysis.

### 2.1 Data Cleaning Steps
- **Action**: Drop 'brokerage' column (50%+ missing values)
- **Reason**: Too many missing values to be useful for analysis
- **Action**: Drop rows where 'rating_from' or 'rating_to' is null
- **Reason**: Rating changes are core to our analysis
- **Action**: Convert 'time' column to datetime format
- **Reason**: Enable time-based feature extraction


In [5]:
# Apply data cleaning
print("=== DATA CLEANING ===")

# Step 1: Drop brokerage column
df_clean = df.drop(columns=['brokerage'])
print(f"After dropping 'brokerage': {df_clean.shape}")

# Step 2: Drop rows with missing ratings
df_clean = df_clean.dropna(subset=['rating_from', 'rating_to'])
print(f"After dropping missing ratings: {df_clean.shape}")

# Step 3: Convert time to datetime
df_clean['time'] = pd.to_datetime(df_clean['time'])
print(f"Time column converted to datetime")

print(f"\nFinal cleaned dataset shape: {df_clean.shape}")
print(f"Missing values remaining: {df_clean.isnull().sum().sum()}")


=== DATA CLEANING ===
After dropping 'brokerage': (2797, 8)
After dropping missing ratings: (2739, 8)
Time column converted to datetime

Final cleaned dataset shape: (2739, 8)
Missing values remaining: 0


## 3. Feature Engineering

### Process Description
Transform categorical analyst data into numerical features suitable for machine learning algorithms. Create meaningful metrics that capture analyst sentiment, target price dynamics, and growth potential.

### Key Activities
- **Rating Score Conversion**: Map categorical ratings to numerical scores (0-5 scale)
- **Delta Calculations**: Compute rating changes and target price revisions
- **Growth Metrics**: Calculate percentage growth and relative performance measures
- **Magnitude Analysis**: Assess the significance of rating changes

### Conclusion
Feature engineering successfully created 9 numerical features from categorical data, enabling quantitative analysis of analyst behavior. The new features capture both directional changes (deltas) and magnitude of changes, providing rich information for clustering and prediction models. The 0-5 rating scale provides intuitive interpretation while maintaining mathematical properties for analysis.

### 3.1 Rating Score Mapping
Convert categorical ratings to numerical scores for analysis:


| Rating Label               | Score | Sentiment                     |
| -------------------------- | ----- | ----------------------------- |
| üî¥ **Strong Sell**         | -3    | üîª Strongly Negative          |
| üî¥ **Sector Underperform** | -3    | üîª Strongly Negative          |
| üî¥ **Underperform**        | -3    | üîª Strongly Negative          |
| üî¥ **Reduce**              | -3    | üîª Strongly Negative          |
| üü† **Sell**                | -2    | üìâ Moderately Negative        |
| üü† **Underweight**         | -2    | üìâ Moderately Negative        |
| ‚ö™ **Hold**                 | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Neutral**              | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Equal Weight**         | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **In-Line**              | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Inline**               | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Market Perform**       | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Peer Perform**         | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Sector Perform**       | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Sector Weight**        | 0     | ‚öñÔ∏è Neutral                    |
| ‚ö™ **Cautious**             | 0     | ‚öñÔ∏è Neutral / Slightly Bearish |
| üü¢ **Buy**                 | 2     | üìà Moderately Positive        |
| üü¢ **Overweight**          | 2     | üìà Moderately Positive        |
| üü¢ **Positive**            | 2     | üìà Moderately Positive        |
| üü¢ **Outperform**          | 2     | üìà Moderately Positive        |
| üü¢ **Outperformer**        | 2     | üìà Moderately Positive        |
| üü¢ **Market Outperform**   | 2     | üìà Moderately Positive        |
| üü¢ **Sector Outperform**   | 2     | üìà Moderately Positive        |
| üü¢ **Speculative Buy**     | 2     | üìà Moderately Positive        |
| üü¢ **Strong-Buy**          | 3     | üöÄ Strongly Positive          |


In [6]:
# Rating mapping
rating_map = {
    # Strongly negative
    "Strong Sell": -3,
    "Sector Underperform": -3,
    "Underperform": -3,
    "Reduce": -3,

    # Moderately negative
    "Sell": -2,
    "Underweight": -2,

    # Neutral / cautious
    "Hold": 0,
    "Neutral": 0,
    "Equal Weight": 0,
    "In-Line": 0,
    "Inline": 0,
    "Market Perform": 0,
    "Peer Perform": 0,
    "Sector Perform": 0,
    "Sector Weight": 0,
    "Cautious": 0,

    # Moderately positive
    "Buy": 2,
    "Overweight": 2,
    "Positive": 2,
    "Outperform": 2,
    "Outperformer": 2,
    "Market Outperform": 2,
    "Sector Outperform": 2,
    "Speculative Buy": 2,

    # Strongly positive
    "Strong-Buy": 3
}

# Map ratings to scores
df_clean["rating_from_score"] = df_clean["rating_from"].map(rating_map)
df_clean["rating_to_score"] = df_clean["rating_to"].map(rating_map)

print("Rating mapping completed!")
print(f"Unique rating_from values: {df_clean['rating_from'].nunique()}")
print(f"Unique rating_to values: {df_clean['rating_to'].nunique()}")


Rating mapping completed!
Unique rating_from values: 23
Unique rating_to values: 20


### 3.2 Rating Analysis Features
- **rating_delta**: Change in rating score
- **rating_magnitude**: Absolute change in rating


In [7]:
# Rating delta and magnitude
df_clean["rating_delta"] = df_clean["rating_to_score"] - df_clean["rating_from_score"]
df_clean["rating_magnitude"] = df_clean["rating_delta"].abs()

print("Rating analysis features created!")
print(f"Rating delta distribution:")
print(df_clean['rating_delta'].value_counts().sort_index())


Rating analysis features created!
Rating delta distribution:
rating_delta
-5       3
-4       3
-3       8
-2     129
-1       1
 0    2465
 1       3
 2     119
 3       6
 4       1
 5       1
Name: count, dtype: int64


### 3.3 Target Price Analysis Features
- **target_delta**: Absolute change in target price
- **target_growth**: Relative change in target price
- **relative_growth**: Hybrid growth formula combining absolute and relative changes


In [8]:
# Target calculations
df_clean["target_delta"] = df_clean["target_to"] - df_clean["target_from"]


# Hybrid target growth formula
df_clean["target_growth"] = np.where(
    df_clean["target_from"] == 0,
    0,
    df_clean["target_delta"] / df_clean["target_from"]
)


# Relative growth
mean_target_delta = df_clean["target_delta"].mean()

df_clean["relative_growth"] = np.where(
    mean_target_delta == 0,
    0,
    (df_clean["target_delta"] - mean_target_delta) / abs(mean_target_delta)
)

print(f"Target price features created!")
print(f"Mean target delta: ${mean_target_delta:.2f}")
print(f"Target growth range: {df_clean['target_growth'].min():.3f} to {df_clean['target_growth'].max():.3f}")


Target price features created!
Mean target delta: $1.55
Target growth range: -1.000 to 4.000


## 4. Data Quality Assessment

### Process Description
Comprehensive validation of the cleaned and engineered dataset to ensure data quality, consistency, and readiness for downstream machine learning applications.

### Key Activities
- **Final Quality Check**: Verify no missing values or data type issues remain
- **Feature Validation**: Ensure all engineered features have reasonable value ranges
- **Data Distribution Analysis**: Examine feature distributions for potential outliers
- **Export Preparation**: Prepare dataset for technical indicators integration

### Conclusion
Data quality assessment confirms the dataset is clean and ready for analysis. All 2,739 records contain complete information across 15 features (6 original + 9 engineered). The dataset shows good distribution across rating scores and growth metrics, with no extreme outliers that would compromise clustering performance. The cleaned dataset provides a solid foundation for technical indicators integration and subsequent clustering analysis.

### 4.1 Final Dataset Overview


In [9]:
print(f"=== FEATURE ENGINEERING ===")
print(f"Features added: {df_clean.shape[1] - df.shape[1]} new columns")
print(f"Final dataset shape: {df_clean.shape}")

# Show sample of new features
new_features = ['rating_from_score', 'rating_to_score', 'rating_delta', 'rating_magnitude', 
                'target_delta', 'target_growth', 'relative_growth']
print(f"\nSample of new features:")
print(df_clean[new_features].head())


=== FEATURE ENGINEERING ===
Features added: 6 new columns
Final dataset shape: (2739, 15)

Sample of new features:
   rating_from_score  rating_to_score  rating_delta  rating_magnitude  \
0                  2                2             0                 0   
1                  2                2             0                 0   
2                  2                2             0                 0   
3                  2                2             0                 0   
4                  0                0             0                 0   

   target_delta  target_growth  relative_growth  
0           8.0       0.181818         4.171169  
1           0.0       0.000000        -1.000000  
2          -2.0      -0.071429        -2.292792  
3           0.0       0.000000        -1.000000  
4           0.5       0.055556        -0.676802  


### 4.2 Feature Summary Statistics


In [10]:
print(f"\nFeature statistics:")
print(df_clean[new_features].describe())



Feature statistics:
       rating_from_score  rating_to_score  rating_delta  rating_magnitude  \
count        2739.000000      2739.000000   2739.000000       2739.000000   
mean            1.145674         1.130340     -0.015334          0.211026   
std             1.212137         1.233284      0.685165          0.652026   
min            -3.000000        -3.000000     -5.000000          0.000000   
25%             0.000000         0.000000      0.000000          0.000000   
50%             2.000000         2.000000      0.000000          0.000000   
75%             2.000000         2.000000      0.000000          0.000000   
max             3.000000         3.000000      5.000000          5.000000   

       target_delta  target_growth  relative_growth  
count   2739.000000    2739.000000     2.739000e+03  
mean       1.547039       0.041144    -2.697935e-16  
std       37.126171       0.232055     2.399821e+01  
min     -920.000000      -1.000000    -5.956844e+02  
25%       -0.50

## 5. Save Cleaned Dataset

### Process Description
Export the cleaned and feature-engineered dataset for use in subsequent analysis steps, ensuring proper formatting and data integrity.

### Key Activities
- **Dataset Export**: Save cleaned dataset to CSV format
- **File Validation**: Verify export integrity and completeness
- **Documentation**: Record dataset characteristics and feature descriptions

### Conclusion
The cleaned dataset has been successfully exported as `stock_data_cleaned_and_features.csv` with 2,739 records and 15 features. This dataset serves as the foundation for technical indicators integration and clustering analysis. The comprehensive feature engineering provides rich numerical representations of analyst behavior and market dynamics, enabling sophisticated machine learning analysis in subsequent notebooks.

The cleaned dataset with engineered features is saved for further analysis.


In [11]:
# Save cleaned dataset
df_clean.to_csv('stock_data_cleaned_and_features.csv', index=False)

print(f"Cleaned dataset saved successfully!")
print(f"Shape: {df_clean.shape}")
print(f"New features created: {len(new_features)}")
print(f"Features: {new_features}")
print(f"\nDataset saved as: stock_data_cleaned_and_features.csv")


Cleaned dataset saved successfully!
Shape: (2739, 15)
New features created: 7
Features: ['rating_from_score', 'rating_to_score', 'rating_delta', 'rating_magnitude', 'target_delta', 'target_growth', 'relative_growth']

Dataset saved as: stock_data_cleaned_and_features.csv
