# **Preprocessing and Baseline - Edition 3**

# **Baseline Model - Logistic Regression on Monthly Avg Market Cap**

This section outlines the development of a baseline model to predict the **Monthly Avg Market Cap** of companies using a **Logistic Regression** classifier. The dataset is preprocessed, split into training and testing sets, and a preprocessing pipeline is applied to handle numerical and categorical features. Cross-validation is used to assess the model's performance.

## **Data Preparation**

### **1. Dropping Irrelevant Columns**
The first step involves cleaning the dataset by removing unnecessary columns, such as any index or irrelevant features that do not contribute to the model.

### **2. Splitting the Data by Ticker**
Since each company is uniquely identified by its `Ticker`, the data is split into training and testing sets while keeping data from the same `Ticker` together within each split. This method prevents data leakage between the train and test sets.

### **3. Target Variable**
The target variable is the `Monthly Avg Market Cap`. It is removed from the feature set and used as the dependent variable (`y_train` and `y_test`).

### **4. Identifying Numerical and Categorical Features**
To apply the appropriate transformations, numerical and categorical features are identified automatically:
- **Numerical Features**: Columns containing numerical data (e.g., revenue, profit).
- **Categorical Features**: Non-numerical columns (e.g., industry type). The `Ticker` column is excluded from modeling as it is not necessary for prediction.

## **Preprocessing**

### **1. Numerical Features**
For numerical features:
- **Imputation**: Any missing values are handled using the median of the feature's distribution.
- **Scaling**: A `RobustScaler` is applied to ensure that outliers do not overly influence the feature scaling.

### **2. Categorical Features**
For categorical features:
- **Imputation**: Missing values are filled with the most frequent category.
- **One-Hot Encoding**: Categorical variables are transformed using `OneHotEncoder`, creating binary columns for each category and handling unknown categories as needed.

### **3. ColumnTransformer**
A `ColumnTransformer` is used to apply the preprocessing steps:
- The numerical pipeline is applied to numerical features.
- The categorical pipeline is applied to categorical features.

This process results in a fully preprocessed dataset, ready for model training.

## **Modeling**

### **1. Logistic Regression**
A **Logistic Regression** model is used to classify companies based on their **Monthly Avg Market Cap**. The target variable is converted into a binary classification problem, with the threshold defined as the median of `y_train`. Companies with a market cap above the median are classified as "high growth" (1), and those below as "low growth" (0).

### **2. Model Training**
The model is trained on the preprocessed training data (`X_train_processed` and `y_train_class`).

### **3. Cross-Validation**
To evaluate the model, 5-fold **cross-validation** is used. This method assesses the model's accuracy across different subsets of the training data, ensuring it performs well without overfitting.

## **Evaluation**

### **1. Testing the Model**
After cross-validation, the model is tested on the **test set** (`X_test_processed` and `y_test_class`). Predictions are compared to the actual labels to assess the model’s performance on unseen data.

### **2. Performance Metrics**
The following performance metrics are calculated:
- **Accuracy**: The percentage of correct predictions.
- **Classification Report**: A detailed report providing **precision**, **recall**, and **F1-score** metrics for the test set.

Logistic Regression, combined with appropriate preprocessing and cross-validation, serves as a strong baseline for predicting the **Monthly Avg Market Cap** of companies. This model lays the foundation for future iterations and improvements using more complex algorithms.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train/Test split + dropping features

In [2]:
# Step 1: Load the dataset
file_path = r'merged_data.csv'
df = pd.read_csv(file_path)

In [3]:
# Step 2: Dropping irrelevant columns
df = df.drop(columns=['Unnamed: 0'])

In [4]:
# Custom train/test split by grouping data by 'Ticker'
def group_train_test_split(X, y=None, test_size=0.2, random_state=None):
    unique_groups = X['Ticker'].unique()
    train_groups, test_groups = train_test_split(unique_groups, test_size=test_size, random_state=random_state)
    train_data = X[X['Ticker'].isin(train_groups)]
    test_data = X[X['Ticker'].isin(test_groups)]
    return train_data, test_data

train_data, test_data = group_train_test_split(df, test_size=0.2, random_state=42)

In [5]:
# Step 3: Separate the target variable
y_train = train_data['Monthly Avg Market Cap']
X_train = train_data.drop('Monthly Avg Market Cap', axis=1)

y_test = test_data['Monthly Avg Market Cap']
X_test = test_data.drop('Monthly Avg Market Cap', axis=1)

In [6]:
# Step 4: Identify numerical and categorical features
def identify_feature_types(df):
    numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = df.select_dtypes(include=['object']).columns.tolist()

    # Exclude 'Ticker' from the categorical features as it's not needed for transformation
    if 'Ticker' in categorical_features:
        categorical_features.remove('Ticker')

    return numerical_features, categorical_features

# Identify feature types after target removal
numerical_features, categorical_features = identify_feature_types(X_train)

In [7]:
# Step 5: Preprocessing pipelines for numerical and categorical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),  # Handle NaNs
    ('scaler', RobustScaler())  # Scale the data
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Handle missing categories
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))  # Encode categories
])

# Combine the transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

In [8]:
# Step 6: Create the final preprocessing pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

In [9]:
# Step 7: Transform the data
X_train_processed = pipeline.fit_transform(X_train)
X_test_processed = pipeline.transform(X_test)

In [10]:
# Step 8: Define the growth threshold (e.g., median of training data)
threshold = y_train.median()

In [11]:
# Step 9: Convert 'Monthly Avg Market Cap' into binary classification labels
y_train_class = (y_train > threshold).astype(int)  # 1 for high growth, 0 for low growth
y_test_class = (y_test > threshold).astype(int)

In [12]:
# Step 10: Create and fit the Logistic Regression model
logistic_model = LogisticRegression(max_iter=1000)
logistic_model.fit(X_train_processed, y_train_class)

In [13]:
# Step 11: Evaluate using cross-validation
cv_scores = cross_val_score(logistic_model, X_train_processed, y_train_class, cv=5, scoring='accuracy')

In [14]:
# Step 12: Predict on the test set
y_pred_test = logistic_model.predict(X_test_processed)

In [15]:
# Step 13: Print performance metrics
print(f"Cross-validated Accuracy: {cv_scores.mean():.4f}")
print("\nClassification Report on Test Set:\n")
print(classification_report(y_test_class, y_pred_test))

Cross-validated Accuracy: 0.7703

Classification Report on Test Set:

              precision    recall  f1-score   support

           0       0.92      0.53      0.67      2548
           1       0.65      0.95      0.77      2308

    accuracy                           0.73      4856
   macro avg       0.78      0.74      0.72      4856
weighted avg       0.79      0.73      0.72      4856

