In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# For text processing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# For model building
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline

# Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# For evaluation
from sklearn.metrics import classification_report, accuracy_score

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')


# Alien Species Classification with Stacking Classifier and Advanced Feature Engineering

In this notebook, we will build a stacking classifier to classify alien species based on intercepted messages and other features. We will perform advanced feature engineering to extract meaningful patterns from the data.

## Table of Contents

1. [Data Loading and Preprocessing](#data-loading)
2. [Exploratory Data Analysis (EDA)](#eda)
3. [Feature Engineering](#feature-engineering)
4. [Unsupervised Feature Extraction](#unsupervised-feature-extraction)
5. [Preparing Data for Modeling](#data-preparation)
6. [Building the Stacking Classifier](#stacking-classifier)
7. [Model Evaluation](#model-evaluation)
8. [Cross-Validation](#cross-validation)
9. [Making Predictions on Test Data](#predictions)
10. [Conclusion](#conclusion)


In [2]:
# Data Loading and Preprocessing
# Load the data
train_df = pd.read_csv('data.csv')
test_df = pd.read_csv('test.csv')


<a id='data-loading'></a>
## 1. Data Loading and Preprocessing

First, we load the training and test datasets using `pandas`. We then perform initial preprocessing steps, such as encoding categorical variables.


In [3]:
# Display the first few rows of the training data
print("Training Data Sample:")
train_df.head()


Training Data Sample:


Unnamed: 0,message,fingers,tail,species
0,pluvia arbor aquos,4,no,Aquari
1,cosmix xeno nebuz odbitaz,5,yes,Zorblax
2,solarix glixx novum galaxum quasar,5,yes,Zorblax
3,arbor insectus pesros ekos dootix nimbus,2,yes,Florian
4,mermax drakos lorix epikoz deftax,4,no,Faerix


In [4]:
# Display the first few rows of the test data
print("Test Data Sample:")
test_df.head()


Test Data Sample:


Unnamed: 0,message,fingers,tail
0,iephyr terram nimbus terram faunar foliar,2,no
1,joyzor uleex luvium caloox shockus blissae,4,yes
2,aquos arbor ventuc,4,yes
3,nympha nympha epikoz nympha mythox mythox mythox,3,no
4,diitax sibenix fabulon,4,yes


In [5]:
# Preprocessing
# Encode 'tail' column: map 'yes' to 1 and 'no' to 0
train_df['tail'] = train_df['tail'].str.lower().map({'yes': 1, 'no': 0})
test_df['tail'] = test_df['tail'].str.lower().map({'yes': 1, 'no': 0})

# Check for any unmapped values
if train_df['tail'].isnull().any() or test_df['tail'].isnull().any():
    raise ValueError("There are unmapped values in the 'tail' column. Please check the data.")


We encode the `tail` column by mapping 'yes' to `1` and 'no' to `0`. This transforms the categorical data into numerical format suitable for modeling.


In [6]:
# Encode the target variable 'species'
label_encoder = LabelEncoder()
train_df['species_encoded'] = label_encoder.fit_transform(train_df['species'])


We encode the target variable `species` using `LabelEncoder` to convert the categorical labels into numerical values.


# Check for missing values
print("Missing values in training data:")
print(train_df.isnull().sum())

print("\nMissing values in test data:")
print(test_df.isnull().sum())


We ensure there are no missing values in the datasets. Missing values can cause errors during modeling and need to be addressed appropriately.


In [8]:
# Drop rows with missing values in training data
train_df.dropna(inplace=True)

# Fill missing values in test data (if any) using forward fill method
test_df.fillna(method='ffill', inplace=True)


We handle missing values by dropping rows with missing values in the training data and filling missing values in the test data using forward fill. This ensures that the datasets are clean and ready for feature engineering.


In [12]:
# Reset indices after dropping rows
train_df.reset_index(drop=True, inplace=True)


After dropping rows with missing values, we reset the indices of the training data to maintain consistency.


In [13]:
# Display updated training data sample
train_df.head()


Unnamed: 0,message,fingers,tail,species,species_encoded
0,pluvia arbor aquos,4,0,Aquari,0
1,cosmix xeno nebuz odbitaz,5,1,Zorblax,9
2,solarix glixx novum galaxum quasar,5,1,Zorblax,9
3,arbor insectus pesros ekos dootix nimbus,2,1,Florian,4
4,mermax drakos lorix epikoz deftax,4,0,Faerix,3


<a id='eda'></a>
## 2. Exploratory Data Analysis (EDA)

At this point, you can perform EDA to understand the distributions and relationships in the data. For brevity, we will proceed to feature engineering.


In [14]:
# Import additional libraries for feature engineering
import string
from collections import Counter
from scipy.stats import entropy


<a id='feature-engineering'></a>
## 3. Feature Engineering

We create a function `extract_features` to perform advanced feature engineering on the data. This includes:

- Calculating message length
- Counting unique characters
- Counting special characters
- Computing character frequencies
- Calculating message entropy


In [15]:
def extract_features(df):
    # Message Length
    df['message_length'] = df['message'].apply(len)
    
    # Unique Characters
    df['unique_chars'] = df['message'].apply(lambda x: len(set(x)))
    
    # Special Characters Count
    special_chars = set(string.punctuation)
    df['special_char_count'] = df['message'].apply(lambda x: sum(1 for c in x if c in special_chars))
    
    # Character Frequencies
    all_chars = ''.join(df['message'])
    char_counts = Counter(all_chars)
    common_chars = [char for char, count in char_counts.most_common(10)]
    
    for char in common_chars:
        df[f'char_count_{char}'] = df['message'].apply(lambda x: x.count(char))
    
    # Entropy
    def calculate_entropy(text):
        prob = [float(text.count(c)) / len(text) for c in set(text)]
        return entropy(prob)
    df['message_entropy'] = df['message'].apply(calculate_entropy)
    
    return df

# Apply feature engineering to training and test data
train_df = extract_features(train_df)
test_df = extract_features(test_df)


We apply the `extract_features` function to both the training and test datasets. This adds new columns to the dataframes representing the engineered features.


In [16]:
# Display the updated training data with new features
train_df.head()


Unnamed: 0,message,fingers,tail,species,species_encoded,message_length,unique_chars,special_char_count,char_count_,char_count_a,char_count_o,char_count_x,char_count_r,char_count_e,char_count_i,char_count_n,char_count_s,char_count_u,message_entropy
0,pluvia arbor aquos,4,0,Aquari,0,18,12,0,2,3,2,0,2,0,1,0,1,2,2.399204
1,cosmix xeno nebuz odbitaz,5,1,Zorblax,9,25,15,0,3,1,3,2,0,2,2,2,1,1,2.622498
2,solarix glixx novum galaxum quasar,5,1,Zorblax,9,34,14,0,4,5,2,4,2,0,2,1,2,3,2.524979
3,arbor insectus pesros ekos dootix nimbus,2,1,Florian,4,40,17,0,5,1,5,1,3,3,3,2,6,2,2.631939
4,mermax drakos lorix epikoz deftax,4,0,Faerix,3,33,16,0,4,3,3,3,3,3,2,0,1,0,2.661067


Now, the training data includes additional features derived from the messages, which will help improve the model's performance.


In [17]:
# List of new feature columns added
new_features = ['message_length', 'unique_chars', 'special_char_count', 'message_entropy'] + \
               [col for col in train_df.columns if col.startswith('char_count_')]

print("New features added:")
print(new_features)


New features added:
['message_length', 'unique_chars', 'special_char_count', 'message_entropy', 'char_count_ ', 'char_count_a', 'char_count_o', 'char_count_x', 'char_count_r', 'char_count_e', 'char_count_i', 'char_count_n', 'char_count_s', 'char_count_u']


<a id='unsupervised-feature-extraction'></a>
## 4. Unsupervised Feature Extraction

We perform unsupervised feature extraction using KMeans clustering on the TF-IDF vectors of the messages. This adds a new feature `cluster` representing the cluster assignments.


In [18]:
from sklearn.cluster import KMeans

def add_cluster_features(df_train, df_test, n_clusters=5):
    vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2, 4))
    X_train_text = vectorizer.fit_transform(df_train['message'])
    X_test_text = vectorizer.transform(df_test['message'])
    
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df_train['cluster'] = kmeans.fit_predict(X_train_text)
    df_test['cluster'] = kmeans.predict(X_test_text)
    
    return df_train, df_test

# Apply clustering to training and test data
train_df, test_df = add_cluster_features(train_df, test_df, n_clusters=5)


The `add_cluster_features` function adds a `cluster` column to both datasets, which can capture latent patterns in the messages.


In [19]:
# Display the updated training data with cluster feature
train_df.head()


Unnamed: 0,message,fingers,tail,species,species_encoded,message_length,unique_chars,special_char_count,char_count_,char_count_a,char_count_o,char_count_x,char_count_r,char_count_e,char_count_i,char_count_n,char_count_s,char_count_u,message_entropy,cluster
0,pluvia arbor aquos,4,0,Aquari,0,18,12,0,2,3,2,0,2,0,1,0,1,2,2.399204,3
1,cosmix xeno nebuz odbitaz,5,1,Zorblax,9,25,15,0,3,1,3,2,0,2,2,2,1,1,2.622498,2
2,solarix glixx novum galaxum quasar,5,1,Zorblax,9,34,14,0,4,5,2,4,2,0,2,1,2,3,2.524979,2
3,arbor insectus pesros ekos dootix nimbus,2,1,Florian,4,40,17,0,5,1,5,1,3,3,3,2,6,2,2.631939,3
4,mermax drakos lorix epikoz deftax,4,0,Faerix,3,33,16,0,4,3,3,3,3,3,2,0,1,0,2.661067,0


<a id='data-preparation'></a>
## 5. Preparing Data for Modeling

We define the features and target variable for modeling. We also scale the features using `MinMaxScaler` to ensure all values are between 0 and 1, which is required for certain classifiers like `MultinomialNB`.


In [20]:
# Define features and target
feature_columns = ['fingers', 'tail', 'message_length', 'unique_chars', 'special_char_count', 'message_entropy', 'cluster']
# Include character count features
char_features = [col for col in train_df.columns if col.startswith('char_count_')]
feature_columns.extend(char_features)

X = train_df[feature_columns]
y = train_df['species_encoded']
test_X = test_df[feature_columns]


In [21]:
# Scale features using MinMaxScaler
scaler = MinMaxScaler()
X[feature_columns] = scaler.fit_transform(X[feature_columns])
test_X[feature_columns] = scaler.transform(test_X[feature_columns])


We scale the features to bring them to a common scale, which can improve the performance of certain algorithms and ensure that `MultinomialNB` receives non-negative inputs.


In [22]:
# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


We split the data into training and validation sets to evaluate the model's performance on unseen data.


<a id='stacking-classifier'></a>
## 6. Building the Stacking Classifier

We build a stacking classifier using multiple base models and a meta-model. The base models include:

- `MultinomialNB`
- `RandomForestClassifier`
- `GradientBoostingClassifier`
- `SVC`

The meta-model is a `LogisticRegression` classifier.


In [23]:
# Base Models
base_models = [
    ('nb', MultinomialNB()),
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=200, random_state=42)),
    ('svc', SVC(probability=True, random_state=42))
]

# Meta Model
meta_model = LogisticRegression(max_iter=1000, random_state=42)

# Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=5,
    n_jobs=-1
)


We define the stacking classifier, specifying the base models and the meta-model. The `cv=5` parameter indicates that we use 5-fold cross-validation when training the meta-model.


In [24]:
# Train the stacking classifier
stacking_clf.fit(X_train, y_train)


We train the stacking classifier on the training data.


<a id='model-evaluation'></a>
## 7. Model Evaluation

We evaluate the model's performance on the validation set by calculating the accuracy and displaying the classification report.


In [25]:
# Predict on validation set
y_pred = stacking_clf.predict(X_val)

# Evaluate the model
print("\nClassification Report on Validation Set:")
print(classification_report(y_val, y_pred, target_names=label_encoder.classes_))

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")



Classification Report on Validation Set:
              precision    recall  f1-score   support

      Aquari       0.89      0.89      0.89         9
       Cybex       0.90      0.82      0.86        11
    Emotivor       0.92      1.00      0.96        11
      Faerix       0.78      0.78      0.78         9
     Florian       0.90      0.90      0.90        10
     Mythron       0.80      0.80      0.80        10
      Nexoon       0.80      0.89      0.84         9
     Quixnar       0.69      0.92      0.79        12
     Sentire       1.00      0.89      0.94         9
     Zorblax       0.83      0.50      0.62        10

    accuracy                           0.84       100
   macro avg       0.85      0.84      0.84       100
weighted avg       0.85      0.84      0.84       100

Validation Accuracy: 84.00%


The classification report provides detailed metrics such as precision, recall, and F1-score for each class. The overall accuracy is also displayed.


<a id='cross-validation'></a>
## 8. Cross-Validation

We perform cross-validation to assess the model's generalization performance.


In [26]:
# Cross-Validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(stacking_clf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
cv_accuracy = cv_scores.mean()
print(f"\nCross-Validation Accuracy: {cv_accuracy * 100:.2f}%")



Cross-Validation Accuracy: 82.60%


Cross-validation helps in evaluating the model's performance across different subsets of the data, reducing the variability associated with a single train-test split.


<a id='predictions'></a>
## 9. Making Predictions on Test Data

We use the trained stacking classifier to make predictions on the test data and save the results to a CSV file.


In [27]:
# Predict on test data
test_predictions = stacking_clf.predict(test_X)

# Decode the predicted labels
test_predictions_labels = label_encoder.inverse_transform(test_predictions)

# Create the result DataFrame
result_df = pd.DataFrame({'species': test_predictions_labels})

# Save to CSV
result_df.to_csv('result.csv', index=False)
print("\nPredictions saved to 'result.csv'.")



Predictions saved to 'result.csv'.


We transform the numerical predictions back to the original species labels and save the results as required.


<a id='conclusion'></a>
## 10. Conclusion

In this notebook, we successfully built a stacking classifier with advanced feature engineering to classify alien species based on intercepted messages and other features. We achieved this by:

- Performing feature engineering to extract meaningful patterns from the messages.
- Using unsupervised clustering to capture latent patterns.
- Building an ensemble model to combine the strengths of multiple classifiers.
- Evaluating the model using cross-validation to ensure robustness.

This approach leverages both the structured and unstructured data to improve classification accuracy.
