* do feature scalling and selection to see if it improves and converges 
* perform feature scaling before feature selection  

https://desirabletomorrows.org/assets/files/GARCIA-A.C.etal.FeatureSelectionMethodsforTextClassification.pdf  
https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/

Feature Selection Techniques: 
1. Filter Methods:
   - Chi-square test: This method assesses the independence between categorical features and a categorical target variable. It is suitable for categorical variables and a categorical target.
   - Information gain: It measures the reduction in entropy or disorder of the target variable based on each feature. It is suitable for categorical or numerical features and a categorical target.  
   - Fisher score

2. Wrapper Methods: Wrapper methods can be computationally expensive, especially if the feature space is large
   - Recursive Feature Elimination (RFE): It recursively eliminates features and builds models based on the remaining features. It assesses the model's performance at each step to determine feature relevance. It can be used with any type of features and any type of target.
   - Forward/Backward Stepwise Selection: These methods iteratively add or remove features based on their individual performance in the model. They can be used with any type of features and any type of target.

3. Embedded Methods:
   - LASSO (Least Absolute Shrinkage and Selection Operator): It applies L1 regularization to linear regression models, promoting sparsity in the coefficient estimates. It is suitable for numerical features and a numerical target.
   - Ridge Regression: It applies L2 regularization to linear regression models, which can shrink less important features towards zero. It is suitable for numerical features and a numerical target.
   - Elastic Net: It combines L1 and L2 regularization methods to balance between feature selection and regularization. It is suitable for numerical features and a numerical target.
   - Regularized Tree-Based Methods: These include techniques like Random Forests with feature importance ranking and tree-based regularization techniques like XGBoost and LightGBM. They can handle different types of features and targets.

5. Univariate Selection:
   - ANOVA F-test: It measures the dependence between numerical features and a categorical target. It is suitable for numerical features and a categorical target.

In [49]:
import itertools
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from scipy.sparse import hstack
import numpy as np
import gensim
from gensim.models import Word2Vec
from nltk import word_tokenize
from scipy.sparse import hstack
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MaxAbsScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, chi2, SelectFromModel 
from scipy.stats import randint
import gensim
from gensim.models import Word2Vec, KeyedVectors
from sklearn.feature_selection import RFECV, RFE

In [50]:
import numpy as np
np.random.seed(42)  # Set random seed for numpy

import random
random.seed(42)  # Set random seed for random module

In [51]:
df = pd.read_csv("D:\\MS DATA SCIENCE\\NLP TESE\\data\\data_processed_selected.csv")

In [52]:
def preprocess_entities(row):
    if pd.isna(row)==False:
        entities = row.lower()  # Convert to lowercase
        entities = entities.replace(',', '')  # Remove commas
        entities = ' '.join(set(entities.split()))  # Convert to set to get unique values, then join back as a string
        return entities
df['entities'] = df['entities'].apply(preprocess_entities)

In [53]:
features = ['narrative_tfidf', 'tfidf_title', 'tfidf_keywords', 'events_tfidf', 'entities']
feature_combinations = []

# Generate the power set of features
for r in range(1, len(features) + 1):
    combinations = itertools.combinations(features, r)
    feature_combinations.extend(combinations)

# Convert each feature combination to a list
feature_combinations = [list(combination) for combination in feature_combinations]
print(len(feature_combinations))

31


In [54]:
new_feature_combinations = []

for t in feature_combinations:
    if not df[t].isna().all(axis=1).any():
        new_feature_combinations.append(t)

feature_combinations = [feature_comb for feature_comb in new_feature_combinations if 'narrative_tfidf' in feature_comb and 'tfidf_title' in feature_comb]
feature_combinations.append(['narrative_tfidf'])

In [55]:
feature_combinations

[['narrative_tfidf', 'tfidf_title'],
 ['narrative_tfidf', 'tfidf_title', 'tfidf_keywords'],
 ['narrative_tfidf', 'tfidf_title', 'events_tfidf'],
 ['narrative_tfidf', 'tfidf_title', 'entities'],
 ['narrative_tfidf', 'tfidf_title', 'tfidf_keywords', 'events_tfidf'],
 ['narrative_tfidf', 'tfidf_title', 'tfidf_keywords', 'entities'],
 ['narrative_tfidf', 'tfidf_title', 'events_tfidf', 'entities'],
 ['narrative_tfidf',
  'tfidf_title',
  'tfidf_keywords',
  'events_tfidf',
  'entities'],
 ['narrative_tfidf']]

# TF-IDF

In [56]:
y=df['reason']
X = df[['narrative_tfidf', 'tfidf_title', 'tfidf_keywords', 'events_tfidf', 'entities']]

In [57]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the remaining data into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, stratify=y_test, random_state=42)

# Print the shape of each set
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (15701, 5)
y_train shape: (15701,)
X_val shape: (3364, 5)
y_val shape: (3364,)
X_test shape: (3365, 5)
y_test shape: (3365,)


### does scaling help?

In [23]:
X_train_combined_transformed.todense()

MemoryError: Unable to allocate 33.2 GiB for an array with shape (15701, 284224) and data type float64

In [10]:
def scaling(feature_combinations, X_train, X_val, scaler,algorithm, y_train, y_val):
    # Initialize the results table
    cont=0

    # Initialize an empty list to store individual result DataFrames
    result_dfs = []
    # Initialize the results table
    results = pd.DataFrame(columns=['Features Combination', 'Accuracy', 'Precision', 'Recall', 
                                    'F1-Score', 'Nr feature before'])

    # Evaluate models for each feature combination and algorithm
    for feature_set in feature_combinations:
        cont += 1
        X_train_combined = X_train[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)
        X_val_combined = X_val[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)

        # Transform the features using TF-IDF
        vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.95, min_df=2)
        X_train_combined_transformed = vectorizer.fit_transform(X_train_combined)
        X_val_combined_transformed = vectorizer.transform(X_val_combined)

        # Nr features before
        feature_names = vectorizer.get_feature_names_out()
        num_features_before = len(feature_names)

        # Feature scaling
        # scaler = MaxAbsScaler()
        X_train_scaled = scaler.fit_transform(X_train_combined_transformed)
        X_val_scaled = scaler.transform(X_val_combined_transformed)

        # Train the model
        # algorithm= LinearSVC(max_iter=2000)
        algorithm.fit(X_train_scaled, y_train)

        # Predict labels
        y_pred = algorithm.predict(X_val_scaled)

        # Calculate performance metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred, average='macro')
        recall = recall_score(y_val, y_pred, average='macro')
        f1 = f1_score(y_val, y_pred, average='macro')

        # Create a DataFrame for the current combination and algorithm
        result_df = pd.DataFrame({'Features Combination': [', '.join(feature_set)],
                                  'Accuracy': [accuracy],
                                  'Precision': [precision],
                                  'Recall': [recall],
                                  'F1-Score': [f1],
                                  'Nr feature before':[num_features_before]})
        # Append the DataFrame to the list
        result_dfs.append(result_df)

        print("Tested combination {} of {}".format(cont, len(feature_combinations)))

    # Concatenate all the result DataFrames into a single DataFrame
    results_scaling = pd.concat(result_dfs, ignore_index=True)
    return results_scaling

In [21]:
results_scaling=scaling(feature_combinations, X_train, X_val, MaxAbsScaler(),LinearSVC(max_iter=3000), y_train, y_val)
results_scaling

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before
0,"narrative_tfidf, tfidf_title",0.646254,0.509206,0.456454,0.467845,211325
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.647741,0.515034,0.449445,0.461037,239686
2,"narrative_tfidf, tfidf_title, events_tfidf",0.64239,0.507407,0.448946,0.460072,256564
3,"narrative_tfidf, tfidf_title, entities",0.641201,0.511699,0.454708,0.467698,217451
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.646254,0.514554,0.446398,0.457357,284224
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.645065,0.507805,0.444778,0.455548,245458
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.642985,0.503758,0.448418,0.458734,262304
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.648335,0.514317,0.445482,0.455827,289922
8,narrative_tfidf,0.633175,0.498001,0.441432,0.45245,203837


In [22]:
results_scaling=scaling(feature_combinations, X_train, X_val, MaxAbsScaler(),XGBClassifier(), y_train, y_val)
results_scaling

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before
0,"narrative_tfidf, tfidf_title",0.659334,0.54655,0.4515,0.462223,211325
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.66409,0.559104,0.456817,0.467326,239686
2,"narrative_tfidf, tfidf_title, events_tfidf",0.662901,0.550561,0.45517,0.465398,256564
3,"narrative_tfidf, tfidf_title, entities",0.657253,0.545262,0.453359,0.463684,217451
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.659631,0.554039,0.453728,0.465004,284224
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.651605,0.512385,0.43459,0.440465,245458
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.658442,0.537404,0.449618,0.456914,262304
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.662307,0.558208,0.460919,0.474053,289922
8,narrative_tfidf,0.645957,0.51885,0.430046,0.435547,203837


In [11]:
results_scaling=scaling(feature_combinations, X_train, X_val, StandardScaler(with_mean=False),LinearSVC(max_iter=5500), y_train, y_val)
results_scaling



Tested combination 1 of 9




Tested combination 2 of 9




Tested combination 3 of 9




Tested combination 4 of 9




Tested combination 5 of 9




Tested combination 6 of 9




Tested combination 7 of 9




Tested combination 8 of 9




Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before
0,"narrative_tfidf, tfidf_title",0.634661,0.525675,0.419825,0.428227,211325
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.631094,0.495797,0.408067,0.410422,239686
2,"narrative_tfidf, tfidf_title, events_tfidf",0.632878,0.509908,0.411542,0.416124,256564
3,"narrative_tfidf, tfidf_title, entities",0.63258,0.532252,0.417684,0.426176,217408
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.630797,0.482126,0.399745,0.396878,284224
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.632878,0.499109,0.407107,0.4085,245489
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.638228,0.523875,0.419148,0.426173,262484
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.634958,0.487328,0.403677,0.401779,290111
8,narrative_tfidf,0.627527,0.493628,0.406825,0.409671,203837


In [24]:
results_scaling=scaling(feature_combinations, X_train, X_val, StandardScaler(with_mean=False),XGBClassifier(), y_train, y_val)
results_scaling

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before
0,"narrative_tfidf, tfidf_title",0.659334,0.54655,0.4515,0.462223,211325
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.66409,0.559104,0.456817,0.467326,239686
2,"narrative_tfidf, tfidf_title, events_tfidf",0.662901,0.550561,0.45517,0.465398,256564
3,"narrative_tfidf, tfidf_title, entities",0.657253,0.545262,0.453359,0.463684,217451
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.659631,0.554039,0.453728,0.465004,284224
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.651605,0.512385,0.43459,0.440465,245458
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.658442,0.537404,0.449618,0.456914,262304
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.662307,0.558208,0.460919,0.474053,289922
8,narrative_tfidf,0.645957,0.51885,0.430046,0.435547,203837


Does not help, probably because TF-IDF encoding already incorporates a form of normalization within its calculation (TF); sparse matrices with lots of zeros; we need relative importance of terms within the TF-IDF vectors and scalling can disrupte them.

### chi-square

In [16]:
def chi_square(X_train, X_val, y_train, y_val, n, classifier):
    #same as selector.fit(X_train, y_train) and then X_train_selected = selector.transform(X_train)
    selector = SelectKBest(score_func=chi2, k=n)
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_val_selected = selector.transform(X_val)
    
    # Train the model
    classifier.fit(X_train_selected, y_train)

    # Predict labels
    y_pred = classifier.predict(X_val_selected)

    # Calculate performance metrics
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred, average='macro')
    recall = recall_score(y_val, y_pred, average='macro')
    f1 = f1_score(y_val, y_pred, average='macro')
    return accuracy, precision, recall, f1

In [25]:
def chi2_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, classifier, low, high, step):
    # Initialize the results table
    cont=0

    # Initialize an empty list to store individual result DataFrames
    result_dfs = []
    # Initialize the results table
    results = pd.DataFrame(columns=['Features Combination', 'Accuracy', 'Precision', 'Recall', 
                                    'F1-Score', 'Nr feature before', 'Nr features after'])

    # Evaluate models for each feature combination and algorithm
    for feature_set in feature_combinations:
        cont += 1
        X_train_combined = X_train[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)
        X_val_combined = X_val[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)

        # Transform the features using TF-IDF
        vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.95, min_df=2)
        X_train_combined_transformed = vectorizer.fit_transform(X_train_combined)
        X_val_combined_transformed = vectorizer.transform(X_val_combined)

        # Nr features before
        feature_names = vectorizer.get_feature_names_out()
        num_features_before = len(feature_names)

        max_f1=0
        maxs=(0,0,0,0)
        best_num_features=0
        for i in range(low, high, step):
            accuracy, precision, recall, f1= chi_square(X_train_combined_transformed, X_val_combined_transformed,
                                                        y_train, y_val, i, classifier)
            if f1>max_f1:
                max_f1=f1
                maxs=(accuracy, precision, recall, f1)
                best_num_features=i

        # Create a DataFrame for the current combination and algorithm
        accuracy, precision, recall, f1= maxs
        result_df = pd.DataFrame({'Features Combination': [', '.join(feature_set)],
                                  'Accuracy': [accuracy],
                                  'Precision': [precision],
                                  'Recall': [recall],
                                  'F1-Score': [f1],
                                  'Nr feature before':[num_features_before], 
                                  'Nr features after':[best_num_features]})
        # Append the DataFrame to the list
        result_dfs.append(result_df)

        print("Tested combination {} of {}".format(cont, len(feature_combinations)))

    # Concatenate all the result DataFrames into a single DataFrame
    results = pd.concat(result_dfs, ignore_index=True)
    return results

In [26]:
results=chi2_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, LinearSVC(), low=1000, high=21001, step=1000)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",0.662901,0.546918,0.448868,0.457685,211325,17000
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.664982,0.564262,0.444714,0.454169,239686,9000
2,"narrative_tfidf, tfidf_title, events_tfidf",0.657551,0.533553,0.436077,0.442641,256564,18000
3,"narrative_tfidf, tfidf_title, entities",0.668847,0.553064,0.453194,0.461752,217451,10000
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.657848,0.549572,0.440608,0.448948,284224,18000
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.662307,0.554895,0.441991,0.449947,245458,8000
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.657551,0.525003,0.435993,0.441559,262304,19000
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.656361,0.549502,0.44002,0.448698,289922,18000
8,narrative_tfidf,0.66201,0.548735,0.436944,0.441998,203837,6000


In [28]:
results=chi2_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, LinearSVC(), low=1000, high=21001, step=100)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",0.664388,0.550169,0.450645,0.459736,211325,17200
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.663496,0.567776,0.444587,0.454427,239686,7700
2,"narrative_tfidf, tfidf_title, events_tfidf",0.658145,0.535086,0.436441,0.443019,256564,18200
3,"narrative_tfidf, tfidf_title, entities",0.667063,0.555337,0.454075,0.464349,217451,10600
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.659631,0.54846,0.441998,0.450269,284224,18700
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.664388,0.561472,0.445026,0.454349,245458,8200
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.657848,0.524237,0.436364,0.441871,262304,19800
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.657848,0.550281,0.440905,0.449526,289922,18200
8,narrative_tfidf,0.660226,0.555258,0.437688,0.44454,203837,5100


In [27]:
results=chi2_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, XGBClassifier(), low=1000, high=21001, step=1000)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",0.666468,0.556829,0.463578,0.475923,211325,5000
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.665874,0.560497,0.457585,0.467151,239686,11000
2,"narrative_tfidf, tfidf_title, events_tfidf",0.655767,0.546979,0.454261,0.467693,256564,2000
3,"narrative_tfidf, tfidf_title, entities",0.656956,0.544619,0.458066,0.468521,217451,5000
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.659631,0.548785,0.459652,0.472621,284224,5000
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.661712,0.557419,0.459358,0.471465,245458,13000
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.66409,0.542431,0.456581,0.467424,262304,1000
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.663496,0.549416,0.458426,0.469421,289922,15000
8,narrative_tfidf,0.654875,0.532995,0.436575,0.444419,203837,16000


In [29]:
results=chi2_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, XGBClassifier(), low=1000, high=21001, step=100)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",0.666468,0.556829,0.463578,0.475923,211325,5000
1,"narrative_tfidf, tfidf_title, tfidf_keywords",0.668847,0.582957,0.468546,0.481305,239686,18700
2,"narrative_tfidf, tfidf_title, events_tfidf",0.661415,0.563185,0.461023,0.475799,256564,2400
3,"narrative_tfidf, tfidf_title, entities",0.66201,0.548412,0.459578,0.472182,217451,1900
4,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.66409,0.569823,0.462847,0.476791,284224,5800
5,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.664685,0.564261,0.462426,0.475357,245458,9400
6,"narrative_tfidf, tfidf_title, events_tfidf, en...",0.659631,0.554976,0.462818,0.476953,262304,1700
7,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",0.667063,0.572599,0.46521,0.478508,289922,12900
8,narrative_tfidf,0.653389,0.526321,0.442609,0.451622,203837,5600


### Tree-based Feature Importance with xgboost  
XGBoost, being a tree-based model, provides a feature importance ranking that can help identify the most relevant features. The feature importance scores indicate the contribution of each feature in the XGBoost model. You can use these scores to select the top-ranked features.  

[SelectFromModel](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) Setting threshold='median' means that the median value of the feature importance scores will be used as the threshold. When using 'median' as the threshold value, features with importance scores above the median will be selected, while those with scores below the median will be discarded. This approach ensures that approximately 50% of the features are retained, making it a reasonable starting point for feature selection.


In [23]:
def xgb_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, t, algorithms):
    cont=0
    # Initialize an empty list to store individual result DataFrames
    result_dfs = []
    # Initialize the results table
    results = pd.DataFrame(columns=['Features Combination', 'Accuracy', 'Precision', 'Recall', 
                                    'F1-Score', 'Nr feature before', 'Nr features after'])

    # Evaluate models for each feature combination and algorithm
    for feature_set in feature_combinations:
        cont += 1
        X_train_combined = X_train[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)
        X_val_combined = X_val[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)

        # Transform the features using TF-IDF
        vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.95, min_df=2)
        X_train_combined_transformed = vectorizer.fit_transform(X_train_combined)
        X_val_combined_transformed = vectorizer.transform(X_val_combined)

        # Nr features before
        feature_names = vectorizer.get_feature_names_out()
        num_features_before = len(feature_names)
        
        # Tree-Based Feature Importance for XGBoost
        xgb_model = XGBClassifier()      
        xgb_model.fit(X_train_combined_transformed, y_train)  

        # Get feature importance scores
        feature_importances = xgb_model.feature_importances_

        # Create a feature selector based on importance scores
        feature_selector = SelectFromModel(xgb_model, threshold=t, prefit=True)

        # Select features above the threshold
        X_train_selected = feature_selector.transform(X_train_combined_transformed)
        X_val_selected = feature_selector.transform(X_val_combined_transformed)
        
        # New nr of features
        best_num_features = X_train_selected.shape[1]
        
        # Train and evaluate models for each algorithm
        for algorithm_name, algorithm in algorithms.items():
            # Train the model
            algorithm.fit(X_train_selected, y_train)

            # Predict labels
            y_pred = algorithm.predict(X_val_selected)

            # Calculate performance metrics
            accuracy = accuracy_score(y_val, y_pred)
            precision = precision_score(y_val, y_pred, average='macro')
            recall = recall_score(y_val, y_pred, average='macro')
            f1 = f1_score(y_val, y_pred, average='macro')

            # Create a DataFrame for the current combination and algorithm
            result_df = pd.DataFrame({'Features Combination': [', '.join(feature_set)],
                                      'Algorithm': [algorithm_name],
                                      'Accuracy': [accuracy],
                                      'Precision': [precision],
                                      'Recall': [recall],
                                      'F1-Score': [f1],
                                      'Nr feature before':[num_features_before], 
                                      'Nr features after':[best_num_features]})
            # Append the DataFrame to the list
            result_dfs.append(result_df)

        print("Tested combination {} of {}".format(cont, len(feature_combinations)))

    # Concatenate all the result DataFrames into a single DataFrame
    results = pd.concat(result_dfs, ignore_index=True)
    return results

In [66]:
algorithms= {'Linear SVC': LinearSVC(), 'XGBoost': XGBClassifier()}
t = 'mean'
results = xgb_feature_selection(feature_combinations, X_train, X_val, y_train, y_val, t, algorithms)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Algorithm,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",Linear SVC,0.65874,0.53141,0.452127,0.459422,211325,1839
1,"narrative_tfidf, tfidf_title",XGBoost,0.659334,0.54655,0.4515,0.462223,211325,1839
2,"narrative_tfidf, tfidf_title, tfidf_keywords",Linear SVC,0.659631,0.536474,0.452782,0.460543,239686,1884
3,"narrative_tfidf, tfidf_title, tfidf_keywords",XGBoost,0.66409,0.559104,0.456817,0.467326,239686,1884
4,"narrative_tfidf, tfidf_title, events_tfidf",Linear SVC,0.659334,0.52446,0.451191,0.456996,256564,1884
5,"narrative_tfidf, tfidf_title, events_tfidf",XGBoost,0.660226,0.546389,0.453691,0.463109,256564,1884
6,"narrative_tfidf, tfidf_title, entities",Linear SVC,0.662604,0.535939,0.453713,0.46052,217518,1856
7,"narrative_tfidf, tfidf_title, entities",XGBoost,0.659631,0.553479,0.453596,0.464237,217518,1856
8,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",Linear SVC,0.656956,0.533347,0.447331,0.454297,284224,1937
9,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",XGBoost,0.659929,0.55419,0.454158,0.46573,284224,1937


### L1 Regularization (LASSO) with LinearSVC  
[L1 based feature selection sklearn](https://scikit-learn.org/stable/modules/feature_selection.html)  
Linear models such as LinearSVC can be regularized using L1 penalty (Lasso). This induces sparsity in the coefficients, allowing you to select important features based on their non-zero coefficients.  
[LinearSVC sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html): Prefer dual=False when n_samples > n_features. dual=False is a parameter setting that is used to specify the algorithm used for optimization. When n_features > n_samples it is generally recommended to set dual=True.

In [44]:
import sklearn
print(sklearn.__version__)

1.2.0


In [50]:
# Initialize the results table
cont=0

# Initialize an empty list to store individual result DataFrames
result_dfs = []
# Initialize the results table
results = pd.DataFrame(columns=['Features Combination', 'Accuracy', 'Precision', 'Recall', 
                                'F1-Score', 'Nr feature before', 'Nr features after'])

# Evaluate models for each feature combination and algorithm
for feature_set in feature_combinations:
    cont += 1
    X_train_combined = X_train[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)
    X_val_combined = X_val[feature_set].apply(lambda x: ' '.join(x.fillna('').astype(str)), axis=1)

    # Transform the features using TF-IDF
    vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.95, min_df=2)
    X_train_combined_transformed = vectorizer.fit_transform(X_train_combined)
    X_val_combined_transformed = vectorizer.transform(X_val_combined)

    # Nr features before
    feature_names = vectorizer.get_feature_names_out()
    num_features_before = len(feature_names)
    
    # L1 Regularization (Lasso) for Linear SVC
    svc_model = LinearSVC(penalty='l1', dual=False)  
    svc_model.fit(X_train_combined_transformed, y_train)                 

    # Select features based on non-zero coefficients
    feature_selector = SelectFromModel(svc_model)
    feature_selector.fit(X_train_combined_transformed, y_train)
    X_train_selected = feature_selector.transform(X_train_combined_transformed)
    X_val_selected = feature_selector.transform(X_val_combined_transformed)

    # New number of features
    best_num_features= X_train_selected.shape[1]

    # Train and evaluate models for each algorithm
    algorithms= {'Linear SVC': LinearSVC(), 'XGBoost': XGBClassifier()}
    for algorithm_name, algorithm in algorithms.items():
        # Train the model
        algorithm.fit(X_train_selected, y_train)

        # Predict labels
        y_pred = algorithm.predict(X_val_selected)

        # Calculate performance metrics
        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred, average='macro')
        recall = recall_score(y_val, y_pred, average='macro')
        f1 = f1_score(y_val, y_pred, average='macro')

        # Create a DataFrame for the current combination and algorithm
        result_df = pd.DataFrame({'Features Combination': [', '.join(feature_set)],
                                      'Algorithm': [algorithm_name],
                                      'Accuracy': [accuracy],
                                      'Precision': [precision],
                                      'Recall': [recall],
                                      'F1-Score': [f1],
                                      'Nr feature before':[num_features_before], 
                                      'Nr features after':[best_num_features]})
        # Append the DataFrame to the list
        result_dfs.append(result_df)
        
    print("Tested combination {} of {}".format(cont, len(feature_combinations)))

# Concatenate all the result DataFrames into a single DataFrame
results = pd.concat(result_dfs, ignore_index=True)
results

Tested combination 1 of 9
Tested combination 2 of 9
Tested combination 3 of 9
Tested combination 4 of 9
Tested combination 5 of 9
Tested combination 6 of 9
Tested combination 7 of 9
Tested combination 8 of 9
Tested combination 9 of 9


Unnamed: 0,Features Combination,Algorithm,Accuracy,Precision,Recall,F1-Score,Nr feature before,Nr features after
0,"narrative_tfidf, tfidf_title",Linear SVC,0.656659,0.51359,0.454399,0.462152,211325,5835
1,"narrative_tfidf, tfidf_title",XGBoost,0.653092,0.548742,0.446319,0.457536,211325,5835
2,"narrative_tfidf, tfidf_title, tfidf_keywords",Linear SVC,0.646849,0.497881,0.43811,0.44477,239686,6141
3,"narrative_tfidf, tfidf_title, tfidf_keywords",XGBoost,0.66201,0.539544,0.449993,0.457471,239686,6141
4,"narrative_tfidf, tfidf_title, events_tfidf",Linear SVC,0.650713,0.507549,0.443724,0.451573,256564,5701
5,"narrative_tfidf, tfidf_title, events_tfidf",XGBoost,0.666171,0.565995,0.46261,0.474662,256564,5701
6,"narrative_tfidf, tfidf_title, entities",Linear SVC,0.656956,0.519443,0.456934,0.465948,217518,5765
7,"narrative_tfidf, tfidf_title, entities",XGBoost,0.66201,0.562623,0.455221,0.466934,217518,5765
8,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",Linear SVC,0.653389,0.505856,0.44158,0.448257,284224,5894
9,"narrative_tfidf, tfidf_title, tfidf_keywords, ...",XGBoost,0.659334,0.550635,0.456959,0.46905,284224,5894


# Embeddings

In [12]:
w2v= KeyedVectors.load_word2vec_format('D:\\MS DATA SCIENCE\\NLP TESE\\embeddings\\skip_s600_word2vec.txt')

In [13]:
glove= KeyedVectors.load_word2vec_format('D:\\MS DATA SCIENCE\\NLP TESE\\embeddings\\glove_s600.txt')

In [10]:
y=df['reason']
X = df[['narrative_embeddings', 'embeddings_title', 'embeddings_keywords', 'events_embeddings', 'entities']]

In [11]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Split the remaining data into validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, stratify=y_test, random_state=42)

# Print the shape of each set
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_val shape:", X_val.shape)
print("y_val shape:", y_val.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (15701, 5)
y_train shape: (15701,)
X_val shape: (3364, 5)
y_val shape: (3364,)
X_test shape: (3365, 5)
y_test shape: (3365,)
