In this notebook we scale numerical features. 
- Numerical features should be scaled because they have different ranges and magnitudes, and scaling ensures that each feature contributes equally to the analysis, preventing features with larger ranges from dominating the results.
- TF-IDF vectors are already normalized, with values generally in the range [0, 1], representing term frequencies normalized by document frequencies. Scaling these again could distort their meaning.

### Get Data

In [30]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the dataset
df = pd.read_csv('numeric_features_added_exp_2.csv')
tfidf_features_df_1000 = pd.read_csv('tfidf_features_exp_2_pca_1000.csv')
tfidf_features_df_700 = pd.read_csv('tfidf_features_exp_2_pca_700.csv')
tfidf_features_df_500 = pd.read_csv('tfidf_features_exp_2_pca_500.csv')

### Scaling numerical features

In [32]:
# Select numerical features from df
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Create a DataFrame for numerical features
numerical_features_df = df[numerical_features]
numerical_features_df.head(1)

Unnamed: 0,reading_time,mistakes_dist_ratio,polysyllabcount,sentence_count,difficult_words,comma_count,transitional_phrases_c,text_dist_words_ratio
0,15.44,0.092,10,14,25,2,5,0.516


In [33]:
# Exclude 'mistakes_dist_ratio' and 'text_dist_words_ratio' from scaling
features_to_scale = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Scale the selected numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numerical_features_df[features_to_scale])

# Convert scaled numerical features back to DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)

### Save Scaler

In [37]:
import joblib
# Save the scaler
joblib.dump(scaler, 'scaler_exp2.pkl')
print("Scaler saved")

Scaler saved


### Combine numerical scaled features with TF_IDF vectors

In [24]:
# Combine the numerical features with the TF-IDF features
combined_features_df_1000 = pd.concat([scaled_features_df, tfidf_features_df_1000], axis=1)
combined_features_df_700 = pd.concat([scaled_features_df, tfidf_features_df_700], axis=1)
combined_features_df_500 = pd.concat([scaled_features_df, tfidf_features_df_500], axis=1)

# Print the count of the combined features
print(f"Total number of features: {combined_features_df_1000.shape[1]}")
print(f"Total number of features: {combined_features_df_700.shape[1]}")
print(f"Total number of features: {combined_features_df_500.shape[1]}")

# Export the combined features to a CSV file
combined_features_df_1000.to_csv('combined_features_exp_2_pca_1000.csv', index=False)
combined_features_df_700.to_csv('combined_features_exp_2_pca_700.csv', index=False)
combined_features_df_500.to_csv('combined_features_exp_2_pca_500.csv', index=False)
print('Files exported')

Total number of features: 1008
Total number of features: 708
Total number of features: 508
Files exported
