In this notebook we scale numerical features. 
- Numerical features should be scaled because they have different ranges and magnitudes, and scaling ensures that each feature contributes equally to the analysis, preventing features with larger ranges from dominating the results.
- TF-IDF vectors are already normalized, with values generally in the range [0, 1], representing term frequencies normalized by document frequencies. Scaling these again could distort their meaning.

### Get Data

In [15]:
from sklearn.preprocessing import StandardScaler
import pandas as pd
import joblib

# Load the dataset
df = pd.read_csv('numeric_features_added_exp_2.csv')
tfidf_features_df = pd.read_csv('tfidf_features_exp_2.csv')

### Check the range of the TF-IDF values in your tfidf_features_df

In [9]:
# Check the minimum and maximum values in the DataFrame
min_tfidf_value = tfidf_features_df.min().min()
max_tfidf_value = tfidf_features_df.max().max()

print(f"Minimum TF-IDF value: {min_tfidf_value}")
print(f"Maximum TF-IDF value: {max_tfidf_value}")

Minimum TF-IDF value: -0.5237111733189144
Maximum TF-IDF value: 0.8012009055934447


### Scaling numerical features

In [11]:
# Select numerical features from df
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Create a DataFrame for numerical features
numerical_features_df = df[numerical_features]
numerical_features_df.head(1)

Unnamed: 0,reading_time,mistakes_dist_ratio,polysyllabcount,sentence_count,difficult_words,comma_count,transitional_phrases_c,text_dist_words_ratio
0,15.44,0.092,10,14,25,2,5,0.516


In [12]:
# Find the range of 'mistakes_dist_ratio' and 'text_dist_words_ratio' features
mistakes_dist_ratio_range = (df['mistakes_dist_ratio'].min(), df['mistakes_dist_ratio'].max())
text_dist_words_ratio_range = (df['text_dist_words_ratio'].min(), df['text_dist_words_ratio'].max())

print(f"Range of 'mistakes_dist_ratio': {mistakes_dist_ratio_range}")
print(f"Range of 'text_dist_words_ratio': {text_dist_words_ratio_range}")


Range of 'mistakes_dist_ratio': (0.0018399264029438, 0.1847133757961783)
Range of 'text_dist_words_ratio': (0.053743961352657, 0.7371428571428571)


Even though the 'mistakes_dist_ratio' and 'text_dist_words_ratio' features already fall within relatively small ranges, scaling them ensures consistency and comparability with the TF-IDF features. This normalization prevents any single feature from disproportionately influencing the model due to differing scales.

In [16]:
# Exclude 'mistakes_dist_ratio' and 'text_dist_words_ratio' from scaling
features_to_scale = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Scale the selected numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numerical_features_df[features_to_scale])

# Convert scaled numerical features back to DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)


Scaler saved


In [38]:
# Check the minimum and maximum values for each feature
min_values = scaled_features_df.min()
max_values = scaled_features_df.max()

# Create a DataFrame to display the range of each feature
range_df = pd.DataFrame({'min': min_values, 'max': max_values})

# Display the range
print(range_df)

                             min       max
reading_time           -1.372675  5.644600
mistakes_dist_ratio    -1.942541  3.896023
polysyllabcount        -1.443892  9.054322
sentence_count         -1.935769  7.261488
difficult_words        -1.663917  6.845090
comma_count            -1.151817  7.596885
transitional_phrases_c -2.172849  5.463704
text_dist_words_ratio  -4.812209  3.426264


### Combine numerical scaled features with TF_IDF vectors

In [18]:
# Combine scaled and unscaled features
#combined_numerical_features_df = pd.concat([scaled_features_df, unscaled_features_df], axis=1)

# Combine the numerical features with the TF-IDF features
combined_features_df = pd.concat([scaled_features_df, tfidf_features_df], axis=1)

# Print the count of the combined features
print(f"Total number of features: {combined_features_df.shape[1]}")

# Export the combined features to a CSV file
combined_features_df.to_csv('combined_features_exp_2.csv', index=False)
print('File exported')

Total number of features: 1308
File exported


In [27]:
combined_features_df.head()

Unnamed: 0,reading_time,mistakes_dist_ratio,polysyllabcount,sentence_count,difficult_words,comma_count,transitional_phrases_c,text_dist_words_ratio,tfidf_feature_1,tfidf_feature_2,...,tfidf_feature_1291,tfidf_feature_1292,tfidf_feature_1293,tfidf_feature_1294,tfidf_feature_1295,tfidf_feature_1296,tfidf_feature_1297,tfidf_feature_1298,tfidf_feature_1299,tfidf_feature_1300
0,-0.832774,0.935982,-1.011867,-0.577084,-0.841618,-0.986748,-1.154642,0.760354,-0.040691,-0.050105,...,-0.002506,-0.011912,0.004882,0.00744,0.015452,-0.010794,0.020366,0.006662,-0.020205,0.01039
1,-0.060045,0.084387,-0.493436,0.259031,-0.448344,0.003672,-0.136435,0.173711,-0.055458,-0.061811,...,-0.006259,-0.003374,-0.001629,-0.014284,-0.007902,0.003364,0.003912,-0.000661,0.0139,-0.003761
2,-1.039652,2.352371,-0.839057,-0.890627,-0.734361,-0.986748,-1.663746,1.86888,0.528772,0.135876,...,-0.0131,0.019167,-0.016713,-0.002862,-0.010954,-0.005106,0.002185,-0.003971,-0.014638,-0.001101
3,-0.562463,0.621275,-0.277424,-1.099655,0.445459,-0.821678,-0.390987,1.60077,-0.277869,0.447509,...,0.006568,-0.017004,0.005746,-0.002621,-0.018733,0.001271,-0.014179,0.001328,0.000835,0.004552
4,-1.044698,1.098406,-1.40069,-0.786112,-1.342148,-0.739143,-1.409194,-0.076249,-0.050562,-0.063374,...,0.000263,-0.001662,-0.007233,0.003478,-0.002804,0.004634,0.007583,-0.006371,0.004315,0.000387
