In this notebook we scale numerical features. 
- Numerical features should be scaled because they have different ranges and magnitudes, and scaling ensures that each feature contributes equally to the analysis, preventing features with larger ranges from dominating the results.
- TF-IDF vectors are already normalized, with values generally in the range [0, 1], representing term frequencies normalized by document frequencies. Scaling these again could distort their meaning.

### Get Data

In [3]:
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Load the dataset
df = pd.read_csv('numeric_features_added_v1.csv')
tfidf_features_df = pd.read_csv('tfidf_features.csv')

### Check the range of the TF-IDF values in your tfidf_features_df

In [17]:
# Check the minimum and maximum values in the DataFrame
min_tfidf_value = tfidf_features_df.min().min()
max_tfidf_value = tfidf_features_df.max().max()

print(f"Minimum TF-IDF value: {min_tfidf_value}")
print(f"Maximum TF-IDF value: {max_tfidf_value}")

Minimum TF-IDF value: -0.5213811721886221
Maximum TF-IDF value: 0.6918156199801564


### Scaling numerical features

In [5]:
# Select numerical features from df
numerical_features = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Create a DataFrame for numerical features
numerical_features_df = df[numerical_features]
numerical_features_df.head(1)

Unnamed: 0,reading_time,mistakes_dist_ratio,polysyllabcount,sentence_count,difficult_words,comma_count,transitional_phrases_c,text_dist_words_ratio
0,45.92,0.03453,45,28,66,9,13,0.343923


In [6]:
# Find the range of 'mistakes_dist_ratio' and 'text_dist_words_ratio' features
mistakes_dist_ratio_range = (df['mistakes_dist_ratio'].min(), df['mistakes_dist_ratio'].max())
text_dist_words_ratio_range = (df['text_dist_words_ratio'].min(), df['text_dist_words_ratio'].max())

print(f"Range of 'mistakes_dist_ratio': {mistakes_dist_ratio_range}")
print(f"Range of 'text_dist_words_ratio': {text_dist_words_ratio_range}")


Range of 'mistakes_dist_ratio': (0.0036697247706422, 0.1847133757961783)
Range of 'text_dist_words_ratio': (0.053743961352657, 0.7371428571428571)


Even though the 'mistakes_dist_ratio' and 'text_dist_words_ratio' features already fall within relatively small ranges, scaling them ensures consistency and comparability with the TF-IDF features. This normalization prevents any single feature from disproportionately influencing the model due to differing scales.

In [8]:
# Exclude 'mistakes_dist_ratio' and 'text_dist_words_ratio' from scaling
features_to_scale = [
    'reading_time',
    'mistakes_dist_ratio',
    'polysyllabcount',
    'sentence_count',
    'difficult_words',
    'comma_count',
    'transitional_phrases_c',
    'text_dist_words_ratio'
]

# Scale the selected numerical features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(numerical_features_df[features_to_scale])

# Convert scaled numerical features back to DataFrame
scaled_features_df = pd.DataFrame(scaled_features, columns=features_to_scale)

# Include the unscaled features
#unscaled_features_df = numerical_features_df[['mistakes_dist_ratio', 'text_dist_words_ratio']]

In [23]:
# Check the minimum and maximum values for each feature
min_values = scaled_features_df.min()
max_values = scaled_features_df.max()

# Create a DataFrame to display the range of each feature
range_df = pd.DataFrame({'min': min_values, 'max': max_values})

# Display the range
print(range_df)

                             min        max
reading_time           -1.562640   7.108973
mistakes_dist_ratio    -2.495825   4.155722
polysyllabcount        -1.620069  11.390363
sentence_count         -2.164033   8.711115
difficult_words        -1.823100   7.380028
comma_count            -1.212686   9.098990
transitional_phrases_c -2.359454   5.011177
text_dist_words_ratio  -5.518142   3.555418


### Combine numerical scaled features with TF_IDF vectors

In [10]:
# Combine scaled and unscaled features
#combined_numerical_features_df = pd.concat([scaled_features_df, unscaled_features_df], axis=1)

# Combine the numerical features with the TF-IDF features
combined_features_df = pd.concat([scaled_features_df, tfidf_features_df], axis=1)

# Print the count of the combined features
print(f"Total number of features: {combined_features_df.shape[1]}")

# Export the combined features to a CSV file
combined_features_df.to_csv('combined_features.csv', index=False)
print('File exported')

Total number of features: 1308
File exported
