# Feature Engineering

### Getting Started

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

from collections import Counter
import math

nltk.download(['vader_lexicon', 'averaged_perceptron_tagger'])

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/kendrick/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/kendrick/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### Import Dataset

Source: https://www.kaggle.com/datasets/kkhandekar/word-difficulty/data

In [2]:
df = pd.read_csv("../data/WordDifficulty.csv")
df.head()

Unnamed: 0,Word,Length,Freq_HAL,Log_Freq_HAL,I_Mean_RT,I_Zscore,I_SD,Obs,I_Mean_Accuracy
0,a,1,10610626,16.18,798.92,-0.01,333.85,24.0,0.73
1,aah,3,222,5.4,816.43,0.21,186.03,21.0,0.62
2,Aaron,5,10806,9.29,736.06,-0.11,289.01,32.0,0.97
3,aback,5,387,5.96,796.27,0.11,171.61,15.0,0.45
4,abacus,6,513,6.24,964.4,0.65,489.0,15.0,0.47


### Feature Descriptions

`Length`: Number of characters

`Freq_HAL`: Hyperspace Analogue to Language frequency norms based on the HAL corpus of 131 million words. Higher values may indicate more frequent words in a corpus.

`Log_Freq_HAL`: Applied logarithmic transformation to `Freq_HAL`

`I_Mean_RT`: Individual mean reaction time, associated with lexical decision time

`I_Zscore`: Z-score of individual reaction times, associated with word difficulty

`I_SD`: Standard deviation of individual reaction times

`Obs`: Number of observations/individuals experimented with respective word

`I_Mean_Accuracy`: Individual mean accuracy score, average accuracy score in tasks related to word difficulty

This dataset appears preprocessed prior at publishment. Performing our own transformations, such as normalization, may conflict with the prior. It is noted that `Freq_HAL` was computed respectively from the SUBTLEXUS corpus of 131 million words; therefore, it would be safe keep the data as is.

From glance, we can perform a quick feature selection by removing the `Freq_HAL`, `I_SD`, and `Obs` columns as other columns, such `Log_Freq_HAL`, `I_Zscore` and mean scores, make them redundant.

In [3]:
# Drop Freq_HAL and Obs columns
df = df.drop(['Freq_HAL', 'I_SD', 'Obs'], axis=1)
df = df.dropna().reset_index(drop=True)

# Apply lower to words
df['Word'] = df['Word'].str.lower()

df.head()

Unnamed: 0,Word,Length,Log_Freq_HAL,I_Mean_RT,I_Zscore,I_Mean_Accuracy
0,a,1,16.18,798.92,-0.01,0.73
1,aah,3,5.4,816.43,0.21,0.62
2,aaron,5,9.29,736.06,-0.11,0.97
3,aback,5,5.96,796.27,0.11,0.45
4,abacus,6,6.24,964.4,0.65,0.47


### Additional Quantitative Features 

We can extract additional quantitative features from a word, such as:

- Vowel Count, also correlated to syllables
- Entropy, or the measure of the unpredictability of the word's character. Computed from $H(x)=\Sigma{p(x)\log{p(x)}}$

In [4]:
# Count vowels
vowels = ['a', 'e', 'i', 'o', 'u']
num_vowels = []
for word in df['Word']:
    vowel_count = sum(word.count(vowel) for vowel in vowels)
    num_vowels.append(vowel_count)
    
df['Vowels'] = num_vowels

In [5]:
# Word entropy
def calculate_entropy(word):
    # Frequency of each character
    char_counts = Counter(word)

    # Calculate the probability of each character
    total_chars = len(word)
    char_probabilities = {char: count / total_chars for char, count in char_counts.items()}

    # Calculate the entropy
    entropy = -sum(prob * math.log2(prob) for prob in char_probabilities.values())

    return entropy

entropy_values = [calculate_entropy(word) for word in df['Word']]
df['Entropy'] = entropy_values

### Additional Features Using Natural Language Toolkit (NLTK) 

We will use `nltk` to get the parts-of-speech tag based on Penn [Treebank Project](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and a sentiment score for each word.

In [6]:
# Getting parts of speech
word_tags = nltk.pos_tag(df['Word'])

word_tags = [word_tag[1] for word_tag in word_tags]

df['PoS'] = word_tags

In [7]:
# Getting sentiment score, we look at the compound score for a final vote
sia = SentimentIntensityAnalyzer()
sentiment_scores = [sia.polarity_scores(word)['compound'] for word in df['Word']]

df['SentimentScore'] = sentiment_scores

In [8]:
df.head()

Unnamed: 0,Word,Length,Log_Freq_HAL,I_Mean_RT,I_Zscore,I_Mean_Accuracy,Vowels,Entropy,PoS,SentimentScore
0,a,1,16.18,798.92,-0.01,0.73,1,-0.0,DT,0.0
1,aah,3,5.4,816.43,0.21,0.62,2,0.918296,JJ,0.0
2,aaron,5,9.29,736.06,-0.11,0.97,3,1.921928,NN,0.0
3,aback,5,5.96,796.27,0.11,0.45,2,1.921928,NN,0.0
4,abacus,6,6.24,964.4,0.65,0.47,3,2.251629,NN,0.0


In [9]:
# Levenshtein Distance... for algorithm distance metric later
# word embeddings...