# Feature Creation

This page will focus on creating features from cleaned-text data by counting word frequency and some limitations regarding using such a method. Also, we will talk about how we should handle non-text columns in the data, especially the columns with categorical variables. 

First, we will load the data we cleaned on the previous page, and we will split this into train and test datasets. We split the dataset to ensure the test dataset does not influence when creating a pipeline. Technically, we are not supposed to know our test set before the completion of our pipeline. **After we create the pipeline, we will apply the same pipeline we used for the train set to the test set to ensure both datasets have the same dimension.** 

As we mentioned earlier, we want each word to be a feature of the dataset. To do this, we will use the **Bag of Words** method that will ensure each description in the dataset is represented only with the word count (e.g., if the description contains the word, the column will mark it as 1, otherwise 0). However, only using the Bag of Words is somewhat problematic. Since frequency is dependent on the length of the text, longer texts can calculate a higher frequency for some unnecessary words that rarely appear across the data as a whole. This clearly can cause big trouble when we come to feature selection. 

To remedy this problem, we will standardize the word frequency using **TF-IDF (Term Frequency-Inverse Document Frequency)** so that all frequencies can be weighted. It is a numerical statistic intended to reflect how important a word is to a document in a collection, and here is how we calculate it. 

$$ \text{TF-IDF} = \text{TF} \times \text{IDF} $$

where $$ TF = \frac{\text{number of times the word appears in the description}}{\text{total number of the word in the entire dataset}}$$

$$ IDF = \log ( \frac{\text{number of description in the dataset}}{\text{number of description that contained the word}} )$$

Term Frequency (TF) measures how frequently a term occurs in a document. Inverse Document Frequency (IDF) is a factor that diminishes the weight of terms that occur very frequently in the document and increases the weight of words that occur rarely. As you observe here, as the word appears less frequently throughout the dataset, the IDF increases which decreases TF-IDF as a result. We gives more weight on the words that appear frequently across the entire dataset. This way, we can avoid possible outlier/confounding features in our dataset.