# Text Feature Selection

As we discussed in [here](featureselection.ipynb), we must perform the feature selection on text features first becuase it is causing MemeoryError due to its massive file size (10GB). Since we cannot use the `fraudulent` column, we will use column means to select the features based on several assumptions. 

In [1]:
import pandas as pd 
import joblib

In [2]:
text_features_train = joblib.load('./data/text_features_train_jlib')
text_features_train.head(5)

Unnamed: 0,aa_desc,aaa_desc,aaab_desc,aab_desc,aabc_desc,aabd_desc,aabf_desc,aac_desc,aaccd_desc,aachen_desc,...,zodat_benefits,zollman_benefits,zombi_benefits,zone_benefits,zoo_benefits,zowel_benefits,zu_benefits,zult_benefits,zutrifft_benefits,zweig_benefits
0,0.165596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's carefully observe the dataframe above. It is very easy to realize that some of these features have unusal names, such as `zutrifft` and `aabd`, and it is hard to believe that they are stemmed from a normal english word. There are two main reasons why those unusual names appear as a feature. 

1. Although we removed URL and HTML format in the pre-dataprocessing step, it is still possible that some formats are not perfectly removed. Also the data can include other non-English words such as email or file names.  
2. If we look over the original dataset, we can observe that text data was saved with no space between lines. For example, "I love dog" (Line 1) "Cat ate fish" (Line 2) to "I love dogCat ate fish". Then it creates abnormal word "dogCat".  

The best way to remove these unusual words with the lowest computational cost is to use column mean and filter out the features with exceptionally low means. This based on two assumptions that unusual words will appear less frequently than noraml words. For instance, the word like "havecommunication" will not appears frequently across the dataset. If the words appear infrequntly, it will have low mean.

However, we must not perform this feature selection by column means on the entire dataset since our `text_features_train` dataset is combination of four different columns: `description`, `title`, `requirements` and `benefits`. Since the tf-idf value can vary depend on different characteristics of each dataset, we must get a column mean and select the feature seperately by each dataset, and combine it later to get the best result. 

In [3]:
sum(text_features_train.columns.str.contains('_desc'))
text_features_desc = text_features_train.iloc[:, 0:40607]

In [4]:
sum(text_features_train.columns.str.contains('_req'))
text_features_req = text_features_train.iloc[:, 40607:74863]

In [7]:
sum(text_features_train.columns.str.contains('_title'))
text_features_title = text_features_train.iloc[:, 74863:78280]

In [8]:
sum(text_features_train.columns.str.contains('_benefits'))
text_features_benefits = text_features_train.iloc[:, 78280:89527]

```{note}
We are seperating the dataframe like this to avoid the MemoryError. 
```

In [18]:
mean_desc = text_features_desc.sum() / 14304
mean_desc.sort_values(ascending = False)

work_desc       3.387162e-02
develop_desc    3.251693e-02
team_desc       3.203152e-02
manag_desc      3.167333e-02
custom_desc     3.149140e-02
                    ...     
peugeot_desc    9.292221e-07
bencki_desc     9.292221e-07
sanofi_desc     9.292221e-07
qmetric_desc    5.049230e-07
gra_desc        5.049230e-07
Length: 40607, dtype: float64

The column means for features from `description` dataset shows that the assumptions we made previously are somewhat reasonable. As we see here, more the average, the words look more normal, such as "work" and "develope".

Since the highest mean is 0.03387162, let's choose all features with average mean higher than 0.005. 

In [54]:
select_desc = mean_desc > 0.005
selected_features_desc = text_features_desc.loc[:, select_desc]
selected_features_desc.head()

Unnamed: 0,abil_desc,abl_desc,account_desc,achiev_desc,across_desc,activ_desc,adkin_desc,administr_desc,adverti_desc,agenc_desc,...,web_desc,websit_desc,week_desc,well_desc,within_desc,work_desc,world_desc,would_desc,write_desc,year_desc
0,0.043316,0.04441,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.021943,0.0,0.0,0.0,0.038188
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.045662,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.050161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.049229,0.0,0.0,0.043822,0.047252,0.025411,0.0,0.0,0.0,0.044223
3,0.0,0.0,0.052053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.045984,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.036083,0.0,0.0,0.0,0.0,0.0,...,0.068596,0.0,0.0,0.0,0.0,0.0,0.069462,0.0,0.0,0.092432


This looks much better. We will repeat the process for the other dataframes as well. 

In [32]:
mean_req = text_features_req.sum() / 14304
mean_req.sort_values(ascending = False)

experi_req       0.052954
work_req         0.034828
skill_req        0.033378
requir_req       0.031967
year_req         0.027195
                   ...   
cano_req         0.000002
mcnz_req         0.000002
orthopaed_req    0.000002
inhabit_req      0.000002
zeta_req         0.000001
Length: 34256, dtype: float64

Since TF-IDF is bit higher for `text_feature_req`, we will adjust the threshold a bit to adjust for the difference.

In [56]:
select_req = mean_req > 0.007
selected_features_req = text_features_req.loc[:, select_req]
selected_features_req.head()

Unnamed: 0,abil_req,abl_req,account_req,analyt_req,applic_req,attitud_req,avail_req,bachelor_req,background_req,build_req,...,us_req,use_req,verbal_req,web_req,well_req,within_req,work_req,write_req,written_req,year_req
0,0.08165,0.105596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.105239,0.0,0.0,0.0,0.064795,0.0,0.0,0.068928
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.060783,0.0,...,0.0,0.0,0.051181,0.0,0.0,0.061549,0.063024,0.0,0.0,0.033522
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.047567,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07056,...,0.0,0.0,0.061309,0.0,0.0,0.0,0.075495,0.0,0.054646,0.040155
4,0.0,0.0,0.0,0.141105,0.0,0.0,0.0,0.0,0.0,0.067876,...,0.0,0.0,0.0,0.0,0.0,0.0,0.072623,0.0,0.0,0.038628


In [34]:
mean_title = text_features_title.sum() / 14304
mean_title.sort_values(ascending = False)

manag_title           0.048922
develop_title         0.046175
engin_title           0.038562
sale_title            0.029780
servic_title          0.024249
                        ...   
maharashtra_title     0.000024
barri_title           0.000024
peterborough_title    0.000024
haliburton_title      0.000024
elgin_title           0.000021
Length: 3417, dtype: float64

In [57]:
select_title = mean_title > 0.01
selected_features_title = text_features_title.loc[:, select_title]
selected_features_title.head()

Unnamed: 0,abroad_title,account_title,administr_title,analyst_title,assist_title,associ_title,busi_title,consult_title,custom_title,data_title,...,product_title,project_title,repres_title,sale_title,senior_title,servic_title,softwar_title,specialist_title,teacher_title,web_title
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.441629,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.616325,0.0,0.0,0.561059,0.0,...,0.0,0.0,0.0,0.0,0.0,0.552591,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.440388,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.284294,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [37]:
mean_benefits = text_features_benefits.sum() / 14304
mean_benefits.sort_values(ascending = False)

job_benefits         0.026567
descript_benefits    0.026070
see_benefits         0.025900
benefit_benefits     0.025386
work_benefits        0.024180
                       ...   
ebe_benefits         0.000002
efd_benefits         0.000002
efff_benefits        0.000002
cdc_benefits         0.000002
charit_benefits      0.000001
Length: 11247, dtype: float64

In [58]:
select_benefits = mean_benefits > 0.007
selected_features_benefits = text_features_benefits.loc[:, select_benefits]
selected_features_benefits.head()

Unnamed: 0,base_benefits,benefit_benefits,bonu_benefits,career_benefits,compani_benefits,compens_benefits,competit_benefits,day_benefits,dental_benefits,descript_benefits,...,prospect_benefits,provid_benefits,salari_benefits,see_benefits,team_benefits,time_benefits,train_benefits,vacat_benefits,vision_benefits,work_benefits
0,0.0,0.07462,0.0,0.0,0.164345,0.0,0.078806,0.097701,0.090261,0.0,...,0.0,0.0,0.082821,0.0,0.0,0.088521,0.0,0.204903,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.109283,0.0,0.0,0.0,0.0,0.0,0.0,0.132191,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.150045,0.150367,0.0
4,0.0,0.042046,0.0,0.0,0.046302,0.0,0.044405,0.0,0.0,0.0,...,0.0,0.0,0.046667,0.0,0.148054,0.0,0.0,0.057729,0.0,0.0


We will combine all dataframe and export it as joblib. 

In [62]:
text_feature = pd.concat([selected_features_desc, selected_features_req, selected_features_title, selected_features_benefits], axis=1)
text_feature

Unnamed: 0,abil_desc,abl_desc,account_desc,achiev_desc,across_desc,activ_desc,adkin_desc,administr_desc,adverti_desc,agenc_desc,...,prospect_benefits,provid_benefits,salari_benefits,see_benefits,team_benefits,time_benefits,train_benefits,vacat_benefits,vision_benefits,work_benefits
0,0.043316,0.04441,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.082821,0.0,0.000000,0.088521,0.0,0.204903,0.000000,0.000000
1,0.000000,0.00000,0.000000,0.0,0.000000,0.000000,0.0,0.045662,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
2,0.050161,0.00000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
3,0.000000,0.00000,0.052053,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.150045,0.150367,0.000000
4,0.000000,0.00000,0.000000,0.0,0.036083,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.046667,0.0,0.148054,0.000000,0.0,0.057729,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14299,0.000000,0.00000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.060816,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.230658
14300,0.000000,0.00000,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.073304,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
14301,0.000000,0.00000,0.000000,0.0,0.000000,0.119119,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
14302,0.000000,0.00000,0.000000,0.0,0.000000,0.057005,0.0,0.000000,0.000000,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000
