# Categorical Encodings and Embeddings

By Alberto Valdés 

**Mail 1:** anvaldes@uc.cl 

**Mail 2:** alberto.valdes.gonzalez.96@gmail.com

In [1]:
import warnings
warnings.filterwarnings("ignore")

When we work with categorical variables we have to decide the encoding to train a model. We will work with data of the census.

In [2]:
import time

### Measure the time

In [3]:
start = time.time()

### Execute the code

In [4]:
import pandas as pd

In [5]:
df = pd.read_csv('adult_census.csv')

In [6]:
df['income'].value_counts()

<=50K    24720
>50K      7841
Name: income, dtype: int64

In [7]:
df['income'] = (df['income'] == '>50K')*1

In [8]:
round(df['income'].mean()*100, 2)

24.08

# 1. One Hot Encoding

We use this encode when for every possible values we want one dummy variable.

In [9]:
from sklearn.preprocessing import OneHotEncoder

In [10]:
oh_enc = OneHotEncoder(handle_unknown='ignore')

In [11]:
marital_values = list(df['marital.status'].unique())

In [12]:
marital_values

['Widowed',
 'Divorced',
 'Separated',
 'Never-married',
 'Married-civ-spouse',
 'Married-spouse-absent',
 'Married-AF-spouse']

In [13]:
oh_enc = oh_enc.fit(df[['marital.status']])

In [14]:
new_marital = pd.DataFrame(oh_enc.transform(df[['marital.status']]).toarray())

In [15]:
new_marital.columns = list(oh_enc.get_feature_names_out())

In [16]:
new_marital

Unnamed: 0,marital.status_Divorced,marital.status_Married-AF-spouse,marital.status_Married-civ-spouse,marital.status_Married-spouse-absent,marital.status_Never-married,marital.status_Separated,marital.status_Widowed
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...
32556,0.0,0.0,0.0,0.0,1.0,0.0,0.0
32557,0.0,0.0,1.0,0.0,0.0,0.0,0.0
32558,0.0,0.0,1.0,0.0,0.0,0.0,0.0
32559,0.0,0.0,0.0,0.0,0.0,0.0,1.0


**Check Columns**

In [17]:
for ind in range(7):

    var_name = marital_values[ind]
    new_var_name = 'marital.status_' + var_name

    per_match = ((df['marital.status'] == var_name)*1 == new_marital[new_var_name]).sum()/len(df)

    if per_match < 1:

        print(var_name)

# 2. Ordinal Encoder 

We use this encode when we have a monotonic relation between all the possible values of a column.

In [18]:
edu_values = ['HS-grad', 'Some-college', 'Bachelors', 'Masters', 'Doctorate']

In [19]:
df = df[df['education'].isin(edu_values)]
df = df.reset_index()
df = df.drop(columns = ['index'])

In [20]:
from sklearn.preprocessing import OrdinalEncoder

In [21]:
ord_enc = OrdinalEncoder(categories = [edu_values]) # Set the order of values

In [22]:
education_values = list(df['education'].unique())

In [23]:
education_values

['HS-grad', 'Some-college', 'Doctorate', 'Bachelors', 'Masters']

In [24]:
ord_enc = ord_enc.fit(df[['education']])

In [25]:
new_edu = pd.DataFrame(ord_enc.transform(df[['education']]))

In [26]:
new_edu

Unnamed: 0,0
0,0.0
1,0.0
2,1.0
3,1.0
4,0.0
...,...
25278,3.0
25279,1.0
25280,0.0
25281,0.0


In [27]:
df[new_edu[0] == 0]['education']

0        HS-grad
1        HS-grad
4        HS-grad
6        HS-grad
15       HS-grad
          ...   
25272    HS-grad
25274    HS-grad
25280    HS-grad
25281    HS-grad
25282    HS-grad
Name: education, Length: 10501, dtype: object

In [28]:
df[new_edu[0] == 1]['education']

2        Some-college
3        Some-college
7        Some-college
16       Some-college
18       Some-college
             ...     
25259    Some-college
25266    Some-college
25275    Some-college
25276    Some-college
25279    Some-college
Name: education, Length: 7291, dtype: object

In [29]:
df[new_edu[0] == 2]['education']

9        Bachelors
13       Bachelors
14       Bachelors
19       Bachelors
25       Bachelors
           ...    
25252    Bachelors
25262    Bachelors
25263    Bachelors
25265    Bachelors
25267    Bachelors
Name: education, Length: 5355, dtype: object

In [30]:
df[new_edu[0] == 3]['education']

10       Masters
12       Masters
24       Masters
26       Masters
28       Masters
          ...   
25248    Masters
25254    Masters
25273    Masters
25277    Masters
25278    Masters
Name: education, Length: 1723, dtype: object

In [31]:
df[new_edu[0] == 4]['education']

5        Doctorate
8        Doctorate
11       Doctorate
21       Doctorate
23       Doctorate
           ...    
25201    Doctorate
25225    Doctorate
25226    Doctorate
25264    Doctorate
25269    Doctorate
Name: education, Length: 413, dtype: object

**Check with one testing value**

In [32]:
n_df = pd.DataFrame({'education': ['Masters']})

In [33]:
n_df

Unnamed: 0,education
0,Masters


In [34]:
pd.DataFrame(ord_enc.transform(n_df[['education']]))

Unnamed: 0,0
0,3.0


# 3. James Stein

We can use this encode to **classification** and **regression** problems.

To use this encode is necessary use info of the target. The value asssigned to each value can take the column use $p_i$ (mean of the target in the specific value take the column) and $p_{all}$ (mean of the target). 

In [35]:
df['marital.status'].value_counts()

Married-civ-spouse       11720
Never-married             8294
Divorced                  3494
Separated                  750
Widowed                    716
Married-spouse-absent      289
Married-AF-spouse           20
Name: marital.status, dtype: int64

In [36]:
from category_encoders import JamesSteinEncoder as JS

In [37]:
js_enc = JS()

In [38]:
js_enc = js_enc.fit(df[['marital.status']], df['income'])

In [39]:
df['JS_marital.status'] = js_enc.transform(df['marital.status'])

In [40]:
df['JS_marital.status']

0        0.133036
1        0.133036
2        0.133036
3        0.104224
4        0.142646
           ...   
25278    0.395354
25279    0.075506
25280    0.395354
25281    0.133036
25282    0.075506
Name: JS_marital.status, Length: 25283, dtype: float64

In [41]:
df['JS_marital.status'].value_counts()

0.395354    11720
0.075506     8294
0.142646     3494
0.104224      750
0.133036      716
0.144319      289
0.376675       20
Name: JS_marital.status, dtype: int64

### i. JS for no observed values

For example the string "A"

In [42]:
df_test = pd.DataFrame({'marital.status': ['Married-civ-spouse', 'A']})

In [43]:
js_enc.transform(df_test['marital.status'])

Unnamed: 0,marital.status
0,0.395354
1,0.258988


# 4. Embeddings

In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.


We can download differents embeddings with different dimensionality:

* Google News
* Wikipedia
* Twitter

In [44]:
import numpy as np
import gensim.downloader

In [45]:
def euc_dist(x, y):
  euc_dist = np.linalg.norm(x-y)
  
  return euc_dist

### a. Google News

In [46]:
glove_vectors = gensim.downloader.load('word2vec-google-news-300')



In [47]:
king = glove_vectors['king']
queen = glove_vectors['queen']
ball = glove_vectors['ball']

In [48]:
euc_dist(king, queen)

2.4796925

In [49]:
euc_dist(king, ball)

4.1938615

In [50]:
euc_dist(queen, ball)

4.195428

### b. Wikipedia

In [51]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-50')



In [52]:
king = glove_vectors['king']
queen = glove_vectors['queen']
ball = glove_vectors['ball']

In [53]:
euc_dist(king, queen)

3.4777563

In [54]:
euc_dist(king, ball)

6.6218195

In [55]:
euc_dist(queen, ball)

6.849128

### c. Twitter

In [56]:
glove_vectors = gensim.downloader.load('glove-twitter-25')



In [57]:
king = glove_vectors['king']
queen = glove_vectors['queen']
ball = glove_vectors['ball']

In [58]:
euc_dist(king, queen)

1.7002395

In [59]:
euc_dist(king, ball)

3.1329837

In [60]:
euc_dist(queen, ball)

3.6142843

# 5. Other ways to apply Embeddings

### a. Feature Selection

i. We apply an embedding to a specific column generating 300/50/25 columns.

ii. Select the most important features.

### b. Dimensionality Reduction

i. We apply an embedding to a specific column generating 300/50/25 columns.

ii. Realize dimensionality reduction to consider less columns.

### End of execution

In [61]:
end = time.time()

In [62]:
delta = end - start

hours = int(delta/3600)
mins = int((delta - hours*3600)/60)
segs = int(delta - hours*3600 - mins*60)

print(f'Execute this notebook it took {hours} hours, {mins} minutos and {segs} seconds.')

Execute this notebook it took 0 hours, 2 minutos and 58 seconds.
