## Encoding categorical features
[Recording](https://ithogskolan.sharepoint.com/:v:/s/AI23/EVhZ58xq_A5Dn1kxBId4xfwBG0BRRvaH00Yc0bVjtggtrQ?e=GxfWaJ)  
- Categorical features are non-numeric with a limited amount of possible options.
- Before feeding them into most ML algorithms we must convert them into numerical features using either:
    - **Label Encoding** Each unique category is assigned to a numerical value: 0, 1, 2, 3, ...
    - **One-hot encoding** a new binary feature is created for each category. En del ML algoritmer tror att det är en inbördes ordning, tex 0, 1, 2,... och då kan man komma undan detta med denna metoden.



In [41]:
import pandas as pd
import seaborn as sns
import warnings

warnings.filterwarnings("ignore", "is_categorical_dtype")
warnings.filterwarnings("ignore", "use_inf_as_na")
warnings.filterwarnings("ignore", "The figure layout has changed to tight")
warnings.filterwarnings("ignore", "The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.")
warnings.filterwarnings("ignore", "is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.")
warnings.filterwarnings("ignore", "is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.  if is_sparse(pd_dtype):")
warnings.filterwarnings("ignore", "is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):")
warnings.filterwarnings("ignore", "is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.")
warnings.filterwarnings(category=FutureWarning, action="ignore")


In [29]:
titanic = sns.load_dataset("titanic")
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [30]:
titanic.info() # för att se vilka som är categorical

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


Tar fram alla kolumner som är categorical, i en lista

In [31]:
#titanic.columns är en lista av alla kolumner
cat_columns = [col for col in titanic.columns if titanic[col].dtype in ['object','category']]
cat_columns

['sex', 'embarked', 'class', 'who', 'deck', 'embark_town', 'alive']

In [32]:
df_categories = titanic[cat_columns]
df_categories.head()

Unnamed: 0,sex,embarked,class,who,deck,embark_town,alive
0,male,S,Third,man,,Southampton,no
1,female,C,First,woman,C,Cherbourg,yes
2,female,S,Third,woman,,Southampton,yes
3,female,S,First,woman,C,Southampton,yes
4,male,S,Third,man,,Southampton,no


In [33]:
for cat in cat_columns:
    print(f'{cat} : {df_categories[cat].unique()}')
    print("====================================================")

sex : ['male' 'female']
embarked : ['S' 'C' 'Q' nan]
class : ['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third']
who : ['man' 'woman' 'child']
deck : [NaN, 'C', 'E', 'G', 'D', 'A', 'B', 'F']
Categories (7, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G']
embark_town : ['Southampton' 'Cherbourg' 'Queenstown' nan]
alive : ['no' 'yes']


### Label encoding
- Label encoding maps each category into a numerical value.
- Use label encoding if:
    - Categories has a natural order
    - If there are only two categories
    - If using one-hot encoding leads to a large number of features

In [34]:
# Manual mapping
titanic["embarked"] = titanic["embarked"].map({"S":0,"C":1,"Q":2}) # map-funktionen tar en dict med översättningstabell


### Auto mapping using Label Encoder from Scikit-learn

In [35]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

titanic["deck"] = le.fit_transform(titanic["deck"])
titanic.head()

FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.

### One-hot encoding
- In one-hot encoding, a new binary feature is created for each category, and the value of that feature is set to 1 if the observation belongs to that category, and 0 otherwise.
- Use one-hot encoding if:
    - Categories have no natural order
    - Number of categories is small (but not 2)
- Use function **get_dummies()**

In [43]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.25,0.0,2,1,1,7,2,0,0
1,1,1,0,38.0,1,0,71.2833,1.0,0,2,0,2,0,1,0
2,1,3,0,26.0,0,0,7.925,0.0,2,2,0,7,2,1,1
3,1,1,0,35.0,1,0,53.1,0.0,0,2,0,2,2,1,0
4,0,3,1,35.0,0,0,8.05,0.0,2,1,1,7,2,0,1


In [44]:
titanic.to_json("../Data/titanic_encoded.json", orient="records") # Sparar denna datan som en .json som används i L10.

In [None]:
pd.get_dummies(data=titanic, columns=['who'])

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,adult_male,deck,embark_town,alive,alone,who_child,who_man,who_woman
0,0,3,male,22.0,1,0,7.2500,0.0,Third,True,7,Southampton,no,False,False,True,False
1,1,1,female,38.0,1,0,71.2833,1.0,First,False,2,Cherbourg,yes,False,False,False,True
2,1,3,female,26.0,0,0,7.9250,0.0,Third,False,7,Southampton,yes,True,False,False,True
3,1,1,female,35.0,1,0,53.1000,0.0,First,False,2,Southampton,yes,False,False,False,True
4,0,3,male,35.0,0,0,8.0500,0.0,Third,True,7,Southampton,no,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,0.0,Second,True,7,Southampton,no,True,False,True,False
887,1,1,female,19.0,0,0,30.0000,0.0,First,False,1,Southampton,yes,True,False,False,True
888,0,3,female,,1,2,23.4500,0.0,Third,False,7,Southampton,no,False,False,False,True
889,1,1,male,26.0,0,0,30.0000,1.0,First,True,2,Cherbourg,yes,True,False,True,False


### Auto-map remaining non-numerical features


In [38]:
cat_columns = [col for col in titanic.columns if titanic[col].dtype in ['object','category','bool']]
cat_columns

['sex', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']

In [42]:
for column in cat_columns:
    titanic[column] = le.fit_transform(titanic[column])

titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,1,22.0,1,0,7.2500,0.0,2,1,1,7,2,0,0
1,1,1,0,38.0,1,0,71.2833,1.0,0,2,0,2,0,1,0
2,1,3,0,26.0,0,0,7.9250,0.0,2,2,0,7,2,1,1
3,1,1,0,35.0,1,0,53.1000,0.0,0,2,0,2,2,1,0
4,0,3,1,35.0,0,0,8.0500,0.0,2,1,1,7,2,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,0.0,1,1,1,7,2,0,1
887,1,1,0,19.0,0,0,30.0000,0.0,0,2,0,1,2,1,1
888,0,3,0,,1,2,23.4500,0.0,2,2,0,7,2,0,0
889,1,1,1,26.0,0,0,30.0000,1.0,0,1,1,2,0,1,1
