#### FEATURE ENGINEERING

- it is the process of tranforming raw data into features that make machine learning algorithms work better. 
    - to improve performance of ML models
    - it can improve quality and relevance of features

- **Catagorical Features**
  - Groups of data
    - will include Binary Features (0 and 1) True and False - Bool data type e.g. Survived - Yes NO i.e True or False
    
    - **Category data type** - it is **more optimized** to store categorical data

    - **Ordinal data** - you can set up a scale in the categories - Small Medium Large - cold warm hot 
    - small height medium height large height

- **Numerical Features**
  - where there are numbers
  - will include `float` and `integers` both
  - continous values, 
  - grouping up will not make much sense (unless doing clustering)

- Different data types in pandas dataframes

- object: Text or mixed numeric and non-numeric values.

- bool: Boolean values (True or False).

- int64: Integer values.

- float64: Floating point values.

- datetime64: Date and time values.

- category: data type is used for categorical variables


```python
# Convert the 'int' column to float
df['int'] = df['int'].astype(float)

In [14]:
import pandas as pd
import seaborn as sns
import numpy as np

In [2]:
iris = sns.load_dataset('iris')

iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


**Can you check for titanic?**
- how many catagorical features?
- how many numerical features?

In [4]:
penguins = sns.load_dataset('penguins')

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [5]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


- **Encoding**

In [6]:
# label encoding
# column that you can encode

penguins['species'].unique()

#  {  }

species_codes = {
    'Adelie': 0,
    'Chinstrap': 1,
    'Gentoo': 2
}

penguins['species_encoded'] = penguins['species'].map(species_codes) 

penguins.sample(10)


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded
87,Adelie,Dream,36.9,18.6,189.0,3500.0,Female,0
45,Adelie,Dream,39.6,18.8,190.0,4600.0,Male,0
335,Gentoo,Biscoe,55.1,16.0,230.0,5850.0,Male,2
307,Gentoo,Biscoe,51.3,14.2,218.0,5300.0,Male,2
18,Adelie,Torgersen,34.4,18.4,184.0,3325.0,Female,0
337,Gentoo,Biscoe,48.8,16.2,222.0,6000.0,Male,2
236,Gentoo,Biscoe,42.0,13.5,210.0,4150.0,Female,2
220,Gentoo,Biscoe,46.1,13.2,211.0,4500.0,Female,2
181,Chinstrap,Dream,52.8,20.0,205.0,4550.0,Male,1
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,Male,2


In [7]:
# pandas map function

s = pd.Series(['cat', 'cow', 'dog'])

s2 = s.map({'cat': 'kitten', 'cow': 'calf'})
print(s2)

0    kitten
1      calf
2       NaN
dtype: object


**copy and view**
df
df2 = df.copy()
df2.dropna()

In [8]:
# inbuilt map function of python

words = ["apple", "banana", "cherry"]
lengths = map(len, words)
list(lengths)

[5, 6, 6]

In [9]:
# factorize
# species still contains the specie names

# factorize will take the column and assigns numbers to the new groups
label_codes, label_names = pd.factorize(penguins['species'])

penguins [ 'species_factorized' ] = label_codes
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded,species_factorized
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,0
3,Adelie,Torgersen,,,,,,0,0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,0
...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,2,2
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,2,2
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,2,2
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,2,2


In [20]:
sex_codes, sex_names = pd.factorize(penguins['sex'],use_na_sentinel= False)
penguins['sex_coded'] = sex_codes
penguins

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded,species_factorized,sex_coded
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0,0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,0,1
3,Adelie,Torgersen,,,,,,0,0,2
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,0,1
...,...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,2,2,2
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,2,2,1
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,2,2,0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,2,2,1


In [10]:
factors = pd.factorize(penguins['species'])
penguins [ 'species_factorized' ] = factors[0]
penguins [[ 'species','species_factorized' ]]

Unnamed: 0,species,species_factorized
0,Adelie,0
1,Adelie,0
2,Adelie,0
3,Adelie,0
4,Adelie,0
...,...,...
339,Gentoo,2
340,Gentoo,2
341,Gentoo,2
342,Gentoo,2


- Using the **catagories** method

- convert any data type feature into categorical, then use
- **cat.codes**

In [11]:
df = penguins.copy()

df['species'] = df['species'].astype('category')

df['species_cat_encoded'] = df['species'].cat.codes
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded,species_factorized,species_cat_encoded
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0,0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,0,0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,0,0
3,Adelie,Torgersen,,,,,,0,0,0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,0,0
...,...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,2,2,2
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,2,2,2
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,2,2,2
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,2,2,2


**Task Exercise**

- apply the three methods to island column

- manually applying label codes
- using the factorize method
- using the catagory
- using the label encoder method



### Label Encoding using sklearn

In [12]:
from sklearn.preprocessing import LabelEncoder

# Create LabelEncoder instance
encoder = LabelEncoder()

df['species_sklearn_le'] = encoder.fit_transform(df['species'])
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded,species_factorized,species_cat_encoded,species_sklearn_le
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0,0,0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,0,0,0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,0,0,0
3,Adelie,Torgersen,,,,,,0,0,0,0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,2,2,2,2
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,2,2,2,2
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,2,2,2,2
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,2,2,2,2


Multiple methods:


- manual - .map() - Ordinal 