### One-Hot Encoding, Label Encoding 
- What are categorical variables?
    - Binary categorical features: Gender, yes/no
    - Multiple class categorical features: Country 
- One-Hot ecncoding 
    - creates binary columns for each categoriy in a categorical features
    - each row is marked with a 1 for its respective category and 0 elsewhere
    - ecample feature color 
---
      red | blue | green 
       1  | 0    | 0
       0  | 1    | 0
       0  | 0    | 1
---
         -Application 
            - categorical features with a small number of unique categories 
             - tree based models, logistic regression, and neural networks

- label encoding
    - label encoding assigns a unique integer to each category 
    - example red =0, blue = 1, green = 2
    - application 
        - ordinal features where the order matters 
        - can introduce unintended ordinal relationship for nominal features 
    - Limitations
        - can mislead algorithms into interpreting categpries as order, especially when the variable is nominal  

### Dealing with high-cardinality categorical features
- High-Cardinality categorical features contain a larger number of unique    categories 
- Challenges 
    - Dimensionality - one hot encoding creates too many columns increasing computational costs 
    - sparse Representation - many columns have a value of zero, leading to sparsity in the data set  
- solutions 
    - Frequency Encoding
        - replace categories with their occurence frequency in the dataset
        - example: cit=['NY', 'LA', 'SF','LA'] Encoding: NY=2, LA=2, SF=1 
    - Target Encoding 
        - Replace categories with the mean of the target variables for each categories 

---
### When to use different encoding techniques
**Encoding techniques**  |  **Use Case**
- one hot encoding       | nominal features with a small number of uniquecategories
- Label ecoding | ordinal features or when used with algorithms like tree bases model 
- frequency encoding | high cardinality features in both regression and classification tasks
- target encoding | high cardinality features in supervised learning tasks 


In [8]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

In [4]:
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

df = pd.read_csv(url)
"print info "
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# Apply one hot encoding 
d_one_hot = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
print("\n One hot encoding:\n", d_one_hot.head())


 One hot encoding:
    PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name   Age  SibSp  Parch  \
0                            Braund, Mr. Owen Harris  22.0      1      0   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  38.0      1      0   
2                             Heikkinen, Miss. Laina  26.0      0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  35.0      1      0   
4                           Allen, Mr. William Henry  35.0      0      0   

             Ticket     Fare Cabin  Sex_male  Embarked_Q  Embarked_S  
0         A/5 21171   7.2500   NaN      True       False        True  
1          PC 17599  71.2833   C85     False       False       False  
2  STON/O2. 3101282   7.9250   NaN     False       False        True  
3            113803  

In [9]:
label_encoder = LabelEncoder()
df['Pclass_encoded'] = label_encoder.fit_transform(df['Pclass'])

In [10]:
print ("\n encoded label encoded \n", df[['Pclass', 'Pclass_encoded']].head())


 encoded label encoded 
    Pclass  Pclass_encoded
0       3               2
1       1               0
2       3               2
3       1               0
4       3               2


In [11]:
# Applying Frequency encoding 
df['Ticket_frequency'] = df['Ticket'].map(df['Ticket'].value_counts())

In [12]:
print("frequency encoded feature \n",df[['Ticket','Ticket_frequency']].head() )

frequency encoded feature 
              Ticket  Ticket_frequency
0         A/5 21171                 1
1          PC 17599                 1
2  STON/O2. 3101282                 1
3            113803                 2
4            373450                 1


In [13]:
x= d_one_hot.drop(columns=['Survived', 'Name', 'Ticket', 'Cabin'])
y = df['Survived']

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
df.dropna(inplace=True)


In [15]:
x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=0.2, random_state=42)

In [20]:
from sklearn.impute import SimpleImputer

# Impute missing values in x_train and x_test
imputer = SimpleImputer(strategy='mean')
x_train_imputed = imputer.fit_transform(x_train)
x_test_imputed = imputer.transform(x_test)

model = LogisticRegression(max_iter = 200)
model.fit(x_train_imputed, y_train)
y_pred = model.predict(x_test_imputed)
print("Accuracy score : ", accuracy_score(y_test, y_pred))

Accuracy score :  0.8044692737430168


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=200).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
