## Exercise 1

#### Using the LabelEncoder class from the scikit-learn package encode target variable - bought as shown below and assign the result to the df DataFrame.

#### In response, print the df DataFrame to the console.

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder


data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df = pd.DataFrame(data=data)
for col in ['size', 'color', 'gender', 'bought']:
    df[col] = df[col].astype('category')
df['weight'] = df['weight'].astype('float')
      
le = LabelEncoder()
      
df['bought'] = le.fit_transform(df['bought'])
      
print(df)

  size  color  gender  price  weight  bought
0   XL    red  female  199.0   500.0       1
1    L  green    male   89.0   450.0       0
2    M   blue    male   99.0   300.0       1
3    L  green  female  129.0   380.0       0
4    M    red  female   79.0   410.0       1


#### Notes:

- LabelEncoder(): This is a class in scikit-learn used to convert categorical data (labels) into numerical format. It assigns a unique integer to each unique category in the data.

- le.fit_transform(): This method first fits the LabelEncoder to the data (i.e., learns the mapping from categories to integers) and then transforms the categorical labels into their corresponding integer values in one step.

- Use Case: It’s commonly used for preprocessing categorical data in machine learning pipelines, allowing models that require numerical input to handle categorical variables.

## Exercise 2:
#### Using the OneHotEncoder from the scikit-learn package, encode the size column as a one-hot numeric array. (set the parameter sparse=False).

#### In response, print the encoded size column to the console (don't assign changes to the df DataFrame). Also print the encoding categories to the console as shown below.

In [5]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder


data = {
    'size': ['XL', 'L', 'M', 'L', 'M'],
    'color': ['red', 'green', 'blue', 'green', 'red'],
    'gender': ['female', 'male', 'male', 'female', 'female'],
    'price': [199.0, 89.0, 99.0, 129.0, 79.0],
    'weight': [500, 450, 300, 380, 410],
    'bought': ['yes', 'no', 'yes', 'no', 'yes']
}

df = pd.DataFrame(data=data)
for col in ['size', 'color', 'gender', 'bought']:
    df[col] = df[col].astype('category')
df['weight'] = df['weight'].astype('float')

enc = OneHotEncoder(sparse = False)
      
enc.fit(df[['size']])

print(enc.transform(df[['size']]))

print(enc.categories_)

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]
[array(['L', 'M', 'XL'], dtype=object)]


#### Notes:

- OneHotEncoder with sparse=False: It converts categorical variables into a binary (0 or 1) matrix, where each unique category is represented by a separate column.

- enc.fit(df[['size']]): It identifies all unique categories within the 'size' column and prepares the encoder for transformation.

- enc.transform(df[['size']]): This transforms the 'size' column into a one-hot encoded matrix, where each row corresponds to a record in the DataFrame, and each column represents a unique category from the 'size' column. The resulting matrix has a 1 in the column corresponding to the category for that record and 0s elsewhere.

- enc.categories_: It allows you to see the categories that correspond to the columns in the transformed matrix.

## Exercise 3:

#### The Breast Cancer Data was loaded into the raw_data variable.

#### Assign the value for the 'data' key (numpy array) to the data variable. Then assign the value for the 'target' key (numpy array) to the target variable.

#### In response, print the first three elements of the data array to the console.

In [12]:
import numpy as np
from sklearn.datasets import load_breast_cancer


np.set_printoptions(precision=2, suppress=True, linewidth=100)
raw_data = load_breast_cancer()

data = raw_data['data']

target = raw_data['target']

print(f'First 3 elements of data array: {data[:3]}')

First 3 elements of data array: [[  17.99   10.38  122.8  1001.      0.12    0.28    0.3     0.15    0.24    0.08    1.09    0.91
     8.59  153.4     0.01    0.05    0.05    0.02    0.03    0.01   25.38   17.33  184.6  2019.
     0.16    0.67    0.71    0.27    0.46    0.12]
 [  20.57   17.77  132.9  1326.      0.08    0.08    0.09    0.07    0.18    0.06    0.54    0.73
     3.4    74.08    0.01    0.01    0.02    0.01    0.01    0.     24.99   23.41  158.8  1956.
     0.12    0.19    0.24    0.19    0.28    0.09]
 [  19.69   21.25  130.   1203.      0.11    0.16    0.2     0.13    0.21    0.06    0.75    0.79
     4.58   94.03    0.01    0.04    0.04    0.02    0.02    0.     23.57   25.53  152.5  1709.
     0.14    0.42    0.45    0.24    0.36    0.09]]
