## Dealing with categorical features
+ scikit-learn will not accept categorical features by default
+ Need to convert categorical features into numeric values
+ Convert to binary features called dummy variables
  - **0:** means that the observation was not in that category
  - **1:** means that the observation was that category

![dummy-variable](supervised-learning/images/dummy-variable-1.png)
![dummy-variable](supervised-learning/images/dummy-variable-1.png)

## Dealing with categorical features in Python
+ scikit-learn: OneHotEncoder()
+ pandas: get_dummies()

### Encoding dummy variables

```python
import pandas as pd
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import Ridge

music_df = pd.read_csv('music.csv')
music_dummies = pd.get_dummies(music_df["genre"], drop_first=True)

music_dummies = pd.concat(music_df, music_dummies, axis=1)
music_dummies = music_dummies.drop('genre', axis=1)


# Create X and y
X = music_dummies.drop("popularity", axis=1)
y = music_dummies["popularity"]

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

```

## Handling Missing Data

+ Dropping missing data

```python

print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
```

### Imputing values
+ Imputation - use subject-matter expertise to replace missing data with educated guesses
+ Common to use the mean
+ Can also use the median, or another value
+ For categorical values, we typically use the most frequent value - the mode
+ Must split our data first, to avoid data leakage

```python

# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))
```

## Pipeline for song genre prediction

Now it's time to build a pipeline. It will contain steps to impute missing values using the mean for each feature and build a KNN model for the classification of song genre.

```python
# Import modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors=3)

# Build steps for the pipeline
steps = [("imputer", imputer), ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))
```