# Machine learning with scikit-learn

<br>

## Supervised Learning with scikit-learn
* Requirements to Proceed
    * No Missing Values
    * Data in numeric format
    * Data stored in DataFrame or NumPy array 


* `Supervised learning` is a type of machine learning where the values to be predicted are already known, and a model is built with the aim of accurately predicting values of previously unseen data. 
* `Supervised learning` uses **features** to predict the value of a `target` variable, such as predicting a basketball player's position based on their points per game.


## Classification Challenge
* Classifying labels of unseen data
1. Build a Model
2. Model learns from the labeled data we pass to it
3. Pass unlabeled data to the model as input
4. Model predicts the labels of the unseen data

* `Labeled Data` = training data

###  k-Nearest Neightbors
* The idea of k-Nearest Neighbors, or `KNN`, is to **predict the label of any data point by looking at the k**, for example, three, closest labeled data points and getting them to vote on what label the unlabeled observation should have. 
* `KNN` uses `majority voting`, which makes predictions based on what label the majority of nearest neighbors have.
* `KNN` stands for K-nearest neighbour, it’s one of the Supervised learning algorithm mostly used for classification of data on the basis how it’s neighbour are classified. 
    * KNN stores all available cases and classifies new cases based on a similarity measure. 
    * **K** in KNN is a parameter that refers to the `number of the nearest neighbours to include in the majority voting process`.
    * neighbors argument from sklearn
* https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

```python
import numpy as np
from sklearn.neighbors import KNeighborsClassifier

# Can't find this publicly so simply using the frame from the exercise
churn_df = pd.read_csv('datasets/churn.csv')

# Create arrays for the features and target variables (use values property on pandas series for numpy array return)
y = churn_df['churn'].values
X = churn_df[['account_length', 'customer_service_calls']].values

print(y[:5], '\n\n', X[:5])

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Output (X is a )
KNeighborsClassifier(n_neighbors=6)

<script.py> output:
    [1 0 0 0 0] 
    
     [[101   3]
     [ 73   2]
     [ 86   4]
     [ 59   1]
     [129   1]

# X 2D numpy array quick peek with the features
      
X.shape
Out[3]:
(3333, 2)

In [2]:
churn_df[['account_length', 'customer_service_calls']].head()
Out[2]:

   account_length  customer_service_calls
0             101                       3
1              73                       2
2              86                       4
3              59                       1
4             129                       1
```

### Predict
k-Nearest Neighbors: Predict
Now you have fit a KNN classifier, you can use it to predict the label of new data points. All available data was used for training, however, fortunately, there are new observations available. These have been preloaded for you as X_new.

The model knn, which you created and fit the data in the last exercise, has been preloaded for you. You will use your classifier to predict the labels of a set of new data points:

```python
# X_new are new observations for the features above in the same shape
X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions for X_new
print("Predictions: {}".format(y_pred)) 

# Output
Predictions: [0 1 0]
```

### Measuring model performance 

**Computing Accuracy**
1. Split Data into Training & Testing Set
2. Fit/train classifier on training set
3. Calculate accuracy using test set

### Train/test split
```python
from sklearn.model_selection import train_test_split
# to method pass features, target 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

knn = kNeighborsClassifers(n=6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test))
# 0.88 rounded score
```

#### Flow
To do this, we import train_test_split from sklearn-dot-model_selection. 
* We call train_test_split, passing our features and targets. We commonly use 20-30% of our data as the test set. By setting the test_size argument to zero-point-three we use 30% here. 
* The random_state argument sets a seed for a random number generator that splits the data. 
    * Using the same number when repeating this step allows us to reproduce the exact split and our downstream results. 
    * It is best practice to ensure our split reflects the proportion of labels in our data. So if churn occurs in 10% of observations, we want 10% of labels in our training and test sets to represent churn. 
* We achieve this by setting stratify equal to y. 
* train_test_split returns four arrays: 
    * the training data, 
    * the test data, 
    * the training labels, 
    * and the test labels. 
    * We unpack these into `X_train, X_test, y_train, and y_test`, respectively. 
* We then instantiate a KNN model and fit it to the training data using the dot-fit method.
*  To check the accuracy, we use the dot-score method, passing X test and y test. 
*  The accuracy of our model is 88%, which is low given our labels have a 9 to 1 ratio.

### Exercise : Compute Accuracy
* Split X and y into training and test sets, setting test_size equal to 20%, random_state to 42, and ensuring the target label proportions reflect that of the original dataset.

```python
# Import the module
from sklearn.model_selection import train_test_split

# Features X is from the churn_df with the target column dropped
X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

0.8740629685157422
```

### Overfitting and underfitting
Interpreting model complexity is a great way to evaluate performance when utilizing supervised learning. Your aim is to produce a model that can interpret the relationship between features and the target variable, as well as generalize well when exposed to new observations.

You will generate accuracy scores for the training and test sets using a KNN classifier with different n_neighbor values, which you will plot in the next exercise.

```python
# Create neighbors as a numpy array of values from 1 up to and including 12
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)
  
	# Fit the model
	knn.fit(X_train, y_train)
  
	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)

print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

[ 1  2  3  4  5  6  7  8  9 10 11 12]
# Train Accuracies 
 {1: 1.0, 2: 0.887943971985993, 3: 0.9069534767383692, 4: 0.8734367183591796, 5: 0.8829414707353677, 6: 0.8689344672336168, 7: 0.8754377188594297, 8: 0.8659329664832416, 9: 0.8679339669834918, 10: 0.8629314657328664, 11: 0.864432216108054, 12: 0.8604302151075538} 

# Test Accuracies
 {1: 0.7871064467766117, 2: 0.8500749625187406, 3: 0.8425787106446777, 4: 0.856071964017991, 5: 0.8553223388305847, 6: 0.861319340329835, 7: 0.863568215892054, 8: 0.8605697151424287, 9: 0.8620689655172413, 10: 0.8598200899550225, 11: 0.8598200899550225, 12: 0.8590704647676162}
```
* Notice how training accuracy decreases as the number of neighbors initially gets larger, and vice versa for the testing accuracy


### Visualizing model complexity
Now you have calculated the accuracy of the KNN model on the training and test sets using various values of n_neighbors, you can create a model complexity curve to visualize how performance changes as the model becomes less complex!

The variables neighbors, train_accuracies, and test_accuracies, which you generated in the previous exercise, have all been preloaded for you. You will plot the results to aid in finding the optimal number of neighbors for your model.


* Using the dictionaries above to plot the line visualtion for model accuracy for each set
```python
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(test_accuracies.keys(), test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()
```
![Screen Shot 2023-03-15 at 9.24.06 AM](Screen%20Shot%202023-03-15%20at%209.24.06%20AM.png)
* See how training accuracy decreases and test accuracy increases as the number of neighbors gets larger. For the test set, accuracy peaks with 7 neighbors, suggesting it is the optimal value for our model.

### Sklearn w/Regression
* Now we're going to check out the other type of supervised learning: regression. In regression tasks, the target variable typically has continuous values, such as a country's GDP, or the price of a house.

#### Sklearn Features Consideration
* This is fine for y, but our **features** must be formatted as a two-dimensional array to be accepted by scikit-learn. To convert the shape of X_bmi we apply `NumPy's dot-reshape method`, passing minus one followed by one. Printing the shape again shows X_bmi is now the correct shape for our model.
```python
# Set bmi as single predictor feature value
# slice out the BMI column of X, which is the fourth column, storing as the variable X_bmi
X_bmi = X[:, 3]
print(X_bmi.shape) # (752,) 1Dimensional
# Reshape for sklearn acceptance
X_bmi = X_bmi.reshape(-1, 1)
print(X_bmi.shape) # (752,1) 2Dimensional feature for single variable 
```

#### Exercise
* In this chapter, you will work with a dataset called sales_df, which contains information on advertising campaign expenditure across different media types, and the number of dollars generated in sales for the respective campaign

* You will use the advertising expenditure as features to predict sales values, initially working with the "radio" column.
```python
In [2]:
sales_df.head()
Out[2]:

        tv     radio  social_media      sales
0  16000.0   6566.23       2907.98   54732.76
1  13000.0   9237.76       2409.57   46677.90
2  41000.0  15886.45       2913.41  150177.83
3  83000.0  30020.03       6922.30  298246.34
4  15000.0   8437.41       1406.00   56594.18

In [3]:
sales_df.isna().sum()
Out[3]:

tv              0
radio           0
social_media    0
sales           0
dtype: int64
# No null values to consider

```
```python
import numpy as np

# Create X from the radio column's values (sales_df) advertising campaign expenditure
X = sales_df['radio'].values

# Create y from the sales column's values (sales_df) `sales` target predictor
y = sales_df['sales'].values

# Reshape X
X = X.reshape(-1, 1)

# Check the shape of the features and targets
print(X.shape, '\n', y.shape)

<script.py> output:
    (4546, 1) 
     (4546,)
```

* Building a linear regression model
Now you have created your feature and target arrays, you will train a linear regression model on all feature and target values.
```python
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X, y)

# Make predictions
predictions = reg.predict(X)

print(predictions[:5])

[ 95491.17119147 117829.51038393 173423.38071499 291603.11444202
 111137.28167129]
```

* Visualizing Linear Regression Model
    * The variables X, an array of radio values, y, an array of sales values, and predictions, an array of the model's predicted values for y given X, have all been preloaded for you from the previous exercise.   
```python
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create scatter plot
plt.scatter(X, y, color="blue")

# Create line plot
plt.plot(X, predictions, color="red")
plt.xlabel("Radio Expenditure ($)")
plt.ylabel("Sales ($)")

# Display the plot
plt.show()
```
![Screen Shot 2023-03-15 at 9.57.58 AM](Screen%20Shot%202023-03-15%20at%209.57.58%20AM.png)


### Fit and predict for regression
Now you have seen how linear regression works, your task is to create a multiple linear regression model using all of the features in the sales_df dataset, which has been preloaded for you
```python
# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

<script.py> output:
    Predictions: [53176.66154234 70996.19873235], Actual Values: [55261.28 67574.9 ]
    
The first two predictions appear to be within around 5% of the actual values from the test set!
```

### Regression performance
Now you have fit a model, reg, using all features from sales_df, and made predictions of sales values, you can evaluate performance using some common regression metrics.

* Your task is to find out how well the features can explain the variance in the target values, along with assessing the model's ability to make predictions on unseen data.
```python
# Import mean_squared_error
from sklearn.metrics import mean_squared_error

# Compute R-squared - model.score : Return the coefficient of determination of the prediction.
r_squared = reg.score(X_test, y_test)

# Compute RMSE : the average squared difference between the estimated values and the actual value
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the metrics
print("R^2: {}".format(r_squared))
print("RMSE: {}".format(rmse))

R^2: 0.9990165886162027
RMSE: 2942.372219812037
```
* the features explain 99.9% of the variance in sales values! Looks like this company's advertising strategy is working well!

## Cross-validation

If we're computing `R-squared` on our test set, the R-squared returned is dependent on the way that we split up the data! The data points in the test set may have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalize to unseen data. To combat this dependence on what is essentially a random split, we use a technique called `cross-validation`.

As we split the dataset into five folds, we call this process 5-fold cross-validation. If we use 10 folds, it is called 10-fold cross-validation. More generally, if we use k folds, it is called k-fold cross-validation or k-fold CV. There is, however, a trade-off. Using more folds is more computationally expensive. This is because we are fitting and predicting more times.

```python
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=6, shuffle=True, random_state=42)
reg = LinearRegression
cv_results = cross_val_score(reg, X, y, cv=kf)
```

### Cross-validation for R-squared : Ex
Cross-validation is a vital approach to evaluating a model. It maximizes the amount of data that is available to the model, as the model is not only trained but also tested on all of the available data.

In this exercise, you will build a linear regression model, then use 6-fold cross-validation to assess its accuracy for predicting sales using social media advertising expenditure. You will display the individual score for each of the six-folds.

```python
# Import the necessary modules
from sklearn.model_selection import KFold, cross_val_score

# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)

reg = LinearRegression()

# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)

# Print scores
print(cv_scores)

[0.74451678 0.77241887 0.76842114 0.7410406  0.75170022 0.74406484]
```
* Notice how R-squared for each fold ranged between 0.74 and 0.77? By using cross-validation, we can see how performance varies depending on how the data is split


### Analyzing cross-validation metrics
Now you have performed cross-validation, it's time to analyze the results.

You will display the mean, standard deviation, and 95% confidence interval for cv_results, which has been preloaded for you from the previous exercise.

```python
# Print the mean
print(np.mean(cv_results))

# Print the standard deviation
print(np.std(cv_results))

# Print the 95% confidence interval
print(np.quantile(cv_results, [.025, .975]))

0.7536937416666666
0.012305386274436092
[0.74141863 0.77191915]
```

### Regularized Regression
#### Why Regularize
* Linear Regresssion : minimizes a loss function
* The model/function of linreg chooses a coefficient, a, for each variable/feature, plus b
* large coefficients can lead to `overfitting`
* Regularization: Penalize large coefficients

#### Ridge regression
* Ridge penalizes large positive or negative coefficients
* Be careful of alpha which controls model complexity
    * low alpha can lead to overfitting
    * high alpha can lead to underfitting

To highlight the impact of different alpha values, we create an empty list for our scores, then loop through a list of different alpha values. Inside the for loop we instantiate Ridge, setting the alpha keyword argument equal to the iterator, also called alpha. We fit on the training data, and predict on the test data. We save the model's R-squared value to the scores list. Finally, outside of the loop, we print the scores for the models with five different alpha values. We see performance gets worse as alpha increases.



#### Exercises
Regularized regression: Ridge
Ridge regression performs regularization by computing the squared values of the model parameters multiplied by alpha and adding them to the loss function.

In this exercise, you will fit ridge regression models over a range of different alpha values, and print their 
 scores. You will use all of the features in the sales_df dataset to predict "sales". The data has been split into X_train, X_test, y_train, y_test for you.
 
```python
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
  
  # Create a Ridge regression model
  ridge = Ridge()
  
  # Fit the data
  ridge.fit(X_train, y_train)
  
  # Obtain R-squared
  score = ridge.score(X_test, y_test)
  ridge_scores.append(score)
  
print(ridge_scores)
[0.9990152104759369, 0.9990152104759373, 0.9990152104759419, 0.9990152104759871, 0.9990152104764387, 0.9990152104809561]
```
* The scores don't appear to change much as alpha increases, which is indicative of how well the features explain the variance in the target—even by heavily penalizing large coefficients, underfitting does not occur!

#### Lasso
* Can be used to identify important features in a dataset
```python
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regression model
lasso = Lasso(alpha=0.3)

# Fit the model to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)
[ 3.56256962 -0.00397035  0.00496385]
plt.bar(sales_columns, lasso_coef)
plt.xticks(rotation=45)
plt.show()
```

<br>

## Fine-Tuning Your Model
Having trained models, now you will learn how to evaluate them. In this chapter, you will be introduced to several metrics along with a visualization technique for analyzing classification model performance using scikit-learn. You will also learn how to optimize classification and regression models through the use of hyperparameter tuning.

### Deciding on a primary metric
As you have seen, several metrics can be useful to evaluate the performance of classification models, including accuracy, precision, recall, and F1-score.


#### Assessing a diabetes prediction classifier
In this chapter you'll work with the diabetes_df dataset introduced previously.

The goal is to predict whether or not each individual is likely to have diabetes based on the features body mass index (BMI) and age (in years). Therefore, it is a binary classification problem. A target value of 0 indicates that the individual does not have diabetes, while a value of 1 indicates that the individual does have diabetes.

diabetes_df has been preloaded for you as a pandas DataFrame and split into X_train, X_test, y_train, and y_test. In addition, a KNeighborsClassifier() has been instantiated and assigned to knn.

You will fit the model, make predictions on the test set, then produce a confusion matrix and classification report.


```python
# Import confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[116  35]
 [ 46  34]]
              precision    recall  f1-score   support

           0       0.72      0.77      0.74       151
           1       0.49      0.42      0.46        80

    accuracy                           0.65       231
   macro avg       0.60      0.60      0.60       231
weighted avg       0.64      0.65      0.64       231
```
* The model produced 34 true positives and 35 false positives, meaning precision was less than 50%, which is confirmed in the classification report. The output also shows a better F1-score for the zero class, which represents individuals who do not have diabetes.


### Logistic Regression & ROC

####
Building a logistic regression model
In this exercise, you will build a logistic regression model using all features in the diabetes_df dataset. The model will be used to predict the probability of individuals in the test set having a diabetes diagnosis.

The diabetes_df dataset has been split into X_train, X_test, y_train, and y_test, and preloaded for you.

```python
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# probability estimates
print(logreg.predict_proba(X_test)[:5])

# Predict probabilities : each individual in the test set having a diabetes diagnosis
# Recall the second column of the results which contains all positive probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

<script.py> output:
    [[0.73448969 0.26551031]
     [0.81663458 0.18336542]
     [0.87880404 0.12119596]
     [0.84386435 0.15613565]
     [0.50388715 0.49611285]]
    [0.26551031 0.18336542 0.12119596 0.15613565 0.49611285 0.44582236
     0.01359235 0.61646125 0.55640546 0.7931187 ]
```

### The ROC curve
Now you have built a logistic regression model for predicting diabetes status, you can plot the ROC curve to visualize how the true positive rate and false positive rate vary as the decision threshold changes.

```python
In [2]:
y_pred_probs[:5]
Out[2]:
array([0.26551031, 0.18336542, 0.12119596, 0.15613565, 0.49611285])
# above positive probablities 
# Import roc_curve
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()
```
* The ROC curve is above the dotted line, so the model performs better than randomly guessing the class of each observation.


### ROC AUC
The ROC curve you plotted in the last exercise looked promising.

Now you will compute the area under the ROC curve, along with the other classification metrics you have used previously.

The confusion_matrix and classification_report functions have been preloaded for you, along with the logreg model you previously built, plus X_train, X_test, y_train, y_test. Also, the model's predicted test set labels are stored as y_pred, and probabilities of test set observations belonging to the positive class stored as y_pred_probs.

A knn model has also been created and the performance metrics printed in the console, so you can compare the roc_auc_score, confusion_matrix, and classification_report between the two models.

```python
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

<script.py> output:
    0.8002483443708608
    [[121  30]
     [ 30  50]]
                  precision    recall  f1-score   support
    
               0       0.80      0.80      0.80       151
               1       0.62      0.62      0.62        80
    
        accuracy                           0.74       231
       macro avg       0.71      0.71      0.71       231
    weighted avg       0.74      0.74      0.74       231
    
```
* Did you notice that logistic regression performs better than the KNN model across all the metrics you calculated? A ROC AUC score of 0.8002 means this model is 60% better than a chance model at correctly predicting labels! scikit-learn makes it easy to produce several classification metrics with only a few lines of code.

## Hyperparameters

### Hyperparameter tuning with GridSearchCV
Now you have seen how to perform grid search hyperparameter tuning, you are going to build a lasso regression model with optimal hyperparameters to predict blood glucose levels using the features in the diabetes_df dataset.

```python
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(.00001, 1, 20)}

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
# Scientific Notation unpacking
print("Tuned lasso paramaters: {:.5f}".format(list(lasso_cv.best_params_.values())[0]))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

Tuned lasso paramaters: {'alpha': 1e-05}
Tuned lasso paramaters: 0.00001
Tuned lasso score: 0.33078807238121977
```
*  Unfortunately, the best model only has an R-squared score of 0.33, highlighting that using the optimal hyperparameters does not guarantee a high performing model!

### Hyperparameter tuning with RandomizedSearchCV
As you saw, GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space. In this case, you can use `RandomizedSearchCV`, which tests a fixed number of hyperparameter settings from specified probability distributions.
```python
# Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))

<script.py> output:
    Tuned Logistic Regression Parameters: {'tol': 0.14294285714285712, 'penalty': 'l2', 'class_weight': 'balanced', 'C': 0.6326530612244898}
    Tuned Logistic Regression Best Accuracy Score: 0.7460082633613221
```


## Preprocessing and Pipelines
Learn how to impute missing values, convert categorical data to numeric values, scale data, evaluate multiple supervised learning models simultaneously, and build pipelines to streamline your workflow!

### Creating Dummy Variables
Being able to include categorical features in the model building process can enhance performance as they may add information that contributes to prediction accuracy.

The music_df dataset has been preloaded for you, and its shape is printed. Also, pandas has been imported as pd
```python
music_df.head()
Out[1]:

   popularity  acousticness  danceability  duration_ms  energy  ...  loudness  speechiness    tempo  valence       genre
0        41.0         0.644         0.823     236533.0   0.814  ...    -5.611        0.177  102.619    0.649        Jazz
1        62.0         0.086         0.686     154373.0   0.670  ...    -7.626        0.225  173.915    0.636         Rap
2        42.0         0.239         0.669     217778.0   0.736  ...    -3.223        0.060  145.061    0.494  Electronic
3        64.0         0.013         0.522     245960.0   0.923  ...    -4.560        0.054  120.406    0.595        Rock
4        60.0         0.121         0.780     229400.0   0.467  ...    -6.645        0.253   96.056    0.312         Rap
```
* First string type column (non float) was designated in the music_df
```python
# Create music_dummies
music_dummies = pd.get_dummies(music_df, drop_first=True)

# Print the new DataFrame's shape
print("Shape of music_dummies: {}".format(music_dummies.shape))
Shape of music_dummies: (1000, 20)
```

### Regression with categorical features
Now you have created music_dummies, containing binary features for each song's genre, it's time to build a ridge regression model to predict song popularity.

music_dummies has been preloaded for you, along with Ridge, cross_val_score, numpy as np, and a KFold object stored as kf.

The model will be evaluated by calculating the average RMSE, but first, you will need to convert the scores for each fold to positive values and take their square root. This metric shows the average error of our model's predictions, so it can be compared against the standard deviation of the target value—"popularity".

```python
# Create X and y
X = music_dummies.drop('popularity', axis=1).values
y = music_dummies['popularity'].values

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

<script.py> output:
    Average RMSE: 8.236853840202299
    Standard Deviation of the target array: 14.02156909907019
```

### Handling Missing Data
```python
# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

script.py> output:
    genre                 8
    popularity           31
    loudness             44
    liveness             46
    tempo                46
    speechiness          59
    duration_ms          91
    instrumentalness     91
    danceability        143
    valence             143
    acousticness        200
    energy              200
    dtype: int64
    popularity            0
    liveness              0
    loudness              0
    tempo                 0
    genre                 0
    duration_ms          29
    instrumentalness     29
    speechiness          53
    danceability        127
    valence             127
    acousticness        178
    energy              178
    dtype: int64
    Shape of the `music_df`: (892, 12)
```

### Pipeline for song genre prediction: I (Impute Pipeline)
Now it's time to build a pipeline. It will contain steps to impute missing values using the mean for each feature and build a KNN model for the classification of song genre.

The modified music_df dataset that you created in the previous exercise has been preloaded for you, along with KNeighborsClassifier and train_test_split.

```python
# Import modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(3)

# Build steps for the pipeline
steps = [("imputer", imputer), 
         ("knn", knn)]
```

### Pipeline for song genre prediction: II
Having set up the steps of the pipeline in the previous exercise, you will now use it on the music_df dataset to classify the genre of songs. What makes pipelines so incredibly useful is the simple interface that they provide.

X_train, X_test, y_train, and y_test have been preloaded for you, and confusion_matrix has been imported from sklearn.metrics.

```python
steps = [("imputer", imp_mean),
        ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

<script.py> output:
    [[79  9]
     [ 4 82]]
```
*  In this case, the confusion matrix highlights that the model had 79 true positives and 82 true negatives!


## Centering and scaling
* Many machine learning models use some form of distance to inform them, so if we have features on far larger scales, they can disproportionately influence our model. For example, KNN uses distance explicitly when making predictions. For this reason, we actually want features to be on a similar scale. To achieve this, we can normalize or standardize our data, often referred to as scaling and centering.

### Centering and scaling for regression
Now you have seen the benefits of scaling your data, you will use a pipeline to preprocess the music_df features and build a lasso regression model to predict a song's loudness.

```python
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)

# Calculate and print R-squared
print(pipeline.score(X_test, y_test))

<script.py> output:
    0.6193523316282489
```
### Centering and scaling for classification
Now you will bring together scaling and model building into a pipeline for cross-validation.

Your task is to build a pipeline to scale features in the music_df dataset and perform grid search cross-validation using a logistic regression model with different values for the hyperparameter C. The target variable here is "genre", which contains binary values for rock as 1 and any other genre as 0.

```python
# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)
```

### Visualizing Regression Model Performance
```python
models = {"Linear Regression": LinearRegression(), "Ridge": Ridge(alpha=0.1), "Lasso": Lasso(alpha=0.1)}
results = []

# Loop through the models' values
for model in models.values():
  kf = KFold(n_splits=6, random_state=42, shuffle=True)
  
  # Perform cross-validation (default model score) # Coefficient of determination R^2
  cv_scores = cross_val_score(model, X_train, y_train, cv=kf)
  
  # Append the results
  results.append(cv_scores)

print(len(results)) # 3 (Return of each r2)

In [1]:
results
Out[1]:

[array([0.71078444, 0.75604101, 0.80460283, 0.72571513, 0.72716622,
        0.61911259]),
 array([0.71082272, 0.75603243, 0.8045894 , 0.72571954, 0.72715631,
        0.61912813]),
 array([0.46884783, 0.4084498 , 0.41445316, 0.41556667, 0.40095602,
        0.39870331])]

# Create a box plot of the results
plt.boxplot(results, labels=models.keys())
plt.show()
```
![Screen Shot 2023-03-15 at 4.48.06 PM](Screen%20Shot%202023-03-15%20at%204.48.06%20PM.png)

### Predicting on the test se
In the last exercise, `linear regression` and `ridge` appeared to produce similar results. It would be appropriate to select either of those models; however, you can check predictive performance on the test set to see if either one can outperform the other.

```python
# Import mean_squared_error
from sklearn.metrics import mean_squared_error

for name, model in models.items():
  
  # Fit the model to the training data
  model.fit(X_train_scaled, y_train)
  
  # Make predictions on the test set
  y_pred = model.predict(X_test_scaled)
  
  # Calculate the test_rmse
  test_rmse = mean_squared_error(y_test, y_pred, squared=False)
  print("{} Test Set RMSE: {}".format(name, test_rmse))

<script.py> output:
    Linear Regression Test Set RMSE: 0.11988851505947569
    Ridge Test Set RMSE: 0.11987066103299668
```

### Visualizing classification model performance
In this exercise, you will be solving a classification problem where the "popularity" column in the music_df dataset has been converted to binary values, with 1 representing popularity more than or equal to the median for the "popularity" column, and 0 indicating popularity below the median.

Your task is to build and visualize the results of three different models to classify whether a song is popular or not.

```python
# Create models dictionary
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), "Decision Tree Classifier": DecisionTreeClassifier()}
results = []

# Loop through the models' values
for model in models.values():
  
  # Instantiate a KFold object
  kf = KFold(n_splits=6, random_state=12, shuffle=True)
  
  # Perform cross-validation
  cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

In [1]:
results
Out[1]:

[array([0.8  , 0.752, 0.736, 0.8  , 0.736, 0.736]),
 array([0.696, 0.728, 0.728, 0.768, 0.72 , 0.72 ]),
 array([0.632, 0.696, 0.752, 0.728, 0.656, 0.648])]
```
![Screen Shot 2023-03-15 at 5.00.21 PM](Screen%20Shot%202023-03-15%20at%205.00.21%20PM.png)
* Looks like logistic regression is the best candidate based on the cross-validation results!

### Pipeline for predicting song popularity
For the final exercise, you will build a pipeline to impute missing values, scale features, and perform hyperparameter tuning of a logistic regression model. The aim is to find the best parameters and accuracy when predicting song genre!

```python
# Create steps
steps = [("imp_mean", SimpleImputer()), 
         ("scaler", StandardScaler()), 
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)
params = {"logreg__solver": ["newton-cg", "saga", "lbfgs"],
         "logreg__C": np.linspace(0.001, 1.0, 10)}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params)
tuning.fit(X_train, y_train)
y_pred = tuning.predict(X_test)

# Compute and print performance :  best parameters and compute and print the test set accuracy score for the grid search object.
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))

<script.py> output:
    Tuned Logistic Regression Parameters: {'logreg__C': 0.112, 'logreg__solver': 'newton-cg'}, Accuracy: 0.82
```
* you've selected a model, built a preprocessing pipeline, and performed hyperparameter tuning to create a model that is 82% accurate in predicting song genres!

![Screen Shot 2023-03-15 at 5.08.36 PM](Screen%20Shot%202023-03-15%20at%205.08.36%20PM.png)
