#### Classification
Two types of supervised learning — classification and regression <br></br>
K-Nearest Neighbors(KNN): classification problem that makes predictions based on what label the majority of nearest neighbors have <br></br>
<b>from sklearn.neighbors import KNeighborsClassifier</b> Used for KNN <br></br>
<b>X = df[["feature_1", "feature_2]].values</b> Split data into X, a 2D array of the features. .values attribute converts X and y into NumPy arrays <br></br>
<b>y = df["target"].values</b> y, a 1D array of target values <br></br>
<b>knn = KNeighborsClassifier(n_neighbors=15)</b> Instantiate the KNN <br></br> 
<b>knn.fit(X, y)</b> fit the classifier <br></br>
<b>predictions = knn.predict(X_new)</b> Predict new data stored in X_new variable <br></br>
<b>print("Predictions: {}".format(predictions))</b> Print out predictions <br></br>
<br></br>
Measuring model performance/accuracy <br></br>
<b>from sklearn.model_selection import train_test_split</b> Used below <br></br>
<b>X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.3, random_state=21, stratify=y)</b> 30% of the data will be used as testing, random_state sets a seed that splits the data, stratify ensures the split reflects the proportion of labels in our data <br></br>
<b>knn.score(X_test, y_test)</b> Check accuracy <br></br>


The features to use will be "account_length" and "customer_service_calls". The target, "churn", needs to be a single column with the same number of observations as the feature data.
<br></br>
You will convert the features and the target variable into NumPy arrays, create an instance of a KNN classifier, and then fit it to the data.

In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np

churn_df = pd.read_csv("telecom_churn_clean.csv")

# Create arrays for the features and the target variable
y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

In [None]:
X_new = np.array([[30.0, 17.5],
                  [107.0, 24.1],
                  [213.0, 10.9]])

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions for X_new
print("Predictions: {}".format(y_pred)) 

In [None]:
# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

In [None]:
# Create neighbors
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)
  
	# Fit the model
	knn.fit(X_train, y_train)
  
	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

In [None]:
import matplotlib.pyplot as plt

# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

#### Intro To Regression 
<b>from sklearn.linear_model import LinearRegression</b> Used below <br></br>
<b>reg = LinearRegression()</b> Instantiate the model <br></br>
<b>reg.fit(X, y)</b> Fit the model <br></br>
<b>predictions = reg.predict(X)</b> Gives line of best fit <br></br>
<b>plt.scatter(X, y)</b> Plot values <br></br>
<b>plt.plot(x, predictions)</b> Plot line of best fit <br></br>
<b>reg_var.score(X_test, y_test)</b> R-squared score to check variance. Values range from 0 to 1 with 1 meaning features completely explain the target's variance <br></br>
<b>from sklearn.metrics import mean_squared_error</b> Used to get MSE <br></br>
<b>mean_squared_error(y_test, y_pred, squared=False)</b> Gives RMSE which is the square root of MSE(squared=False) <br></br>
Cross-validation: <br></br>
<b>from sklearn.model_selection import cross_val_score, KFold</b> Used below <br></br>
<b>kf = KFold(n_splits=6, shuffle=True, random_state=42)</b> Shuffle data before we split the data into folds <br></br>
<b>cv_result = cross_val_score(reg, X, y, cv=kf)</b> Arguments: model, feature data, target data, and number of folds. R-squared is returned <br></br>
Regularized regression: <br></br>
Large coefficients on line of best fit can lead to overfitting, so regularization penalizes large coefficients <br></br>
<b>from sklearn.linear_model import Ridge</b> Used below <br></br>
loop through alpha values when fitting the model <br></br>
<b>ridge = Ridge(alpha=alpha)</b> alpha can be a list of values <br></br> 
<b>from sklearn.linear_model import Lasso</b> Another type of regularized regression. Can be used to assess feature importance <br></br>
<b>lasso = Lasso(alpha=0.1)</b> Used below <br></br>
<b>lasso_coef = lasso.fit(X, y).coef_</b> Can plot this to see what feature affects target variable the most <br></br>



In [None]:
import numpy as np
import pandas as pd

sales_df = pd.read_csv("advertising_and_sales_clean.csv")

# Create X from the radio column's values
X = sales_df["radio"].values

# Create y from the sales column's values
y = sales_df["sales"].values

# Reshape X
X = X.reshape(-1, 1)

# Check the shape of the features and targets
print(X.shape, y.shape)

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X, y)

# Make predictions
predictions = reg.predict(X)

print(predictions[:5])

In [None]:
# Import matplotlib.pyplot
import matplotlib.pyplot as plt


# Create scatter plot
plt.scatter(X, y, color="blue")

# Create line plot
plt.plot(X, predictions, color="red")
plt.xlabel("Radio Expenditure ($)")
plt.ylabel("Sales ($)")

# Display the plot
plt.show()

In [None]:
# Create X, an array containing values of all features in sales_df, and y, containing all values from the "sales" column.
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

In [None]:
# Import mean_squared_error
from sklearn.metrics import mean_squared_error

# Compute R-squared
r_squared = reg.score(X_test, y_test)

# Compute RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)

# Print the metrics
print("R^2: {}".format(r_squared))
print("RMSE: {}".format(rmse))

In [None]:
# Import the necessary modules
from sklearn.model_selection import cross_val_score, KFold

# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)

reg = LinearRegression()

# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)

# Print scores
print(cv_scores)

In [None]:
# Print the mean
print(np.mean(cv_results))

# Print the standard deviation
print(np.std(cv_results))

# Print the 95% confidence interval
print(np.quantile(cv_results, [0.025, 0.975]))

In [None]:
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
  
  # Create a Ridge regression model
  ridge = Ridge(alpha=alpha)
  
  # Fit the data
  ridge.fit(X_train, y_train)
  
  # Obtain R-squared
  score = ridge.score(X_test, y_test)
  ridge_scores.append(score)
print(ridge_scores)

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regression model
lasso = Lasso(alpha=0.3)

# Fit the model to the data
lasso.fit(X, y)

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef)
plt.bar(sales_columns, lasso_coef)
plt.xticks(rotation=45)
plt.show()

#### Fine-Tuning Your Model
<b>from sklearn.metrics import classification_report, confusion_matrix</b> Used below <br></br>
<b>confusion_matrix(y_test, y_pred)</b> Returns confusion matrix <br></br>
<b>classification_report(x_test, y_pred)</b> Returns metrics like precision, recall(sensitivity), and f1-score <br></br>
<b>Recall</b> If model needs to reduce false negatives this metric is most useful <br></br>
<b>from sklearn.linear_model import LogisticRegression</b> Logistic regression is used for classification. This model calculates the probability, p, that an observation belongs to a binary class. The default probability threshold for logistic regression in scikit-learn is zero-point-five <br></br>
<b>The ROC curve</b> Used to visualize how different thresholds affect true positive and false positive rates. If The ROC curve is above the dotted line the model performs better than randomly guessing the class of each observation. <br></br>
<b>from sklearn.metrics import roc_curve</b> For ROC curve <br></br>
<b>y_pred_probs = logreg.predict_proba(X_test)[:, 1]</b> Used below <br></br>
<b>fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)</b> unpack the results into three variables: false positive rate, FPR; true positive rate, TPR; and the thresholds. <br></br>
<b>from sklearn.metrics import roc_auc_score </b> To calculate area under curve <br></br>
<b>roc_auc_score(y_test, y_pred_probs)</b> Calculate area under curve <br></br>
Hyperparameters: Parameters we specify before fitting the model like alpha and n_neighbors <br></br>
For hyperparameter tuning it is essential to use cross validation to avoid overfitting <br></br>
GridSearchCV: <br></br>
param_grid = {"alpha": np.arrange(0.0001, 1, 10), "solver": ["sag", "lsqr"]} Used below <br></br>
<b>ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)</b> model named ridge <br></br>
<b>ridge_cv.best_params_, ridge_cv.best_score_</b> Print metrics <br></br>
<b>from sklearn.model_selection import RandomizedSearchCV</b> RandomizedSearchCV, which tests a fixed number of hyperparameter settings from specified probability distributions. Used below <br></br>
<b>ridge_cv = GridSearchCV(ridge, param_grid, cv=kf, n_iter=2)</b> Scales better than GridSearchCV <br></br>
<b>ridge_cv.score(X_test, y_test)</b>


In [None]:
# Import confusion matrix
from sklearn.metrics import confusion_matrix, classification_report

knn = KNeighborsClassifier(n_neighbors=6)

# Fit the model to the training data
knn.fit(X_train, y_train)

# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)

# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate the model
logreg = LogisticRegression()

# Fit the model
logreg.fit(X_train, y_train)

# Predict probabilities
y_pred_probs = logreg.predict_proba(X_test)[:, 1]

print(y_pred_probs[:10])

In [None]:
# Import roc_curve
from sklearn.metrics import roc_curve

# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)

plt.plot([0, 1], [0, 1], 'k--')

# Plot tpr against fpr
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve for Diabetes Prediction')
plt.show()


In [None]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

In [None]:
# Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(0.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))

#### Preprocessing and Pipelines
Dummy variables or OneHotEncoder: <br></br>
<b>df_dummies = pd.get_dummies(df["col1"], drop_first=True)</b> Need drop_first for some ML algorithms. If a observation is not part of the columns categories then it is the category we droped in drop_first. Have 10 categories we drop first cat. If observation is not part of the 9 cats then it is the cat we dropped  <br></br>
<b>df_dummies = pd.concat([df, df_dummies], axis=1)</b> Undo OneHotEncodeing <br></br>
<b>df = pd.get_dummies(df, drop_first=True)</b> If DF only has one cat feature you can pass the whole DF <br></br>
Handle missing data: <br></br>
<b>df.isna().sum.sort_values()</b>Get sum of na values in df <br></br>
Common practice is to drop na if 5% or less of all data is na <br></br>
<b>df.dropna(subset=["col_3%", "col_4%", "col_1%"])</b> Drop cols with 5% of less of na values <br></br>
Input missing data(Make guess on missing data) <br></br>
<b>from sklearn.impute import SimpleImputer</b> Used below <br></br>
<b>imp_cat = SimpleImputer(strategy="most_frequent")</b> Handle missing data by most frequent <br></br>
<b>imp_cat.fit_transform(X_train)</b> Fit and transform the data <br></br>
<b>from sklearn.pipeline import Pipeline</b> Used below <br></br>
<b>steps = [("imputation", SimpleImputer()), ("logistic_regression", LogisticRegression())]</b> Used below <br></br>
<b>pipeline = Pipeline(steps)</b> Create pipeline <br></br> 
<b>pipeline.fit(), pipeline.score()</b> Get accuracy <br></br>

There are several ways to scale our data: given any column, we can subtract the mean and divide by the variance so that all features are centered around zero and have a variance of one. This is called standardization. We can also subtract the minimum and divide by the range of the data so the normalized dataset has minimum zero and maximum one. Or, we can center our data so that it ranges from -1 to 1 instead. <br></br>
<b>from sklearn.preprocessing import StandardScaler</b> Used below <br></br>
<b>X_train_scaled = scaler.fit_transform(X_train)</b> <br></br> 
<b>X_test_scaled = scaler.transform(X_test)</b> <br></br>
<b>cv = GridSearchCV(pipeline, param_grid=parameters)</b> CV and scaling in a pipeline <br></br>
Models metrics: <br></br>
Regression model performance: RMSE, R-squared <br></br>
Classification: Accuracy, confusion matrix, Precision, recall, F1-score, ROC AUC <br></br>
<b>from sklearn.tree import DecisionTreeClassifie</b> Used below <br></br>
Workflow:
As usual, we create our feature and target arrays, then split our data. We then scale our features using the scaler's dot-fit_transform method on the training set, and the dot-transform method on the test set. <br></br>

Evaluating classification models <br></br>
We create a dictionary with our model names as strings for the keys, and instantiate models as the dictionary's values. We also create an empty list to store the results. Now we loop through the models in our models dictionary, using its dot-values method. Inside the loop, we instantiate a KFold object. Next we perform cross-validation, using the model being iterated, along with our scaled training features, and target training array. We set cv equal to our kfold variable. By default, the scoring here will be accuracy. We then append the cross-validation results to our results list. Lastly, outside of the loop, we create a boxplot of our results, and set the labels argument equal to a call of models-dot-keys to retrieve each model's name. <br></br>


In [None]:
# Create music_dummies
music_dummies = pd.get_dummies(music_df, drop_first=True)

# Print the new DataFrame's shape
print("Shape of music_dummies: {}".format(music_dummies.shape))


The model will be evaluated by calculating the average RMSE, but first, you will need to convert the scores for each fold to positive values and take their square root. This metric shows the average error of our model's predictions, so it can be compared against the standard deviation of the target value

In [None]:
# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

In [None]:
# Import modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors=3)

# Build steps for the pipeline
steps = [("imputer", imputer), 
         ("knn", knn)]

In [None]:
steps = [("imputer", imp_mean),
        ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)

# Calculate and print R-squared
print(pipeline.score(X_test, y_test))

In [None]:
# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)

In [None]:
models = {"Linear Regression": LinearRegression(), "Ridge": Ridge(alpha=0.1), "Lasso": Lasso(alpha=0.1)}
results = []

# Loop through the models' values
for model in models.values():
  kf = KFold(n_splits=6, random_state=42, shuffle=True)
  
  # Perform cross-validation
  cv_scores = cross_val_score(model, X_train, y_train, cv=kf)
  
  # Append the results
  results.append(cv_scores)

# Create a box plot of the results
plt.boxplot(results, labels=models.keys())
plt.show()

In [None]:
# Import mean_squared_error
from sklearn.metrics import mean_squared_error

for name, model in models.items():
  
  # Fit the model to the training data
  model.fit(X_train_scaled, y_train)
  
  # Make predictions on the test set
  y_pred = model.predict(X_test_scaled)
  
  # Calculate the test_rmse
  test_rmse = mean_squared_error(y_test, y_pred, squared=False)
  print("{} Test Set RMSE: {}".format(name, test_rmse))

In [None]:
# Create models dictionary
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), "Decision Tree Classifier": DecisionTreeClassifier()}
results = []

# Loop through the models' values
for model in models.values():
  
  # Instantiate a KFold object
  kf = KFold(n_splits=6, random_state=12, shuffle=True)
  
  # Perform cross-validation
  cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

In [None]:
# Create steps
steps = [("imp_mean", SimpleImputer()), 
         ("scaler", StandardScaler()), 
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)
params = {"logreg__solver": ["newton-cg", "saga", "lbfgs"],
         "logreg__C": np.linspace(0.001, 1.0, 10)}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params)
tuning.fit(X_train, y_train)
y_pred = tuning.predict(X_test)

# Compute and print performance
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))