<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Threshold Adjustment

The data you will be working with in this exercise consists of measurements of the ionosphere using radar. The radar data was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere.

Received signals were processed using an autocorrelation function whose arguments are the time of a pulse and the pulse number. There were 17 pulse numbers for the Goose Bay system. Instances in this databse are described by 2 attributes per pulse number, corresponding to the complex values returned by the function resulting from the complex electromagnetic signal.

If you are curious, more details about this dataset<a href='https://archive.ics.uci.edu/ml/datasets/Ionosphere'> can be found on the UCI machine learning website.</a>

👇 Load the player `ionosphere.data` dataset located within the data folder to see what you will be working with. Note that the dataset does **NOT** have headers for the columns, so you need use <code>header=None</code> as an argument in your <code>read_csv</code> method. <br>
Call your new dataframe <code>data</code>.

In [None]:
from nbta.utils import download_data
download_data(id='1aT-kqzDPaNZCG82NMGmbjwUtUFbtGLZP')

In [None]:
import pandas as pd
import numpy as np

data = pd.read_csv('raw_data/ionosphere.data', header=None)
data

ℹ️ There is a total of 34 features in this dataset:
* All 34 are continuous
* The 35th attribute is either "good" (g) or "bad" (b) according to the definition summarized above. 

Hence, this is a binary classification task. 

# Preparing the target
Your first task will be to encode the target using <code>sklearn LabelEncoder</code>. Do this, create a new column in your dataset named <code>y</code> and remove the original label column (<code>34</code>).

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

data['y'] = label_encoder.fit_transform(data[34])
data.drop(34, inplace=True, axis=1)
data

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('target_encoding',
                         dataset = data
)

result.write()
print(result.check())

# Preprocessing

👇 This dataset has no missing values (you're welcome). So first create a target dataset (<code>y</code>) and a feature dataset <code>X</code>. Then let's go ahead and train/test split these using a <code>random_state=42</code> and a <code>test_size=0.3</code> this way your results will be comparable (name them <code>X_train</code>, <code>X_test</code>, <code>y_train</code>, <code>y_test</code>).

In [None]:
from sklearn.model_selection import train_test_split

y = data.y.copy()
X = data.drop('y', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

👇 To avoid spending too much time on the preprocessing, Robust Scale the entire feature set. This practice is not optimal, but can be used for preliminary preprocessing and/or to get models up and running quickly. Remember to train your <code>RobustScaler()</code> on the <code>X_train</code> only, and then use the <code>.transform</code> method on both the <code>X_train</code> and <code>X_test</code> dataset (replaced the original by the scaled version).

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler().fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
X_test.min()

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('scaled_features',
                         scaled_features = X_train
)

result.write()
print(result.check())

# Base modelling

🎯 The task is to detect good radar readings with a 90% guarantee.

First, let's create a dummy model to see what our accuracy would be if we took a random guess and always classified an instance as a 'bad' reading. Checkout the <code>scikit-learn</code> documentation for <code>DummyClassifier</code> and <code>precision_score</code>, train your classified on your <code>X_train</code> and save the <code>precision_score</code> of your predictions under a variable named '<code>dummy_baseline</code>'.👇 

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_score

dummy = DummyClassifier()

dummy.fit(X_train, y_train)

dummy_predictions = dummy.predict(X_test)

dummy_baseline = precision_score(y_test, dummy_predictions)
dummy_baseline

Well, that is interesing isn't it? Are you surprise by this baseline score? Explain in a sentence or two below what you think this result means.

In [None]:
'The score is >50% which means that the dataset must be slightly unbalanced towards class 1 (good measurements)'

Can you test your theory by looking at the dataset? 👇 

In [None]:
## Your code here

In [None]:
y_test.value_counts()

👇 Now let's check if a default Logistic Regression model is going to satisfy our requirement of over 90% precision. Use cross validation on your <code>X_train</code> and save the score that supports your answer under variable name `base_score`.

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LogisticRegression

base_model = LogisticRegression(max_iter=2000)

metrics = ['precision', 'recall']

cv = cross_validate(base_model, X_train, y_train, scoring=metrics)

In [None]:
base_score = np.mean(cv['test_precision'])
base_score

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('base_precision',
                         score = base_score,
                         dummy = dummy_baseline
)

result.write()
print(result.check())

# Threshold adjustment

So our logistic regression does much better than a DummyClassifier. 🥳 But this is still not quite where we need it (>90%). Luckily, because we are dealing with binary classification we can adjust our decision threshold to increase precision at the cost of accuracy.<br>
👇 Find the decision threshold that guarantees a 90% precision for a positive identification as belonging to the 'good measurement' class. Save the threshold under variable name `new_threshold`.

<details>
<summary>💡 Hint</summary>

- Make cross validated probability predictions with [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)
    
- Plug the probabilities into [`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) to generate precision scores at different thresholds

- Find out which threshold guarantees a precision of 0.9
      
</details>



In [None]:
from sklearn.metrics import precision_recall_curve
from sklearn.model_selection import cross_val_predict

proba_neg, proba_pos = cross_val_predict(LogisticRegression(),
                                         X_train,
                                         y_train, 
                                        method='predict_proba',
                                           cv=5).T

In [None]:
precision, recall, threshold = precision_recall_curve(y_train,proba_pos, pos_label=1)

In [None]:
scores = pd.DataFrame({'Threshold':threshold,
                     'Precision':precision[:-1],
                     'Recall':recall[:-1]})
scores

In [None]:
scores[scores.Precision > 0.9]

In [None]:
idx = scores[scores.Precision > 0.9].index[0]
new_threshold = scores.loc[idx,'Threshold']
new_threshold

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('decision_threshold',
                         threshold = new_threshold
)

result.write()
print(result.check())

# Using the new threshold

🎯 Now let's properly train our <code>LogisticRegression()</code> model using the train set, test it with the test set and the <code>precision_score</code> using your new threshold. Remember that you will need to use the <code>.predict_proba</code> method on your logistic classifier and apply the threshold manually. Save the precision on the test score as a variable named <code>test_precision_score</code>.

In [None]:
model = LogisticRegression().fit(X_train, y_train)
proba_classes = model.predict_proba(X_test)
y_predict = []

for proba_0, proba_1 in proba_classes:
    if proba_1 >= new_threshold:
        y_predict.append(1)
    else:
        y_predict.append(0)

test_precision_score = precision_score(y_test,y_predict)
test_precision_score

🤔 So this is not quite 90% is it? This is because we adjusted the threshold on a small training set. As you can see, on unseen data this does not quite translate to exactly what we wanted, but it is close and better than before!

❓ Now let's open a new, unseen sample without a label: open the <code>ionosphere_sample.csv</code> file located in your <code>data</code> folder (remember: <code>header=False</code>). Do you think that this is a good or a bad reading? Save your answer as string under variable name `recommendation` as "good" or "bad". <br>
🚨 Remember to scale this data with your scaler before you predict (you will need to transpose the data as you have only 1 sample in a column vector)!

In [None]:
sample = pd.read_csv('raw_data/ionosphere_sample.csv', header=None)
sample_scaled = scaler.transform(sample.T)
model.predict_proba(sample_scaled)

In [None]:
recommendation = 'good'

### ☑️ Check your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('recommendation',
                         recommendation = recommendation
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.