# Tutorial 2: Baseline Calculation.

Before we try any feature engineering methods, we will need to build a basic statistical model to act as a baseline against more complicated techniques and pipelines. In this notebook, we will use logistic regression (LR) on the whole wavelength spectrum without any feature selection/reduction methods; such a method can be called a baseline.

LR is a simple linear statistical method that can linearly classify objects. In this notebook, we will use LR to classify the apples between Sound `S` and Bruised`B`. Follow this [link](https://developers.google.com/machine-learning/crash-course/logistic-regression/video-lecture) to know more about LR

If you are unaware of the classification metrics, you can look at the following [link](https://developers.google.com/machine-learning/crash-course/classification/video-lecture).

---

In [13]:
# ___Cell no. 1___

# Python packages 
import pandas as pd # for importing data into data frame format
import seaborn as sns # For drawing useful graphs, such as bar graphs
import numpy as np
import matplotlib.pyplot as plt

import sys
sys.path.append("..")
from source.utils import split #  a pre-defined function to split the data into training and testing


First, let us call the data

In [14]:
# ___Cell no. 2___

%store -r X
%store -r Y
%store -r df
print(X.shape) # printing the shape the dataframe X

(503, 2074)


Converting labels from `S` & `B` to `1` &`0`

In [15]:
# ___Cell no. 3___

Y = Y.map({'S': 1, 'B': 0})
Y

0      0
1      0
2      1
3      1
4      0
      ..
498    1
499    1
500    0
501    1
502    1
Name: Condition, Length: 503, dtype: int64

---

### Data splitting

In [16]:
# ___Cell no. 4___

Xtrain, Xtest, Ytrain, Ytest  = split(X, Y)

In [17]:
# ___Cell no. 5___

print(Xtrain.shape)
print(Ytrain.shape)

(352, 2074)
(352,)


---

### Baseline (LR) training

In [18]:
# ___Cell no. 6___

from sklearn.linear_model import LogisticRegression


In [19]:
# ___Cell no. 7___

LR = LogisticRegression(random_state=0, solver="newton-cg") #defining the model
LR.fit(Xtrain.values, Ytrain) # training the machine learning model

LogisticRegression(random_state=0, solver='newton-cg')

### Testing the machine learning model

In [20]:
# ___Cell no. 8___

from sklearn.metrics import accuracy_score, precision_score

In [21]:
# ___Cell no. 9___
e
y_pred = LR.predict(Xtest)
accuracy_score(Ytest.values, y_pred)



0.7947019867549668

We have calculated the classification accuracy, which is about 80%, not bad for a baseline. However, we are more interested in evaluating our model in reducing the "False Positives" -> a bruised apple that the model thought was a sound apple. Hence, we need to calculate the precision score to reflect such measurement.

In [22]:
# ___Cell no. 10___

precision_score(Ytest.values, y_pred)

0.7647058823529411

The model has about 76% of precision score.

We will use the above precision score to compare against other methods in the upcoming tutorials.

---

**Exercise 1:** Perfom LR on the other two data sets.