## Basic Label Classification and Prediction Using scikit-learn

Scikit-learn is an efficient and simple tool for predictive data analysis. All you need to do is to follow the steps: import the module, apply it on data, and predict on future data. Let's take a look at how each of these works!

In [36]:
# if you haven't installed scikit-learn, please type the following command to install it in Python
! pip install sklearn

In the following coding blocks, we will use the demo datasets from sklearn. By using the import command in Python, you can import any modules you have.

In [37]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

Let's split the dataset into training and test set. Usually, we want to see how our model performs before applying it to actual testing case. If you want the dataset to be splitted the same every time, set random state to an integer:

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

### Logistics Regression
One common classifier is Logistic Regression. It tries to find linear boundaries between different labels. Scikit-learn has a built-in module for Logistic Regression. In the following code blocks, we will demonstrate each method and apply it to demo data.

In [39]:
# import logistic regression
from sklearn.linear_model import LogisticRegression

In [40]:
# initialize a logistic regression classifier by calling the function
clf = LogisticRegression()

If you want to generate the same results every time (the classifier may vary every time!), you can set the random state to an integer:

In [41]:
clf = LogisticRegression(random_state=0)

Now you have an initialized logistic regression classifier. You can fit it to the training data:

In [42]:
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Now your classifier is ready to predict on test data. Call the "predict" method to do it:

In [43]:
prediction_lr = clf.predict(X_test)

We can measure how well our classifier does by comparing the predictions to the actual labels. One metric is accuracy - what percentage of data entries are classified correctly. We can use the "score" function from sklearn:

In [44]:
clf.score(X_test, y_test)

0.9736842105263158

It turns out that our classifier is doing pretty well - approximately 97% test data entries are predicted correctly!

### Naive Bayes
Another common classifier is Naive Bayes. Different than Logistic Regression, Naive Bayes tries to predict the label with the highest probablity. The procedure of building Naive Bayes classifier is very similar to Logistic Regression. Here, we will demonstrate one kind of Naive Bayes models, called "Gaussian Naive Bayes".

In [47]:
# import GaussianNB
from sklearn.naive_bayes import GaussianNB

Initialize the classifier. Note that this function has no "random_state" option.

In [48]:
gnb = GaussianNB()

In [49]:
gnb.fit(X_train, y_train)

In [55]:
prediction_nb = gnb.predict(X_test)

For simplicity, you can also concatenate the previous two commands into one:

In [54]:
prediction_nb = gnb.fit(X_train, y_train).predict(X_test)

Same as before, we can use the "score" function to measure accuracy:

In [56]:
gnb.score(X_test, y_test)

1.0