### COMP3359 Artificial Intelligence Applications
Department of Computer Science, HKU
<br><br>

# <u>Checkpoint 1a – Scikit-Learn</u>  

## Estimated Time to Finish:   
1~2 hours   

## Main Learning Objectives:   
-	Practise usage of common ML framework to construct simple application.

## Overview   
1.	[Introduction](#s1)  
2.	[Before You Start](#s2)
3.	[Preparation](#s3)
4.	[Task - Reuse Model and Make Predictions](#s4)
5.	[Submission](#s5)

-----


<a id=’s1’></a>
# 1 Introduction

It may be a good idea to kick-start our study of building AI applications by learning basic usage of some ML framework. This checkpoint extends our other material "ML Framework Learning Roadmap – Scikit-Learn". The main task in this checkpoint is to reuse the model trained in the example in "ML Framework Learning Roadmap - Scikit-Learn", and give predictions to data we provide.  

-----   


<a id=’s2’></a>
# 2 Before You Start

## Referenced Tutorial

This checkpoint is mainly built by referencing the following tutorial. It is <b>strongly recommended</b> to study it once first to understand the context of this tutoiral.

Referenced tutorial:
- [Introduction to Python Scikit-learn](https://intellipaat.com/blog/tutorial/python-tutorial/scikit-learn-tutorial/)

## Auxilliary Tools

In this checkpoint, you may need to use python packages to help you tackle the problems. <b>If you have no experience</b> using the following packages, it is <b>recommended</b> to check the following short tutorials and complete the simple exercises inside.

- NumPy
    - [NumPy UltraQuick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/numpy_ultraquick_tutorial.ipynb)
- Pandas
    - [Pandas DataFrame UltraQuick Tutorial](https://colab.research.google.com/github/google/eng-edu/blob/master/ml/cc/exercises/pandas_dataframe_ultraquick_tutorial.ipynb)

Since in the context of AI, we often handles a large number of numerical values that are arraged as <b>(multi-dimensional) arrays</b> (e.g. vectors, matrices, tensors), you may pay attentation to the manipulations of such (multi-dimensional) arrays in these tutorials.

-----

<a id=’s3’></a>
# 3 Preparation

"ML Framework Learning Roadmap – Scikit-Learn" suggests a tutorial about training a classifier of iris species. To prepare ourselves for this checkpoint, we trained the same model here:



In [None]:
# Referenced tutorial:
# https://intellipaat.com/blog/tutorial/python-tutorial/scikit-learn-tutorial/
import sklearn
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="ticks", color_codes=True)

### Preparing Dataset ###
print("===== Training Data =====")
# Load dataset
dfiris = sns.load_dataset("iris")
print("Example input data:")
print(dfiris.head(3))

# Get data labels
labels = np.asarray(dfiris.species)
print("Example labels: ")
print(labels[:3])

### Data Preprocessing ###
print("===== Data Processing =====")
# Convert class strings to integer labels
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)
print("Example label-encoded labels: ")
print(labels[:3])
# Drop unnecessary features
df_selected1 = dfiris.drop(['sepal_length', 'sepal_width', "species"], axis=1)
print("Example input data (with only selected features):")
print(df_selected1.head(3))
# Convert features to Numpy array 
df_features = df_selected1.to_dict(orient='records')
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
features = vec.fit_transform(df_features).toarray()
print("Example input data (vertorized):")
print(features[:3])

### Train/Test Split ###
from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
features, labels, test_size=0.20, random_state=0)
print("===== Train/Test Split =====")
print("# input features: ", features_train.shape[1])
print("Shape of features_train (# samples, # features): ", features_train.shape)
print("Shape of labels_train (# samples,): ", labels_train.shape)
print("Shape of features_test (# samples, # features): ", features_test.shape)
print("Shape of labels_test (# samples,): ", labels_test.shape)

### Training Model ###
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train)

### Model Evaluation ###
svm_predictions = svm_model_linear.predict(features_test)
accuracy = svm_model_linear.score(features_test, labels_test)
print("===== Model Evaluation =====")
print("Test accuracy:",accuracy)

===== Training Data =====
Example input data:
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
Example labels: 
['setosa' 'setosa' 'setosa']
===== Data Processing =====
Example label-encoded labels: 
[0 0 0]
Example input data (with only selected features):
   petal_length  petal_width
0           1.4          0.2
1           1.4          0.2
2           1.3          0.2
Example input data (vertorized):
[[1.4 0.2]
 [1.4 0.2]
 [1.3 0.2]]
===== Train/Test Split =====
# input features:  2
Shape of features_train (# samples, # features):  (120, 2)
Shape of labels_train (# samples, 1):  (120,)
Shape of features_test (# samples, # features):  (30, 2)
Shape of labels_test (# samples, 1):  (30,)
===== Model Evaluation =====
Test accuracy: 1.0


-----

<a id=’s4’></a>
# 4 Task - Reuse Model and Make Predictions

In the previous section, we have trained a model for image recognition. Next, we will try to reuse the trained model and make predictions on the images we provide for this checkpoint.

We have prepared a data directory with few test images to be classified. The following sample codes assume the data directory is located next to this notebook, e.g.:
```
..
|-- Checkpoint1_Scikit-Learn.ipynb
|-- iris_test.csv
```
If you put your data directory in somewhere else, you will need to modify the path to data directory accordingly below.

Our task is to <u><b> make predictions on the provided data using the trained model</b></u>. More specifically, we want to print out predictions as in:
```
# … all preceding steps.
# Print out predicted class names
pred_class_names = le.inverse_transform(preds_labels)
print("Predictions: ", pred_class_names)
```

and your task here is to <u>complete the steps before printing out predictions</u>, which are briefly:
1.	Load data.
2.	Preprocess data.
3.	Make predictions using trained model.

There are more than one way to carry out these steps and <u>you are welcomed to prepare predictions in your own fashion</u>. In case you are feeling uncertain about where to start, in the next code cell an example procedure is provided for you, and <u>you could complete the task by filling in the missing parts according to instructions given</u>. 


In [1]:
# Mount Google Drive for loading data if running in Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
""" Reuse Trained Model and Make Prediction """
########################################################
# This is only a suggested method to make predictions  #
# on the provided data.                                #
#                                                      #
# You may modify or replace the following codes,       #
# as long as you can provide predictions from          #
# the trained model.                                   #
########################################################

import sklearn
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(style="ticks", color_codes=True)

# Specify data file path
data_path = 'drive/My Drive/Module 1 - Introduction/Checkpoint/Scikit-Learn/iris_test.csv'


# Read data into Pandas DataFrame
# (you may try: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
#test_df = ...
df_test = pd.read_csv(data_path)


# Preprocess data using the same procedure in tutorial
#features = ...
df_test_selected1 = df_test.drop(['sepal_length', 'sepal_width'], axis=1)
df_test_features = df_test_selected1.to_dict(orient='records')

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
test_features = vec.fit_transform(df_test_features).toarray()


# Reuse trained model  
dfiris = sns.load_dataset("iris")
labels = np.asarray(dfiris.species)

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)
labels = le.transform(labels)

df_selected1 = dfiris.drop(['sepal_length', 'sepal_width', "species"], axis=1)
df_features = df_selected1.to_dict(orient='records')

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
features = vec.fit_transform(df_features).toarray()

from sklearn.model_selection import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
features, labels, test_size=0.20, random_state=0)

from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(features_train, labels_train)


# Make predictions using trained model
#preds_labels = ...
preds_labels = svm_model_linear.predict(test_features)


############################################################
# Printing out predictions here is your goal of this task. #
############################################################
# Print out predicted class names
pred_class_names = le.inverse_transform(preds_labels)
print("Predictions: ", pred_class_names)

Predictions:  ['setosa' 'setosa' 'versicolor' 'versicolor' 'virginica' 'virginica']


-----

<a id=’s5’></a>
# 5 Submission

To complete and submit your work, please submit the completed version of this notebook to Moodle. Please make sure that it can be executed without errors, and predictions from trained model are provided in clear, comprehensible fashion.

-----