# Scikit learn 101


* Be an user of AI first, then you can be an AI engineer or researcher.
* First do it, then do it right, then do it better.

**We're gonna 'Do ML' today.<br> after this lab, you will be a ML user**
* step1 : distinguish the problem, classification or regression
* step2 : Google it!
    * [sklearn] [the name of algorithm] [regression or classification]
    * example> sklearn randomforest regression
* step3 : do ML !
    1. import
    2. declare
    3. fit
    4. predict


In [1]:
###################
## Run this cell ##
###################
import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()

df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target

print("Classification or Regression?")
print("-----------------------------")
print(boston.DESCR)

Classification or Regression?
-----------------------------
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      fu

In [2]:
boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [3]:
df['MEDV']

0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: MEDV, Length: 506, dtype: float64

In [5]:
"""
Training set for train model.
Test set for final evaluation ( to estimate generalization error )
"""
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df.drop(['MEDV'], axis=1), df['MEDV'],
                                                    test_size=0.2, random_state=2021)

print(y_train.head())
x_train.head()

28     18.4
498    21.2
284    32.2
414     7.0
123    17.3
Name: MEDV, dtype: float64


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
28,0.77299,0.0,8.14,0.0,0.538,6.495,94.4,4.4547,4.0,307.0,21.0,387.94,12.8
498,0.23912,0.0,9.69,0.0,0.585,6.019,65.3,2.4091,6.0,391.0,19.2,396.9,12.92
284,0.00906,90.0,2.97,0.0,0.4,7.088,20.8,7.3073,1.0,285.0,15.3,394.72,7.85
414,45.7461,0.0,18.1,0.0,0.693,4.519,100.0,1.6582,24.0,666.0,20.2,88.27,36.98
123,0.15038,0.0,25.65,0.0,0.581,5.856,97.0,1.9444,2.0,188.0,19.1,370.31,25.41


# Example exercise

In [7]:
# 1. Import what model you want.
from sklearn.linear_model import LinearRegression # after write this code, do ctrl+enter 


# 2. Declare your model.
lr = LinearRegression()

# 3. Fit your model.
lr.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = lr.predict(x_test)


In [9]:
lr.score(x_train,y_train)

0.7561240805539942

In [8]:
lr.score(x_test,y_test)

0.6352336167833779

# Exercise 1 : Train KNN model for this problem, and make prediction using test set.

In [12]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.neighbors import KNeighborsRegressor

# 2. Declare your model.
knr = KNeighborsRegressor()

# 3. Fit your model.
knr.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = knr.predict(x_train)



# Exercise 2 : Train decision tree for this problem, and make prediction using test set.

In [13]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.tree import DecisionTreeRegressor

# 2. Declare your model.

dtr = DecisionTreeRegressor()

# 3. Fit your model.

dtr.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = dtr.predict(x_train)


# Exercise 3 : Train support vector machine for this problem, and make prediction using test set.

* **There could be various answers**
* 'SVR' is recommended
* What is the name of support vector machine class for classification in sklearn?

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.svm import SVR

# 2. Declare your model.

svr = SVR()

# 3. Fit your model.

svr.fit(x_train, y_train)

# 4. predict using your fitted model.

y_pred = svr.predict(x_train)


# Exercise 4 : Train gradient boosting model for this problem, and make prediction using test set.

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.ensemble import GradientBoostingRegressor

# 2. Declare your model.

gbr = GradientBoostingRegressor()

# 3. Fit your model.

gbr.fit(x_train, y_train)

# 4. predict using your fitted model.

y_pred = gbr.predict(x_train)




# This time, we'll use another dataset

In [None]:
###################
## Run this cell ##
###################
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['class'] = cancer.target

print("Classification or Regression?")
print("-----------------------------")
print(cancer.target_names)
print("-----------------------------")
print(cancer.DESCR)

Classification or Regression?
-----------------------------
['malignant' 'benign']
-----------------------------
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these 

In [None]:
df['class'].unique()

array([0, 1])

In [None]:
"""
Training set for train model.
Test set for final evaluation ( to estimate generalization error )
"""
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df.drop(['class'], axis=1), df['class'],
                                                    test_size=0.2, random_state=2021)

print(y_train.head())
x_train.head()

269    1
51     1
187    1
28     0
199    0
Name: class, dtype: int64


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
269,10.71,20.39,69.5,344.9,0.1082,0.1289,0.08448,0.02867,0.1668,0.06862,0.3198,1.489,2.23,20.74,0.008902,0.04785,0.07339,0.01745,0.02728,0.00761,11.69,25.21,76.51,410.4,0.1335,0.255,0.2534,0.086,0.2605,0.08701
51,13.64,16.34,87.21,571.8,0.07685,0.06059,0.01857,0.01723,0.1353,0.05953,0.1872,0.9234,1.449,14.55,0.004477,0.01177,0.01079,0.007956,0.01325,0.002551,14.67,23.19,96.08,656.7,0.1089,0.1582,0.105,0.08586,0.2346,0.08025
187,11.71,17.19,74.68,420.3,0.09774,0.06141,0.03809,0.03239,0.1516,0.06095,0.2451,0.7655,1.742,17.86,0.006905,0.008704,0.01978,0.01185,0.01897,0.001671,13.01,21.39,84.42,521.5,0.1323,0.104,0.1521,0.1099,0.2572,0.07097
28,15.3,25.27,102.4,732.4,0.1082,0.1697,0.1683,0.08751,0.1926,0.0654,0.439,1.012,3.498,43.5,0.005233,0.03057,0.03576,0.01083,0.01768,0.002967,20.27,36.71,149.3,1269.0,0.1641,0.611,0.6335,0.2024,0.4027,0.09876
199,14.45,20.22,94.49,642.7,0.09872,0.1206,0.118,0.0598,0.195,0.06466,0.2092,0.6509,1.446,19.42,0.004044,0.01597,0.02,0.007303,0.01522,0.001976,18.33,30.12,117.9,1044.0,0.1552,0.4056,0.4967,0.1838,0.4753,0.1013


# Exercise 5 : Train logistic regression for this problem, and make prediction using test set.

**Note**
* Actually, logistic regression is to estimate the probability of the class.
* In this time, we just use logistic regression as a classifier.
    - Sklearn do below automatically
        - If probability >= 0.5 : 1 (class 1)
        - else : 0 (class 0)
* You can ignore the convergence warning

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.linear_model import LogisticRegression

# 2. Declare your model.

model = LogisticRegression()

# 3. Fit your model.

model.fit(x_train, y_train)

# 4. predict using your fitted model.

y_pred = model.predict(x_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# Exercise 6 : Train AdaBoost for this problem, and make prediction using test set.

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.ensemble import AdaBoostClassifier

# 2. Declare your model.

model = AdaBoostClassifier()

# 3. Fit your model.

model.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = model.predict(x_train)


# Exercise 7 : Train random forest for this problem, and make prediction using test set.

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.ensemble import RandomForestRegressor

# 2. Declare your model.
model = RandomForestRegressor()

# 3. Fit your model.
model.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = model.predict(x_train)


# Exercise 8 : Train artificial neural network for this problem, and make prediction using test set.

**note**
* Just for exercise, we will use Keras ( Highlevel API of Tensorflow 2.x ) for deep learning

In [None]:
####################
## Your code here ##
####################

# 1. Import what model you want.
from sklearn.neural_network import MLPClassifier

# 2. Declare your model.

model = MLPClassifier()

# 3. Fit your model.
model.fit(x_train, y_train)

# 4. predict using your fitted model.
y_pred = model.predict(x_train)
