# Introduction to Sprint Machine Learning Scratch

# 1.About the Sprint
## The purpose of this Sprint
<li> Prepare for machine learning scratch </li>

## How to learn
<li> First, use scikit-learn to implement a machine learning program that will be implemented by scratch in future learning.</li>

# 2. What is scratch?

By combining the basic libraries provided in NumPy etc., you can create your own classes / functions equivalent to the functions implemented in the applied libraries such as scikit-learn. This is called scratch.​


Through scratching, it is difficult to grasp just by moving a library such as scikit-learn, and we aim for a deep understanding of the algorithm. It also improves your coding skills, but that's not the main purpose.


We are aiming for the following effects.


<li>Make it easier to understand theory and mathematical formulas when encountering new methods
<li>Reduce ambiguity in using libraries
<li>Make existing implementations easier to read

This time, first, we will implement the machine learning program using scikit-learn without completely scratching it. Then, from the next time, we will gradually shift the implementation using scikit-learn to scratch.

## Library

In [41]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score

We will implement the code using scikit-learn.


Use the self-made function created in question 1 to divide the verification data. The holdout method may be used instead of cross Validation.

<b> Classification problem </b>

Classification scratches three methods.


<li>Logistic regression</li>
<li>SVM</li>
<li>Decision tree</li>

It can be used in scikit-learn from two types, LogisticRegression class and SGDClassifier class. Here, use the SGDClassifier class that calculates using the gradient descent method. Logistic regression can be calculated by setting loss = ”log” as an argument.

The scikit-learn as a classifier that can be used in logistic regression LogisticRegression class and SGDClassifier have classes are available. SGDClassifier class, which is calculated using the gradient descent method. Logistic regression can be calculated by specifying loss="log" as an argument.

## Problem 1: Scratch of train_test_split
First, try scratching train_test_split of scikit-learn. Please implement the function based on the following template.


sklearn.model_selection.train_test_split - scikit-learn stable version documentation


Be sure to check if the created function train_test_split​
```
def scratch_train_test_split(X, y, train_size=0.8):
    """Divide the validation data.
    Parameters
    ----------
    X : ndarray
      Training data (n_samples, n_features)
    y : ndarray
      Correct answer value (n_samples,)
    train_size : float
      Specify what percentage to use as a train (0 < train_size < 1)
    Returns
    -------
    X_train : ndarray
      Training data (n_samples, n_features)
    X_test : ndarray
      Validation data (n_samples, n_features)
    y_train : ndarray
      Correct answer value of training data (n_samples,)
    y_test : ndarray
      Correct value of verification data (n_samples,)
    """
    # Write code here
    pass
    return X_train, X_test, y_train, y_test
```

In [32]:
X = pd.DataFrame(load_iris().data)

In [33]:
X.columns =['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

In [34]:
X.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [35]:
X.shape

(150, 4)

In [36]:
Y = pd.DataFrame(load_iris().target)
Y.columns = ['Species']
Y

Unnamed: 0,Species
0,0
1,0
2,0
3,0
4,0
...,...
145,2
146,2
147,2
148,2


In [37]:
def scratch_train_test_split(X, y, train_size=0.8):
    """Divide the validation data.
    Parameters
    ----------
    X : ndarray
      Training data (n_samples, n_features)
    y : ndarray
      Correct answer value (n_samples,)
    train_size : float
      Specify what percentage to use as a train (0 < train_size < 1)
    Returns
    -------
    X_train : ndarray
      Training data (n_samples, n_features)
    X_test : ndarray
      Validation data (n_samples, n_features)
    y_train : ndarray
      Correct answer value of training data (n_samples,)
    y_test : ndarray
      Correct value of verification data (n_samples,)
    """
    # Write code here
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test

### Since we want to make a binary classification, we will use only the following two objective variables. All four types of features are used.


virgicolor and virginica

The other two are artificial datasets with two feature values. With the following code we can create the explanatory variableXand the objective variable y. Let's call them "Simple Data Set 1" and "Simple Data Set 2". As there are only two features, visualization is easy.
<br><br>

<b> 
Simple data set 1 creation code </b>

```
import numpy as np
np.random.seed(seed=0)
n_samples = 500
f0 = [-1, 2]
f1 = [2, -1]
cov = [[1.0,0.8], [0.8, 1.0]]
f0 = np.random.multivariate_normal(f0, cov, n_samples // 2)
f1 = np.random.multivariate_normal(f1, cov, n_samples // 2)
X = np.concatenate([f0, f1])
y = np.concatenate([
    np.full(n_samples // 2, 1),
    np.full(n_samples // 2, -1)
])
```

<b> 
Simple data set 2 creation code </b>

X = np.array([
    <br>[-0.44699 , -2.8073  ],[-1.4621  , -2.4586  ],
    <br>[ 0.10645 ,  1.9242  ],[-3.5944  , -4.0112  ],
    <br>[-0.9888  ,  4.5718  ],[-3.1625  , -3.9606  ],
    <br>[ 0.56421 ,  0.72888 ],[-0.60216 ,  8.4636  ],
    <br>[-0.61251 , -0.75345 ],[-0.73535 , -2.2718  ],
    <br>[-0.80647 , -2.2135  ],[ 0.86291 ,  2.3946  ],
    <br>[-3.1108  ,  0.15394 ],[-2.9362  ,  2.5462  ],
    <br>[-0.57242 , -2.9915  ],[ 1.4771  ,  3.4896  ],
    <br>[ 0.58619 ,  0.37158 ],[ 0.6017  ,  4.3439  ],
    <br>[-2.1086  ,  8.3428  ],[-4.1013  , -4.353   ],
    <br>[-1.9948  , -1.3927  ],[ 0.35084 , -0.031994],
    <br>[ 0.96765 ,  7.8929  ],[-1.281   , 15.6824  ],
    <br>[ 0.96765 , 10.083   ],[ 1.3763  ,  1.3347  ],
    <br>[-2.234   , -2.5323  ],[-2.9452  , -1.8219  ],
    <br>[ 0.14654 , -0.28733 ],[ 0.5461  ,  5.8245  ],
    <br>[-0.65259 ,  9.3444  ],[ 0.59912 ,  5.3524  ],
    <br>[ 0.50214 , -0.31818 ],[-3.0603  , -3.6461  ],
    <br>[-6.6797  ,  0.67661 ],[-2.353   , -0.72261 ],
    <br>[ 1.1319  ,  2.4023  ],[-0.12243 ,  9.0162  ],
    <br>[-2.5677  , 13.1779  ],[ 0.057313,  5.4681  ],
])
<br>
y = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


## Problem 2: Creating a code to solve the classification problem

## Regression problem
Regression then scratches one type.

<li> Linear regression </li>

For linear regression, use SGDRegressor class, which is calculated using gradient descent.


sklearn.linear_model.SGDRegressor - scikit-lear stable version documentation


The data set is from the House Prices competition as in the pre-study period.


House Prices: Advanced Regression Techniques


Download train.csv and use SalePriceas the objective variable andGrLivAreaand YearBuiltas the explanatory variables.

### Dataset 1

In [38]:
X_train, X_test, y_train, y_test = scratch_train_test_split(np.array(X),np.array(Y))

### Dataset 2 

In [17]:
import numpy as np
np.random.seed(seed=0)
n_samples = 500
f0 = [-1, 2]
f1 = [2, -1]
cov = [[1.0,0.8], [0.8, 1.0]]
f0 = np.random.multivariate_normal(f0, cov, n_samples // 2)
f1 = np.random.multivariate_normal(f1, cov, n_samples // 2)
X_2 = np.concatenate([f0, f1])
y_2 = np.concatenate([
    np.full(n_samples // 2, 1),
    np.full(n_samples // 2, -1)
])

X2_train, X2_test, y2_train, y2_test = scratch_train_test_split(X_2,y_2)


### Dataset 3

In [31]:
X3 = np.array([
    [-0.44699 , -2.8073  ],[-1.4621  , -2.4586  ],
    [ 0.10645 ,  1.9242  ],[-3.5944  , -4.0112  ],
    [-0.9888  ,  4.5718  ],[-3.1625  , -3.9606  ],
    [ 0.56421 ,  0.72888 ],[-0.60216 ,  8.4636  ],
    [-0.61251 , -0.75345 ],[-0.73535 , -2.2718  ],
    [-0.80647 , -2.2135  ],[ 0.86291 ,  2.3946  ],
    [-3.1108  ,  0.15394 ],[-2.9362  ,  2.5462  ],
    [-0.57242 , -2.9915  ],[ 1.4771  ,  3.4896  ],
    [ 0.58619 ,  0.37158 ],[ 0.6017  ,  4.3439  ],
    [-2.1086  ,  8.3428  ],[-4.1013  , -4.353   ],
    [-1.9948  , -1.3927  ],[ 0.35084 , -0.031994],
    [ 0.96765 ,  7.8929  ],[-1.281   , 15.6824  ],
    [ 0.96765 , 10.083   ],[ 1.3763  ,  1.3347  ],
    [-2.234   , -2.5323  ],[-2.9452  , -1.8219  ],
    [ 0.14654 , -0.28733 ],[ 0.5461  ,  5.8245  ],
    [-0.65259 ,  9.3444  ],[ 0.59912 ,  5.3524  ],
    [ 0.50214 , -0.31818 ],[-3.0603  , -3.6461  ],
    [-6.6797  ,  0.67661 ],[-2.353   , -0.72261 ],
    [ 1.1319  ,  2.4023  ],[-0.12243 ,  9.0162  ],
    [-2.5677  , 13.1779  ],[ 0.057313,  5.4681  ],
])
y3 = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


X3_train, X3_test, y3_train, y3_test = scratch_train_test_split(X3,y3)

### Method 1

In [46]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
clf = make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000,tol=1e-3))

print("SGD Classifier")
# Predict for first dataset
clf.fit(X_train,y_train)
y_pred_data1 = clf.predict(X_test)

print("Dataset 1 accuracy:",accuracy_score(y_test, y_pred_data1))

#Predict for second dataset
clf.fit(X2_train,y2_train)
y_pred_data2 = clf.predict(X2_test)
print("Dataset 2 accuracy:",accuracy_score(y2_test, y_pred_data2))

#Predict for third dataset
clf.fit(X3_train,y3_train)
y_pred_data3 = clf.predict(X3_test)
print("Dataset 3 accuracy:",accuracy_score(y3_test, y_pred_data3))


SGD Classifier
Dataset 1 accuracy: 0.94
Dataset 2 accuracy: 1.0
Dataset 3 accuracy: 0.5714285714285714


  return f(*args, **kwargs)


### Method 2

In [47]:
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(),SVC(gamma='auto'))

print("SVM Classifier")
# Predict for first dataset
clf.fit(X_train,y_train)
y_pred_data1 = clf.predict(X_test)

print("Dataset 1 accuracy:",accuracy_score(y_test, y_pred_data1))

#Predict for second dataset
clf.fit(X2_train,y2_train)
y_pred_data2 = clf.predict(X2_test)
print("Dataset 2 accuracy:",accuracy_score(y2_test, y_pred_data2))

#Predict for third dataset
clf.fit(X3_train,y3_train)
y_pred_data3 = clf.predict(X3_test)
print("Dataset 3 accuracy:",accuracy_score(y3_test, y_pred_data3))

SVM Classifier
Dataset 1 accuracy: 0.98
Dataset 2 accuracy: 1.0
Dataset 3 accuracy: 0.5714285714285714


  return f(*args, **kwargs)


### Method 3 

In [48]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state = 0)

print("Decision Tree Classifier")
# Predict for first dataset
clf.fit(X_train,y_train)
y_pred_data1 = clf.predict(X_test)

print("Dataset 1 accuracy:",accuracy_score(y_test, y_pred_data1))

#Predict for second dataset
clf.fit(X2_train,y2_train)
y_pred_data2 = clf.predict(X2_test)
print("Dataset 2 accuracy:",accuracy_score(y2_test, y_pred_data2))

#Predict for third dataset
clf.fit(X3_train,y3_train)
y_pred_data3 = clf.predict(X3_test)
print("Dataset 3 accuracy:",accuracy_score(y3_test, y_pred_data3))

Decision Tree Classifier
Dataset 1 accuracy: 0.96
Dataset 2 accuracy: 1.0
Dataset 3 accuracy: 0.5714285714285714


## Regression problem
Regression then scratches one type.


<li> Linear regression </li>

For linear regression, use SGDRegressor class, which is calculated using gradient descent.


[sklearn.linear_model.SGDRegressor - scikit-lear stable version documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)


The data set is from the House Prices competition as in the pre-study period.


[House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)


Downloadtrain.csvand use SalePriceas the objective variable and GrLivArea and YearBuilt as the explanatory variables.

## Problem 3: Creating a code to solve the regression problem
Create code to train and estimate the House Prices data set with linear regression.

In [109]:
from sklearn.metrics import mean_squared_error

In [54]:
data = pd.read_csv("train.csv")

In [55]:
my_X = data.loc[:,["GrLivArea","YearBuilt"]]
my_Y = data.loc[:,"SalePrice"]

In [56]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split( np.array(my_X), np.array(my_Y), test_size=0.25, random_state=42)

In [81]:
X_train_reg[:,:].shape

(1095, 2)

In [82]:
y_train_reg.shape

(1095,)

In [102]:
X_train_reg.shape

(1095, 2)

In [106]:
X_test_reg.shape

(365, 2)

In [108]:
clf = make_pipeline(StandardScaler(),SGDClassifier(max_iter=1000,tol=1e-3))
# Learn
clf.fit(X_train_reg, y_train_reg.reshape(-1,1))
#Predict
y_predict = clf.predict(X_test_reg)
# # Evaluation
mean_squared_error(y_test_reg,y_predict)

  return f(*args, **kwargs)


5732564762.679452