# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Breast-Cancer-Wisconsin-(Diagnostic)-Data-Set" data-toc-modified-id="Breast-Cancer-Wisconsin-(Diagnostic)-Data-Set-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Breast Cancer Wisconsin (Diagnostic) Data Set</a></div><div class="lev2 toc-item"><a href="#Attribute-Information:" data-toc-modified-id="Attribute-Information:-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Attribute Information:</a></div><div class="lev2 toc-item"><a href="#分類器" data-toc-modified-id="分類器-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>分類器</a></div><div class="lev2 toc-item"><a href="#仮説クラス" data-toc-modified-id="仮説クラス-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>仮説クラス</a></div><div class="lev1 toc-item"><a href="#最急降下法" data-toc-modified-id="最急降下法-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>最急降下法</a></div><div class="lev1 toc-item"><a href="#code(ruby)" data-toc-modified-id="code(ruby)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>code(ruby)</a></div><div class="lev3 toc-item"><a href="#直交ベクトル" data-toc-modified-id="直交ベクトル-301"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>直交ベクトル</a></div>

# Breast Cancer Wisconsin (Diagnostic) Data Set

<https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)>

## Attribute Information:

1. ID number 
1. Diagnosis (M = malignant, B = benign) M:悪性，B:良性
1. 3-32

Ten real-valued features are computed for each cell nucleus: 

* 半径radius (mean of distances from center to points on the perimeter) 
* テクスチャtexture (standard deviation of gray-scale values) 
* 境界の長さperimeter 
* 面積area 
* なめらかさsmoothness (local variation in radius lengths) 
* コンパクトさcompactness (perimeter^2 / area - 1.0) 
* くぼみ度合いconcavity (severity of concave portions of the contour) 
* くぼみの数concave points (number of concave portions of the contour) 
* 対称性symmetry 
* フラクタル次元fractal dimension ("coastline approximation" - 1)

http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html

## 分類器
与えられた特徴ベクトル$\boldsymbol{a}$に対し，
細胞組織が悪性か良性かを分類する関数$C(\boldsymbol{y})$を選び出すプログラムを作成しよう．

## 仮説クラス
分類器は可能な分類器の集合(**仮説クラス**)から選ばれる．この場合，仮説クラスとは特徴ベクトルの空間$\mathbb{R}^D$から$\mathbb{R}$への線形関数$h(\cdot)$である．すると分類器は次のような関数として定義される．

$$
C(\boldsymbol{y}) = 
\left\{ \begin{array}{ccc}
+1 &  {\rm when} & h(\boldsymbol{y})\geq 0\\
-1 &  {\rm when} & h(\boldsymbol{y})<0
\end{array} \right.
$$

各線形関数$h:\mathbb{R}^D \rightarrow \mathbb{R}$に対して，
次のような$D$ベクトル$\boldsymbol{w}$が存在する．
$$
h(\boldsymbol{y}) = \boldsymbol{w}\cdot \boldsymbol{y}
$$
したがって，そのような線形関数を選ぶことは，結局$D$ベクトル$\boldsymbol{w}$を
選ぶことに等しい．特に，$\boldsymbol{w}$を選ぶことは，仮説クラス$h$を
選ぶことと等価なので，$\boldsymbol{w}$を**仮説ベクトル**と呼ぶ．

単に，ベクトルの掛け算で分類器はできそう．問題はどうやってこの仮説ベクトルを決定するか？ですよね．


# 最急降下法

損失関数に
$$
L(w)=\sum_{i=1}^m (a_i \cdot w - b_i)^2
$$
を選ぶと
$$
\begin{aligned}
\frac{\partial L}{\partial w_j} &= 
\sum_{i=1}^m \frac{\partial}{\partial w_j}(a_i \cdot w -b_i)^2 \\
&= \sum_{i=1}^m 2(a_i \cdot w -b_i) a_{ij}
\end{aligned}
$$
となる．
ここで，$a_{ij}$は$a_i$の$j$番目の要素です．
こいつを勾配として，local minimumを求める．

このsumはiのmまでの集計と記述していますが，テキストではデータ数の和を意図しています．jはベクトル$w$の要素となります．

# code(ruby)

* file:/Users/bob/Github/TeamNishitani/coding_the_matrix/codes/my_cancer_detector.rb


In [7]:
require 'narray'

# initial set ups
lines_A = File.readlines('./codes/train_A.data')
lines_b = File.readlines('./codes/train_b.data')

p n = lines_A.size
p m = lines_A[0].split("\t").size
matrix_A = NMatrix.sfloat(m,n)
vector_b = NVector.sfloat(n)
vector_w = NVector.sfloat(m)
vector_dLw = NVector.sfloat(m)

n.times do |i|
  lines_A[i].split("\t").each_with_index do |v,j|
    matrix_A[j,i] = v.to_f
  end
  vector_b[i] = lines_b[i].to_f
end

m.times{|i| vector_w[i]=0.0001}

300
30


30

In [2]:
def print_w(vector_w)
  params = ["radius", "texture","perimeter","area",
    "smoothness","compactness","concavity","concave points",
    "symmetry","fractal dimension"];
  print("    (params)      :")
  print("    (mean)    (stderr)     (worst)")
  params.each_with_index do |param, i|
    printf("\n%17s :",param)
    3.times{|j| printf("%12.8f", vector_w[i*3+j])}
  end
end

:print_w

In [3]:
loop, sigma = 300, 3.0*10**(-9)
loop.times do |l|
  vector_dLw = matrix_A*vector_w - vector_b
  vector_w = vector_w - vector_dLw*matrix_A*sigma
end

print_w(vector_w)

    (params)      :    (mean)    (stderr)     (worst)
           radius :  0.00052012  0.00082878  0.00260480
          texture :  0.00165763  0.00010463  0.00010006
        perimeter :  0.00009600  0.00009791  0.00010882
             area :  0.00010355  0.00010234  0.00016925
       smoothness :  0.00009954 -0.00079295  0.00010042
      compactness :  0.00010042  0.00010070  0.00010026
        concavity :  0.00010118  0.00010019  0.00050416
   concave points :  0.00100466  0.00244736 -0.00193280
         symmetry :  0.00010583  0.00009609  0.00009172
fractal dimension :  0.00009761  0.00011135  0.00010346

["radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", "concave points", "symmetry", "fractal dimension"]

In [4]:
def show_correct_error(mA, vb, vw)
  # Diagnosis (M = malignant, B = benign) M:悪性(-1)，B:良性(1)
  correct,safe_error,critical_error=0,0,0
  predict = mA*vw
  p n = vb.size
  n.times do |i|
    if predict[i]*vb[i]>0 then
      correct += 1
    elsif (predict[i]<0 && vb[i]>0) then
      safe_error += 1
    elsif (predict[i]>0 && vb[i]<0) then
      critical_error += 1
    end
  end
  printf("       correct: %4d/%4d\n",correct,n);
  printf("    safe error: %4d\n",safe_error);
  printf("critical error: %4d",critical_error);
end

:show_correct_error

In [5]:
show_correct_error(matrix_A, vector_b, vector_w)

300
       correct:  274/ 300
    safe error:    5
critical error:   21

In [6]:
require 'narray'

# initial set ups
lines_A = File.readlines('./codes/validate_A.data')
lines_b = File.readlines('./codes/validate_b.data')

p n = lines_A.size
p m = lines_A[0].split("\t").size

matrix_A = NMatrix.sfloat(m,n)
vector_b = NVector.sfloat(n)

n.times do |i|
  lines_A[i].split("\t").each_with_index do |v,j|
    matrix_A[j,i] = v.to_f
  end
  vector_b[i] = lines_b[i].to_f
end

show_correct_error(matrix_A, vector_b, vector_w)

260
30
260
       correct:  240/ 260
    safe error:    9
critical error:   11

### 直交ベクトル
でも0に近い値から始めたほうがいいか．．．
つまり直交ベクトルからか．．．なんかreasonable...


In [8]:
require 'lapacke'

LoadError: cannot load such file -- lapacke