## scikit-learn資料集

In [4]:
import pandas as pd
import numpy as np

In [5]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


In [6]:
from sklearn.datasets import load_iris
iris = load_iris()

In [2]:
import joblib

In [7]:
print('資料集描述：')
print(iris.DESCR)
print('特徵值：')
print(iris.data)
print('目標值：')
print(iris.target)
print('特徵名稱：')
print(iris.feature_names)
print('目標名稱：')
print(iris.target_names)

資料集描述：
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov

In [8]:
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(data_home='.', subset='all')
print('目標值：')
print(news.target)
print('目標名稱：')
print(news.target_names)
print('第一篇新聞內容：')
print(news.data[0])  #列印第一篇新聞

目標值：
[10  3 17 ...  3  1  7]
目標名稱：
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
第一篇新聞內容：
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why


In [9]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
print('原始_特徵：{}, 原始_目標：{}'.
      format(iris.data.shape, iris.target.shape))
x_train, x_test, y_train, y_test = train_test_split(
		iris.data,	iris.target, test_size=0.2)
print('訓練_特徵：{}, 訓練_目標：{}'.
      format(x_train.shape, y_train.shape))
print('測試_特徵：{}, 測試_目標：{}'.
      format(x_test.shape, y_test.shape))

原始_特徵：(150, 4), 原始_目標：(150,)
訓練_特徵：(120, 4), 訓練_目標：(120,)
測試_特徵：(30, 4), 測試_目標：(30,)


## K近鄰演算法

In [10]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = load_iris()
x_train , x_test , y_train , y_test = train_test_split(
    iris.data,iris.target,test_size=0.2)
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
print('  目標值：{}'.format(y_test))
print('預測結果：{}'.format(y_predict))
print('  準確率：{}'.format(knn.score(x_test, y_test)))

  目標值：[0 2 0 1 1 0 0 2 1 2 0 0 2 2 2 1 0 1 1 1 1 2 0 0 0 2 2 0 0 2]
預測結果：[0 2 0 1 1 0 0 2 2 2 0 0 2 2 2 1 0 1 1 1 1 2 0 0 0 1 2 0 0 2]
  準確率：0.9333333333333333


## K近鄰演算法應用：手寫數字

In [4]:
from sklearn.neighbors import \
    KNeighborsClassifier
from sklearn.model_selection import \
    train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [5]:
data = []
for i in range(10):
    for j in range(1,501):
        data.append(plt.imread('mnist500/%d/%d_%d.bmp'%(i,i,j)))
x = np.array(data)
print(x.shape)

(5000, 28, 28)


In [6]:
y = [0,1,2,3,4,5,6,7,8,9]*500
y = np.array(y)
y.sort()
print(y)

[0 0 0 ... 9 9 9]


In [7]:
x_train , x_test , y_train , y_test = \
    train_test_split(x,y,test_size=0.2)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train.reshape(4000,-1),y_train)
score = knn.score(x_test.reshape(1000,-1),y_test)
print(score)

0.928


In [8]:
#以<mnist1000.zip>進行手寫數字辨識
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
data = []
for i in range(10):
    for j in range(1,1001):
        data.append(plt.imread('mnist1000/%d/%d_%d.bmp'%(i,i,j)))
x = np.array(data)
y = [0,1,2,3,4,5,6,7,8,9]*1000
y = np.array(y)
y.sort()

In [9]:
x_train , x_test , y_train , y_test = train_test_split(x,y,test_size=0.2)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train.reshape(8000,-1),y_train)
score = knn.score(x_test.reshape(2000,-1),y_test)
print(score)

0.934


In [17]:
knnmodel = joblib.load('mnist500.pkl')
score = knnmodel.score(x_test.reshape(2000,-1),y_test)
print(score)

0.014


## 樸素貝葉斯演算法

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
data = cv.fit_transform(['code is easy, i like python', 
                         'code is too hard, i dislike python'])
#print(data)
print(data.toarray())
print(cv.get_feature_names_out())

[[1 0 1 0 1 1 1 0]
 [1 1 0 1 1 0 1 1]]
['code' 'dislike' 'easy' 'hard' 'is' 'like' 'python' 'too']


In [25]:
pip install jieba

Note: you may need to restart the kernel to use updated packages.


In [26]:
import jieba
t1 = list(jieba.cut('今天台北天氣晴朗，風景區擠滿了人潮。'))
t2 = list(jieba.cut('台北的天氣常常下雨。'))
c1 = ' '.join(t1)
c2 = ' '.join(t2)
print(c1)
print(c2)

今天 台北 天氣 晴朗 ， 風景區 擠 滿 了 人潮 。
台北 的 天氣 常常 下雨 。


In [27]:
cv = CountVectorizer()
data = cv.fit_transform([c1, c2])
print(data.toarray())
print(cv.get_feature_names_out())

[[0 1 1 1 1 0 1 1]
 [1 0 0 1 1 1 0 0]]
['下雨' '人潮' '今天' '台北' '天氣' '常常' '晴朗' '風景區']


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
data = tf.fit_transform([c1, c2])
print(data.toarray())
print(cv.get_feature_names_out())

[[0.         0.44665616 0.44665616 0.31779954 0.31779954 0.
  0.44665616 0.44665616]
 [0.57615236 0.         0.         0.40993715 0.40993715 0.57615236
  0.         0.        ]]
['下雨' '人潮' '今天' '台北' '天氣' '常常' '晴朗' '風景區']


In [29]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
news = fetch_20newsgroups(subset='all')
x_train, x_test, y_train, y_test = train_test_split(
    news.data, news.target, test_size=0.20)
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)  
x_test = tf.transform(x_test)
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)
score = mlt.score(x_test, y_test)
print(score)

0.8538461538461538


In [30]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import jieba
f = open('toutiao_cat_data.txt',encoding='utf-8')
data = []
target = []
for line in f:
  linelist = line.split('_!_')
  target.append(linelist[1])
  tem = list(jieba.cut(linelist[3]))
  data.append(' '.join(tem))
x_train, x_test, y_train, y_test = train_test_split(
    data, target, test_size=0.20)
tf = TfidfVectorizer()
x_train = tf.fit_transform(x_train)  
x_test = tf.transform(x_test)
mlt = MultinomialNB(alpha=1.0)
mlt.fit(x_train, y_train)
score = mlt.score(x_test, y_test)
print(score)

0.8281768533277588


## 決策樹

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

df = pd.read_csv('titanic.csv')
x = df[['pclass', 'age', 'sex']]
y = df['survived']
x['age'].fillna(x['age'].mean(), inplace=True) #空值填充
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2)
dict = DictVectorizer(sparse=False)
x_train = x_train.to_dict(orient='records')
x_train = dict.fit_transform(x_train)
x_test = x_test.to_dict(orient='records')
x_test = dict.transform(x_test)
print('訓練資料：')
print(x_train)
print('onehot 特徵名稱：')
print(dict.get_feature_names_out())

訓練資料：
[[27.          0.          0.          1.          0.          1.        ]
 [31.19418104  0.          0.          1.          0.          1.        ]
 [20.          1.          0.          0.          1.          0.        ]
 ...
 [26.          0.          0.          1.          0.          1.        ]
 [19.          0.          1.          0.          1.          0.        ]
 [34.          0.          1.          0.          1.          0.        ]]
onehot 特徵名稱：
['age' 'pclass=1st' 'pclass=2nd' 'pclass=3rd' 'sex=female' 'sex=male']


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [21]:
dec = DecisionTreeClassifier()
dec.fit(x_train, y_train)
score = dec.score(x_test, y_test)
print(score)

0.7642585551330798


## 隨機森林

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

df = pd.read_csv('wine.csv')
x = df.iloc[:, 1:14]
y = df.iloc[:, 0]
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=10, 
                            min_samples_split=30)
rf.fit(x_train, y_train)
score = rf.score(x_test, y_test)
print(score)

0.9375
