# Pumpkin Seeds Dataset

## Introdução

Este conjunto de dados está em formato ```.xlsx```. Deste modo, usarei a função ```read_excel``` ao invés de ```read_csv```. Além disso, usarei estes dados para implementar o modelo de máquina de vetor de suporte (SVM). Já com respeito aos dados, é forneceido uma série de características sobre abóboras a fim de classificarmos entre dois tipos: "Çerçevelik" e "Ürgüp Sivrisi".

## Dados Iniciais

In [1]:
from pandas import read_excel

df = read_excel('Pumpkin_Seeds_Dataset.xlsx')

df.head()

Unnamed: 0,Area,Perimeter,Major_Axis_Length,Minor_Axis_Length,Convex_Area,Equiv_Diameter,Eccentricity,Solidity,Extent,Roundness,Aspect_Ration,Compactness,Class
0,56276,888.242,326.1485,220.2388,56831,267.6805,0.7376,0.9902,0.7453,0.8963,1.4809,0.8207,Çerçevelik
1,76631,1068.146,417.1932,234.2289,77280,312.3614,0.8275,0.9916,0.7151,0.844,1.7811,0.7487,Çerçevelik
2,71623,1082.987,435.8328,211.0457,72663,301.9822,0.8749,0.9857,0.74,0.7674,2.0651,0.6929,Çerçevelik
3,66458,992.051,381.5638,222.5322,67118,290.8899,0.8123,0.9902,0.7396,0.8486,1.7146,0.7624,Çerçevelik
4,66107,998.146,383.8883,220.4545,67117,290.1207,0.8187,0.985,0.6752,0.8338,1.7413,0.7557,Çerçevelik


## Tratamento dos dados

Vamos verificar se há linhas repetidas e células em branco.

In [2]:
df[ df.duplicated() ]

Unnamed: 0,Area,Perimeter,Major_Axis_Length,Minor_Axis_Length,Convex_Area,Equiv_Diameter,Eccentricity,Solidity,Extent,Roundness,Aspect_Ration,Compactness,Class


In [7]:
for i in df.columns:
    valor = int(df[i].isnull().sum())
    if valor > 0:
        print(i)

Como podemos ver, não temos células nulas nem linhas repetidas. Sendo assim, transformemos a última coluna (Class) em número, pois o modelo de SVM funciona se os elementos da tabela forem todos números.

## Divisão Atributos-Classe

In [4]:
x = df.iloc[:,0:12].values

y = df.iloc[:,12].values

y

array(['Çerçevelik', 'Çerçevelik', 'Çerçevelik', ..., 'Ürgüp Sivrisi',
       'Ürgüp Sivrisi', 'Ürgüp Sivrisi'], dtype=object)

## Codificação de Categoria

Embora não exista uma relação de ordem entre as classes "Çerçevelik" e "Ürgüp Sivrisi", usaremos o Label Encoder ao invés do One Hot Encoding pois temos apenas duas classes.

In [5]:
from sklearn.preprocessing import LabelEncoder

codificador = LabelEncoder()

y = codificador.fit_transform(y)

y

array([0, 0, 0, ..., 1, 1, 1])

In [16]:
x

array([[5.627600e+04, 8.882420e+02, 3.261485e+02, ..., 8.963000e-01,
        1.480900e+00, 8.207000e-01],
       [7.663100e+04, 1.068146e+03, 4.171932e+02, ..., 8.440000e-01,
        1.781100e+00, 7.487000e-01],
       [7.162300e+04, 1.082987e+03, 4.358328e+02, ..., 7.674000e-01,
        2.065100e+00, 6.929000e-01],
       ...,
       [8.799400e+04, 1.210314e+03, 5.072200e+02, ..., 7.549000e-01,
        2.282800e+00, 6.599000e-01],
       [8.001100e+04, 1.182947e+03, 5.019065e+02, ..., 7.185000e-01,
        2.451300e+00, 6.359000e-01],
       [8.493400e+04, 1.159933e+03, 4.628951e+02, ..., 7.933000e-01,
        1.973500e+00, 7.104000e-01]])

## Particionamento dos dados

In [9]:
from sklearn.model_selection import train_test_split

xTreino, xTeste, yTreino, yTeste = train_test_split(x,y,
                                                    test_size = 0.3,
                                                    random_state = 0)

yTreino

array([1, 1, 1, ..., 0, 0, 1])

In [10]:
xTreino

array([[5.805500e+04, 9.383600e+02, 3.735688e+02, ..., 8.285000e-01,
        1.879900e+00, 7.278000e-01],
       [8.227200e+04, 1.121769e+03, 4.594482e+02, ..., 8.216000e-01,
        2.012700e+00, 7.044000e-01],
       [9.174400e+04, 1.186482e+03, 4.650748e+02, ..., 8.190000e-01,
        1.843700e+00, 7.349000e-01],
       ...,
       [8.828300e+04, 1.147212e+03, 4.541398e+02, ..., 8.429000e-01,
        1.827500e+00, 7.383000e-01],
       [7.171100e+04, 1.066107e+03, 4.342415e+02, ..., 7.929000e-01,
        2.057300e+00, 6.959000e-01],
       [1.064530e+05, 1.331894e+03, 5.555923e+02, ..., 7.541000e-01,
        2.263000e+00, 6.626000e-01]])

## Implementação do SVM

OBS: **NÃO** ERA PRECISO CODIFICAR A CLASSE, POIS O SVM PRECISA APENAS QUE OS **ATRIBUTOS** SEJAM CODIFICADOS.

In [11]:
from sklearn.svm import SVC

svm = SVC()

svm.fit(xTreino, yTreino)

previsao = svm.predict(xTeste)

In [12]:
previsao

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,

## Precisão do Modelo

In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score

matriz = confusion_matrix(previsao,yTeste)

matriz

array([[342, 269],
       [ 47,  92]])

In [15]:
taxa = accuracy_score(yTeste,previsao)

taxa

0.5786666666666667