## **Scikit Learn**
*  Open source python library for ML

*  Built on top of Numpy, Scipy and matplotlib and other standard libraries.

*  Simple and efficient tools for predictive data analysis


# **Usage:**
 
1.  Fit and predict estimators.
2.  Transformers and pre processors
3.  Pipelines
4.  Model Evaluation
5.  Automated parameter research.

In [1]:
import sklearn
print("version:: ", sklearn.__version__)

version::  1.2.2


In [2]:
import pandas as pd

class_ = pd.read_csv('/kaggle/input/zoo-animal-classification/class.csv')
class_.head()


Unnamed: 0,Class_Number,Number_Of_Animal_Species_In_Class,Class_Type,Animal_Names
0,1,41,Mammal,"aardvark, antelope, bear, boar, buffalo, calf,..."
1,2,20,Bird,"chicken, crow, dove, duck, flamingo, gull, haw..."
2,3,5,Reptile,"pitviper, seasnake, slowworm, tortoise, tuatara"
3,4,13,Fish,"bass, carp, catfish, chub, dogfish, haddock, h..."
4,5,4,Amphibian,"frog, frog, newt, toad"


In [15]:
zoo = pd.read_csv('/kaggle/input/zoo-animal-classification/zoo.csv')
zoo.head()
# print(len(zoo))# 101

Unnamed: 0,animal_name,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,class_type
0,aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
1,antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
2,bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
3,bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
4,boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1


In [4]:
zoo.columns

Index(['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne',
       'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous',
       'fins', 'legs', 'tail', 'domestic', 'catsize', 'class_type'],
      dtype='object')

In [5]:
filtered_df = zoo[zoo['animal_name'].str.startswith('k')]
print(filtered_df)

   animal_name  hair  feathers  eggs  milk  airborne  aquatic  predator  \
41        kiwi     0         1     1     0         0        0         1   

    toothed  backbone  breathes  venomous  fins  legs  tail  domestic  \
41        0         1         1         0     0     2     1         0   

    catsize  class_type  
41        0           2  


In [6]:
filtered_df = zoo.query('class_type==1')
print(len(filtered_df))

41


Training the model on the zoo dataset and then checking if it gives proper class for sample data.
In the end we have checked if this returns correct class type for kangaroo.

### **Load a Dataset**

A dataset is nothing but a collection of data. A dataset generally has two main components: 

*   **Features**: (also known as predictors, inputs, or attributes) they are simply the variables of our data. They can be more than one and hence represented by a **feature matrix** (‘X’ is a common notation to represent feature matrix). A list of all the feature names is termed **feature names**.
    
*   **Response**: (also known as the target, label, or output) This is the output variable depending on the feature variables. We generally have a single response column and it is represented by a **response vector** (‘y’ is a common notation to represent response vector). All the possible values taken by a response vector are termed **target names**.

In [8]:
from sklearn.preprocessing import OneHotEncoder


X = zoo[['hair', 'feathers', 'eggs', 'milk', 'airborne',
       'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous',
       'fins', 'legs', 'tail', 'domestic', 'catsize', ]]  # Feature Matrix
# Removed animal name as it was categorical.

# Apply One-Hot Encoding
# encoder = OneHotEncoder(sparse=False)
# X_encoded = encoder.fit_transform(X)
y = zoo['class_type']                         # Response Vector

print("X features", X.head(), len(X))
print("Y features", y.head(), len(y))

X features    hair  feathers  eggs  milk  airborne  aquatic  predator  toothed  backbone  \
0     1         0     0     1         0        0         1        1         1   
1     1         0     0     1         0        0         0        1         1   
2     0         0     1     0         0        1         1        1         1   
3     1         0     0     1         0        0         1        1         1   
4     1         0     0     1         0        0         1        1         1   

   breathes  venomous  fins  legs  tail  domestic  catsize  
0         1         0     0     4     0         0        1  
1         1         0     0     4     1         0        1  
2         0         0     1     0     1         0        0  
3         1         0     0     4     0         0        1  
4         1         0     0     4     1         0        1   101
Y features 0    1
1    1
2    4
3    1
4    1
Name: class_type, dtype: int64 101


### **Splitting the Dataset**

*   Split the dataset into two pieces: a training set and a testing set.
    
*   Train the model on the training set.
    
*   Test the model on the testing set and evaluate how well our model did.
    

**Advantages of train/test split**

*   The model can be trained and tested on different data than the one used for training.
    
*   Response values are known for the test dataset; hence predictions can be evaluated.
    
*   Testing accuracy is a better estimate than training accuracy of out-of-sample performance.

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# printing the shapes of the new X objects
print("X_train Shape:",  X_train.shape)
print("X_test Shape:", X_test.shape)

# printing the shapes of the new y objects
print("Y_train Shape:", y_train.shape)
print("Y_test Shape: ",y_test.shape)

X_train Shape: (60, 16)
X_test Shape: (41, 16)
Y_train Shape: (60,)
Y_test Shape:  (41,)


The **train\_test\_split** function takes several arguments which are explained below:  

*   **X, y**: These are the feature matrix and response vector which need to be split.
    
*   **test\_size**: It is the ratio of test data to the given data. For example, setting test\_size = 0.4 for 100 rows of X produces test data of 100 x 0.4 = 40 rows.
    
*   **random\_state**: If you use random\_state = some\_number, then you can guarantee that your split will be always the same. This is useful if you want reproducible results, for example in testing for consistency in the documentation (so that everybody can see the same numbers).

###  **Training the Model**

Now, it’s time to train some prediction models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc.The example given below uses KNN (K nearest neighbors) classifier.

In [11]:
# training the model on training set
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

In [16]:
y_pred = knn.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("KNN model accuracy", metrics.accuracy_score(y_test, y_pred))

KNN model accuracy 0.8536585365853658


In [17]:
# making prediction for out of sample data
# Kangaroo
sample_test_input = pd.DataFrame({
    'hair': [1],
    'feathers': [0],
    'eggs': [0],
    'milk': [1],
    'airborne': [0],
    'aquatic': [0],
    'predator': [0],
    'toothed': [1],
    'backbone': [1],
    'breathes': [1],
    'venomous': [0],
    'fins': [0],
    'legs': [2],
    'tail': [1],
    'domestic': [0],
    'catsize': [0]
})

# Make a prediction using the sample input
sample_prediction = knn.predict(sample_test_input)

print("Sample prediction:", sample_prediction)

Sample prediction: [1]
