<h2> Predictive Analytics </h2>

We perform predictive analytics to predict the family of plant given Symbol, National Common Name and Author.

In [1]:
import numpy as np 
import pandas as pd

from sklearn import metrics
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB

We read the cleaned and pre-processed plant dataset.

In [2]:
data = pd.read_csv('dataset_plants.csv')
data.head()

Unnamed: 0,Symbol,National Common Name,Family,Author
0,0,0,0,0
1,1,2,0,1
2,2,1,1,2
3,2,2,1,3
4,3,3,0,4


Now we have to split the dataset into train and test set.

### Train-Test Split
We split the dataset in such a way that 80% of the data is in train set and other 20% is in test set.

In [3]:
train,test = train_test_split(data, test_size=0.2)

In [4]:
train_x = train.drop('Family', axis=1) 
train_y = train['Family']
test_x = test.drop('Family', axis=1) 
test_y = test['Family']

### Random Forest Classifier

In [5]:
RF = RandomForestClassifier(n_estimators=100)
RF.fit(train_x,train_y)
y_pred = RF.predict(test_x)
RF_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.6169041450777202


### Decision Tree Classifier

In [6]:
DT = DecisionTreeClassifier()
DT = DT.fit(train_x,train_y)
y_pred = DT.predict(test_x)
DT_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.7684585492227979


### K-Nearest Neighbors Classification

In [7]:
KNN = KNeighborsClassifier(n_neighbors=7)
KNN.fit(train_x, train_y)
y_pred = KNN.predict(test_x)
KNN_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.47668393782383417


In [8]:
KNN = KNeighborsClassifier(n_neighbors=5)
KNN.fit(train_x, train_y)
y_pred = KNN.predict(test_x)
KNN_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.4948186528497409


In [9]:
KNN = KNeighborsClassifier(n_neighbors=3)
KNN.fit(train_x, train_y)
y_pred = KNN.predict(test_x)
KNN_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.5103626943005182


### Support Vector Machines

In [10]:
SVM = LinearSVC(dual=False)
SVM.fit(train_x, train_y)
y_pred = SVM.predict(test_x)
SVM_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.17260362694300518


### Gaussian Naive Bayes

In [11]:
GNB = GaussianNB()
GNB.fit(train_x, train_y)
y_pred = GNB.predict(test_x)
GNB_accuracy = metrics.accuracy_score(test_y, y_pred)
print("Accuracy:",metrics.accuracy_score(test_y, y_pred))

Accuracy: 0.18879533678756477


<h4> Now, for finding the best model.</h4>

In [12]:
best_model = pd.DataFrame({
'Model': ['Random Forest Classifier','Decision Tree','K-Nearest Neighbors','Support Vector Machines','Gaussian Naive Bayes'],
'Score': [RF_accuracy,DT_accuracy,KNN_accuracy,SVM_accuracy,GNB_accuracy]})
best_model = best_model.sort_values(by='Score', ascending=False).reset_index()
best_model = best_model.drop("index",axis=1)
best_model

Unnamed: 0,Model,Score
0,Decision Tree,0.768459
1,Random Forest Classifier,0.616904
2,K-Nearest Neighbors,0.510363
3,Gaussian Naive Bayes,0.188795
4,Support Vector Machines,0.172604


<h4> For the given input, Decision Tree Classifier was found to be the best for predicting family of the plant.</h4>