## Importing Packages

In [25]:
import numpy as np

In [26]:
import pandas as pd

## Read Data from CSV to Dataframe

In [27]:
iris_original = pd.read_csv("iris.csv")
iris_original2 = iris_original

## Check the attributes whether there are any null values or not

In [28]:
iris_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 5.9 KB


In [29]:
iris_original.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [30]:
iris_original.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


Count - There are 150 elements in each attribute.
mean - Mean of all values in each attribute
std - Standard deviation per attribute
min,max - Minimum and Maximum value(out of 150) for each
25% - % of values which are below a certain value in the data set
50% - % of values which are below a certain value in the data set
75% - % of values which are below a certain value in the data set

In [31]:
iris_original.shape

(150, 5)

This shows that all the rows from the CSV file have been successfully imported into our dataframe 'iris_original'.
There are no null values in any of the attributes.

## Splitting the parent Data set into Train and Test

In [32]:
from sklearn.model_selection import train_test_split

The function 'train_test_split' returns 2 data frames based on the inputs given.


In [33]:
train_set, test_set = train_test_split(iris_original,test_size=0.2,random_state=42)

The above statement illustrates that we shall get 'train_set' and 'test_set' in response from the function 'train_test_split'. The parameters passed to the function are as follows: original dataframe, size of the test dataframe post splitting, here 0.2 specifies 20% meaning 20% values of the original dataset will be moved to the test_set and 80% to the train_set.

In [34]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 120 entries, 22 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  120 non-null    float64
 1   sepal_width   120 non-null    float64
 2   petal_length  120 non-null    float64
 3   petal_width   120 non-null    float64
 4   class         120 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 5.6 KB


test_set.info()

In [35]:
print(train_set.shape)
print(test_set.shape)

(120, 5)
(30, 5)


This output indicates that we have successfully split the original data into 2 parts - Test with 30 values which will be used to test our model post development and 120 values in Train , which will be used to train our model.

In [36]:
iris_original = train_set.drop('class',axis=1)
iris_labels = train_set['class'].copy()

## Analyze the dataset for any correlations

In [37]:
iris_corr = iris_original2.corr()

In [38]:
iris_corr

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
sepal_length,1.0,-0.11757,0.871754,0.817941,0.782561
sepal_width,-0.11757,1.0,-0.42844,-0.366126,-0.426658
petal_length,0.871754,-0.42844,1.0,0.962865,0.949035
petal_width,0.817941,-0.366126,0.962865,1.0,0.956547
class,0.782561,-0.426658,0.949035,0.956547,1.0


In [39]:
iris_corr['class'].sort_values(ascending=False)

class           1.000000
petal_width     0.956547
petal_length    0.949035
sepal_length    0.782561
sepal_width    -0.426658
Name: class, dtype: float64

We have the correlation coefficients as above. A typical correlation coefficient will have values that revolve from '-1' to '1'.
petal_width has value 0.956 which is considered very high. This indicates petal_width is directly related to the type of flower.
petal_length has value 0.949 which is considered very high. This indicates petal_width is directly related to the type of flower.
sepal_length has value 0.783 which is considered very high. This indicates petal_width is directly related to the type of flower.
sepal_width has negative value -0.426 which is considered very low. This indicates sepal_width is not directly related to the type of flower.

## Creating a modelling pipeline


In [40]:
from sklearn.pipeline import Pipeline

In [41]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

Primarily 2 types of feature scaling methods:

    Min-max scaling (Normalization)

Formula = value-min/(max-min) Sklearn provides a class for this called as MinMaxScaler

    Standardisation

Formula = value - mean / std dev Sklearn provides a class called StandardScaler

In [42]:
my_pipeline = Pipeline([
    
    ('imputer',SimpleImputer(strategy="median")),
    ('stand_scale',StandardScaler())
])

What is imputer ?
Imputer is used to substitute any blank/missing value in our data with the value specified in function SimpleImputer. In our case we have mentioned the strategy as 'median', which indicates if there are any missing value in any attribute, that value will be replaced by the median of that attribute.

In [43]:
iris_transformed = my_pipeline.fit_transform(iris_original)

1.Estimators - 

It estimates some paramter based on data set. Eg. Imputer

It has a fit method and transform method. Fit - Fits the dataset and calculates internal parameters.

2.Transformers -

Transform method takes input and gives output based on learnings from fit().It also has a convinience function called fit_transform() which fits and then transforms.

3.Predictors - LinearRegression model is example. fit and predict() are 2 common functions. It also gives score() which will evaluate the predictions.


In [44]:
iris_transformed

array([[-1.47393679,  1.20365799, -1.56253475, -1.31260282],
       [-0.13307079,  2.99237573, -1.27600637, -1.04563275],
       [ 1.08589829,  0.08570939,  0.38585821,  0.28921757],
       [-1.23014297,  0.75647855, -1.2187007 , -1.31260282],
       [-1.7177306 ,  0.30929911, -1.39061772, -1.31260282],
       [ 0.59831066, -1.25582892,  0.72969227,  0.95664273],
       [ 0.72020757,  0.30929911,  0.44316389,  0.4227026 ],
       [-0.74255534,  0.98006827, -1.27600637, -1.31260282],
       [-0.98634915,  1.20365799, -1.33331205, -1.31260282],
       [-0.74255534,  2.32160658, -1.27600637, -1.44608785],
       [-0.01117388, -0.80864948,  0.78699794,  0.95664273],
       [ 0.23261993,  0.75647855,  0.44316389,  0.55618763],
       [ 1.08589829,  0.08570939,  0.55777524,  0.4227026 ],
       [-0.49876152,  1.87442714, -1.39061772, -1.04563275],
       [-0.49876152,  1.4272477 , -1.27600637, -1.31260282],
       [-0.37686461, -1.47941864, -0.01528151, -0.24472256],
       [ 0.59831066, -0.

The output is a numpy array

The new array is scaled and tranformed with the above conditions. In order to convert the array into a dataframe , we use the below functions:

In [45]:
iris_transformed_df = pd.DataFrame(iris_transformed)

In [46]:
iris_transformed_df

Unnamed: 0,0,1,2,3
0,-1.473937,1.203658,-1.562535,-1.312603
1,-0.133071,2.992376,-1.276006,-1.045633
2,1.085898,0.085709,0.385858,0.289218
3,-1.230143,0.756479,-1.218701,-1.312603
4,-1.717731,0.309299,-1.390618,-1.312603
...,...,...,...,...
115,0.354517,-0.585060,0.156636,0.155733
116,-1.108246,-1.255829,0.443164,0.689673
117,-0.011174,2.098017,-1.447923,-1.312603
118,-0.011174,-1.032239,0.156636,0.022248


## Selecting a desired model for Training

In [114]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier

In [115]:
#model = LinearRegression()
#model= DecisionTreeRegressor()
model = RandomForestClassifier()

In [117]:
model.fit(iris_transformed,iris_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Training the model with 2 sets of data: iris_transformed - input data sets & iris_labels - output data related to the input data

## Using samples from original data to test the model

In [118]:
some_data=iris_original.iloc[:5]
some_labels =iris_labels.iloc[:5]

passing 'some_data' to the pipeline

In [119]:
prep_data = my_pipeline.fit_transform(some_data)

In [120]:
model.predict(prep_data)

array([0, 1, 2, 0, 0], dtype=int64)

In the above case, we have extracted the predicted output from the data model

In [121]:
list(some_labels)

[0, 0, 1, 0, 0]

The output list(some_labels) is the expected output for the input some_data based on original dataset.
The ouput model.predict(prep_data) is the output received from model.

## Evaluating the model

In [122]:
from sklearn.metrics import mean_squared_error

Passing the entire training data to generate output and saving it in 'iris_preictions'

In [123]:
iris_predictions = model.predict(iris_transformed)

In [124]:
mse = mean_squared_error(iris_predictions,iris_labels)

In [125]:
rmse = np.sqrt(mse)

In [126]:
rmse

0.0

RMSE - root mean squared error which is the root of mean squared error between actual iris labels and predicted value generated from model.

The RMSE value is 0.2215 which indicates the model is behaving correctly and giving accurate response.

## Using cross validation - to evaluate our model

Cross-validation uses subset of our testing data to evaluate the model

In [127]:
from sklearn.model_selection import cross_val_score

In [128]:
scores = cross_val_score(model,iris_transformed,iris_labels,scoring="neg_mean_squared_error",cv=10)

In [129]:
def print_scores(scores):
    print("Score are:",scores)
    print("Mean is:",np.mean(scores))
    print("Standard Deviation is:",np.std(scores))

In [130]:
print_scores(scores)

Score are: [-0.08333333 -0.         -0.08333333 -0.         -0.33333333 -0.16666667
 -0.         -0.         -0.         -0.08333333]
Mean is: -0.075
Standard Deviation is: 0.10172129679778084


## Testing the model with Test data

In [131]:
X_test = test_set.drop('class',axis=1)

In [132]:
Y_test = test_set['class'].copy()

In [133]:
X_test_pred = my_pipeline.fit_transform(X_test)

In [134]:
final_pred_iris = model.predict(X_test_pred)

In [135]:
final_mse = mean_squared_error(final_pred_iris,Y_test)

In [136]:
final_rmse = np.sqrt(final_mse)

In [137]:
final_rmse

0.18257418583505536

## Saving the model

In [138]:
from joblib import dump,load
dump(model,"Iris_akash.joblib")

['Iris_akash.joblib']