In a multiclassification problem the limitations like, unbalanced data and more number of classes are present, which can be addressed using different approaches like SMOTE algorithm, ONE VS One/ One Vs REST binary classifiers, and Linear discriminant analysis.
First the issue of unbalanced data is analysed by building models on the unbalanced data and then balancing the data using SMOTE algorithm and then the performance of the models on the balanced data is verified.

In [1]:
#import sklearn as sk
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,precision_score, recall_score, f1_score


In [2]:
data= pd.read_csv("train_CloudCondition.csv",low_memory=False)

In [3]:
#The temperature column is imported as datatype object so if any invalid values are present in the data need to be checked.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71428 entries, 0 to 71427
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Day                         71428 non-null  int64  
 1   Cloud_Condition             71428 non-null  object 
 2   Rain_OR_SNOW                71313 non-null  object 
 3   Temperature (C)             71176 non-null  object 
 4   Apparent Temperature (C)    71425 non-null  float64
 5   Humidity                    71427 non-null  float64
 6   Wind Speed (km/h)           71426 non-null  float64
 7   Wind Bearing (degrees)      71391 non-null  float64
 8   Visibility (km)             71408 non-null  float64
 9   Pressure (millibars)        71363 non-null  float64
 10  Condensation                71428 non-null  object 
 11  Solar irradiance intensity  71428 non-null  int64  
dtypes: float64(6), int64(2), object(4)
memory usage: 6.5+ MB


In [4]:
data.columns

Index(['Day', 'Cloud_Condition', 'Rain_OR_SNOW', 'Temperature (C)',
       'Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)',
       'Wind Bearing (degrees)', 'Visibility (km)', 'Pressure (millibars)',
       'Condensation', 'Solar irradiance intensity'],
      dtype='object')

In [5]:
#To check if any special characters or invalid values are present in the data
#There were some invalid values like '-' in the "Temperature" column and no invalid values in other columns.
print(data.Day.unique(),data.Cloud_Condition.unique(),data.Rain_OR_SNOW.unique(),data["Temperature (C)"].unique(), 
     data["Apparent Temperature (C)"].unique(),data.Humidity.unique(),data["Wind Speed (km/h)"].unique(),
      data["Wind Bearing (degrees)"].unique(),data["Visibility (km)"].unique(),data["Pressure (millibars)"].unique(),
      data.Condensation.unique(),data["Solar irradiance intensity"].unique())

[    1     2     3 ... 79998 79999 80000] ['Partly Cloudy' 'Light Rain' 'Breezy and Dry' 'Overcast' 'Foggy'
 'Breezy and Mostly Cloudy' 'Clear' 'Breezy and Partly Cloudy'
 'Breezy and Overcast' 'Humid and Mostly Cloudy' 'Mostly Cloudy'
 'Humid and Partly Cloudy' 'Windy and Foggy' 'Windy and Overcast'
 'Breezy and Foggy' 'Windy and Partly Cloudy' 'Breezy'
 'Dry and Partly Cloudy' 'Windy and Mostly Cloudy'
 'Dangerously Windy and Partly Cloudy' 'Dry' 'Windy' 'Humid and Overcast'
 'Drizzle' 'Windy and Dry' 'Dry and Mostly Cloudy'] ['rain' 'snow' nan] ['-13' '15' '33' '30' '27' '-17' '-5' '-14' '10' '7' '9' '20' '3' '29'
 '-8' '-15' '-20' '36' '32' '6' '17' '28' '-21' '23' '-4' '25' '-7' '16'
 '39' '-1' '13' '35' '22' '12' '14' '1' '8' '-3' '38' '5' '37' '-10' '19'
 '34' '26' '0' '24' '11' '21' '-9' '-2' '4' '-19' '-6' '2' '-16' '-11'
 '18' '31' '-18' '-12' nan '-'] [-19.   5. -12.  36.  30.  33.  21.  -1. -15.  35. -24.  22.  28. -27.
   7.   2. -14.  13. -25. -26.   0.  27.   1.  29.  17

In [6]:
#Renaming the columns to replace space with underscore.
data.rename(columns={'Temperature (C)' : "Temperature_(C)",'Apparent Temperature (C)' : 'Apparent_Temperature_(C)',
                     'Wind Speed (km/h)' : 'Wind_Speed_(km/h)','Wind Bearing (degrees)' : 'Wind_Bearing_(degrees)', 
                     'Visibility (km)' : 'Visibility_(km)', 'Pressure (millibars)' : 'Pressure_(millibars)',
       'Solar irradiance intensity' : 'Solar_irradiance_intensity'}, inplace = True)
data

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
0,1,Partly Cloudy,rain,-13,-19.0,0.134364,17.0,68.0,4.0,1008.0,Frost,1068
1,2,Partly Cloudy,rain,15,5.0,0.847434,8.0,291.0,2.0,1036.0,Frost,1291
2,3,Partly Cloudy,rain,33,-12.0,0.763775,32.0,32.0,8.0,1004.0,Dry,1433
3,4,Partly Cloudy,snow,30,36.0,0.255069,15.0,130.0,3.0,1016.0,Dry,1410
4,5,Partly Cloudy,snow,27,30.0,0.495435,63.0,60.0,15.0,1007.0,Fog,1391
...,...,...,...,...,...,...,...,...,...,...,...,...
71423,79996,Foggy,rain,39,31.0,0.243553,19.0,347.0,14.0,1013.0,Frost,1269
71424,79997,Foggy,rain,8,4.0,0.913108,1.0,101.0,8.0,1031.0,Dry,1224
71425,79998,Mostly Cloudy,rain,28,-22.0,0.496076,2.0,149.0,7.0,1032.0,Frost,1463
71426,79999,Mostly Cloudy,rain,-16,-3.0,0.783161,44.0,266.0,11.0,1019.0,Fog,1251


In [7]:
#Checking the null values in the temperature column before replacing the invalid values
data["Temperature_(C)"].isnull().sum()

252

In [8]:
#Replacing the invalid values as null values
data["Temperature_(C)"].replace("-", np.nan, inplace=True)

In [9]:
#To check if the invalid value is replaced
data["Temperature_(C)"].unique()

array(['-13', '15', '33', '30', '27', '-17', '-5', '-14', '10', '7', '9',
       '20', '3', '29', '-8', '-15', '-20', '36', '32', '6', '17', '28',
       '-21', '23', '-4', '25', '-7', '16', '39', '-1', '13', '35', '22',
       '12', '14', '1', '8', '-3', '38', '5', '37', '-10', '19', '34',
       '26', '0', '24', '11', '21', '-9', '-2', '4', '-19', '-6', '2',
       '-16', '-11', '18', '31', '-18', '-12', nan], dtype=object)

In [10]:
#to check if the number of null values is increased.
data["Temperature_(C)"].isnull().sum()

253

In [11]:
#Now the datatype of temparature column can be changed to float.
data['Temperature_(C)'] = data['Temperature_(C)'].astype('float64')

In [12]:
#The data is about weather conditions with 11 input variables which are the different conditions that affect the weather. 
#target variable is cloud_condition which is classified into 26 different classes.
data.head()

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
0,1,Partly Cloudy,rain,-13.0,-19.0,0.134364,17.0,68.0,4.0,1008.0,Frost,1068
1,2,Partly Cloudy,rain,15.0,5.0,0.847434,8.0,291.0,2.0,1036.0,Frost,1291
2,3,Partly Cloudy,rain,33.0,-12.0,0.763775,32.0,32.0,8.0,1004.0,Dry,1433
3,4,Partly Cloudy,snow,30.0,36.0,0.255069,15.0,130.0,3.0,1016.0,Dry,1410
4,5,Partly Cloudy,snow,27.0,30.0,0.495435,63.0,60.0,15.0,1007.0,Fog,1391


In [13]:
#From the below record counts we can see that the data is unbalanced with unequal number of records for each class.
data['Cloud_Condition'].value_counts()

Mostly Cloudy                          22017
Partly Cloudy                          17613
Overcast                               13612
Clear                                   9719
Foggy                                   5900
Breezy and Dry                           656
Breezy and Mostly Cloudy                 473
Breezy and Overcast                      454
Breezy and Partly Cloudy                 350
Light Rain                               214
Dry and Partly Cloudy                     86
Windy and Partly Cloudy                   63
Breezy                                    45
Windy and Overcast                        43
Dry                                       34
Breezy and Foggy                          34
Humid and Mostly Cloudy                   32
Windy and Mostly Cloudy                   32
Humid and Partly Cloudy                   17
Dry and Mostly Cloudy                     14
Windy                                      5
Humid and Overcast                         5
Drizzle   

In [14]:
#From the info of the data there are null records in some coulmns.
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71428 entries, 0 to 71427
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Day                         71428 non-null  int64  
 1   Cloud_Condition             71428 non-null  object 
 2   Rain_OR_SNOW                71313 non-null  object 
 3   Temperature_(C)             71175 non-null  float64
 4   Apparent_Temperature_(C)    71425 non-null  float64
 5   Humidity                    71427 non-null  float64
 6   Wind_Speed_(km/h)           71426 non-null  float64
 7   Wind_Bearing_(degrees)      71391 non-null  float64
 8   Visibility_(km)             71408 non-null  float64
 9   Pressure_(millibars)        71363 non-null  float64
 10  Condensation                71428 non-null  object 
 11  Solar_irradiance_intensity  71428 non-null  int64  
dtypes: float64(7), int64(2), object(3)
memory usage: 6.5+ MB


In [15]:
#To check the number of null records in each column
data.isnull().sum()

Day                             0
Cloud_Condition                 0
Rain_OR_SNOW                  115
Temperature_(C)               253
Apparent_Temperature_(C)        3
Humidity                        1
Wind_Speed_(km/h)               2
Wind_Bearing_(degrees)         37
Visibility_(km)                20
Pressure_(millibars)           65
Condensation                    0
Solar_irradiance_intensity      0
dtype: int64

In [16]:
#Missing values could be imputed with mean/median values in numerical features and mode values in categorical features.
#But it should be done based on the preexisting knowledge on the subject if its done without any preunderstanding 
#it might affect the quality of the data, so to avoid that, the null value records are removed here which is about 490 records,
#though it might result in loss of data but since it is less compared to the total records here it might not affect much.
data=data.dropna(axis=0)

In [17]:
#To check if the removal of null value records has changed the percentage distribution of the class, as the class with less 
#records are not changed only the class with more records have been changed as a result of removing the null values.
data['Cloud_Condition'].value_counts()

Mostly Cloudy                          21860
Partly Cloudy                          17526
Overcast                               13506
Clear                                   9648
Foggy                                   5871
Breezy and Dry                           630
Breezy and Mostly Cloudy                 467
Breezy and Overcast                      452
Breezy and Partly Cloudy                 350
Light Rain                               209
Dry and Partly Cloudy                     86
Windy and Partly Cloudy                   63
Breezy                                    45
Windy and Overcast                        42
Dry                                       34
Breezy and Foggy                          34
Humid and Mostly Cloudy                   32
Windy and Mostly Cloudy                   32
Humid and Partly Cloudy                   17
Dry and Mostly Cloudy                     14
Windy                                      5
Humid and Overcast                         5
Drizzle   

The data with the unbalalnced distribution of classes is used for model building.
Using algorithms like KNN, Decision tree, SVM, Random Forest and logistic regression.


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70938 entries, 0 to 71427
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Day                         70938 non-null  int64  
 1   Cloud_Condition             70938 non-null  object 
 2   Rain_OR_SNOW                70938 non-null  object 
 3   Temperature_(C)             70938 non-null  float64
 4   Apparent_Temperature_(C)    70938 non-null  float64
 5   Humidity                    70938 non-null  float64
 6   Wind_Speed_(km/h)           70938 non-null  float64
 7   Wind_Bearing_(degrees)      70938 non-null  float64
 8   Visibility_(km)             70938 non-null  float64
 9   Pressure_(millibars)        70938 non-null  float64
 10  Condensation                70938 non-null  object 
 11  Solar_irradiance_intensity  70938 non-null  int64  
dtypes: float64(7), int64(2), object(3)
memory usage: 7.0+ MB


In [19]:
data.describe(include = "all")

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
count,70938.0,70938,70938,70938.0,70938.0,70938.0,70938.0,70938.0,70938.0,70938.0,70938,70938.0
unique,,26,2,,,,,,,,4,
top,,Mostly Cloudy,rain,,,,,,,,Mist,
freq,,21860,61992,,,,,,,,21531,
mean,37239.108997,,,8.958499,5.553469,0.501019,31.526939,179.205658,8.015027,1022.979249,,1249.685162
std,22185.777966,,,17.631349,19.09273,0.289413,18.509038,103.733352,4.908782,13.559313,,144.827495
min,1.0,,,-21.0,-27.0,1.9e-05,0.0,0.0,0.0,1000.0,,1000.0
25%,18443.25,,,-6.0,-11.0,0.249236,16.0,89.0,4.0,1011.0,,1124.0
50%,36177.5,,,9.0,5.0,0.501566,31.0,180.0,8.0,1023.0,,1249.0
75%,53989.75,,,24.0,22.0,0.751925,48.0,269.0,12.0,1035.0,,1375.0


In [20]:
#To label encode the categorical varaibles in the target column to nominal variable.
from sklearn.preprocessing import LabelEncoder

In [21]:
    Train_data = data.copy()
    label_ec = LabelEncoder()
    label_ec.fit(list(Train_data['Cloud_Condition'].values))
    Train_data['Cloud_Condition'] = label_ec.transform(list(Train_data['Cloud_Condition'].values))

In [22]:
Train_data.head()

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
0,1,19,rain,-13.0,-19.0,0.134364,17.0,68.0,4.0,1008.0,Frost,1068
1,2,19,rain,15.0,5.0,0.847434,8.0,291.0,2.0,1036.0,Frost,1291
2,3,19,rain,33.0,-12.0,0.763775,32.0,32.0,8.0,1004.0,Dry,1433
3,4,19,snow,30.0,36.0,0.255069,15.0,130.0,3.0,1016.0,Dry,1410
4,5,19,snow,27.0,30.0,0.495435,63.0,60.0,15.0,1007.0,Fog,1391


In [23]:
#To do one hot coding of the categorical input varaibles.
Train_data=pd.get_dummies(Train_data)

In [24]:
Train_data.head()

Unnamed: 0,Day,Cloud_Condition,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Solar_irradiance_intensity,Rain_OR_SNOW_rain,Rain_OR_SNOW_snow,Condensation_Dry,Condensation_Fog,Condensation_Frost,Condensation_Mist
0,1,19,-13.0,-19.0,0.134364,17.0,68.0,4.0,1008.0,1068,1,0,0,0,1,0
1,2,19,15.0,5.0,0.847434,8.0,291.0,2.0,1036.0,1291,1,0,0,0,1,0
2,3,19,33.0,-12.0,0.763775,32.0,32.0,8.0,1004.0,1433,1,0,1,0,0,0
3,4,19,30.0,36.0,0.255069,15.0,130.0,3.0,1016.0,1410,0,1,1,0,0,0
4,5,19,27.0,30.0,0.495435,63.0,60.0,15.0,1007.0,1391,0,1,0,1,0,0


In [25]:
#Not splitting into train and test as the records for some clasess are very less, then splitting might not result in correct
#distribution of the clasess.
Train_X = Train_data.drop(['Cloud_Condition'], axis = 1).copy()
Train_Y = Train_data['Cloud_Condition'].copy()

print(Train_X.shape)
print(Train_Y.shape)

(70938, 15)
(70938,)


In [26]:
###################
# Standardization
###################

from sklearn.preprocessing import StandardScaler

Scaling = StandardScaler().fit(Train_X)
Train_X_Std = Scaling.transform(Train_X) # This step standardizes the train input data


# Add the column names to Train_X_Std
Train_X_Std = pd.DataFrame(Train_X_Std, columns = Train_X.columns)

In [27]:
###################
#KNN - As the KNN algorithm works well and gives good accuracy with most of the classification problems, it is used in this case.
###################

KNN_Model_Def = KNeighborsClassifier(n_neighbors=3) # no of k is taken 3 as above this the accuracy is decreased.
KNN_Model_Fit = KNN_Model_Def.fit(Train_X_Std, Train_Y)

###################
# Model prediction
###################

# Class Prediction
Pred_KNN = KNN_Model_Fit.predict(Train_X_Std)


In [28]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_KNN)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_KNN,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_KNN,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_KNN,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 


Accuracy: 56.95395979587809
Precision: 0.569539597958781
Recall: 0.569539597958781
F1-Score: 0.569539597958781


In [29]:
###################
# DT- A model with decision tree algorithm is built to check if there is any improvement in the accuracy.
###################

DT_Model_Def = DecisionTreeClassifier(random_state=123)
DT_Model_Fit = DT_Model_Def.fit(Train_X_Std, Train_Y)

###################
# Model prediction
###################

# Class Prediction
Pred_DT = DT_Model_Fit.predict(Train_X_Std)

In [30]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_DT)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_DT,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_DT,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_DT,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 100.0
Precision: 1.0
Recall: 1.0
F1-Score: 1.0


In [31]:
###################
# Random Forest -Now trying with an ensemble method to see if it increases the accuracy
###################

RF_Model_Def = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
RF_Model_Fit = RF_Model_Def.fit(Train_X_Std, Train_Y)

###################
# Model prediction
###################

# Class Prediction
Pred_RF = RF_Model_Fit.predict(Train_X_Std)

In [32]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_RF)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_RF,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_RF,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_RF,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 


Accuracy: 98.89058050692154
Precision: 0.9889058050692153
Recall: 0.9889058050692153
F1-Score: 0.9889058050692153


Though Decision tree and Random forest algorithms have given a better accuracy, we cannot decide the models are good as there may be overfitting since the model is trained and predicted on the same data.
Other algorithm KNN has given a poor accuracy.

So to increase the performance of the model the data need to be balanced, which can be achieved using the algorithm SMOTE to augment the dataset with artificial data by adding more data to the classes with lesser samples so that all the classes will have equal number of samples.

To apply smote for oversampling, the minimum number of records needed for each class is two, but in the data there are two classes having only one record each. So before applying smote, manually one record for each of the minimum class is added similar to the existing data.


In [33]:
data[data["Cloud_Condition"] == "Dangerously Windy and Partly Cloudy"]

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
11763,12182,Dangerously Windy and Partly Cloudy,rain,16.0,14.0,0.829716,59.0,269.0,11.0,1041.0,Frost,1043


In [34]:
Newdata1 = {'Day' : 80001 , 'Cloud_Condition' : 'Dangerously Windy and Partly Cloudy', 'Rain_OR_SNOW' : 'rain' , 
          'Temperature_(C)' : 15.0, 'Apparent_Temperature_(C)' : 13.0, 'Humidity' : 0.79921, 'Wind_Speed_(km/h)' : 60,
       'Wind_Bearing_(degrees)' : 270 , 'Visibility_(km)' : 12.0, 'Pressure_(millibars)' : 1040.0,
       'Condensation' : 'Frost', 'Solar_irradiance_intensity' : 1044}

In [35]:
New_data = data.append(Newdata1, ignore_index = True) 

In [36]:
New_data.tail()

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
70934,79997,Foggy,rain,8.0,4.0,0.913108,1.0,101.0,8.0,1031.0,Dry,1224
70935,79998,Mostly Cloudy,rain,28.0,-22.0,0.496076,2.0,149.0,7.0,1032.0,Frost,1463
70936,79999,Mostly Cloudy,rain,-16.0,-3.0,0.783161,44.0,266.0,11.0,1019.0,Fog,1251
70937,80000,Mostly Cloudy,rain,-15.0,8.0,0.191555,38.0,154.0,6.0,1023.0,Fog,1258
70938,80001,Dangerously Windy and Partly Cloudy,rain,15.0,13.0,0.79921,60.0,270.0,12.0,1040.0,Frost,1044


In [37]:
data[data["Cloud_Condition"] == "Windy and Dry"] 

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
52706,53125,Windy and Dry,rain,6.0,-7.0,0.923588,25.0,45.0,4.0,1028.0,Dry,1485


In [38]:
Newdata2 = {'Day' : 80002 , 'Cloud_Condition' : 'Windy and Dry', 'Rain_OR_SNOW' : 'rain' , 
          'Temperature_(C)' : 7.0, 'Apparent_Temperature_(C)' : -8.0, 'Humidity' : 0.89921, 'Wind_Speed_(km/h)' : 26.0,
       'Wind_Bearing_(degrees)' : 46.0 , 'Visibility_(km)' : 5.0, 'Pressure_(millibars)' : 1030.0,
       'Condensation' : 'Dry', 'Solar_irradiance_intensity' : 1484}

In [39]:
New_data = New_data.append(Newdata2, ignore_index = True) 

In [40]:
New_data.tail()

Unnamed: 0,Day,Cloud_Condition,Rain_OR_SNOW,Temperature_(C),Apparent_Temperature_(C),Humidity,Wind_Speed_(km/h),Wind_Bearing_(degrees),Visibility_(km),Pressure_(millibars),Condensation,Solar_irradiance_intensity
70935,79998,Mostly Cloudy,rain,28.0,-22.0,0.496076,2.0,149.0,7.0,1032.0,Frost,1463
70936,79999,Mostly Cloudy,rain,-16.0,-3.0,0.783161,44.0,266.0,11.0,1019.0,Fog,1251
70937,80000,Mostly Cloudy,rain,-15.0,8.0,0.191555,38.0,154.0,6.0,1023.0,Fog,1258
70938,80001,Dangerously Windy and Partly Cloudy,rain,15.0,13.0,0.79921,60.0,270.0,12.0,1040.0,Frost,1044
70939,80002,Windy and Dry,rain,7.0,-8.0,0.89921,26.0,46.0,5.0,1030.0,Dry,1484


In [41]:
    New_Train_data = New_data.copy()
    label_ec = LabelEncoder()
    label_ec.fit(list(New_Train_data['Cloud_Condition'].values))
    New_Train_data['Cloud_Condition'] = label_ec.transform(list(New_Train_data['Cloud_Condition'].values))

In [42]:
#To do one hot coding of the categorical input varaibles.
New_Train_data=pd.get_dummies(New_Train_data)

In [43]:
New_Train_X = New_Train_data.drop(['Cloud_Condition'], axis = 1).copy()
New_Train_Y = New_Train_data['Cloud_Condition'].copy()

print(New_Train_X.shape)
print(New_Train_Y.shape)

(70940, 15)
(70940,)


In [44]:
###################
# Standardization
###################

from sklearn.preprocessing import StandardScaler

Scaling = StandardScaler().fit(New_Train_X)
New_Train_X_Std = Scaling.transform(New_Train_X) # This step standardizes the train input data

# Add the column names to Train_X_Std
New_Train_X_Std = pd.DataFrame(New_Train_X_Std, columns = New_Train_X.columns)

In [45]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=1)
SM_TrainInput, SM_TrainOutput = sm.fit_resample(New_Train_X_Std, New_Train_Y)

In [46]:
print(SM_TrainInput.shape, SM_TrainOutput.shape)

(568360, 15) (568360,)


In [47]:
#Now all the classes have the same number of samples.
SM_TrainOutput.value_counts()

25    21860
24    21860
1     21860
2     21860
3     21860
4     21860
5     21860
6     21860
7     21860
8     21860
9     21860
10    21860
11    21860
12    21860
13    21860
14    21860
15    21860
16    21860
17    21860
18    21860
19    21860
20    21860
21    21860
22    21860
23    21860
0     21860
Name: Cloud_Condition, dtype: int64

In [48]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split(SM_TrainInput, SM_TrainOutput, train_size = 0.8, random_state = 123)
print("x_train ",x_train.shape)
print("x_test ",x_test.shape)
print("y_train ",y_train.shape)
print("y_test ",y_test.shape)

x_train  (454688, 15)
x_test  (113672, 15)
y_train  (454688,)
y_test  (113672,)


In [49]:
###################
# DT
###################

DT_Model_Def = DecisionTreeClassifier(random_state=123)
DT_Model_Fit = DT_Model_Def.fit(x_train,y_train)

###################
# Model prediction
###################

# Class Prediction
Pred_DT = DT_Model_Fit.predict(x_test)

In [50]:
Confusion_Mat = confusion_matrix(y_test, Pred_DT)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_DT,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_DT,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_DT,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 89.15564079104793
Precision: 0.8915564079104793
Recall: 0.8915564079104793
F1-Score: 0.8915564079104793


In [51]:
###################
# Random Forest -Now trying with an ensemble method to see if it increases the accuracy
###################

RF_Model_Def = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
RF_Model_Fit = RF_Model_Def.fit(x_train,y_train)

###################
# Model prediction
###################

# Class Prediction
Pred_RF = RF_Model_Fit.predict(x_test)

In [52]:
Confusion_Mat = confusion_matrix(y_test, Pred_RF)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_RF,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_RF,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_RF,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 91.63206418467169
Precision: 0.9163206418467169
Recall: 0.9163206418467169
F1-Score: 0.9163206418467169


In [53]:
###################
#KNN - As the KNN algorithm works well and gives good accuracy with most of the classification problems, it is used in this case.
###################

KNN_Model_Def = KNeighborsClassifier(n_neighbors=3) # no of k is taken 3 as above this the accuracy is decreased.
KNN_Model_Fit = KNN_Model_Def.fit(x_train,y_train)

###################
# Model prediction
###################

# Class Prediction
Pred_KNN = KNN_Model_Fit.predict(x_test)


In [54]:
Confusion_Mat = confusion_matrix(y_test, Pred_KNN)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_KNN,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_KNN,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_KNN,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 


Accuracy: 91.81328735308607
Precision: 0.9181328735308607
Recall: 0.9181328735308607
F1-Score: 0.9181328735308607


Though the accuracy score of Decision Tree and Random forest models seem to have decreased but previously it was predicted on the training data but now the prediction is done on a seperate test data, so it might seem the score is decreased but the model performance might have improved.
The performance of the model using KNN algorithm has improved a lot.
So the performance of the models have improved by using a balanced data.

Now the analysis is done on the One vs rest and One vs one classifier using SVM and logistic regression algorithms, on the unbalanced data.
As the OVR and OVO classifier works like a binary classifier the imbalance in the data should not affect the performance of the model. 
And by analysing the same model with the balanced data we can find if there is any improvement in the performance of the model because of the data balance.

In [55]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier

In [56]:
# Since SVM takes longer time to compute with a larger data, have done undersampling to reduce the data.
from imblearn.under_sampling import RandomUnderSampler
strategy = {0:200, 1:200, 2:200, 3:200, 4:200, 5:200, 6:200,7:200,8:200,9:200,10:200,11:200,12:200,13:200,14:200,
            15:200,16:200,17:200,18:200,19:200,20:200,21:200,22:200,23:200,24:200,25:200}

under = RandomUnderSampler(sampling_strategy =strategy, random_state = 123, replacement = False)
US_TrainInput, US_TrainOutput = under.fit_resample(SM_TrainInput, SM_TrainOutput)

In [57]:
print(US_TrainInput.shape, US_TrainOutput.shape)

(5200, 15) (5200,)


In [58]:
x_train,x_test,y_train, y_test = train_test_split(US_TrainInput, US_TrainOutput, train_size = 0.8, random_state = 123)
print("x_train ",x_train.shape)
print("x_test ",x_test.shape)
print("y_train ",y_train.shape)
print("y_test ",y_test.shape)

x_train  (4160, 15)
x_test  (1040, 15)
y_train  (4160,)
y_test  (1040,)


In [59]:
###################
# SVM
###################

SVM_Model_Def = SVC(decision_function_shape='ovo',kernel='linear')
SVM_UBModel_Fit = SVM_Model_Def.fit(Train_X_Std, Train_Y) #Model with unbalanced data
SVM_BModel_Fit = SVM_Model_Def.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBSVM = SVM_UBModel_Fit.predict(Train_X_Std)
Pred_BSVM = SVM_BModel_Fit.predict(x_test)


In [60]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 5.346922664862274
Precision: 0.05346922664862274
Recall: 0.05346922664862274
F1-Score: 0.05346922664862274


In [61]:
Confusion_Mat = confusion_matrix(y_test, Pred_BSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 47.30769230769231
Precision: 0.47307692307692306
Recall: 0.47307692307692306
F1-Score: 0.47307692307692306


In [59]:
###################
# SVM
###################

SVM_Model_Def = SVC(kernel='linear')
SVM_OVR_Model = OneVsRestClassifier(SVM_Model_Def)
SVM_UBModel_Fit = SVM_OVR_Model.fit(Train_X_Std, Train_Y) #Model with unbalanced data
SVM_BModel_Fit = SVM_OVR_Model.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBSVM = SVM_UBModel_Fit.predict(Train_X_Std)
Pred_BSVM = SVM_BModel_Fit.predict(x_test)


In [61]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 3.211254898643886
Precision: 0.03211254898643886
Recall: 0.03211254898643886
F1-Score: 0.03211254898643886


In [62]:
Confusion_Mat = confusion_matrix(y_test, Pred_BSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 29.71153846153846
Precision: 0.2971153846153846
Recall: 0.2971153846153846
F1-Score: 0.2971153846153846


The performance of One Vs one and One Vs rest classfier on the unbalanced data is very poor.
But comparatively the performance of the model with balanced data is more but still it is not good.
And also One Vs one is giving a better classification than One Vs rest classifier.
But the poor performance might be due to the wrong selection of the kernel option.
For the reason of less computation and faster convergence, have used the kernel as linear, but the other kernels like "polynomial or RBF" might give a better multi classification performance. 
Due to hardware limitations could not try them as it is taking longer time to compute and converge.

In [73]:
###################
# Logistic Regression
###################

LR_Model_Def = LogisticRegression(solver='lbfgs', max_iter=500)
LR_OVO_Model = OneVsOneClassifier(LR_Model_Def)
LR_UBModel_Fit = LR_OVO_Model.fit(Train_X_Std, Train_Y) #Model with unbalanced data
LR_BModel_Fit = LR_OVO_Model.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBLR = LR_UBModel_Fit.predict(Train_X_Std)
Pred_BLR = LR_BModel_Fit.predict(x_test)

In [74]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 4.611068820660295
Precision: 0.04611068820660295
Recall: 0.04611068820660295
F1-Score: 0.04611068820660295


In [75]:
Confusion_Mat = confusion_matrix(y_test, Pred_BLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 43.07692307692308
Precision: 0.4307692307692308
Recall: 0.4307692307692308
F1-Score: 0.43076923076923074


In [76]:
###################
# Logistic Regression
###################

LR_Model_Def = LogisticRegression(multi_class='ovr',solver='lbfgs',max_iter=500)
LR_UBModel_Fit = LR_Model_Def.fit(Train_X_Std, Train_Y) #Model with unbalanced data
LR_BModel_Fit = LR_Model_Def.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBLR = LR_UBModel_Fit.predict(Train_X_Std)
Pred_BLR = LR_BModel_Fit.predict(x_test)

In [77]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/Train_X.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 2.6628887197270856
Precision: 0.026628887197270856
Recall: 0.026628887197270856
F1-Score: 0.026628887197270856


In [78]:
Confusion_Mat = confusion_matrix(y_test, Pred_BLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 38.17307692307692
Precision: 0.3817307692307692
Recall: 0.3817307692307692
F1-Score: 0.38173076923076915


The performance of One Vs one and One Vs rest classfier on the unbalanced data is very poor.
But comparatively the performance of the model with balanced data is more but still it is not good.
And also One Vs one is giving a better classification than One Vs rest classifier.
But the poor performance might be due to the reason that the classes are not linearly seperable or there are more number of classes.
As logistic regression does not work well on a non-linear data and more number of classes.

Now will apply linear discriminant analysis on both the unbalanced and balanced data and verify if it affects the performance 
of the model

In [80]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
X_lda = lda.fit_transform(Train_X_Std, Train_Y) #Applying LDA on the unbalanced data
New_X_lda = lda.fit_transform(SM_TrainInput, SM_TrainOutput) # Applying LDA on the balanced data
Small_X_lda = lda.fit_transform(US_TrainInput, US_TrainOutput) # Applying LDA on the under sampled data

In [82]:
print(X_lda.shape,New_X_lda.shape,Small_X_lda.shape)

(70938, 13) (568360, 13) (5200, 13)


In [83]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split(New_X_lda, SM_TrainOutput, train_size = 0.8, random_state = 123)
print("x_train ",x_train.shape)
print("x_test ",x_test.shape)
print("y_train ",y_train.shape)
print("y_test ",y_test.shape)

x_train  (454688, 13)
x_test  (113672, 13)
y_train  (454688,)
y_test  (113672,)


In [85]:
###################
#KNN - As the KNN algorithm works well and gives good accuracy with most of the classification problems, it is used in this case.
###################

KNN_Model_Def = KNeighborsClassifier(n_neighbors=3) # no of k is taken 3 as above this the accuracy is decreased.
KNN_UBModel_Fit = KNN_Model_Def.fit(X_lda, Train_Y)
KNN_BModel_Fit = KNN_Model_Def.fit(x_train,y_train)

###################
# Model prediction
###################

# Class Prediction
Pred_UBKNN = KNN_UBModel_Fit.predict(X_lda)
Pred_BKNN = KNN_BModel_Fit.predict(x_test)

In [86]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBKNN)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/X_lda.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBKNN,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBKNN,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBKNN,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 17.18401984831825
Precision: 0.1718401984831825
Recall: 0.1718401984831825
F1-Score: 0.1718401984831825


In [87]:
Confusion_Mat = confusion_matrix(y_test, Pred_BKNN)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BKNN,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BKNN,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BKNN,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 91.82824266310085
Precision: 0.9182824266310086
Recall: 0.9182824266310086
F1-Score: 0.9182824266310086


Applying LDA over the unbalanced data has reduced the performance further.
But LDA on a balanced data has not impacted the performance much but only a very minimal increase is observed in the performance.

In [88]:
# LDA applied undersampled data for SVM and logistic regression
from sklearn.model_selection import train_test_split
x_train,x_test,y_train, y_test = train_test_split(Small_X_lda, US_TrainOutput, train_size = 0.8, random_state = 123)
print("x_train ",x_train.shape)
print("x_test ",x_test.shape)
print("y_train ",y_train.shape)
print("y_test ",y_test.shape)

x_train  (4160, 13)
x_test  (1040, 13)
y_train  (4160,)
y_test  (1040,)


In [89]:
###################
# SVM
###################

SVM_Model_Def = SVC(decision_function_shape='ovo',kernel='linear')
SVM_UBModel_Fit = SVM_Model_Def.fit(X_lda, Train_Y) #Model with unbalanced data
SVM_BModel_Fit = SVM_Model_Def.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBSVM = SVM_UBModel_Fit.predict(X_lda)
Pred_BSVM = SVM_BModel_Fit.predict(x_test)

In [90]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/X_lda.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 3.06464800248104
Precision: 0.030646480024810397
Recall: 0.030646480024810397
F1-Score: 0.030646480024810397


In [91]:
Confusion_Mat = confusion_matrix(y_test, Pred_BSVM)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BSVM,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BSVM,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BSVM,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 48.36538461538461
Precision: 0.48365384615384616
Recall: 0.48365384615384616
F1-Score: 0.48365384615384616


In [92]:
###################
# Logistic Regression
###################

LR_Model_Def = LogisticRegression(solver='lbfgs', max_iter=500)
LR_OVO_Model = OneVsOneClassifier(LR_Model_Def)
LR_UBModel_Fit = LR_OVO_Model.fit(X_lda, Train_Y) #Model with unbalanced data
LR_BModel_Fit = LR_OVO_Model.fit(x_train,y_train) #Model with balanced data

###################
# Model prediction
###################

# Class Prediction
Pred_UBLR = LR_UBModel_Fit.predict(X_lda)
Pred_BLR = LR_BModel_Fit.predict(x_test)

In [93]:
Confusion_Mat = confusion_matrix(Train_Y, Pred_UBLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/X_lda.shape[0]*100}")
print(f"Precision: {precision_score(Train_Y, Pred_UBLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(Train_Y, Pred_UBLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(Train_Y, Pred_UBLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 3.452310468296259
Precision: 0.03452310468296259
Recall: 0.03452310468296259
F1-Score: 0.03452310468296259


In [94]:
Confusion_Mat = confusion_matrix(y_test, Pred_BLR)

print(f"Accuracy: {sum(np.diagonal(Confusion_Mat))/x_test.shape[0]*100}")
print(f"Precision: {precision_score(y_test, Pred_BLR,average = 'micro')}") # Precision [Total Positives/(Total Predicted Positives)]
print(f"Recall: {recall_score(y_test, Pred_BLR,average = 'micro')}") # Recall (Also called TPR) [Total Positives/(Total Actual Positives)]
print(f"F1-Score: {f1_score(y_test, Pred_BLR,average ='micro')}") # F1-Score [2*Precision*Recall/(Precision + Recall)] 

Accuracy: 43.269230769230774
Precision: 0.4326923076923077
Recall: 0.4326923076923077
F1-Score: 0.4326923076923077


Performance model was built on LDA applied unbalanced and balanced data using One Vs One classifier of SVM and logistic regression.
Applying LDA over the unbalanced data has reduced the performance across all the three algorithms KNN, SVM and logistic regression.
Similarly applying LDA on a balanced data has improved the performance in all the three algorithms though it is very minimal that might be due to the limitations in the algorithms.
If the data quality is good and a correct algorithm is used LDA can improve the classification performance of the model.

Three approaches tried to improve the performance of a model when the data is unbalanced were,
1. Oversampling the data using Smote algorithm
2. One Vs One or One Vs Rest classifiers
3. Applying linear discriminant algorithm. 

Among them, the approach to balance the data seemed to improve the performance more than the other two methods.
If the data is balanced then the other techniques with correct tuning parameters can improve the performance further.
But if the data is imblanced applying any complex techniques might not give any improvement in the performance.