# Multiclass Classification using Logistic Regression and SVMs

In this assignment, you need to perform multiclass classification using Logistic Regression and SVMs. The dataset is provided to you. Please note the following:
1. Use the dataset provided with train/val/test set. (Dataset-MultiClass-Train/Validate/Test.csv)
2. You can use **LogisticRegression** and **SVC** from `sklearn` package:
    - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
    
3. You might need to look into data balancing, dealing with categorical values.
4. X (cancelled in class) Show/Plot training and validation loss for  all the experiments
5. For both the validation and test sets: 
    -  Show confusion matrix
    -  Accuracy, Precision, Recall, F-1 score
    
    The most important metrics in this problem are F1-Score and accuracy.

In [38]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
from sklearn.preprocessing import LabelBinarizer

# Test data - One Hot Encoding

In [39]:
data = np.genfromtxt('Dataset-MultiClass-Test.csv', delimiter=',', dtype=None)
print(np.shape(data))

(401, 23)


  data = np.genfromtxt('Dataset-MultiClass-Test.csv', delimiter=',', dtype=None)


In [40]:
gender = data[1:,0]
df = pd.DataFrame(list(zip(gender)), columns=['Gender'])
gend = pd.get_dummies(df.Gender, prefix='Gender')
print(gend.head(10))
gend = LabelBinarizer().fit_transform(df.Gender)
print(np.shape(gend))

   Gender_Female  Gender_Male
0              1            0
1              1            0
2              1            0
3              1            0
4              0            1
5              0            1
6              0            1
7              0            1
8              1            0
9              1            0
(400, 1)


In [41]:
age = data[1:,1]
dt = pd.DataFrame(list(zip(age)), columns=['Age'])
a = pd.get_dummies(dt.Age, prefix='Age')
print(a.head(10))
a = LabelBinarizer().fit_transform(dt.Age)
print(a)

   Age_18-30  Age_31-40  Age_41-50  Age_51-65
0          1          0          0          0
1          1          0          0          0
2          1          0          0          0
3          1          0          0          0
4          1          0          0          0
5          1          0          0          0
6          1          0          0          0
7          1          0          0          0
8          0          0          1          0
9          0          0          1          0
[[1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 ...
 [0 0 0 1]
 [0 0 0 1]
 [0 0 0 1]]


In [42]:
job = data[1:,2]
dj = pd.DataFrame(list(zip(job)), columns=['Job'])
j = pd.get_dummies(dj.Job, prefix='Job')
print(j.head(10))
j = LabelBinarizer().fit_transform(dj.Job)
print(j)

   Job_Entrepreneur  Job_Private Company Employee  \
0                 0                             0   
1                 0                             0   
2                 0                             0   
3                 0                             0   
4                 0                             0   
5                 0                             0   
6                 0                             0   
7                 0                             0   
8                 0                             0   
9                 0                             0   

   Job_Public Company Employee  Job_School/ University Student  Job_Unemployed  
0                            0                               1               0  
1                            0                               1               0  
2                            0                               1               0  
3                            0                               1               0  
4          

In [43]:
edu = data[1:,3]
dj = pd.DataFrame(list(zip(edu)), columns=['Education'])
ed = pd.get_dummies(dj.Education, prefix='Education')
print(ed.head(10))
ed = LabelBinarizer().fit_transform(dj.Education)
print(ed)

   Education_Graduate  Education_High School  Education_Under High School  \
0                   0                      0                            0   
1                   0                      0                            0   
2                   0                      0                            0   
3                   0                      0                            0   
4                   1                      0                            0   
5                   1                      0                            0   
6                   1                      0                            0   
7                   1                      0                            0   
8                   0                      0                            0   
9                   0                      0                            0   

   Education_Undergraduate  
0                        1  
1                        1  
2                        1  
3                        1  
4      

In [44]:
income = data[1:,4]
dj = pd.DataFrame(list(zip(income)), columns=['Income'])
inc = pd.get_dummies(dj.Income, prefix='Income')
print(inc.head(10))
inc = LabelBinarizer().fit_transform(dj.Income)
print(inc)

   Income_0 – 30000  Income_31000–60000  Income_61000 – 90000  \
0                 0                   1                     0   
1                 0                   1                     0   
2                 0                   1                     0   
3                 0                   1                     0   
4                 0                   0                     1   
5                 0                   0                     1   
6                 0                   0                     1   
7                 0                   0                     1   
8                 0                   0                     1   
9                 0                   0                     1   

   Income_91000 – 120000  Income_> 120000  
0                      0                0  
1                      0                0  
2                      0                0  
3                      0                0  
4                      0                0  
5                   

In [45]:
car = data[1:,5]
dj = pd.DataFrame(list(zip(car)), columns=['CarOwnership'])
carown = pd.get_dummies(dj.CarOwnership, prefix='CarOwnership')
print(carown.head(10))
carown = LabelBinarizer().fit_transform(dj.CarOwnership)
print(carown)

   CarOwnership_Do not have a personal car  but other family member has a car  \
0                                                  1                            
1                                                  1                            
2                                                  1                            
3                                                  1                            
4                                                  0                            
5                                                  0                            
6                                                  0                            
7                                                  0                            
8                                                  1                            
9                                                  1                            

   CarOwnership_Have a personal car  CarOwnership_More than one car  \
0                                 0  

In [46]:
driving = data[1:,6]
dj = pd.DataFrame(list(zip(driving)), columns=['DrivingLicense'])
dL = pd.get_dummies(dj.DrivingLicense, prefix='DrivingLicense')
print(dL.head(10))
dL = LabelBinarizer().fit_transform(dj.DrivingLicense)
print(dL)

   DrivingLicense_No  DrivingLicense_Yes
0                  1                   0
1                  1                   0
2                  1                   0
3                  1                   0
4                  0                   1
5                  0                   1
6                  0                   1
7                  0                   1
8                  0                   1
9                  0                   1
[[0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]

In [47]:
travel = data[1:,7]
dj = pd.DataFrame(list(zip(travel)), columns=['Travel'])
tra = pd.get_dummies(dj.Travel, prefix='Travel')
print(tra.head(10))
tra = LabelBinarizer().fit_transform(dj.Travel)
print(tra)

   Travel_2 Days  Travel_3 Days  Travel_4 Days  Travel_5 Days  Travel_None  \
0              0              0              0              1            0   
1              0              0              0              1            0   
2              0              0              0              1            0   
3              0              0              0              1            0   
4              0              0              0              1            0   
5              0              0              0              1            0   
6              0              0              0              1            0   
7              0              0              0              1            0   
8              0              0              1              0            0   
9              0              0              1              0            0   

   Travel_One Day  
0               0  
1               0  
2               0  
3               0  
4               0  
5               0  
6

In [48]:
distance = data[1:,8]
dj = pd.DataFrame(list(zip(travel)), columns=['Distance'])
dist = pd.get_dummies(dj.Distance, prefix='Distance')
print(dist.head(10))
dist = LabelBinarizer().fit_transform(dj.Distance)
print(dist)

   Distance_2 Days  Distance_3 Days  Distance_4 Days  Distance_5 Days  \
0                0                0                0                1   
1                0                0                0                1   
2                0                0                0                1   
3                0                0                0                1   
4                0                0                0                1   
5                0                0                0                1   
6                0                0                0                1   
7                0                0                0                1   
8                0                0                1                0   
9                0                0                1                0   

   Distance_None  Distance_One Day  
0              0                 0  
1              0                 0  
2              0                 0  
3              0                 0  
4          

In [49]:
purpose = data[1:,9]
dj = pd.DataFrame(list(zip(purpose)), columns=['Purpose'])
purp = pd.get_dummies(dj.Purpose, prefix='Purpose')
print(purp.head(10))
purp = LabelBinarizer().fit_transform(dj.Purpose)
print(purp)

   Purpose_amusement  Purpose_event  Purpose_shoping  Purpose_work 
0                  0              1                0              0
1                  1              0                0              0
2                  0              0                0              1
3                  0              0                1              0
4                  0              1                0              0
5                  1              0                0              0
6                  0              0                0              1
7                  0              0                1              0
8                  0              1                0              0
9                  1              0                0              0
[[0 1 0 0]
 [1 0 0 0]
 [0 0 0 1]
 ...
 [1 0 0 0]
 [0 0 0 1]
 [0 0 1 0]]


In [50]:
at = data[1:,10]
dj = pd.DataFrame(list(zip(at)), columns=['AT'])
att = pd.get_dummies(dj.AT, prefix='AT')
print(att.head(10))
att = LabelBinarizer().fit_transform(dj.AT)
print(att)

   AT_alone   AT_family   AT_friends 
0          1           0            0
1          0           0            1
2          1           0            0
3          0           1            0
4          1           0            0
5          0           0            1
6          1           0            0
7          0           1            0
8          1           0            0
9          0           0            1
[[1 0 0]
 [0 0 1]
 [1 0 0]
 ...
 [0 0 1]
 [1 0 0]
 [0 1 0]]


In [51]:
tt1 = data[1:,11]
dj = pd.DataFrame(list(zip(tt1)), columns=['AT'])
tt1 = pd.get_dummies(dj.AT, prefix='AT')
print(tt1.head(10))
tt1 = LabelBinarizer().fit_transform(dj.AT)
print(tt1)

   AT_20  AT_30  AT_40
0      0      0      1
1      0      1      0
2      0      0      1
3      1      0      0
4      0      0      1
5      0      1      0
6      0      0      1
7      1      0      0
8      0      0      1
9      0      1      0
[[0 0 1]
 [0 1 0]
 [0 0 1]
 ...
 [0 1 0]
 [0 0 1]
 [1 0 0]]


In [52]:
tc1 = data[1:,12]
dj = pd.DataFrame(list(zip(tc1)), columns=['AT'])
tc1 = pd.get_dummies(dj.AT, prefix='AT')
print(tc1.head(10))
tc1 = LabelBinarizer().fit_transform(dj.AT)
print(tc1)

   AT_350  AT_400  AT_450
0       0       1       0
1       1       0       0
2       0       0       1
3       0       1       0
4       0       1       0
5       1       0       0
6       0       0       1
7       0       1       0
8       0       1       0
9       1       0       0
[[0 1 0]
 [1 0 0]
 [0 0 1]
 ...
 [1 0 0]
 [0 0 1]
 [0 1 0]]


In [53]:
ref1 = data[1:,13]
dj = pd.DataFrame(list(zip(ref1)), columns=['AT'])
ref1 = pd.get_dummies(dj.AT, prefix='AT')
print(ref1.head(10))
ref1 = LabelBinarizer().fit_transform(dj.AT)
print(ref1)

   AT_20000  AT_30000  AT_40000
0         0         1         0
1         1         0         0
2         0         0         1
3         0         1         0
4         0         1         0
5         1         0         0
6         0         0         1
7         0         1         0
8         0         1         0
9         1         0         0
[[0 1 0]
 [1 0 0]
 [0 0 1]
 ...
 [1 0 0]
 [0 0 1]
 [0 1 0]]


In [54]:
tt2 = data[1:,14]
print (tt2)
r = np.shape(tt2)
temp = np.array([15,30])
tt2 = np.concatenate((tt2, temp))
#temp = np.zeros((r,1),dtype=int)
dj = pd.DataFrame(list(zip(tt2)), columns=['AT'])
tt2 = pd.get_dummies(dj.AT, prefix='AT')
print(tt2.head(10))
tt2 = LabelBinarizer().fit_transform(dj.AT)
print(tt2)
n = 2 
tt2 = tt2[:-n, :]
print(tt2)
print(np.shape(tt2))

['35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35'
 '25' '25' '35' '35' '25' '25' '35' '35' '25' '25' '35' '35' '25' '25'
 '35' 

In [55]:
tc2 = data[1:,15]
dj = pd.DataFrame(list(zip(tc2)), columns=['AT'])
tc2 = pd.get_dummies(dj.AT, prefix='AT')
print(tc2.head(10))
tc2 = LabelBinarizer().fit_transform(dj.AT)
print(tc2)

   AT_300  AT_350  AT_400
0       0       0       1
1       1       0       0
2       0       0       1
3       0       1       0
4       0       0       1
5       1       0       0
6       0       0       1
7       0       1       0
8       0       0       1
9       1       0       0
[[0 0 1]
 [1 0 0]
 [0 0 1]
 ...
 [1 0 0]
 [0 0 1]
 [0 1 0]]


In [56]:
mc2 = data[1:,16]
dj = pd.DataFrame(list(zip(mc2)), columns=['AT'])
mc2 = pd.get_dummies(dj.AT, prefix='AT')
print(mc2.head(10))
mc2 = LabelBinarizer().fit_transform(dj.AT)
print(mc2)

   AT_30000  AT_40000  AT_50000
0         1         0         0
1         0         0         1
2         1         0         0
3         0         1         0
4         1         0         0
5         0         0         1
6         1         0         0
7         0         1         0
8         1         0         0
9         0         0         1
[[1 0 0]
 [0 0 1]
 [1 0 0]
 ...
 [0 0 1]
 [1 0 0]
 [0 1 0]]


In [57]:
cc2 = data[1:,17]
dj = pd.DataFrame(list(zip(cc2)), columns=['AT'])
cc2 = pd.get_dummies(dj.AT, prefix='AT')
print(cc2.head(10))
cc2 = LabelBinarizer().fit_transform(dj.AT)
print(cc2)

   AT_0  AT_1000000  AT_1300000  AT_1600000
0     1           0           0           0
1     1           0           0           0
2     1           0           0           0
3     1           0           0           0
4     1           0           0           0
5     1           0           0           0
6     1           0           0           0
7     1           0           0           0
8     1           0           0           0
9     1           0           0           0
[[1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 ...
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]]


In [58]:
tt3 = data[1:,18]
temp = np.array([0])
tt3 = np.concatenate((tt3, temp))
dj = pd.DataFrame(list(zip(tt3)), columns=['AT'])
tt3 = pd.get_dummies(dj.AT, prefix='AT')
print(tt3.head(10))
tt3 = LabelBinarizer().fit_transform(dj.AT)
print(tt3)
n = 1 
tt3 = tt3[:-n, :]
print(tt3)
print(np.shape(tt3))

   AT_0  AT_40  AT_50  AT_60
0     0      0      1      0
1     0      1      0      0
2     0      0      0      1
3     0      0      1      0
4     0      0      1      0
5     0      1      0      0
6     0      0      0      1
7     0      0      1      0
8     0      0      1      0
9     0      1      0      0
[[0 0 1 0]
 [0 1 0 0]
 [0 0 0 1]
 ...
 [0 0 0 1]
 [0 0 1 0]
 [1 0 0 0]]
[[0 0 1 0]
 [0 1 0 0]
 [0 0 0 1]
 ...
 [0 1 0 0]
 [0 0 0 1]
 [0 0 1 0]]
(400, 4)


In [59]:
tc3 = data[1:,19]
dj = pd.DataFrame(list(zip(tc3)), columns=['AT'])
tc3 = pd.get_dummies(dj.AT, prefix='AT')
print(tc3.head(10))
tc3 = LabelBinarizer().fit_transform(dj.AT)
print(tc3)

   AT_500  AT_550  AT_600
0       1       0       0
1       0       1       0
2       1       0       0
3       0       0       1
4       1       0       0
5       0       1       0
6       1       0       0
7       0       0       1
8       1       0       0
9       0       1       0
[[1 0 0]
 [0 1 0]
 [1 0 0]
 ...
 [0 1 0]
 [1 0 0]
 [0 0 1]]


In [60]:
tt4 = data[1:,20]
dj = pd.DataFrame(list(zip(tt4)), columns=['AT'])
tt4 = pd.get_dummies(dj.AT, prefix='AT')
print(tt4.head(10))
tt4 = LabelBinarizer().fit_transform(dj.AT)
print(tt4)

   AT_60  AT_75  AT_90
0      0      0      1
1      0      1      0
2      0      0      1
3      1      0      0
4      0      0      1
5      0      1      0
6      0      0      1
7      1      0      0
8      0      0      1
9      0      1      0
[[0 0 1]
 [0 1 0]
 [0 0 1]
 ...
 [0 1 0]
 [0 0 1]
 [1 0 0]]


In [61]:
tc4 = data[1:,21]
dj = pd.DataFrame(list(zip(tc4)), columns=['AT'])
tc4 = pd.get_dummies(dj.AT, prefix='AT')
print(tc4.head(10))
tc4 = LabelBinarizer().fit_transform(dj.AT)
print(tc4)

   AT_20  AT_30  AT_40
0      0      0      1
1      0      1      0
2      1      0      0
3      1      0      0
4      0      0      1
5      0      1      0
6      1      0      0
7      1      0      0
8      0      0      1
9      0      1      0
[[0 0 1]
 [0 1 0]
 [1 0 0]
 ...
 [0 1 0]
 [1 0 0]
 [1 0 0]]


In [62]:
y_test = data[1:,22]
##dj = pd.DataFrame(list(zip(y)), columns=['AT'])
#y = pd.get_dummies(dj.AT, prefix='AT')
#print(y.head(10))
#y_test = LabelBinarizer().fit_transform(dj.AT)
#print(y_test)
#print(np.shape(y_test))

In [63]:
args = (gend, a, j, ed, inc, carown, dL, tra, dist, purp, att, tt1, tc1, ref1, tt2, tc2, mc2, cc2, tt3, tc3, tt4, tc4)
x_test = np.concatenate(args, axis=1)
print (x_test)
print (np.shape(x_test))

[[0 1 0 ... 0 0 1]
 [0 1 0 ... 0 1 0]
 [0 1 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 1 0 0]]
(400, 79)


# Validation data - One Hot Encoding

In [64]:
data = np.genfromtxt('Dataset-MultiClass-Validate.csv', delimiter=',', dtype=None)
print(np.shape(data))
gender = data[1:,0]
df = pd.DataFrame(list(zip(gender)), columns=['Gender'])
gend = pd.get_dummies(df.Gender, prefix='Gender')
print(gend.head(10))
gend = LabelBinarizer().fit_transform(df.Gender)
print(np.shape(gend))
age = data[1:,1]
dt = pd.DataFrame(list(zip(age)), columns=['Age'])
a = pd.get_dummies(dt.Age, prefix='Age')
print(a.head(10))
a = LabelBinarizer().fit_transform(dt.Age)
print(a)
job = data[1:,2]
dj = pd.DataFrame(list(zip(job)), columns=['Job'])
j = pd.get_dummies(dj.Job, prefix='Job')
print(j.head(10))
j = LabelBinarizer().fit_transform(dj.Job)
print(j)
edu = data[1:,3]
dj = pd.DataFrame(list(zip(edu)), columns=['Education'])
ed = pd.get_dummies(dj.Education, prefix='Education')
print(ed.head(10))
ed = LabelBinarizer().fit_transform(dj.Education)

r, c = np.shape(gend)
temp = np.zeros((r,1),dtype=int)
ed1 = ed[:,:2]
ed2 = ed[:,2:]
args = (ed1, temp, ed2)
ed = np.concatenate(args, axis=1)
print(ed)

income = data[1:,4]
dj = pd.DataFrame(list(zip(income)), columns=['Income'])
inc = pd.get_dummies(dj.Income, prefix='Income')
print(inc.head(10))
inc = LabelBinarizer().fit_transform(dj.Income)
print(inc)
car = data[1:,5]
dj = pd.DataFrame(list(zip(car)), columns=['CarOwnership'])
carown = pd.get_dummies(dj.CarOwnership, prefix='CarOwnership')
print(carown.head(10))
carown = LabelBinarizer().fit_transform(dj.CarOwnership)
print(carown)
driving = data[1:,6]
dj = pd.DataFrame(list(zip(driving)), columns=['DrivingLicense'])
dL = pd.get_dummies(dj.DrivingLicense, prefix='DrivingLicense')
print(dL.head(10))
dL = LabelBinarizer().fit_transform(dj.DrivingLicense)
print(dL)
travel = data[1:,7]
dj = pd.DataFrame(list(zip(travel)), columns=['Travel'])
tra = pd.get_dummies(dj.Travel, prefix='Travel')
print(tra.head(10))
tra = LabelBinarizer().fit_transform(dj.Travel)
print(tra)
distance = data[1:,8]
dj = pd.DataFrame(list(zip(travel)), columns=['Distance'])
dist = pd.get_dummies(dj.Distance, prefix='Distance')
print(dist.head(10))
dist = LabelBinarizer().fit_transform(dj.Distance)
print(dist)
purpose = data[1:,9]
dj = pd.DataFrame(list(zip(purpose)), columns=['Purpose'])
purp = pd.get_dummies(dj.Purpose, prefix='Purpose')
print(purp.head(10))
purp = LabelBinarizer().fit_transform(dj.Purpose)
print(purp)
at = data[1:,10]
dj = pd.DataFrame(list(zip(at)), columns=['AT'])
att = pd.get_dummies(dj.AT, prefix='AT')
print(att.head(10))
att = LabelBinarizer().fit_transform(dj.AT)
print(att)
tt1 = data[1:,11]
dj = pd.DataFrame(list(zip(tt1)), columns=['AT'])
tt1 = pd.get_dummies(dj.AT, prefix='AT')
print(tt1.head(10))
tt1 = LabelBinarizer().fit_transform(dj.AT)
print(tt1)
tc1 = data[1:,12]
dj = pd.DataFrame(list(zip(tc1)), columns=['AT'])
tc1 = pd.get_dummies(dj.AT, prefix='AT')
print(tc1.head(10))
tc1 = LabelBinarizer().fit_transform(dj.AT)
print(tc1)
ref1 = data[1:,13]
dj = pd.DataFrame(list(zip(ref1)), columns=['AT'])
ref1 = pd.get_dummies(dj.AT, prefix='AT')
print(ref1.head(10))
ref1 = LabelBinarizer().fit_transform(dj.AT)
print(ref1)
#===========================#
tt2 = data[1:,14]
print (tt2)
r = np.shape(tt2)
temp = np.array([15,30])
tt2 = np.concatenate((tt2, temp))
#temp = np.zeros((r,1),dtype=int)
dj = pd.DataFrame(list(zip(tt2)), columns=['AT'])
tt2 = pd.get_dummies(dj.AT, prefix='AT')
print(tt2.head(10))
tt2 = LabelBinarizer().fit_transform(dj.AT)
print(tt2)
n = 2 
tt2 = tt2[:-n, :]
print(tt2)
print(np.shape(tt2))
#===========================#
tc2 = data[1:,15]
dj = pd.DataFrame(list(zip(tc2)), columns=['AT'])
tc2 = pd.get_dummies(dj.AT, prefix='AT')
print(tc2.head(10))
tc2 = LabelBinarizer().fit_transform(dj.AT)
print(tc2)
mc2 = data[1:,16]
dj = pd.DataFrame(list(zip(mc2)), columns=['AT'])
mc2 = pd.get_dummies(dj.AT, prefix='AT')
print(mc2.head(10))
mc2 = LabelBinarizer().fit_transform(dj.AT)
print(mc2)
cc2 = data[1:,17]
dj = pd.DataFrame(list(zip(cc2)), columns=['AT'])
cc2 = pd.get_dummies(dj.AT, prefix='AT')
print(cc2.head(10))
cc2 = LabelBinarizer().fit_transform(dj.AT)
print(cc2)
#===========================#
tt3 = data[1:,18]
temp = np.array([0])
tt3 = np.concatenate((tt3, temp))
dj = pd.DataFrame(list(zip(tt3)), columns=['AT'])
tt3 = pd.get_dummies(dj.AT, prefix='AT')
print(tt3.head(10))
tt3 = LabelBinarizer().fit_transform(dj.AT)
print(tt3)
n = 1 
tt3 = tt3[:-n, :]
print(tt3)
print(np.shape(tt3))
#===========================#
tc3 = data[1:,19]
dj = pd.DataFrame(list(zip(tc3)), columns=['AT'])
tc3 = pd.get_dummies(dj.AT, prefix='AT')
print(tc3.head(10))
tc3 = LabelBinarizer().fit_transform(dj.AT)
print(tc3)
tt4 = data[1:,20]
dj = pd.DataFrame(list(zip(tt4)), columns=['AT'])
tt4 = pd.get_dummies(dj.AT, prefix='AT')
print(tt4.head(10))
tt4 = LabelBinarizer().fit_transform(dj.AT)
print(tt4)
tc4 = data[1:,21]
dj = pd.DataFrame(list(zip(tc4)), columns=['AT'])
tc4 = pd.get_dummies(dj.AT, prefix='AT')
print(tc4.head(10))
tc4 = LabelBinarizer().fit_transform(dj.AT)
print(tc4)



(201, 23)
   Gender_Female  Gender_Male
0              1            0
1              1            0
2              1            0
3              1            0
4              0            1
5              0            1
6              0            1
7              0            1
8              0            1
9              0            1
(200, 1)
   Age_18-30  Age_31-40  Age_41-50  Age_51-65
0          0          1          0          0
1          0          1          0          0
2          0          1          0          0
3          0          1          0          0
4          1          0          0          0
5          1          0          0          0
6          1          0          0          0
7          1          0          0          0
8          0          0          1          0
9          0          0          1          0
[[0 1 0 0]
 [0 1 0 0]
 [0 1 0 0]
 [0 1 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [0 0 1 0]
 [1 0 0 0]
 [

  data = np.genfromtxt('Dataset-MultiClass-Validate.csv', delimiter=',', dtype=None)


In [65]:
args = (gend, a, j, ed, inc, carown, dL, tra, dist, purp, att, tt1, tc1, ref1, tt2, tc2, mc2, cc2, tt3, tc3, tt4, tc4)
x_val = np.concatenate(args, axis=1)
print (x_val)
print (np.shape(x_val))

[[0 0 1 ... 0 0 1]
 [0 0 1 ... 0 1 0]
 [0 0 1 ... 1 0 0]
 ...
 [1 1 0 ... 0 1 0]
 [1 1 0 ... 1 0 0]
 [1 1 0 ... 1 0 0]]
(200, 79)


In [66]:
# label validation data
y_val = data[1:,22]
#dj = pd.DataFrame(list(zip(y)), columns=['AT'])
#y = pd.get_dummies(dj.AT, prefix='AT')
#print(y.head(10))
#y_test = LabelBinarizer().fit_transform(dj.AT)
#print(y_test)
#print(np.shape(y_test))

# Training data - One Hot Encoding

In [67]:
data = np.genfromtxt('Dataset-MultiClass-Train.csv', delimiter=',', dtype=None)
print(np.shape(data))
gender = data[1:,0]
df = pd.DataFrame(list(zip(gender)), columns=['Gender'])
gend = pd.get_dummies(df.Gender, prefix='Gender')
print(gend.head(10))
gend = LabelBinarizer().fit_transform(df.Gender)
print(np.shape(gend))
age = data[1:,1]
dt = pd.DataFrame(list(zip(age)), columns=['Age'])
a = pd.get_dummies(dt.Age, prefix='Age')
print(a.head(10))
a = LabelBinarizer().fit_transform(dt.Age)
print(a)
job = data[1:,2]
dj = pd.DataFrame(list(zip(job)), columns=['Job'])
j = pd.get_dummies(dj.Job, prefix='Job')
print(j.head(10))
j = LabelBinarizer().fit_transform(dj.Job)
print(j)
edu = data[1:,3]
dj = pd.DataFrame(list(zip(edu)), columns=['Education'])
ed = pd.get_dummies(dj.Education, prefix='Education')
print(ed.head(10))
ed = LabelBinarizer().fit_transform(dj.Education)
print(ed)
income = data[1:,4]
dj = pd.DataFrame(list(zip(income)), columns=['Income'])
inc = pd.get_dummies(dj.Income, prefix='Income')
print(inc.head(10))
inc = LabelBinarizer().fit_transform(dj.Income)
print(inc)
car = data[1:,5]
dj = pd.DataFrame(list(zip(car)), columns=['CarOwnership'])
carown = pd.get_dummies(dj.CarOwnership, prefix='CarOwnership')
print(carown.head(10))
carown = LabelBinarizer().fit_transform(dj.CarOwnership)
print(carown)
driving = data[1:,6]
dj = pd.DataFrame(list(zip(driving)), columns=['DrivingLicense'])
dL = pd.get_dummies(dj.DrivingLicense, prefix='DrivingLicense')
print(dL.head(10))
dL = LabelBinarizer().fit_transform(dj.DrivingLicense)
print(dL)
travel = data[1:,7]
dj = pd.DataFrame(list(zip(travel)), columns=['Travel'])
tra = pd.get_dummies(dj.Travel, prefix='Travel')
print(tra.head(10))
tra = LabelBinarizer().fit_transform(dj.Travel)
print(tra)
distance = data[1:,8]
dj = pd.DataFrame(list(zip(travel)), columns=['Distance'])
dist = pd.get_dummies(dj.Distance, prefix='Distance')
print(dist.head(10))
dist = LabelBinarizer().fit_transform(dj.Distance)
print(dist)
purpose = data[1:,9]
dj = pd.DataFrame(list(zip(purpose)), columns=['Purpose'])
purp = pd.get_dummies(dj.Purpose, prefix='Purpose')
print(purp.head(10))
purp = LabelBinarizer().fit_transform(dj.Purpose)
print(purp)
at = data[1:,10]
dj = pd.DataFrame(list(zip(at)), columns=['AT'])
att = pd.get_dummies(dj.AT, prefix='AT')
print(att.head(10))
att = LabelBinarizer().fit_transform(dj.AT)
print(att)
tt1 = data[1:,11]
dj = pd.DataFrame(list(zip(tt1)), columns=['AT'])
tt1 = pd.get_dummies(dj.AT, prefix='AT')
print(tt1.head(10))
tt1 = LabelBinarizer().fit_transform(dj.AT)
print(tt1)
tc1 = data[1:,12]
dj = pd.DataFrame(list(zip(tc1)), columns=['AT'])
tc1 = pd.get_dummies(dj.AT, prefix='AT')
print(tc1.head(10))
tc1 = LabelBinarizer().fit_transform(dj.AT)
print(tc1)
ref1 = data[1:,13]
dj = pd.DataFrame(list(zip(ref1)), columns=['AT'])
ref1 = pd.get_dummies(dj.AT, prefix='AT')
print(ref1.head(10))
ref1 = LabelBinarizer().fit_transform(dj.AT)
print(ref1)
tt2 = data[1:,14]
dj = pd.DataFrame(list(zip(tt2)), columns=['AT'])
tt2 = pd.get_dummies(dj.AT, prefix='AT')
print(tt2.head(10))
tt2 = LabelBinarizer().fit_transform(dj.AT)
print(tt2)
tc2 = data[1:,15]
dj = pd.DataFrame(list(zip(tc2)), columns=['AT'])
tc2 = pd.get_dummies(dj.AT, prefix='AT')
print(tc2.head(10))
tc2 = LabelBinarizer().fit_transform(dj.AT)
print(tc2)
mc2 = data[1:,16]
dj = pd.DataFrame(list(zip(mc2)), columns=['AT'])
mc2 = pd.get_dummies(dj.AT, prefix='AT')
print(mc2.head(10))
mc2 = LabelBinarizer().fit_transform(dj.AT)
print(mc2)
cc2 = data[1:,17]
dj = pd.DataFrame(list(zip(cc2)), columns=['AT'])
cc2 = pd.get_dummies(dj.AT, prefix='AT')
print(cc2.head(10))
cc2 = LabelBinarizer().fit_transform(dj.AT)
print(cc2)
tt3 = data[1:,18]
dj = pd.DataFrame(list(zip(tt3)), columns=['AT'])
tt3 = pd.get_dummies(dj.AT, prefix='AT')
print(tt3.head(10))
tt3 = LabelBinarizer().fit_transform(dj.AT)
print(tt3)
tc3 = data[1:,19]
dj = pd.DataFrame(list(zip(tc3)), columns=['AT'])
tc3 = pd.get_dummies(dj.AT, prefix='AT')
print(tc3.head(10))
tc3 = LabelBinarizer().fit_transform(dj.AT)
print(tc3)
tt4 = data[1:,20]
dj = pd.DataFrame(list(zip(tt4)), columns=['AT'])
tt4 = pd.get_dummies(dj.AT, prefix='AT')
print(tt4.head(10))
tt4 = LabelBinarizer().fit_transform(dj.AT)
print(tt4)
tc4 = data[1:,21]
dj = pd.DataFrame(list(zip(tc4)), columns=['AT'])
tc4 = pd.get_dummies(dj.AT, prefix='AT')
print(tc4.head(10))
tc4 = LabelBinarizer().fit_transform(dj.AT)
print(tc4)



(1201, 23)
   Gender_FEMALE  Gender_MALE
0              0            1
1              0            1
2              0            1
3              0            1
4              1            0
5              1            0
6              1            0
7              1            0
8              1            0
9              1            0
(1200, 1)
   Age_18-30  Age_31-40  Age_41-50  Age_51-65
0          1          0          0          0
1          1          0          0          0
2          1          0          0          0
3          1          0          0          0
4          1          0          0          0
5          1          0          0          0
6          1          0          0          0
7          1          0          0          0
8          1          0          0          0
9          1          0          0          0
[[1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]
 ...
 [0 1 0 0]
 [0 1 0 0]
 [0 1 0 0]]
   Job_Entrepreneur  Job_Private Company Employee  \
0                 

  data = np.genfromtxt('Dataset-MultiClass-Train.csv', delimiter=',', dtype=None)


In [68]:
args = (gend, a, j, ed, inc, carown, dL, tra, dist, purp, att, tt1, tc1, ref1, tt2, tc2, mc2, cc2, tt3, tc3, tt4, tc4)
x_train = np.concatenate(args, axis=1)
print (x_train)
print (np.shape(x_train))

[[1 1 0 ... 0 0 1]
 [1 1 0 ... 0 1 0]
 [1 1 0 ... 0 1 0]
 ...
 [0 0 1 ... 0 0 1]
 [0 0 1 ... 1 0 0]
 [0 0 1 ... 0 1 0]]
(1200, 79)


In [76]:
# label validation data
y_train = data[1:,22]
#dj = pd.DataFrame(list(zip(y)), columns=['AT'])
#y = pd.get_dummies(dj.AT, prefix='AT')
#print(y.head(10))
#y_train = LabelBinarizer().fit_transform(dj.AT)
#print(y_train)
#print(np.shape(y_train))

# Logistic_Regression

# Validation Set


In [79]:

regr = LogisticRegression(multi_class='multinomial', solver='lbfgs')
print(np.shape(x_train),np.shape(y_train))
regr.fit(x_train, y_train)
y_pred = regr.predict(x_val)
y_pred_test = regr.predict(x_test)
score = regr.score(x_val, y_val)
print("Validation set score: ",score)
print("Test set score: ", regr.score(x_test, y_test))



(1200, 79) (1200,)
Validation set score:  0.34
Test set score:  0.25


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [88]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_val, y_pred))
confusion = confusion_matrix(y_val, y_pred).ravel()


(200,)
(200,)
[[ 6 12 14  5]
 [ 3 18 34  5]
 [ 3  3 33 25]
 [ 4 18  6 11]]


In [96]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
accuracy = accuracy_score(y_val, y_pred)
print("Accuracy: %5.2f" % (accuracy*100), "%")
precision = precision_score(y_val, y_pred, average='macro')
print("Precision: %5.2f" % (precision*100), "%")
recall = recall_score(y_val, y_pred, average='macro')
print("Recall: %5.2f" % (recall*100), "%")
f1 = 2 * ((precision*recall)/(precision+recall))
print("F1-Score: %5.2f" % (f1))


Accuracy: 34.00 %
Precision: 33.66 %
Recall: 31.50 %
F1-Score:  0.33


# TEST SET

In [97]:
print(confusion_matrix(y_test, y_pred_test))


[[ 8 52 23  4]
 [19 30 51  7]
 [ 9 10 51 45]
 [22 29 29 11]]


In [98]:
accuracy = accuracy_score(y_test, y_pred_test)
print("Accuracy: %5.2f" % (accuracy*100), "%")
precision = precision_score(y_test, y_pred_test, average='macro')
print("Precision: %5.2f" % (precision*100), "%")
recall = recall_score(y_test, y_pred_test, average='macro')
print("Recall: %5.2f" % (recall*100), "%")
f1 = 2 * ((precision*recall)/(precision+recall))
print("F1-Score: %5.2f" % (f1))


Accuracy: 25.00 %
Precision: 22.03 %
Recall: 23.42 %
F1-Score:  0.23


# SVM


In [100]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

regr = make_pipeline(StandardScaler(), SVC(gamma='auto'))
regr.fit(x_train, y_train)
y_pred = regr.predict(x_val)
y_pred_test = regr.predict(x_test)
score = regr.score(x_val, y_val)
print("Validation set score: ",score)
print("Test set score: ", regr.score(x_test, y_test))

Validation set score:  0.255
Test set score:  0.28


# VALIDATION SET


In [105]:
print(confusion_matrix(y_val, y_pred))
confusion = confusion_matrix(y_val, y_pred).ravel()

[[ 3  1 23 10]
 [ 6  7 34 13]
 [ 3  2 28 31]
 [ 6  9 11 13]]


In [102]:
accuracy = accuracy_score(y_val, y_pred)
print("Accuracy: %5.2f" % (accuracy*100), "%")
precision = precision_score(y_val, y_pred, average='macro')
print("Precision: %5.2f" % (precision*100), "%")
recall = recall_score(y_val, y_pred, average='macro')
print("Recall: %5.2f" % (recall*100), "%")
f1 = 2 * ((precision*recall)/(precision+recall))
print("F1-Score: %5.2f" % (f1))


Accuracy: 25.50 %
Precision: 25.52 %
Recall: 24.21 %
F1-Score:  0.25


# TEST SET

In [106]:
print(confusion_matrix(y_test, y_pred_test))

[[15 34 32  6]
 [38 15 36 18]
 [ 6  2 73 34]
 [32 16 34  9]]


In [108]:
accuracy = accuracy_score(y_test, y_pred_test)
print("Accuracy: %5.2f" % (accuracy*100), "%")
precision = precision_score(y_test, y_pred_test, average='macro')
print("Precision: %5.2f" % (precision*100), "%")
recall = recall_score(y_test, y_pred_test, average='macro')
print("Recall: %5.2f" % (recall*100), "%")
f1 = 2 * ((precision*recall)/(precision+recall))
print("F1-Score: %5.2f" % (f1))

Accuracy: 28.00 %
Precision: 23.50 %
Recall: 26.16 %
F1-Score:  0.25


# End