# Ebola Virus Diease and Marburg Virus Disease Classification

## MVD Symptoms
Nausea, vomiting, chest pain, a sore throat, abdominal pain, and diarrhea may appear. Symptoms become increasingly severe and can include jaundice, inflammation of the pancreas, severe weight loss, delirium, shock, liver failure, massive hemorrhaging, and multi-organ dysfunction.

## EVD Sysmptoms
Symptoms show up 2 to 21 days after infection and usually include:

High fever
Headache
Joint and muscle aches
Sore throat
Weakness
Stomach pain
Lack of appetite

Implementation outline:

![image.png](attachment:image.png)


- Getting Data: Import data from file
- Preprocess Data: Clean data with pandas
- Split Data: 80% traiting and 20% testing
- Setting up Environment: Setup an experiment for building multiclass models
- Create Model: Create a model, perform stratified cross validation and evaluate classification metrics
- Train Model: Train the multiclass models with the train dataset
- Test Model: Test best three multiclass models with the test dataset
- Tune Model: Automatically tune the hyper-parameters of a multiclass model
- Plot Model: Analyze model performance using various plots
- Finalize Model: Finalize and select the best model at the end of the experiment
- Predict Model: Make predictions on new / unseen data
- Save / Load Model: Save / load a model for future use

![image.png](attachment:image.png)

### Terminologies:
- AUC: Area under the curve

## Pictures

![image.png](attachment:image.png)

Ext Data
![image.png](attachment:image.png)

Ext2 Data
![image.png](attachment:image.png)

## Get Data

In [1]:
!pip install pycaret pandas openpyxl explainerdashboard shap



You should consider upgrading via the 'C:\xampp\locale\ly\projects\ebola.ai\ebola_env\Scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
#!pip install openpyxl

In [3]:
#!pip install pycaret

In [2]:
import pandas as pd
from pycaret.classification import *

In [5]:
# reads the unclean data set from excel
raw_data = pd.read_excel('EVD data LAGOS26SEPT2014_080944.xlsx',sheet_name = '10-AUG2014 - LGA CONTACT TRACER')


In [6]:
raw_data.head()

Unnamed: 0,S/NO,Gender,Age,Address,State,SourceCase,DateLastContact,LGA,Phone,HCW,HCFacility,Unnamed: 11,FinalOutcome,Temperature Chart,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39
0,,,,,,,,,,,,,,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,,,,,,
1,1.0,Male,51.0,"NO 2 OLAGOKE STREET, SUPER BUS STOP, ABULE EGBA",LAGOS,DR ADEJORO IGE,2014-08-04 00:00:00,AGEGE,8055514314.0,No,,,,,,36.1,36.9,36.7,36.8,37.1,36.7,37.0,36.1,36.7,36.4,36.6,36.8,36.8,36.8,37.0,36.8,37.0,36.1,37.0,,,,,,
2,2.0,Female,40.0,"NO 2 OLAGOKE STREET, SUPER BUS STOP, ABULE EGBA",LAGOS,DR ADEJORO IGE,2014-08-04 00:00:00,AGEGE,7058588437.0,No,,,,,,35.9,36.4,36.7,37.0,36.9,36.6,37.4,36.7,36.3,36.2,36.6,35.7,36.8,37.0,36.7,36.7,36.6,36.8,36.7,,,,,,
3,3.0,Female,16.0,"NO 2 OLAGOKE STREET, SUPER BUS STOP, ABULE EGBA",LAGOS,DR ADEJORO IGE,2014-08-04 00:00:00,AGEGE,,No,,,,,,36.4,36.5,36.9,36.0,35.8,36.1,36.8,36.6,36.6,36.5,36.4,36.7,37.2,37.1,37.4,37.2,36.7,36.9,37.2,,,,,,
4,4.0,Male,10.0,"NO 2 OLAGOKE STREET, SUPER BUS STOP, ABULE EGBA",LAGOS,DR ADEJORO IGE,2014-08-04 00:00:00,AGEGE,,No,,,,,,36.1,36.4,36.6,37.2,36.3,36.3,35.5,35.7,35.1,36.7,36.1,36.2,36.6,36.6,35.5,35.4,36.6,35.6,36.0,,,,,,


In [7]:
# reads formatted data for analysis
formatted_data = pd.read_excel('EVD data LAGOS26SEPT2014_080944.xlsx',sheet_name = 'Formatted')

ValueError: Worksheet named 'Formatted' not found

In [None]:
formatted_data.head()

## Clean Data with C#

 public static void EbolaDataCleaning()
 {
            var data = File.ReadAllLines("formatted_data.txt");
            var list = new List<int>();
            var target = new List<string>();

            for (int i = 1; i < data.Length; i++)
            {
                var cols = ToIntarray(data[i].Split('\t'));

                var temps = cols.Take(21).ToArray();
                var evds = cols.Skip(21).Take(8).ToArray();
                var mvds = cols.Skip(28).Take(7).ToArray();

                var tevdL = temps.Where(x => x >= 38);
                var tmvdL = temps.Where(x => IsRange(x, 35, 37.9));
                var tnilL = temps.Where(x => x < 35);

                var eevdL = evds.Where(x => x == 1);
                var enilL = evds.Where(x => x == 0);

                var mnilL = mvds.Where(x => x == 0);
                var mmvdL = mvds.Where(x => x == 1);

                var evdCount = tevdL.Concat(eevdL).Count();
                var mvdCount = tmvdL.Concat(mmvdL).Count();
                var nilCount = tnilL.Concat(enilL.Concat(mnilL)).Count();

                if (evdCount > mvdCount)
                {
                    if (evdCount > nilCount)
                    {
                        // Console.Write("Sample is EVD positive");
                        target.Add("EVD");
                    }
                    else
                    {
                        // Console.Write("Sample is nagative");
                        target.Add("NIL");
                    }
                }
                else if (mvdCount > nilCount)
                    //Console.Write("Sample is MVD positive");
                    target.Add("MVD");
                else
                    // Console.Write("Sample is nagative");
                    target.Add("NIL");

            }
            var evd = target.Where(x => x == "EVD").Count();
            var mvd = target.Where(x => x == "MVD").Count();
            var nil = target.Where(x => x == "NIL").Count();

            File.WriteAllLines(@"output.txt", target.ToArray());
}

In [10]:
# reads formatted data for analysis
clean_data = pd.read_excel('EVD data LAGOS26SEPT2014_080944.xlsx',sheet_name = 'Ext2')

In [11]:
clean_data.head()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,...,Deep-set eyes,Expressionless faces,Extreme lethargy,Non-itchy rash,"Multiple bleeding (nose, gums, vagina)",Confusion,Irritability and aggression,Orchitis (inflammation of testicles,Severe blood loss and shock,Target
0,23,21,33,37,30,34,33,37,40,24,...,0,1,0,1,0,1,1,0,0,MVD
1,31,26,21,23,21,20,36,27,29,20,...,1,1,0,0,1,0,1,1,0,MVD
2,20,25,37,40,29,22,37,32,37,20,...,1,1,1,1,0,0,1,1,1,EVD
3,22,22,40,27,30,40,26,25,23,23,...,0,1,1,1,1,1,1,0,0,EVD
4,34,37,22,36,32,40,29,21,23,36,...,1,1,0,0,1,1,1,0,0,EVD


In [12]:
clean_data.tail()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,...,Deep-set eyes,Expressionless faces,Extreme lethargy,Non-itchy rash,"Multiple bleeding (nose, gums, vagina)",Confusion,Irritability and aggression,Orchitis (inflammation of testicles,Severe blood loss and shock,Target
2495,28,21,40,27,22,29,24,33,26,21,...,1,0,1,0,0,0,1,1,0,MVD
2496,23,36,28,27,36,31,23,28,22,29,...,1,0,1,1,0,0,1,0,1,NIL
2497,40,40,27,20,23,21,33,30,21,37,...,1,1,1,1,1,1,0,0,1,EVD
2498,40,38,33,36,23,25,27,25,32,35,...,1,0,1,1,0,0,1,0,1,EVD
2499,23,21,36,21,22,38,38,21,40,28,...,1,0,1,0,1,1,0,1,1,EVD


In [13]:
clean_data.dtypes

T1                                         int64
T2                                         int64
T3                                         int64
T4                                         int64
T5                                         int64
T6                                         int64
T7                                         int64
T8                                         int64
T9                                         int64
T10                                        int64
T11                                        int64
T12                                        int64
T13                                        int64
T14                                        int64
T15                                        int64
T16                                        int64
T17                                        int64
T18                                        int64
T19                                        int64
T20                                        int64
T21                 

In [14]:
train_data = clean_data.sample(frac=0.8, random_state=786)
test_data = clean_data.drop(train_data.index)

train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

print('Train Data for Modeling: ' + str(train_data.shape))
print('Test Data For Predictions: ' + str(test_data.shape))

Train Data for Modeling: (2000, 53)
Test Data For Predictions: (500, 53)


In [None]:
#engineering_features = ['T1','T2','T3','T4','T5','T6','T7','T8','T9','T10','T12','T13','T14','T15','T16','T17','T18','T19','T20','T21','Fever','Headache', 'Stomach Pain', 'Vomiting', 'Diarrhea','Sore Throat', 'Loss of Appetite', 'Weakness and Fatigue', 'Nausea', 'Vomiting.1', 'Chest Pain', 'Sore Throat.1', 'Abdominal Pain', 'Diarrhea.1', 'Severe Weight Loss' ]

In [None]:
setup??

In [15]:
#experiment = setup(data =train_data, target ='Target', categorical_features =engineering_features)
#from pycaret.classification import *
experiment = setup(data =train_data, target ='Target', session_id=123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Target
2,Target Type,Multiclass
3,Label Encoded,"EVD: 0, MVD: 1, NIL: 2"
4,Original Data,"(2000, 53)"
5,Missing Values,False
6,Numeric Features,21
7,Categorical Features,31
8,Ordinal Features,False
9,High Cardinality Features,False


In [16]:
best = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9428,0.9692,0.7007,0.9259,0.9312,0.8922,0.8939,1.497
ridge,Ridge Classifier,0.9192,0.0,0.6426,0.877,0.8974,0.8456,0.8495,0.011
lda,Linear Discriminant Analysis,0.9157,0.9684,0.6867,0.8996,0.9052,0.8412,0.8431,0.018
nb,Naive Bayes,0.8842,0.9622,0.6662,0.8664,0.8723,0.7811,0.7837,0.013
gbc,Gradient Boosting Classifier,0.8471,0.9456,0.6429,0.8431,0.8354,0.7095,0.7122,0.779
qda,Quadratic Discriminant Analysis,0.8342,0.9155,0.5832,0.7974,0.8142,0.6829,0.6875,0.017
lightgbm,Light Gradient Boosting Machine,0.8306,0.9358,0.7016,0.8353,0.8261,0.6812,0.6835,0.176
et,Extra Trees Classifier,0.8084,0.8944,0.565,0.7721,0.789,0.6335,0.6374,0.27
rf,Random Forest Classifier,0.7891,0.8686,0.5512,0.7546,0.7701,0.5962,0.6009,0.289
svm,SVM - Linear Kernel,0.6884,0.0,0.4899,0.7216,0.6296,0.4069,0.468,0.039


In [None]:
rf = create_model('rf')

In [None]:
print(rf)

In [None]:
knn = create_model('knn')

In [None]:
nb = create_model('nb')

In [None]:
tuned_rf = tune_model(rf)

In [None]:
print(tuned_rf)

In [None]:
import numpy as np
tuned_knn = tune_model(knn, custom_grid = {'n_neighbors' : np.arange(0,50,1)})

In [None]:
#tuned_nb = tune_model(nb)

In [None]:
#plot_model(tuned_knn, plot = 'confusion_matrix')

In [None]:
#plot_model(tuned_nb, plot = 'class_report')

In [None]:
#plot_model(tuned_knn, plot='boundary')

In [None]:
#plot_model(tuned_knn, plot = 'error')
#plot_model(tuned_rf, plot = 'pr')
#plot_model(tuned_rf, plot = 'feature'),
#plot_model(tuned_rf, plot = 'Accuracy')

#plot_model(tuned_nb, plot = 'class_report')

In [None]:
#from pycaret.classification import *

#dashboard(nb, display_format='inline')

In [None]:
pycaret.__version__