# **BIG BIO-DATA ANALYSIS - PRACTICE : 1**
### **SABAKAKI PETER ZIRIBAGWA** (MSC. BIOINFORMATICS)

## 1. Importing the neccessary libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


## 2. Loading the datasets

In [None]:
train = pd.read_csv("../input/amp-data-set/AMP_TrainSet.csv") # this is the trainning dataset
test = pd.read_csv("../input/amp-data-set/Test.csv") # this is the testing dataset

#### 3. Checking dataset content
- This helps to give a quick insight about the content of the datasete. i.e it answers the question, "What is within the datasets?"

In [None]:
train.head() # to see what is in the first 5 rows of the train dataset

In [None]:
test.head() # to see what is in the first 5 rows of the test data set

## **4.Checking the dimensions of the dataset**
- it is important to know the number of columns and rows with in the dataset, because to big data set may take so long to train whereas too small dataset may reduce the performance of the model.

- to know which program to use when loading the data,forexample, pandas can't be used to load data with a big dimension

In [None]:
train.shape # to cheak the dimension i.e the number of rows and columns

`The above results show that there are 3038 rows and 12 columns.`

## **5. Checking column names of the train dataset**

In [None]:
train.columns  #checking the column names with in the train dataset 
#This gives an insight on the features available in the dataset
 

## **6.Checking the data type of each feature / column**
- This is important because some algorithms prefer certain data types to others. so it is good for me to first ensure the data type of my features is appropriate for the algorithm am intending to use, and if not, i should then convert the data types to the suitable ones.

In [None]:
train.dtypes # cheaking datatypes in the train dataset

## 7. Descriptive Statistics 
- This is important to understand and draw conclusions about the data in each column of the dataset.
- It enable one to describe the size, center and spread of the data in each column; i.e the count, mean, max, min, standard deviations and the percentage quartiles.

In [None]:
train.describe() # to output the statistics of each attribute in the entire dataset

## 8. Checking the distribution of the class
- It is important to know how balanced the frequency of the outcome categories is !. This is because imbalanced frequncies of the outcome categories may cause bias in the performance of the final model.

In [None]:
train.groupby('CLASS').size().plot(kind='bar') #a bar-graph is used to visualize the distribution of the values in the class atribute

 `From the above bar graph, i deduce that the "class attribute" is evenly distributed;` 
 `hence, there is no need to use methods such as smote to balance the class attribute`

## **9. Determining the corrlation between the all the atributes across the entire dataset**

In [None]:
train.corr(method='pearson') # do correlation using pearson correlation coeficient

In [None]:
plt.figure(figsize=(6,6))
sns.heatmap(train.corr(method='pearson')) # constract a heatmap to visually display the pearson correlation between attributes across the entire train databse

## **10.Checking the skewness of the data**
- it is important to know the skewness of the data, because some algorithm perform better with gaussian data

In [None]:
train.skew() # compute the level of skewness of each attribute in the train dataset

In [None]:
train.skew().plot(kind='bar') # visualise the level of skewness of each attribute in the train dataset

`The graph above shows that the " NT_EFC195" attribute is highly posiively skewed. i may consider transforming
it in future`

## 11. Univariate data visualisation 
- it is importart to observe the distribution of each attribute independently
- it also shades a picture of the data skewness. i.e how are values spread in the entire attribute and how many are distributed around the center.

In [None]:
### Generate Histograms;
plt.figure(figsize=(15,15)) # sets the dimmesions of the plots.
train.hist() # histogram of each attribute in the train dataset
plt.show() # it outputs the plots

In [None]:
# Density plots
train.plot(kind='density', subplots=True, layout=(4,3), sharex=False)
plt.show

In [None]:
# Box and whisker plots
# This gives a visual of the median, the range between 25% and 75% quartiles, and the outliers on the whiskers 
train.plot(kind='box', subplots=True, layout=(4,3), sharex=False, sharey=False)
plt.show()

## **12. Multivariate data visualization**
- Differing from univariate,this shows how each attribute is correlated with other attributes in the dataset

### Correlation matrix 

In [None]:
# plot a correlation matrix across the train dataset.
# this displays a heatmap that shows how each attribute is correlated with the other in the dataset
correlations = train.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(train.columns)
ax.set_yticklabels(train.columns)
plt.show()

#### Scatter plot matrix
- This shows a relationship between two attributes in the dataset

In [None]:
## sns.pairplot(train,) # draw scatterplot using pair plot


## **13. Data preparation**

### Standardizing the data

In [None]:
#Standidizing the data.
from sklearn.preprocessing import StandardScaler
array = train.values
# Separate the array into input and output componets
x = array[:,0:11]
y = array[:,11]
scaler = StandardScaler().fit(x)
rescalledx = scaler.transform(x)
# summurized transformed data
np.set_printoptions(precision=3)
print(rescalledx[0:5,:])

## Rescaling the data

In [None]:
#Rescaling the data.
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

array = train.values
# separate array into input and output components
X = array[:,0:11]
Y = array[:,11]
scaler = MinMaxScaler(feature_range=(0, 1))
rescalledX = scaler.fit_transform(X)
# summarize transformed data
set_printoptions(precision=3)
print(rescalledX[0:5,:])

## Feature selection

In [None]:
# Recursive feature elimination

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

array = train.values
X = rescalledx[:,0:11]
Y = array[:,11]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 5)
fit = rfe.fit(X, Y)
print("Num Features: ",  fit.n_features_)
print("Selected Features:",  fit.support_)
print("Feature Ranking: ",  fit.ranking_)

## Comparing mutiple classification algorithms

In [None]:
# Compare Algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# load dataset

#split the dataset 
X = rescalledX[:,0:11]
Y = array[:,11]

# prepare models and add them to a list
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'

for name, model in models:
    kfold = KFold(n_splits=10)
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

`the above comparision shows that naive bayes gives the best performance,therefore,  i am going to continue with naive bayes`

## Rescaling the test_dataset

In [None]:
#Rescaling the test_data.
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler

array = test.values
# separate array into input and output components
y = array[:,0:11]
scaler = MinMaxScaler(feature_range=(0, 1))
rescalled_test = scaler.fit_transform(y)
# summarize transformed data
set_printoptions(precision=3)
print(rescalled_test[0:5,:])

## Genenerating the model using Naive_Bayes

In [None]:
# using naive_bayes

from sklearn.naive_bayes import GaussianNB
array = train.values
X = rescalledX[:,0:11]
Y = array[:,11]
kfold = KFold(n_splits=10)
model = GaussianNB()  # Using naive bayes

# fitting the model
model.fit(X,Y)

# predicting the test_dataset using the model
Class= model.predict(rescalled_test)

#Returning the Naive Bayes output in a dataframe
report = pd.DataFrame(Class)

report.columns=['CLASS'] # Creating a class column
report.index.name='Index' #Creating a culumn index

# Map function to change the 0.0 and 1.0 into False and True repectively
report['CLASS']= report['CLASS'].map({0.0:False, 1.0:True})
report

#Storing the dataframe output in a csv file.
report.to_csv("ziribagwa.csv")

