##Diabetes prediction problem
(taken from https://www.kaggle.com/kumargh/pimaindiansdiabetescsv):

Predicition of a patient suffering diabetes within the next 5 years.
The following are the input variables:

 $x_0$:  Number of times pregnant.
 
 $x_1:$ Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
 
 $x_2:$ Diastolic blood pressure (mm Hg).
 
 $x_3:$ Triceps skinfold thickness (mm).
 
 $x_4:$ 2-Hour serum insulin (mu U/ml).
 
 $x_5:$ Body mass index (weight in kg/(height in m)^2).
 
 $x_6:$ Diabetes pedigree function.
 
 $x_7: $Age (years).

The column 8 represents the target $t$.
Thus $\vec{x} \in \mathbb{R}^8$, and the problem is a binary classification one.

The dataset contains $m=768$ samples or observations.

A baseline simple linear model yields around 65% of accuracy, while top results achieved around 77%.

# Loading the googledrive
Mount the google drive and the dataset path

In [1]:
#ver este ejemplo de aca 
#https://medium.com/predict/using-pytorch-for-kaggles-famous-dogs-vs-cats-challenge-part-1-preprocessing-and-training-407017e1a10c

# Load the Drive helper and mount
from google.colab import drive
#pandas library for reading csvs
from pandas import read_csv
import numpy as np
#scikit binary classification algorithms
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Perceptron


# This will prompt for authorization.
drive.mount('/content/drive')
!ls "/content/drive/My Drive/Parma/LibroVersionMasReciente/2_Preprocesamiento/dataImputing/"

diabetesDatasetPath = '/content/drive/My Drive/Parma/LibroVersionMasReciente/2_Preprocesamiento/dataImputing/pima-indians-diabetes.csv'

ModuleNotFoundError: No module named 'google.colab'

#Descriptive statistics for the data
We can make use of the function "describe" from pandas, to see minimum and maximum values of the column, quartiles, and basic descriptive statistics



In [0]:

#read the dataset from csv
dataset = read_csv(diabetesDatasetPath, header=None)
#print descriptive stats
print(dataset.describe())


# print the first 20 rows of data
print(dataset.head(20))

print("number of zeros per column")
print((dataset == 0).astype(int).sum())


                0           1           2  ...           6           7           8
count  768.000000  768.000000  768.000000  ...  768.000000  768.000000  768.000000
mean     3.845052  120.894531   69.105469  ...    0.471876   33.240885    0.348958
std      3.369578   31.972618   19.355807  ...    0.331329   11.760232    0.476951
min      0.000000    0.000000    0.000000  ...    0.078000   21.000000    0.000000
25%      1.000000   99.000000   62.000000  ...    0.243750   24.000000    0.000000
50%      3.000000  117.000000   72.000000  ...    0.372500   29.000000    0.000000
75%      6.000000  140.250000   80.000000  ...    0.626250   41.000000    1.000000
max     17.000000  199.000000  122.000000  ...    2.420000   81.000000    1.000000

[8 rows x 9 columns]
     0    1   2   3    4     5      6   7  8
0    6  148  72  35    0  33.6  0.627  50  1
1    1   85  66  29    0  26.6  0.351  31  0
2    8  183  64   0    0  23.3  0.672  32  1
3    1   89  66  23   94  28.1  0.167  21  0
4    0

# Data marking
On some columns, like "Triceps skinfold thickness" or the "diastolic presure", values of 0 are nonsense.  These values should be ignored by a machine learning algorithm, but a value of zero is misleading to most of machine learning algorithms. 
Thus, it is important to mark them, using the "NaN" special value.
In this case, columns from 1 to 5 must have its values replaced by "NaNs", as a zero value is nonsense.

For this purpose we use the function replace, which can later be used as a flag for  observation manipulation (deletion for instance).


In [0]:
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, np.NaN)
print("Number of null observations per column: ")
print(dataset.isnull().sum())
#we can eliminate the observations or samples with nans
#dataset = dataset.dropna()
print("Number of null observations per column after NaN entry deletion: ")
print(dataset.isnull().sum())
#print descriptive stats
print(dataset.describe())


Number of null observations per column: 
0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64
Number of null observations per column after NaN entry deletion: 
0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64
                0           1           2  ...           6           7           8
count  768.000000  763.000000  733.000000  ...  768.000000  768.000000  768.000000
mean     3.845052  121.686763   72.405184  ...    0.471876   33.240885    0.348958
std      3.369578   30.535641   12.382158  ...    0.331329   11.760232    0.476951
min      0.000000   44.000000   24.000000  ...    0.078000   21.000000    0.000000
25%      1.000000   99.000000   64.000000  ...    0.243750   24.000000    0.000000
50%      3.000000  117.000000   72.000000  ...    0.372500   29.000000    0.000000
75%      6.000000  141.000000   80.000000  ...    0.626250   41.000000    1.000000
max     17.000000  199.000000  122.000000

# Missing values consequences
By executing a simple linear classification, we can see the consequences of using NaN values, which makes the perceptron algorithm fail during its execution, if we did not eliminate the observations with NaNs.

If we eliminate the entries with NaNs, the performance is likely to decrease.

In [0]:
# split dataset into inputs and outputs

def testModel(dataset):
  values = dataset.values
  X = values[:,0:8]
  t = values[:,8]
  # use a Perceptron model
  model = Perceptron()
  #perform a k folds validation
  kfold = KFold(n_splits = 3, random_state = 7)
  result = cross_val_score(model, X, t, cv=kfold, scoring='accuracy')
  print(result.mean())
  
testModel(dataset)



ValueError: ignored

# Data imputting for missing values
We can use different techniques for filling the missing values, instead of simply eliminate the records or atributes:


1.  Use a constant value, meaningful for the atribute domain.
2.  Pick a random value from any other valid observation.
3.  Use a statistic from the dataset, such as the mean, median, or the mode.
4.  Use a value predicted by using a regression or any other model.

Try using the median and the mean for data imputing, and discover which one is better:



In [0]:
print(dataset.median())

#dataset.mean() calculates the per column mean
#fillna replaces the nans for the received function
dataset.fillna(dataset.median(), inplace=True)
# count the number of NaN values in each column
print(dataset.isnull().sum())
#test the model with the imputed values
testModel(dataset)

0      3.0000
1    117.0000
2     72.0000
3     29.0000
4    125.0000
5     32.3000
6      0.3725
7     29.0000
8      0.0000
dtype: float64
0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64
0.6796875
