# Data & Information quality assignment

TODO:
- perform data quality assesment using the following metrics:
    - completeness
- perform data imputation
- perform data quality assessment again:
    - completeness
    - accuracy
    
- perform ML quality analysis

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import utility.dirty_completeness

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

SEED = 122
plt.style.use('fivethirtyeight')

In [3]:
soybean_df = pd.read_csv('../datasets/soybean.csv')
soybean_df.head()

Unnamed: 0,date,plant-stand,precip,temp,hail,crop-hist,area-damaged,severity,seed-tmt,germination,...,sclerotia,fruit-pods,fruit-spots,seed,mold-growth,seed-discolor,seed-size,shriveling,roots,class
0,october,normal,gt-norm,norm,yes,same-lst-yr,low-areas,pot-severe,none,90-100,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
1,august,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,severe,fungicide,80-89,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
2,july,normal,gt-norm,norm,yes,same-lst-yr,scattered,severe,fungicide,lt-80,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
3,july,normal,gt-norm,norm,yes,same-lst-yr,scattered,severe,none,80-89,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker
4,october,normal,gt-norm,norm,yes,same-lst-two-yrs,scattered,pot-severe,none,lt-80,...,absent,norm,dna,norm,absent,absent,norm,absent,norm,diaporthe-stem-canker


Exploratory analysis.

In [37]:
soybean_df.nunique()

date                7
plant-stand         2
precip              3
temp                3
hail                2
crop-hist           4
area-damaged        4
severity            3
seed-tmt            3
germination         3
plant-growth        2
leaves              2
leafspots-halo      3
leafspots-marg      3
leafspot-size       3
leaf-spread         2
leaf-malf           2
leaf-mild           3
stem                2
lodging             2
stem-cankers        4
canker-lesion       4
fruiting-bodies     2
external-decay      2
mycelium            2
int-discolor        3
sclerotia           2
fruit-pods          3
fruit-spots         4
seed                2
mold-growth         2
seed-discolor       2
seed-size           2
shriveling          2
roots               3
class              15
dtype: int64

From the [Dataset documentation](https://archive.ics.uci.edu/ml/datasets/Soybean+(Large)) we can look at the values each variable can take. We can deduce that almost all the variables are categorical.\
Features that are not categorical are:

cyclic features:
- date

ordinal features:
- plant-stand
- precip
- temp
- germination

For the classification task, we can modify the previous features in numeric values to preserve the order of the values.


## Injection of null values.

In [49]:
df_s = dirty_completeness.injection(soybean_df, SEED, name='', name_class='class')

saved -completeness50%
saved -completeness60%
saved -completeness70%
saved -completeness80%
saved -completeness90%


## Simple imputation 

In this section, for each column, we fill each nan value with the mode of that column.

In [55]:
imputed_df_s = []

for n, df in enumerate(df_s):
    
    print(f"imputing {n}-th dataframe")
    # Sanity check: be sure that there are null values in the dataframe
    assert df.isnull().sum().sum() != 0
    simple_imputer = SimpleImputer(missing_values=np.NaN, strategy='most_frequent')
    imputed_df = simple_imputer.fit_transform(df)
    imputed_df = pd.DataFrame(imputed_df, columns=df.columns)
    imputed_df_s.append(imputed_df)
    
    # Sanity check: be sure that there are no null values remaining
    assert imputed_df.isnull().sum().sum() == 0

imputing 0-th dataframe
imputing 1-th dataframe
imputing 2-th dataframe
imputing 3-th dataframe
imputing 4-th dataframe
