# Missing values

Missing values should be handled carefully to avoid their affecting analyses. Whatever missing value is chosen, it should be used consistently throughout all data associated files and identified in the metadata and/or data description files, otherwise we should identify the missing values by analysing each column, which is our case.


## 1) Identifying missing values
Missing values need to be handled because they reduce the quality for any of our performance metric. It can also lead to wrong prediction or classification and can also cause a high bias for any 
given model being used.

Depending on data sources, missing data are identified differently. Pandas always identify missing values as NaN. However, unless the data has been pre-processed to a degree that an analyst will encounter missing values as NaN. Missing values can appear as a question mark (?) or a zero (0) or minus one (-1) or a blank. As a result, it is always important that a data scientist always perform exploratory data analysis (EDA)


## 2) handling missing values

There are several options for handling missing values each with its own PROS and CONS. However, the choice of what should be done is largely dependent on the nature of our data and the missing values.
We have 2 options
 ### 1) Imputation
   #### 1.1) Fill missing values with statistical methods
   Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast, but has clear disadvantages like it only works on the column level and will give poor results on encoded categorical features.
   
   Most Frequent is another statistical strategy to impute missing values. It works with categorical features.
   
   Hot-Deck imputation that works by randomly choosing the missing value from a set of related and similar variables. 
   
   #### Note 1: 
   Linear Interpolation works well for a time series with some trend but is not suitable for seasonal data
   #### Note 2: 
   For longitudinal data, such as patients’ weights over a period of visits, it might make sense to use last valid 
   observation to fill the NA’s. This is known as Last observation carried forward (LOCF).
   
   #### 1.2) predict missing values with a machine learning algorithm
   This is by far one of the best and most efficient method for handling missing data. Depending on the class of data that is missing, one can either use a regression model or classification to predict missing data. This works by turning missing features to labels themselves and now using columns without missing values to predict columns with missing values.
   
   The only drawback to this approach is that if there is no correlation between attributes with missing data and other attributes in the data set, then the model will be bias for predicting missing values.
   
   ##### 1.2.1) Linear Regression
   Mean, median or mode imputation only look at the distribution of the values of the variable with missing entries. If we know there is a correlation between the missing value and other variables, we can often get better guesses by regressing the missing variable on other variables.
   
   ##### 1.2.2) K Nearest Neighbors
   The distance metric varies according to the type of data:
   Continuous Data: The commonly used distance metrics for continuous data are Euclidean, Manhattan and Cosine
   Categorical Data: Hamming distance is generally used in this case. It takes all the categorical attributes and for each, count one if the value is not the same between two points. 
   
   Despite performance of K-NN, it is quite sensitive to outliers in the data (unlike SVM)
   ##### 1.2.3) XGBoost
   ##### 1.2.4) Random forest
   ##### 1.2.5) Deep Learning (Datawig)
   This method works very well with categorical and non-numerical features. It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. It also supports both CPU and GPU for training.
   
   This method is quite accurate compared to others but, you have to specify the columns that contain information about the target column that will be imputed, moreover it works with a single column.
   
   
   #### For more information == > https://datawig.readthedocs.io/en/latest/source/userguide.html#overview-of-datawig
   
   ##### 1.2.6) logistic regression & ANOVA for prediction
   
   #### Note 3: 
   Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way by using Markov Chain Monte Carlo (MCMC) simulation in case we have an arbitrary
missing data pattern, otherwise we use for instance a parametric regression method.
#### Article ==> Multiple Imputation for Missing Data: Concepts and New Development

#### Autoimpute is a Python package for analysis and implementation of Imputation Methods!

  #### Note 4:
  Multivariate imputer that estimates each feature from all the others.
A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. This new approache is still at the experimental stage in the sci-kit learn library. 
 The IterativeImputer package allows the flexibility to choose a pre-loaded sci-kit learn model to iterate through the data to impute missing values.
 
 #### Best practices ==> https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
 
 #### Articles ==> Multivariate Imputation by Chained Equations in R && Statistical Analysis with Missing Data
 
 #### Note 5: 
 Interpolation is a mathematical method that adjusts a function to your data and uses this function to extrapolate the missing data. The most simple type of interpolation is the linear interpolation, that makes a mean between the values before the missing data and the value after.

Combining these two techniques (Seasonal Adjustment and Linear Interpolation) may be better.
 
 #### Article ==> Comparison of Linear Interpolation Method and Mean Method to Replace the Missing Values in Environmental Data Set
#### 1.3) Interpolation
Interpolation is a mathematical method that adjusts a function to your data and uses this function to extrapolate the missing data. The most simple type of interpolation is the linear interpolation, that makes a mean between the values before the missing data and the value after.

 ### 2) Deletion
 This is the fastest and easiest step to handle missing values. However, it is not generally advised. This method reduces the quality of our model as it reduces sample size
 #### 2.1) Listwise
Listwise deletion (complete-case analysis) removes all data for an observation that has one or more missing values. Particularly if the missing data is limited to a small number of observations, you may just opt to eliminate those cases from the analysis. However in most cases, it is often disadvantageous to use listwise deletion. This is because the assumptions of MCAR (Missing Completely at Random) are typically rare to support. As a result, listwise deletion methods produce biased parameters and estimates.
#### 2.2) Pairwise
Pairwise deletion is an alternative to listwise deletion to mitigate the loss of data.
It analyses all cases in which the variables of interest are present and thus maximizes all data available by an analysis basis.
The strength to this technique is that it increases power in your analysis but it has many disadvantages.It assumes that the missing data are completely at random. If you delete pairwise then you’ll end up with different numbers of observations contributing to different parts of your model, which can make interpretation difficult.

This approach is not implemented yet.

#### Article ==> Pairwise deletion for missing data in structural equation models
#### 2.3) Dropping Variables
There are situations when the variable has a lot of missing values, in this case, if the variable is not a very important predictor for the target variable, the variable can be dropped completely. As a rule of thumb, when the data goes missing on 60–70 percent of the variable, dropping the variable should be considered. 

### 3) Dealing with categorical data 
#### 3.1) Identifying Categorical Data: Nominal, Ordinal and Continuous
Categorical features can only take on a limited, and usually fixed, number of possible values. For example, if a dataset is about information related to users, then you will typically find features like country, gender, age group, etc. Alternatively, if the data you're working with is related to products, you will find features like product type, manufacturer, seller and so on.

These are all categorical features in your dataset. These features are typically stored as text values which represent various traits of the observations. For example, gender is described as Male (M) or Female (F), product type could be described as electronics, apparels, food etc.

- Note that these type of features where the categories are only labeled without any order of precedence are called nominal features.

- Features which have some order associated with them are called ordinal features. For example, a feature like economic status, with three categories: low, medium and high, which have an order associated with them.

- There are also continuous features. These are numeric variables that have an infinite number of values between any two values. A continuous variable can be numeric or a date/time.

#### 3.2) Encoding Categorical Data
You will now learn different techniques to encode the categorical features to numeric quantities. To keep it simple, you will apply these encoding methods only on the carrier column. However, the same approach can be extended to all columns.

The techniques that you'll cover are the following:

- Replacing values
- Encoding labels
- One-Hot encoding
- Binary encoding
- Backward difference encoding
- Miscellaneous features

# Code

- First of all, we have to declare Missing Value by replacing the specified values of the selected attributes by Double.NaN, Thus these values will be treated as missing values, hence we will get the unique values from the selected column, in order to identify meaningless values.

- Then, we check the percentenge of missing values for each column, in order to define which one needs to be dropped, otherwise we can drop data that goes missing on a specified threshold.

- Afterwards, we can remove rows depending on a threshold that defines how many columns have NaN.
- Finally, we can impute missing values either by statistical methods or by using machine learning algorithms to predict missing data.

In [2]:
import pandas as pd
import numpy as np
import re
from collections import Iterable
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import OrdinalEncoder
from modules.encoding import Encoding

class MissingValues:
    def __init__(self, df,meaninglessVals = ["?","Na","na",""," ",'None',"nan","nane","none"]):
        self.df = df    
        self.meaninglessVals=meaninglessVals
    def collectMisVal(self,misVals):
        return self.meaninglessVals.append(misVals)
    def getUniqueValues(self,column):
        return list(self.df[column].unique())
    def replace(self,column,val,reVal):
        self.df[column].replace(val,reVal,inplace=True) 
    def replaceMisCategorical(self,column):
        self.df[column].replace(np.nan,"unknown",inplace=True)
    def replaceMeaninglessVals(self):
        self.df.replace(self.meaninglessVals,np.nan,inplace=True) 
    def replaceStr(self,column,stri):
        self.df[column].replace(stri,np.nan, inplace=True)
    def replaceByRegex(self,exp): 
        self.df.replace(to_replace=exp, value = np.nan, regex = True,inplace=True)
    def getInfCols(self):
        percent_missing = self.df.isnull().sum() * 100 / len(self.df)
        missing_value_df = pd.DataFrame({'column_name': self.df.columns,
                                 'percent_missing': percent_missing})
        missing_value_df.sort_values('percent_missing', inplace=True)
        return missing_value_df   
    
    def getColumnsType(self):
        dictTypes = {'object':[],'int':[],'float':[],'bool':[],'time':[],'category':[]}
        for col in self.df.columns:
            dtype = str(self.df[col].dtype)
            if "object" in dtype:
                dictTypes['object'].insert(0,col)
            elif "int" in dtype:
                dictTypes['int'].insert(0,col)
            elif "float" in dtype:
                dictTypes['float'].insert(0,col)
            elif "bool" in dtype:
                dictTypes['bool'].insert(0,col)  
            elif "category" in dtype:
                dictTypes['category'].insert(0,col)  

            else:
                dictTypes['time'].insert(0,col)
        return dictTypes        
    
    def fillna(self,column,replacedVal):
        return self.df[column].fillna(replacedVal)

    def dropColumns(self,columns):
        self.df.drop(columns, axis=1, inplace=True)
        
    #Supprimer toutes les colonnes surpassant le pourcentage assigné comme paramètre
    def dropAutomColumns(self,nan_percent = 0.7):
        threshold = len(self.df.index) * nan_percent
        columns = [c for c in self.df.columns if sum(self.df[c].isnull()) >= threshold]
        for column in columns:
            self.df.drop(column,axis = 1,inplace=True)
            
    #Supprimer les colonnes suivant le seuil du nombre des valeurs nulles 
    def dropRowsCo(self,threshold = 1):
        self.df.dropna(thresh=threshold,axis=1,inplace=True)
        
    #Supprimer les lignes suivant le seuil du nombre des valeurs nulles
    def dropRowsRo(self,threshold=1):
        self.df.dropna(thresh=threshold,axis=0,inplace=True)

    def dropRows(self,columns):
        self.df.dropna(subset=columns,inplace=True)
            
    #method ==> {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’,‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’ ‘piecewise_polynomial’, ‘pchip’}
    #limit_direction ==> {‘forward’, ‘backward’, ‘both’}
    def interpolate(self,column,method,limDirec):
        self.df[column].interpolate(method=method,limit_direction=limDirec,inplace=True)
    def imputeRandomly(self, feature):
        number_missing = self.df[feature].isnull().sum()
        observed_values = self.df.loc[self.df[feature].notnull(), feature]
        self.df.loc[self.df[feature].isnull(), feature] = np.random.choice(observed_values, number_missing, replace = True)
        
    def imputeByStatiMeth(self,strategy,column):
        if strategy=="mean":
            self.df[column].fillna((self.df[column].mean()), inplace=True)
        elif strategy=="median":
            self.df[column].fillna((self.df[column].median()), inplace=True)
        elif strategy =="random":
            self.df[column].fillna(list(df['A'].sample())[0], inplace=True)
        return self.df
        
    #inputCols = variables indépendantes & outpulCol = variable indépendante
    def imputeByKNN(self,knn=3):
        # imputation (knn) s'applique sur toutes les colonnes
        imputer = KNNImputer(n_neighbors=knn)
        self.df = imputer.fit_transform(self.df)
            
    def imputeByPred(self,algorithm,inputCols,outputCol,categoricalVal=None,knn = 3):
        #Pour prédire les valeurs manquantes, on prend que les lignes dont il n'y'a pas de valeurs nulles

        if categoricalVal == None:
            newDf = self.df.copy()
            inCols = inputCols
            inputCols.append(outputCol)
            train = newDf[inputCols].dropna()
            inCols.remove(outputCol)
            X_train = train[inCols].values.reshape(-1,len(inCols))       
            y_train = train[outputCol].values.reshape(-1,1)
            nans = self.df[outputCol].isnull()
            nansY = self.df[nans][inCols]
            dfPred = nansY[(nansY.notnull().all(1))]
            indexP = dfPred.index
        else:
            X_train = self.df.loc[self.df[outputCol]!=categoricalVal,inputCols].values.reshape(-1,len(inputCols)) 
            y_train = self.df.loc[self.df[outputCol]!=categoricalVal,outputCol].values.reshape(-1,1) 
            dfPred = self.df.loc[self.df[outputCol]==categoricalVal,inputCols]
            indexP = dfPred.index
            
        # imputation (knn) s'applique sur les colonnes passées sur les paramétres
        # round ==> for example we have 2 classes and we got (0.60==>1) or (0.40==>0) threshold = 0.50
        if algorithm == "knn":
            model = KNeighborsClassifier(knn, weights='distance')
            model.fit(X_train, y_train)
            predictedValues = model.predict(dfPred)
            if categoricalVal == None:
                self.df.loc[indexP,outputCol] = predictedValues
            else:
                self.df.loc[indexP,outputCol] = [int(round(num)) for num in list(self.flat(predictedValues))]


        elif algorithm == "regression":            
            model = LinearRegression()
            model.fit(X_train, y_train)
            predictedValues = model.predict(dfPred)
            if categoricalVal == None:
                self.df.loc[indexP,outputCol] = predictedValues
            else:
                self.df.loc[indexP,outputCol] = [int(round(num)) for num in list(self.flat(predictedValues))]

        elif algorithm == "decisionTree":
            model = tree.DecisionTreeClassifier(criterion="entropy", max_depth=3)
            model.fit(X_train, y_train)
            predictedValues = model.predict(dfPred)
            if categoricalVal == None:
                self.df.loc[indexP,outputCol] = predictedValues
            else:
                self.df.loc[indexP,outputCol] = [int(round(num)) for num in list(self.flat(predictedValues))]

        elif algorithm == "randomForest":
            model = MultiOutputRegressor(RandomForestRegressor(max_depth=30,random_state=0))
            model.fit(X_train, y_train)
            predictedValues = model.predict(dfPred)
            if categoricalVal == None:
                self.df.loc[indexP,outputCol] = predictedValues
            else:
                self.df.loc[indexP,outputCol] = [int(round(num)) for num in list(self.flat(predictedValues))]

    #reference ==> https://towardsdatascience.com/preprocessing-encode-and-knn-impute-all-categorical-features-fast-b05f50b4dfaa        
    def encodeImpute(self,columns):
        imputeData = self.df.copy()
        dictFe = []
        for col in columns:
            if 'category' in str(self.df[col].dtype):
                imputeData[col] = imputeData[col].astype(object, axis=0)
        encoder = LabelEncoder()
        imputer = IterativeImputer(ExtraTreesRegressor())
        for col in columns:
            dictF = Encoding.encode(imputeData[col])
            dictFe.append((col,dictF))
        self.df.drop(labels=columns, axis="columns", inplace=True)
        self.df = pd.merge(pd.DataFrame(np.round(imputer.fit_transform(imputeData)),columns = imputeData.columns),self.df)
        return dictFe

    ''''   
    def imputeInterative(self,outputCol):        
        #imp = IterativeImputer(missing_values=np.nan,n_nearest_features=2, initial_strategy=strategy)
        #imp.fit([self.df[inputCols]])
        #return imp.transform([self.df[outputCol]])
        imp = IterativeImputer(max_iter=10, verbose=0)
        imp.fit([self.df[outputCol]])
        imputed_df = imp.transform([self.df[outputCol]])
        imputed_df = pd.DataFrame(imputed_df, columns=self.df.columns)
        return imputed_df
    '''
        
        
   
    #### Convert from string to int

    def convertToInt(self,col):
        self.df[col] = pd.to_numeric(self.df[col],errors='coerce')

    #flatten list
    def flat(self,lst):
        for parent in lst:
            if not isinstance(parent, Iterable):
                yield parent
            else:
                for child in self.flat(parent):
                    yield child

    #find id
    def findAndRemoveId(self):
        idsList = ["id","ID","Id"]
        for iD in idsList:
            try:
                self.df.drop(iD,axis = 1,inplace = True)
            except:
                print(iD," doesn't exist in dataFrame")
                continue

    #retour imputed DataFrame
    def read(self):
        return self.df

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd
import numpy as np

class Encoding:
 #### Encoding categorical data ###   
    def encode(data):
        #function to encode non-null data and replace it in the original data
        encoder = LabelEncoder()
        #retains only non-null values
        nonulls = np.array(data.dropna())
        #reshapes the data for encoding
        impute_reshape = nonulls.reshape(-1,1)
        #encode date
        impute_ordinal = encoder.fit_transform(impute_reshape)
        #Assign back encoded values to non-null values
        data.loc[data.notnull()] = np.squeeze(impute_ordinal)
        return {l: i for i, l in enumerate(encoder.classes_)}
    
    #encoder une caratéristique ordinale en utilisant une classe predéfinie LabelEncoder
    def enOrdFeature(df,column):
        label_encoder = LabelEncoder()
        df[column] = label_encoder.fit_transform(df[column])
        return {l: i for i, l in enumerate(label_encoder.classes_)}
    
    '''      
    #encoder une classe binaire de type objet (str) par une nouvelle colonne ajoutée au DataFrame 
    def enBiFeature(self,column,newName,value):
        self.df[newName] = np.where(self.df[column].str.contains(value), 1, 0)   
    '''
    
    #encoder une caratéristique nominale en utilisant One Hot encoding
    def enNomFeature(df,column):
        one_hot = pd.get_dummies(df[column])
        return one_hot
       
    def enOrFeatureByDict(df,column,dictFeature):
        df[column].replace(to_replace = list(dictFeature.keys()), value =list(dictFeature.values()), inplace=True)
        return dictFeature

    #### Encoding categorical data ###
    
    
    #### Decoding categorical data ###
    
    def decode(df,dictFeatures,column):
        df[column].replace(to_replace = list(dictFeatures.values()), value =list(dictFeatures.keys()), inplace=True)
        return df

    

In [3]:
df = pd.DataFrame({"A":[12, 7, 11, 8, "-"], 
                   "B":[70, 2, 54, 3, 2], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6],
                   "E":[np.nan, 6, np.nan, None, 8]}) 
clas=MissingValues(df)
#print(clas.meaninglessVals)
clas.collectMisVal('none')
#print(clas.meaninglessVals)


#print(clas.getUniqueValues("A"))


#print(clas.replaceMeaninglessVals())
#print(clas.read())


#print(clas.replaceByRegex('^-$'))


#print(clas.getInfCols())


#print(clas.getColumnsType())


#print(clas.fillna("A","-"))


#print(clas.dropColumns(["E","D"]))
#print(clas.read())


#print(clas.dropAutomColumns(0.6))
#print(clas.read())


#print(clas.dropRowsRo(3))
#print(clas.read())


#print(clas.dropRowsCo(3))
#print(clas.read())


#print(clas.interpolate("linear","forward"))
#print(clas.read())


#print(clas.imputeRandomly('D'))
#print(clas.read())


#print(clas.replaceByRegex('^-$'))
#print(clas.imputeByStatiMeth("random","C"))


#print(clas.imputeByPred("knn",["B"],"C"))
#print(clas.read())


#print(clas.imputeByPred("regression",["B"],"C"))
#print(clas.read())


#print(clas.imputeByPred("decisionTree",["B"],"C"))
#print(clas.read())


#print(clas.imputeByPred("randomForest",["B"],"C"))
#print(clas.read())


'''

    def imputeByPred(self,algorithm,inputCols,outputCol,knn = 3):
'''   

'\n\n    def imputeByPred(self,algorithm,inputCols,outputCol,knn = 3):\n'

In [4]:
dff = pd.DataFrame({"A":[12, 7, 11, 8, 4], 
                   "B":[70, 2, 54, 3, 2], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6],
                   "E":[None, 6, None, None, 8]}) 
m=MissingValues(dff)
#m.imputeByPred('regression',["A",'B'],"D")
m.imputeByPred('regression',"A","D")
print(m.read())
#print(m.read())



    A   B     C          D    E
0  12  70  20.0  14.000000  NaN
1   7   2  16.0   3.000000  6.0
2  11  54   NaN  11.408163  NaN
3   8   3   3.0   8.040816  NaN
4   4   2   8.0   6.000000  8.0


In [12]:
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
        'color':['green','blue','yellow','blue','green']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','color'])
m = MissingValues(df)
l = ['score','color']
print(m.encodeImpute(l))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


   patient  obs  treatment  score  color
0      1.0  1.0        0.0    1.0    1.0
1      1.0  2.0        1.0    2.0    0.0
2      1.0  3.0        0.0    0.0    2.0
3      2.0  1.0        1.0    2.0    0.0
4      2.0  2.0        0.0    1.0    1.0


In [6]:
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
        'color':['green','blue','yellow','blue','green'],
           'gender':["male","female","male","male","female"]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','color','gender'])
m = MissingValues(df)
decoder = m.enOrdFeature('gender')
print("\n")
print(decoder)
print(m.read())
m.decode(decoder,'gender')




{'female': 0, 'male': 1}
   patient  obs  treatment   score   color  gender
0        1    1          0  strong   green       1
1        1    2          1    weak    blue       0
2        1    3          0  normal  yellow       1
3        2    1          1    weak    blue       1
4        2    2          0  strong   green       0


Unnamed: 0,patient,obs,treatment,score,color,gender
0,1,1,0,strong,green,male
1,1,2,1,weak,blue,female
2,1,3,0,normal,yellow,male
3,2,1,1,weak,blue,male
4,2,2,0,strong,green,female


In [7]:
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
        'color':['green','blue','yellow','blue','green']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','color'])
m = MissingValues(df)
m.enNomFeature("score")

'score'

In [8]:
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
        'color':['green','blue','yellow','blue','green'],
           'gender':["male","female","male","male","female"]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','color','gender'])
m = MissingValues(df)
decoder = m.enOrFeatureByDict('score',{"strong":3,"normal":2,"weak":1})
print("\n")
print(decoder)
print(m.read())
m.decode(decoder,'score')



{'strong': 3, 'normal': 2, 'weak': 1}
   patient  obs  treatment  score   color  gender
0        1    1          0      3   green    male
1        1    2          1      1    blue  female
2        1    3          0      2  yellow    male
3        2    1          1      1    blue    male
4        2    2          0      3   green  female


Unnamed: 0,patient,obs,treatment,score,color,gender
0,1,1,0,strong,green,male
1,1,2,1,weak,blue,female
2,1,3,0,normal,yellow,male
3,2,1,1,weak,blue,male
4,2,2,0,strong,green,female


In [8]:
import pandas as pd
import numpy as np
raw_data = {'patient': [1, 1, 1, 2, 2],
        'obs': [1, 2, 3, 1, 2],
        'treatment': [0, 1, 0, 1, 0],
        'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
        'color':['green','blue','yellow','blue','green'],
           'gender':["male","female","male","male","female"]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score','color','gender'])
df.drop(columns=["obs","ok"],axis=1)

KeyError: "['ok'] not found in axis"