In [2]:
import pandas as pd
import numpy as np
from io import StringIO

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Dealing with missing Data

In [3]:
data = \
'''
A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0,, 6.0, 7.0
8.0, 9.0, 10.0
'''

In [4]:
df = pd.read_csv(StringIO(data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,,6.0,7.0
2,8.0,9.0,10.0,


The StringIO function in the preceding code example was simply used for the purposes of illustration. It allowed us to read the string assigned to csv_data into a pandas DataFrame as if it was a regular CSV file on our hard drive.


In [5]:
df.isnull().sum()

A    0
B    1
C    0
D    1
dtype: int64

### 1 - Eliminating training examples or features with missing values

In [6]:
df.dropna(axis=0)  #drop rows with atleast 1 na

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [7]:
df.dropna(axis=1)  #drop cols with atleast 1 na

Unnamed: 0,A,C
0,1.0,3.0
1,5.0,6.0
2,8.0,10.0


In [8]:
df.dropna(how='all')  # will remove rows when all the cells have na

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,,6.0,7.0
2,8.0,9.0,10.0,


In [9]:
df.dropna(thresh=4)  # will remove rows except when  >=4 cells have data

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [10]:
df.dropna(subset = ['C'])  # only drop rows where NaN appear in specific columns (here: 'C')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,,6.0,7.0
2,8.0,9.0,10.0,


Although the removal of missing data seems to be a convenient approach, it also comes with certain disadvantages; for example, we may end up removing too many samples, which will make a reliable analysis impossible. Or, if we remove too many feature columns, we will run the risk of losing valuable information that our classifier needs to discriminate between classes. 

### 2- Using mean Imputations

 we can use different interpolation techniques to estimate the missing values from the other training examples in our dataset. One of the most common interpolation techniques is mean imputation, where we simply replace the missing value with the mean value of the entire feature column. A convenient way to achieve this is by using the SimpleImputer class from scikit-learn, as shown in the following code:

In [11]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan, strategy="mean")  # strategy can be median or most_frequent
si.fit(df.values)
imputed_data = si.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  5.5,  6. ,  7. ],
       [ 8. ,  9. , 10. ,  5.5]])

In [13]:
#other way
df.fillna(df.mean())

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,5.5,6.0,7.0
2,8.0,9.0,10.0,5.5


 additional imputation techniques, are the KNNImputer based on a k-nearest neighbors approach to impute missing features by nearest neighbors.