Q1.Missing values refer to the absence of a particular data point in a dataset. This can happen due to various reasons such as data collection errors, data corruption, or incomplete data entries. Handling missing values is crucial because it can lead to biased or inaccurate results in statistical analyses and machine learning models.

There are several ways to handle missing values in a dataset. One common approach is to impute the missing values by estimating or filling in a value based on the available data. Another approach is to remove the missing values altogether, but this can result in a reduction of the dataset's size and possibly a loss of information.

Some algorithms that are not affected by missing values include decision trees, random forests, and support vector machines (SVMs). These algorithms can handle missing values by ignoring the affected feature during the split or classification process, without needing any imputation or removal of the missing values. Other algorithms such as linear regression and k-nearest neighbors (KNN) are sensitive to missing values and require some form of imputation or removal.

In [8]:
import pandas as pd
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9, 10, 11, np.nan]}
df = pd.DataFrame(data)

# Drop rows with missing values
df_dropna = df.dropna()

print(df_dropna)


     A    B    C
0  1.0  5.0  9.0


In [6]:
import pandas as pd
from sklearn.impute import SimpleImputer
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9, 10, 11, np.nan]}
df = pd.DataFrame(data)

# Impute missing values with mean imputation
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)


          A         B     C
0  1.000000  5.000000   9.0
1  2.000000  6.666667  10.0
2  2.333333  7.000000  11.0
3  4.000000  8.000000  10.0


In [7]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import KNNImputer
import numpy as np

# Create a sample dataset with missing values
data = {'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8], 'C': [9, 10, 11, np.nan]}
df = pd.DataFrame(data)

# Impute missing values with KNN imputation
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print(df_imputed)


     A    B     C
0  1.0  5.0   9.0
1  2.0  6.0  10.0
2  3.0  7.0  11.0
3  4.0  8.0  10.5
