# Dropping missing data

The voting dataset from Chapter 1 contained a bunch of missing values that we dealt with for you behind the scenes. Now, it's time for you to take care of these yourself!

The unprocessed dataset has been loaded into a DataFrame df. Explore it in the IPython Shell with the .head() method. You will see that there are certain data points labeled with a '?'. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a '9999', other times a 0 - real-world data can be very messy! If you're lucky, the missing values will already be encoded as NaN. We use NaN because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as .dropna() and .fillna(), as well as scikit-learn's Imputation transformer Imputer().

In this exercise, your job is to convert the '?'s to NaNs, and then drop the rows that contain them from the DataFrame.

INSTRUCTIONS


Explore the DataFrame df in the IPython Shell. Notice how the missing value is represented.
Convert all '?' data points to np.nan.
Count the total number of NaNs using the .isnull() and .sum() methods. This has been done for you.
Drop the rows with missing values from df using .dropna().

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('house-votes-84_missing_data.csv')

In [2]:
df.head()

Unnamed: 0,index,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,0,republican,0,1,0,1,1,1,0,0,0,1,?,1,1,1,0,1
1,1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,?
2,2,democrat,?,1,1,?,1,1,0,0,0,0,1,0,1,1,0,0
3,3,democrat,0,1,1,0,?,1,0,0,0,0,1,0,1,0,0,1
4,4,democrat,1,1,1,0,1,1,0,0,0,0,1,?,1,1,1,1


In [3]:
df1 = df.copy()

In [4]:
import numpy as np
# Convert '?' to NaN
df1[df1 == '?'] = np.nan

In [5]:
df1.head()

Unnamed: 0,index,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,0,republican,0.0,1,0,1.0,1.0,1,0,0,0,1,,1.0,1,1,0,1.0
1,1,republican,0.0,1,0,1.0,1.0,1,0,0,0,0,0.0,1.0,1,1,0,
2,2,democrat,,1,1,,1.0,1,0,0,0,0,1.0,0.0,1,1,0,0.0
3,3,democrat,0.0,1,1,0.0,,1,0,0,0,0,1.0,0.0,1,0,0,1.0
4,4,democrat,1.0,1,1,0.0,1.0,1,0,0,0,0,1.0,,1,1,1,1.0


In [6]:
# Print the number of NaNs
print(df1.isnull().sum())

index                  0
party                  0
infants               12
water                 48
budget                11
physician             11
salvador              15
religious             11
satellite             14
aid                   15
missile               22
immigration            7
synfuels              21
education             31
superfund             25
crime                 17
duty_free_exports     28
eaa_rsa              104
dtype: int64


In [7]:
# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(df.shape))

Shape of Original DataFrame: (435, 18)


In [8]:
# Drop missing values and print shape of new DataFrame
df1 = df1.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df1.shape))

Shape of DataFrame After Dropping All Rows with Missing Values: (232, 18)


# Imputing missing data in a ML Pipeline I

As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same .fit() and .predict() methods as before.

INSTRUCTIONS


Import Imputer from sklearn.preprocessing and SVC from sklearn.svm. SVC stands for Support Vector Classification, which is a type of SVM.
Setup the Imputation transformer to impute missing data (represented as 'NaN') with the 'most_frequent' value in the column (axis=0).
Instantiate a SVC classifier. Store the result in clf.
Create the steps of the pipeline by creating a list of tuples:
The first tuple should consist of the imputation step, using imp.
The second should consist of the classifier.

In [9]:
# Import the Imputer module
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = SimpleImputer(missing_values='NaN', strategy='most_frequent')

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

# Imputing missing data in a ML Pipeline II

Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the .fit() and .predict() methods on pipelines just as you did with your classifiers and regressors!

Practice this for yourself now and generate a classification report of your predictions. The steps of the pipeline have been set up for you, and the feature array X and target variable array y have been pre-loaded. Additionally, train_test_split and classification_report have been imported from sklearn.model_selection and sklearn.metrics respectively.

INSTRUCTIONS


Import the following modules:
Imputer from sklearn.preprocessing and Pipeline from sklearn.pipeline.
SVC from sklearn.svm.
Create the pipeline using Pipeline() and steps.
Create training and test sets. Use 30% of the data for testing and a random state of 42.
Fit the pipeline to the training set and predict the labels of the test set.
Compute the classification report.

In [10]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv('house-votes-84-2.csv')

In [11]:
df.head()

Unnamed: 0,Index,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
2,2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,3,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,0,0,1
4,4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


In [12]:
# Import necessary modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

In [13]:
import numpy as np
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

In [14]:
X = df.drop('party',axis=1).values # drop the target
y = df['party'].values #keep the target

In [15]:
from sklearn.model_selection import train_test_split
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [16]:
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

Pipeline(steps=[('imputation', SimpleImputer(strategy='most_frequent')),
                ('SVM', SVC())])

In [17]:
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

In [18]:
from sklearn.metrics import classification_report

report_dict = classification_report(y_test, y_pred, output_dict=True)
# Compute metrics
pd.DataFrame(report_dict)

Unnamed: 0,democrat,republican,accuracy,macro avg,weighted avg
precision,0.648855,0.0,0.648855,0.324427,0.421013
recall,1.0,0.0,0.648855,0.5,0.648855
f1-score,0.787037,0.0,0.648855,0.393519,0.510673
support,85.0,46.0,0.648855,131.0,131.0
