# Replacing Data: Missing Data And Map

### Introduction

In our last lab, we were able to gather data from a our CSV file and coerce much of our data into numbers so that we could better make sense of the data.  There are a couple of places where we were stuck.  In this lesson, we'll learn how to finish cleaning our data by cleaning missing values, and working with the map method.

### Our SAT Data - Not as Clean as We Thought :(

Let's take another look at our SAT data from the last lab.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/nyc_hs_sat.csv"
sat_df = pd.read_csv(url, index_col = 0)

In [2]:
sat_df.dtypes

dbn                     object
name                    object
num_test_takers        float64
reading_avg            float64
math_avg               float64
writing_score          float64
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
dtype: object

Looking at the above data, it appears that we have a good number of features to help explain the `math_avg` of a school.

But one problem that we may have is that some of the data is missing.  It's good to know if a lot of our data is missing, as we may wish to either not use the column (if too many entries in the column is missing) or change some of the replace the missing entries with the average value in the column.

> In calculating summary statistics, and in plotting data, and in training a machine learning model it's important to remove missing values.

### Working with Missing Values

So how do we see if our dataset has missing values?

Missing values (if we're lucky) are generally identified with the value `na` which stands for not available.  We can identify the number of missing values in each column with the following line of code.

In [13]:
sat_df.isna().sum()

dbn                     0
name                    0
num_test_takers        29
reading_avg            29
math_avg               29
writing_score          29
boro                    0
total_students          0
graduation_rate         5
attendance_rate         0
college_career_rate     5
dtype: int64

Now we can see that a number of columns have missing values.

What to do with missing values warrants a longer discussion, but for now, we can simply drop the rows that contain missing values.  Here's how.

In [14]:
dropped_sat_df = sat_df.dropna()

> The method `dropna` returns a new, updated, dataframe so be sure to store this new dataframe in a variable.

And now we can see that none of the columns have `na` values.

In [15]:
dropped_sat_df.isna().sum()

dbn                    0
name                   0
num_test_takers        0
reading_avg            0
math_avg               0
writing_score          0
boro                   0
total_students         0
graduation_rate        0
attendance_rate        0
college_career_rate    0
dtype: int64

And now we can work with our reduced dataset.

In [16]:
X = dropped_sat_df.select_dtypes(exclude = ['object']).drop(columns = ['math_avg'])
X.columns

y = dropped_sat_df.math_avg

### Summary

In this lesson, we saw that all of our machine learning model data being numeric means that we must not have any `na` values in our training data.  We can discover how many `na` values are in each column with the line:

```python
df.isna().sum()
```

And we can drop our those rows with missing data in a column with the line:

`dropped_df = df.dropna()`