# Data Cleaning in Pandas

Data cleaning is a key part of data science, but it can be deeply frustrating. Why are some of your text fields garbled? \ 
What should you do about those missing values? Why aren’t your dates formatted correctly? How can you quickly clean up inconsistent data entry? \
In this course, you'll learn why you've run into these problems and, more importantly, how to fix them! 

In this course, you’ll learn how to tackle some of the most common data cleaning problems so you can get to actually analyzing your data faster. 
You’ll work through five hands-on exercises with real, messy data and answer some of your most commonly-asked data cleaning questions.

In this notebook, we'll look at how to deal with missing values.

Take a first look at the data
The first thing we'll need to do is load in the libraries and dataset we'll be using.

For demonstration, we'll use a dataset of events that occured in American Football games.  \
In the following exercise, you'll apply your new skills to a dataset of building permits issued in San Francisco.

In [4]:
# modules we'll use
# Here you must import the libraries numpy and pandas
# TODO: Your code here
import numpy as np
import pandas as pd


# read in all our data
# Here you must read the csv and load it in a pandas
# File name is: "NFL Play by Play 2009-2016 (v4).csv"
# Read it into a variable called nfl_data
# TODO: Your code here
nfl_data = pd.read_csv("NFL Play by Play 2009-2016 (v4).csv")

# set seed for reproducibility
np.random.seed(0) 

The first thing to do when you get a new dataset is take a look at some of it. \
This lets you see that it all read in correctly and gives an idea of what's going on with the data. In this case, \
let's see if there are any missing values, which will be reprsented with NaN or None.

In [7]:
# look at the first five rows of the nfl_data file. 
# I can see a handful of missing data already!
# TODO: Your code here
print(nfl_data.head())


         Date      GameID  Drive  qtr  down   time  TimeUnder  TimeSecs  \
0  2016-09-08  2016090800      1    1   NaN  15:00         15    3600.0   
1  2016-09-08  2016090800      1    1   1.0  15:00         15    3600.0   
2  2016-09-08  2016090800      1    1   1.0  14:17         15    3557.0   
3  2016-09-08  2016090800      1    1   2.0  14:13         15    3553.0   
4  2016-09-08  2016090800      1    1   3.0  14:08         15    3548.0   

   PlayTimeDiff SideofField  ...    yacEPA  Home_WP_pre  Away_WP_pre  \
0           0.0         CAR  ...       NaN     0.500007     0.499993   
1           0.0         DEN  ...  1.801934     0.500007     0.499993   
2          43.0         DEN  ... -0.000482     0.533353     0.466647   
3           4.0         DEN  ... -0.506220     0.512108     0.487892   
4           5.0         DEN  ...  0.149739     0.482752     0.517248   

   Home_WP_post  Away_WP_post  Win_Prob       WPA    airWPA    yacWPA  Season  
0      0.500007      0.499993  0.500

Yep, it looks like there's some missing values.

**How many seasons do we have data for?**\
user column Season

In [11]:
# TODO: Your code here
seasons = nfl_data['Season'].unique()
num_seasons = len(seasons)
print(f'El número de temporadas es: {num_seasons}') 

El número de temporadas es: 3


**Select only those from the 2016 season**

In [17]:
# TODO: Your code here
seasons_2016 = nfl_data[nfl_data['Season'] == 2016]
seasons_2016

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2016-09-08,2016090800,1,1,,15:00,15,3600.0,0.0,CAR,...,,0.500007,0.499993,0.500007,0.499993,0.500007,0.000000,,,2016
1,2016-09-08,2016090800,1,1,1.0,15:00,15,3600.0,0.0,DEN,...,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.055450,2016
2,2016-09-08,2016090800,1,1,1.0,14:17,15,3557.0,43.0,DEN,...,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114,2016
3,2016-09-08,2016090800,1,1,2.0,14:13,15,3553.0,4.0,DEN,...,-0.506220,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779,2016
4,2016-09-08,2016090800,1,1,3.0,14:08,15,3548.0,5.0,DEN,...,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.082780,0.076812,0.005968,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,2016-09-11,2016091108,18,4,4.0,07:22,8,442.0,6.0,TEN,...,,0.061045,0.938955,0.070987,0.929013,0.061045,0.009943,,,2016
496,2016-09-11,2016091108,19,4,1.0,07:09,8,429.0,13.0,MIN,...,,0.070987,0.929013,0.069443,0.930557,0.929013,0.001544,,,2016
497,2016-09-11,2016091108,19,4,2.0,06:26,7,386.0,43.0,MIN,...,,0.069443,0.930557,0.059481,0.940519,0.930557,0.009962,,,2016
498,2016-09-11,2016091108,19,4,3.0,05:41,6,341.0,45.0,TEN,...,0.080864,0.059481,0.940519,0.034085,0.965915,0.940519,0.025396,0.019335,0.006060,2016


**What is the description that appears most often?**\
Use column desc

In [23]:
# TODO: Your code here
description_most_often = nfl_data['desc'].mode().iloc[0]
print(f'La descripción más común es "{description_most_often}"')



**Select only the rows with desc equals to 'END QUARTER 3' and Season 2016** 

In [26]:
# TODO: Your code here
rows_selected = nfl_data[(nfl_data['desc'] == 'END QUARTER 3') & (nfl_data['Season'] == 2016)]
rows

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
117,2016-09-08,2016090800,15,3,,00:00,0,900.0,21.0,CAR,...,,,,,,,0.0,,,2016
303,2016-09-11,2016091100,18,3,,00:00,0,900.0,13.0,TB,...,,,,,,,0.0,,,2016
473,2016-09-11,2016091108,14,3,,00:00,0,900.0,16.0,MIN,...,,,,,,,0.0,,,2016


**Delete all rows from Season 2014**

In [30]:
# TODO: Your code here
nfl_data_without_2014 = nfl_data.drop(nfl_data[nfl_data['Season'] == 2014].index)
nfl_data_without_2014

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2016-09-08,2016090800,1,1,,15:00,15,3600.0,0.0,CAR,...,,0.500007,0.499993,0.500007,0.499993,0.500007,0.000000,,,2016
1,2016-09-08,2016090800,1,1,1.0,15:00,15,3600.0,0.0,DEN,...,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.055450,2016
2,2016-09-08,2016090800,1,1,1.0,14:17,15,3557.0,43.0,DEN,...,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114,2016
3,2016-09-08,2016090800,1,1,2.0,14:13,15,3553.0,4.0,DEN,...,-0.506220,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779,2016
4,2016-09-08,2016090800,1,1,3.0,14:08,15,3548.0,5.0,DEN,...,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.082780,0.076812,0.005968,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2015-09-13,2015091306,22,3,,00:00,0,900.0,29.0,HOU,...,,,,,,,0.000000,,,2015
996,2015-09-13,2015091306,22,4,4.0,15:00,15,900.0,0.0,HOU,...,,0.029945,0.970055,0.036203,0.963797,0.029945,0.006258,,,2015
997,2015-09-13,2015091306,23,4,1.0,14:45,15,885.0,15.0,KC,...,,0.036203,0.963797,0.040507,0.959493,0.963797,-0.004304,,,2015
998,2015-09-13,2015091306,23,4,2.0,14:09,15,849.0,36.0,KC,...,,0.040507,0.959493,0.026349,0.973651,0.959493,0.014158,,,2015


## How many missing data points do we have?
Ok, now we know that we do have some missing values. \
**Let's see how many we have in each column**

In [36]:
# get the number of missing data points per column
# TODO: Your code here
num_missing_per_column = nfl_data.isnull().sum()


# Put the result in a variable  and
# look at the # of missing points in the first ten columns
# TODO: Your code here
first_ten_columns_missing = num_missing_per_column.head(10)
first_ten_columns_missing


Date              0
GameID            0
Drive             0
qtr               0
down            247
time              0
TimeUnder         0
TimeSecs          0
PlayTimeDiff      0
SideofField       5
dtype: int64

**Let's see how many we have in each row**\
Use isnull() and sum()

In [39]:
# TODO: Your code here
num_missing_per_row = nfl_data.isnull().sum(axis=1)
num_missing_per_row

0       35
1       22
2       23
3       23
4       23
        ..
1495    22
1496    22
1497    31
1498    29
1499    22
Length: 1500, dtype: int64

**Creates a subset of the data set with only rows and columns that have any nulls**\
Use loc[], isnull() and sum() 

In [44]:
# TODO: Your code here
subset_with_null = nfl_data.loc[nfl_data.isnull().sum(axis=1) > 0, nfl_data.isnull().sum(axis=0) > 0]
subset_with_null

Unnamed: 0,down,SideofField,yrdln,yrdline100,GoalToGo,FirstDown,posteam,DefensiveTeam,ExPointResult,TwoPointConv,...,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA
0,,CAR,35.0,35.0,0.0,,DEN,CAR,,,...,,,0.500007,0.499993,0.500007,0.499993,0.500007,0.000000,,
1,1.0,DEN,25.0,75.0,0.0,1.0,DEN,CAR,,,...,-0.770860,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.055450
2,1.0,DEN,36.0,64.0,0.0,0.0,DEN,CAR,,,...,-0.714457,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114
3,2.0,DEN,36.0,64.0,0.0,0.0,DEN,CAR,,,...,-0.507284,-0.506220,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779
4,3.0,DEN,36.0,64.0,0.0,1.0,DEN,CAR,,,...,2.540002,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.082780,0.076812,0.005968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,2.0,STL,7.0,93.0,0.0,0.0,STL,MIN,,,...,-0.243345,0.000000,0.035756,0.964244,0.032837,0.967163,0.035756,-0.002919,-0.002919,0.000000
1496,3.0,STL,13.0,87.0,0.0,0.0,STL,MIN,,,...,-0.736817,0.463238,0.032837,0.967163,0.030020,0.969980,0.032837,-0.002817,-0.005277,0.002460
1497,4.0,STL,21.0,79.0,0.0,1.0,STL,MIN,,,...,,,0.030020,0.969980,0.039436,0.960564,0.030020,0.009416,,
1498,1.0,MIN,30.0,70.0,0.0,0.0,MIN,STL,,,...,,,0.039436,0.960564,0.042766,0.957234,0.960564,-0.003330,,


That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing \
to give us a better sense of the scale of this problem:\
**Let's look at the percentage of total nulls in the entire dataset**\
Idea: null_cells/total number of cells

In [52]:
# how many total missing values do we have?
# TODO: Your code here
total_missing_values = nfl_data.isnull().sum().sum()
print(f'El número total de valores nulos es: {total_missing_values}')

# percent of data that is missing
# TODO: Your code here
percent_missing_data = round((total_missing_values/ nfl_data.size)*100) #Ponemos el round para redondear
print(f'El porcentaje de datos que faltan es de: {percent_missing_data} %')

El número total de valores nulos es: 42707
El porcentaje de datos que faltan es de: 28 %


Wow, almost a quarter of the cells in this dataset are empty! \
In the next step, we're going to take a closer look at some of the columns with missing values and try to figure out what might be going on with them.

Figure out why the data is missing
This is the point at which we get into the part of data science that I like to call "data intution", by which I mean \
"really looking at your data and trying to figure out why it is the way it is and how that will affect your analysis". \
It can be a frustrating part of data science, especially if you're newer to the field and don't have a lot of experience. \
For dealing with missing values, you'll need to use your intution to figure out why the value is missing. \
One of the most important questions you can ask yourself to help figure this out is this:

Is this value missing because it wasn't recorded or because it doesn't exist?

* If a value is missing becuase **it doesn't exist** (like the height of the oldest child of someone who doesn't have any children) \
then it doesn't make sense to try and guess what it might be. These values you probably do want to keep as NaN. 
* On the other hand, if a value is missing because **it wasn't recorded**, then you can try to guess what it might have been based on the other values \
in that column and row. This is called imputation, and we'll learn how to do it next! :)

Let's work through an example. Looking at the number of missing values in the nfl_data dataframe, \
I notice that the column **"TimesSec"** has a lot of missing values in it:

In [58]:
# look at the # of missing points in the first ten columns
# TODO: Your code here
missing_points_first_ten_columns = nfl_data.isnull().sum().head(10)
missing_points_first_ten_columns

Date              0
GameID            0
Drive             0
qtr               0
down            247
time              0
TimeUnder         0
TimeSecs          0
PlayTimeDiff      0
SideofField       5
dtype: int64

I can see that this column has information on the number of seconds left in the game when the play was made. This means that these values are probably missing because they were not recorded, rather than because they don't exist. So, it would make sense for us to try and guess what they should be rather than just leaving them as NA's.

On the other hand, there are other fields, like **"PenalizedTeam"** that also have lot of missing fields. In this case, though, the field is missing because if there was no penalty then it doesn't make sense to say which team was penalized. For this column, it would make more sense to either leave it empty or to add a third value like "neither" and use that to replace the NA's.

If you're doing very careful data analysis, this is the point at which you'd look at each column individually to figure out the best strategy for filling those missing values. For the rest of this notebook, we'll cover some "quick and dirty" techniques that can help you with missing values but will probably also end up removing some useful information or adding some noise to your data.

## Drop missing values
If you're in a hurry or don't have a reason to figure out why your values are missing, \
one option you have is to just remove any rows or columns that contain missing values. \
(Note: I don't generally recommend this approch for important projects! It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.)

If you're sure you want to drop rows with missing values, pandas does have a handy function, \
**dropna()** to help you do this. Let's try it out on our NFL dataset!

In [59]:
# remove all the rows that contain a missing value
# TODO: Your code here
nfl_data_no_missing_rows = nfl_data.dropna()
nfl_data_no_missing_rows

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


Oh dear, it looks like that's removed all our data! 😱 This is because every row in our dataset had at least one missing value. We might have better luck removing all the columns that have at least one missing value instead.

In [63]:
# How many columns have at least a single null value?
# TODO: Your code here
columns_with_nulls = nfl_data.columns[nfl_data.isnull().any()]
number_of_columns_with_nulls = len(columns_with_nulls)
number_of_columns_with_nulls

52

In [66]:
# remove all columns with at least one missing value
# TODO: Your code here
nfl_data_no_missing_columns = nfl_data.dropna(axis=1)
nfl_data_no_missing_columns


Unnamed: 0,Date,GameID,Drive,qtr,time,TimeUnder,TimeSecs,PlayTimeDiff,ydstogo,ydsnet,...,Opp_Safety_Prob,Opp_Touchdown_Prob,Field_Goal_Prob,Safety_Prob,Touchdown_Prob,ExPoint_Prob,TwoPoint_Prob,ExpPts,EPA,Season
0,2016-09-08,2016090800,1,1,15:00,15,3600.0,0.0,0,0,...,0.004441,0.254179,0.233081,0.003656,0.340639,0.0,0.0,0.814998,0.000000,2016
1,2016-09-08,2016090800,1,1,15:00,15,3600.0,0.0,10,11,...,0.004441,0.254179,0.233081,0.003656,0.340639,0.0,0.0,0.814998,1.031075,2016
2,2016-09-08,2016090800,1,1,14:17,15,3557.0,43.0,10,11,...,0.001772,0.196669,0.273691,0.003675,0.396658,0.0,0.0,1.846073,-0.714939,2016
3,2016-09-08,2016090800,1,1,14:13,15,3553.0,4.0,10,11,...,0.002647,0.233342,0.265016,0.004037,0.344743,0.0,0.0,1.131134,-1.013504,2016
4,2016-09-08,2016090800,1,1,14:08,15,3548.0,5.0,10,23,...,0.003375,0.287852,0.237856,0.004885,0.280973,0.0,0.0,0.117630,2.689740,2016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,2014-09-07,2014090709,19,4,13:08,14,788.0,21.0,26,-10,...,0.019359,0.357295,0.127355,0.003061,0.167341,0.0,0.0,-1.679311,-0.243345,2014
1496,2014-09-07,2014090709,19,4,12:30,13,750.0,38.0,20,-2,...,0.012135,0.371084,0.108245,0.003892,0.159390,0.0,0.0,-1.922656,-0.273580,2014
1497,2014-09-07,2014090709,19,4,12:00,12,720.0,30.0,12,-2,...,0.007469,0.384180,0.069872,0.004241,0.158924,0.0,0.0,-2.196236,0.734740,2014
1498,2014-09-07,2014090709,20,4,11:50,12,710.0,10.0,10,1,...,0.002519,0.190025,0.234816,0.003188,0.352326,0.0,0.0,1.461496,-0.542992,2014


We've lost quite a bit of data, but at this point we have successfully removed all the NaN's from our data.

## Filling in missing values automatically
Another option is to try and fill in the missing values. \
For this next bit, I'm getting a small sub-section of the NFL data so that it will print well. \
**We get only the first 6 rows and the columns between 'EPA' and 'Season'**\
USE loc[] and slicing in both axis

In [68]:
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:5, 'EPA':'Season']
subset_nfl_data

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,0.0,,,0.500007,0.499993,0.500007,0.499993,0.500007,0.0,,,2016
1,1.031075,-0.77086,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.05545,2016
2,-0.714939,-0.714457,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114,2016
3,-1.013504,-0.507284,-0.50622,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779,2016
4,2.68974,2.540002,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.08278,0.076812,0.005968,2016
5,0.061182,-0.093682,0.154865,0.565532,0.434468,0.569686,0.430314,0.565532,0.004154,-0.002694,0.006848,2016


We can use the Panda's fillna() function to fill in missing values in a dataframe for us. One option we have is to specify what we want the NaN values to be replaced with. Here, I'm saying that I would like to replace all the NaN values with 0.

In [70]:
# replace all NA's with 0
# TODO: Your code here

subset_nfl_data_filled = subset_nfl_data.fillna(0)
subset_nfl_data_filled

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,0.0,0.0,0.0,0.500007,0.499993,0.500007,0.499993,0.500007,0.0,0.0,0.0,2016
1,1.031075,-0.77086,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.05545,2016
2,-0.714939,-0.714457,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114,2016
3,-1.013504,-0.507284,-0.50622,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779,2016
4,2.68974,2.540002,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.08278,0.076812,0.005968,2016
5,0.061182,-0.093682,0.154865,0.565532,0.434468,0.569686,0.430314,0.565532,0.004154,-0.002694,0.006848,2016


Ok, now we know that we do have some missing values. \
**Let's see how many we have in each column in the subset**

In [78]:
# TODO: Your code here
missing_values_subset = subset_nfl_data.isnull().sum()
missing_values_subset

EPA             0
airEPA          1
yacEPA          1
Home_WP_pre     0
Away_WP_pre     0
Home_WP_post    0
Away_WP_post    0
Win_Prob        0
WPA             0
airWPA          1
yacWPA          1
Season          0
dtype: int64

**Let's see how many we have in each row in the subset**

In [2]:
# TODO: Your code here
missing_values_rows_subset = subset_nfl_data.isnull().sum(axis=1)
missing_values_rows_subset

NameError: name 'subset_nfl_data' is not defined

**Creates dataset of the data set with only rows and columns that have any nulls**

In [83]:
# TODO: Your code here
subset_nulls = subset_nfl_data[subset_nfl_data.isnull().any(axis=1) | subset_nfl_data.isnull().any(axis=0)]
subset_nulls

  subset_nulls = subset_nfl_data[subset_nfl_data.isnull().any(axis=1) | subset_nfl_data.isnull().any(axis=0)]


Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,0.0,,,0.500007,0.499993,0.500007,0.499993,0.500007,0.0,,,2016


**Fill the nulls in airEPA and airWPA with 0**

In [81]:
# TODO: Your code here
subset_nfl_data_filled_specific = subset_nulls.fillna({'airEPA':0, 'airWPA':0})
subset_nfl_data_filled_specific

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,0.0,0.0,,0.500007,0.499993,0.500007,0.499993,0.500007,0.0,0.0,,2016


**Fill the nulls values in yacEPA and yacWPA with the mean ot its values**

In [86]:
# TODO: Your code here
mean_yacEPA = subset_nfl_data['yacEPA'].mean()
mean_yacWPA = subset_nfl_data['yacWPA'].mean()
subset_nfl_data_filled_mean = subset_nfl_data.fillna({'yacEPA': mean_yacEPA, 'yacWPA': mean_yacWPA})
subset_nfl_data_filled_mean


Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,0.0,,0.319967,0.500007,0.499993,0.500007,0.499993,0.500007,0.0,,0.010675,2016
1,1.031075,-0.77086,1.801934,0.500007,0.499993,0.533353,0.466647,0.500007,0.033347,-0.022104,0.05545,2016
2,-0.714939,-0.714457,-0.000482,0.533353,0.466647,0.512108,0.487892,0.533353,-0.021245,-0.021131,-0.000114,2016
3,-1.013504,-0.507284,-0.50622,0.512108,0.487892,0.482752,0.517248,0.512108,-0.029356,-0.014578,-0.014779,2016
4,2.68974,2.540002,0.149739,0.482752,0.517248,0.565532,0.434468,0.482752,0.08278,0.076812,0.005968,2016
5,0.061182,-0.093682,0.154865,0.565532,0.434468,0.569686,0.430314,0.565532,0.004154,-0.002694,0.006848,2016


**Check there are no nulls values in the subset**

In [88]:
# TODO: Your code here
no_missing_values_subset = subset_nfl_data_filled_mean.isnull().sum().sum()
print(f'El número de valores nulos en el subset tras llenar los anteriormente pedidos son: {no_missing_values_subset}')

El número de valores nulos en el subset tras llenar los anteriormente pedidos son: 2
