# Data Cleaning: Missing Values


> "Garbage in, garbage out."  

Let's clean up the garbage!

In this Notebook we will:

1. Review dropping unnecessary columns and duplicates, correcting datatypes, and fixing poorly formed categories
2. Identify columns with missing values
3. Choose and implement strategies for dealing with missing values.



### Initial Inspection
Phase 1: What data have we been provided?
The stakeholders have provided us with two links:

* [Share-URL to a .csv file](https://drive.google.com/file/d/1Jach7HsZVywhJnUJmkyqje52ho_0VJgo/view?usp=drive_link)  

A spreadsheet of various features of homes in their town, as well as the price of the house at the time of sale.  Add this to your drive or load to the colab session

* [Data Dictionary](https://docs.google.com/document/d/1nmnel7g35aMOl0mKiSsTHXT8wRzbJ1EktKNqYFEmpWE/edit?usp=sharing)
A data dictionary is a document that lists the name and explanation for every feature in a dataset.

**Data Dictionary**  
As Data Scientists, we often work with data that is outside of our domain of expertise. In order for us to quickly get acclimated to the domain of the project/task and to learn the business value of each feature, we may be provided with a data dictionary. Ideally, the data dictionary will clarify any abbreviations or codes that are used in the data. If possible, you could consult the stakeholders for additional clarification if needed. It is a good idea to preview the data dictionary at the onset of a project and to keep it readily available as a reference.

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
# Import Pandas
import pandas as pd

In [2]:
# Load the Data
df = pd.read_csv('../files/ames-housing-dojo.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
0,0,907227090,RL,60,7200,Pave,,AllPub,CollgCr,1Fam,...,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006,119900.0
1,1,527108010,RL,134,19378,Pave,,AllPub,Gilbert,1Fam,...,Attchd,2006.0,2.0,576.0,TA,TA,Y,,03-2006,320000.0
2,2,534275170,RL,-1,12772,Pave,,AllPub,NAmes,1Fam,...,Attchd,1960.0,1.0,301.0,TA,TA,Y,,04-2007,151500.0
3,3,528104050,RL,114,14803,Pave,,AllPub,NridgHt,1Fam,...,Attchd,2007.0,3.0,1220.0,TA,TA,Y,,06-2008,385000.0
4,4,533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,...,Attchd,2006.0,2.0,476.0,TA,TA,Y,,02-2007,193800.0


In [None]:
# How many rows and columns?
# 5 rows × 38 columns

In [3]:
# What are the datatypes?
df.dtypes

Unnamed: 0          int64
PID                 int64
MS Zoning          object
Lot Frontage        int64
Lot Area            int64
Street             object
Alley              object
Utilities          object
Neighborhood       object
Bldg Type          object
House Style        object
Overall Qual        int64
Overall Cond        int64
Year Built          int64
Year Remod/Add      int64
Exter Qual         object
Exter Cond         object
Bsmt Unf SF       float64
Total Bsmt SF     float64
Central Air        object
Gr Liv Area        object
Bsmt Full Bath    float64
Bsmt Half Bath    float64
Full Bath           int64
Half Bath          object
Bedroom             int64
Kitchen             int64
TotRms AbvGrd       int64
Garage Type        object
Garage Yr Blt     float64
Garage Cars       float64
Garage Area       float64
Garage Qual        object
Garage Cond        object
Paved Drive        object
Fence              object
Date Sold          object
SalePrice         float64
dtype: objec

In [4]:
# What are the column names?
df.columns

Index(['Unnamed: 0', 'PID', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street',
       'Alley', 'Utilities', 'Neighborhood', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add',
       'Exter Qual', 'Exter Cond', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Central Air', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath',
       'Full Bath', 'Half Bath', 'Bedroom', 'Kitchen', 'TotRms AbvGrd',
       'Garage Type', 'Garage Yr Blt', 'Garage Cars', 'Garage Area',
       'Garage Qual', 'Garage Cond', 'Paved Drive', 'Fence', 'Date Sold',
       'SalePrice'],
      dtype='object')

In [5]:
# General information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2959 entries, 0 to 2958
Data columns (total 38 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      2959 non-null   int64  
 1   PID             2959 non-null   int64  
 2   MS Zoning       2959 non-null   object 
 3   Lot Frontage    2959 non-null   int64  
 4   Lot Area        2959 non-null   int64  
 5   Street          2959 non-null   object 
 6   Alley           201 non-null    object 
 7   Utilities       2959 non-null   object 
 8   Neighborhood    2959 non-null   object 
 9   Bldg Type       2959 non-null   object 
 10  House Style     2959 non-null   object 
 11  Overall Qual    2959 non-null   int64  
 12  Overall Cond    2959 non-null   int64  
 13  Year Built      2959 non-null   int64  
 14  Year Remod/Add  2959 non-null   int64  
 15  Exter Qual      2959 non-null   object 
 16  Exter Cond      2959 non-null   object 
 17  Bsmt Unf SF     2958 non-null   f

**Changing a data type**

Upon closer inspection, we find that one of our data types is incorrect. The "Gr Liv Area" is defined in the data dictionary as "Above grade (ground) living area square feet." Square footage should be a numeric value, but the data type is "object" (string).

In [6]:
# Check the datatype for a particular column
df['Gr Liv Area'].dtypes

dtype('O')

In [None]:
# We can convert using the atype method
df['Gr Liv Area'].astype(float)


ValueError: ignored

We have an error pointing to incompatible datatype conversions. Specifically some strings cannnot be converted to floats.  
Let us inspect the column further

In [None]:
# Check the value_counts for Living Area Sqft
df['Gr Liv Area'].value_counts()



864sqft     41
1092sqft    27
1040sqft    25
1456sqft    20
1200sqft    18
            ..
2799sqft     1
1778sqft     1
1386sqft     1
2004sqft     1
1789sqft     1
Name: Gr Liv Area, Length: 1292, dtype: int64

We can see that the values have the string **sqft** at the end. What we want is numbers so let us modify the value using the string **replace()** method.

In [None]:
test_string = "hello world"

In [None]:
test_string.replace("hello", "hi")

'hi world'

In [None]:
"864sqft".replace("sqft",'')

'864'

In [None]:
# Replace Sqft with nothing
df['Gr Liv Area'] = df['Gr Liv Area'].str.replace("sqft",'')
# Preview the first 5 values for the column to verify the change
df['Gr Liv Area'].head()


0     864
1    2462
2     958
3    2084
4    1565
Name: Gr Liv Area, dtype: object

In [None]:
# let us try to change the column datatype again
# Convert the column to a float
df['Gr Liv Area'] = df['Gr Liv Area'].astype(float)
# Confirm the data type of the column
df['Gr Liv Area'].dtype


dtype('float64')

### Deleting columns  
It is also common to remove unreasonable columns from your data. Of course this decision may require domain knowledge to validate your assumptions. For example the column **Unnamed: 0** may not be useful for our analysis.

In [None]:
# Dropping unnamed 0 (permanently)
df = df.drop(columns=['Unnamed: 0'])
df.head()

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
0,907227090,RL,60,7200,Pave,,AllPub,CollgCr,1Fam,1Story,...,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006,119900.0
1,527108010,RL,134,19378,Pave,,AllPub,Gilbert,1Fam,2Story,...,Attchd,2006.0,2.0,576.0,TA,TA,Y,,03-2006,320000.0
2,534275170,RL,-1,12772,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1960.0,1.0,301.0,TA,TA,Y,,04-2007,151500.0
3,528104050,RL,114,14803,Pave,,AllPub,NridgHt,1Fam,1Story,...,Attchd,2007.0,3.0,1220.0,TA,TA,Y,,06-2008,385000.0
4,533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,...,Attchd,2006.0,2.0,476.0,TA,TA,Y,,02-2007,193800.0


### Renaming Columns
Renaming columns can make your life easier. Deciding which columns to rename and how to rename them is subjective, but we have selected some very abbreviated column names or ambiguous columns to rename for easier interpretation:

* "Year Remod/Add" is really "Year Remodeled"
* Several Features abbreviate square feet as SF, which we will replace with Sqft:
  *  "Bsmt Unf SF" is really "Bsmt Unfinished Sqft"
  * "Total Bsmt SF" is really "Total Bsmt Sqft"
* "TotRms AbvGrd" is really "Total Rooms"
There are also some misleading/confusing column names:

* "Gr Liv Area" is really the sqft of living area (above ground), which will call "Living Area Sqft"  

While there are other techniques for renaming, we recommend starting by creating a dictionary with the old name matched to the new name.

In [None]:
# Create a dictionary using old column name : new column name format
rename_dict = {"Year Remod/Add":"Year Remodeled",
               "Bsmt Unf SF": "Bsmt Unf Sqft",
               "Total Bsmt SF": "Total Bsmnt Sqft",
               "TotRms AbvGrd": "Total Rooms",
               "Gr Liv Area":"Living Area Sqft"}
rename_dict

{'Year Remod/Add': 'Year Remodeled',
 'Bsmt Unf SF': 'Bsmt Unf Sqft',
 'Total Bsmt SF': 'Total Bsmnt Sqft',
 'TotRms AbvGrd': 'Total Rooms',
 'Gr Liv Area': 'Living Area Sqft'}

In [None]:
# dictionary substitution using rename method
df = df.rename(rename_dict,axis=1)
df.head()

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
0,907227090,RL,60,7200,Pave,,AllPub,CollgCr,1Fam,1Story,...,Detchd,1977.0,1.0,297.0,TA,TA,Y,MnPrv,03-2006,119900.0
1,527108010,RL,134,19378,Pave,,AllPub,Gilbert,1Fam,2Story,...,Attchd,2006.0,2.0,576.0,TA,TA,Y,,03-2006,320000.0
2,534275170,RL,-1,12772,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1960.0,1.0,301.0,TA,TA,Y,,04-2007,151500.0
3,528104050,RL,114,14803,Pave,,AllPub,NridgHt,1Fam,1Story,...,Attchd,2007.0,3.0,1220.0,TA,TA,Y,,06-2008,385000.0
4,533206070,FV,32,3784,Pave,Pave,AllPub,Somerst,TwnhsE,1Story,...,Attchd,2006.0,2.0,476.0,TA,TA,Y,,02-2007,193800.0


In [None]:
# check if columns are renamed

# Remove Duplicates

We don't want any duplicate entries in our data.  This will skew our analysis and confuse our predictive models.  Duplicate entries are quite common in data that has not been cleaned.

We can use `df.duplicated()` to show whether rows are duplicates.

In [None]:
# Check for duplicates
duplicated_rows  = df.duplicated()
duplicated_rows

0       False
1       False
2       False
3       False
4       False
        ...  
2954    False
2955    False
2956    False
2957    False
2958    False
Length: 2959, dtype: bool

Well, we don't want to have to look row by row, so let's use `.sum()` to add up all of the `True` values.  When using `.sum()`, a `True` value will evaluate to a 1.

In [None]:
# Count the duplicates
df.duplicated().sum()
# OR
# duplicated_rows.sum()

7

We have 7 duplicated rows.  We can simply remove those with df.drop_duplicates().

In [None]:
# Visually checking the duplicate rows
df[duplicated_rows]

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
869,535153150,RL,76,9120,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,2.0,433.0,TA,TA,Y,,11-2008,163000.0
1019,921205030,RL,88,11443,Pave,,AllPub,Timber,1Fam,1Story,...,Attchd,2005.0,3.0,880.0,TA,TA,Y,,03-2006,369900.0
1867,908103280,RL,65,6500,Pave,,AllPub,Edwards,1Fam,1Story,...,Detchd,1991.0,2.0,480.0,TA,TA,Y,,05-2008,135000.0
2029,526351010,RL,81,14267,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,1.0,312.0,TA,TA,Y,,06-2010,172000.0
2203,923230040,RL,63,9297,Pave,,AllPub,Mitchel,Duplex,1Story,...,Detchd,1976.0,2.0,560.0,TA,TA,Y,,07-2006,188000.0
2306,907262070,RL,72,7226,Pave,,AllPub,CollgCr,1Fam,2Story,...,Attchd,2003.0,2.0,595.0,TA,TA,Y,,06-2008,183000.0
2552,528174020,RL,34,3901,Pave,,AllPub,NridgHt,Twnhs,1Story,...,Attchd,2005.0,2.0,631.0,TA,TA,Y,,08-2007,204000.0


We don't have any repeated rows in the above data. This is because our check for .duplicated defaults to keep = 'first" which means the first occurrence is kept as the original and True will only apply for every row after the first occurrence. If we wanted to keep the last row as the original and all else as True duplicates, we could use keep = False. If we do not want to designate an original row and want all identical rows included as duplicates, we can add keep=False to .duplicated().

In [None]:
# Include the first row along with all dupicates
duplicated_rows = df.duplicated(keep=False)
duplicated_rows.sum()

14

In [None]:
# Wwe can sort by some column to hhave them next to each other
df[duplicated_rows].sort_values("PID")



Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
290,526351010,RL,81,14267,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,1.0,312.0,TA,TA,Y,,06-2010,172000.0
2029,526351010,RL,81,14267,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,1.0,312.0,TA,TA,Y,,06-2010,172000.0
1100,528174020,RL,34,3901,Pave,,AllPub,NridgHt,Twnhs,1Story,...,Attchd,2005.0,2.0,631.0,TA,TA,Y,,08-2007,204000.0
2552,528174020,RL,34,3901,Pave,,AllPub,NridgHt,Twnhs,1Story,...,Attchd,2005.0,2.0,631.0,TA,TA,Y,,08-2007,204000.0
524,535153150,RL,76,9120,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,2.0,433.0,TA,TA,Y,,11-2008,163000.0
869,535153150,RL,76,9120,Pave,,AllPub,NAmes,1Fam,1Story,...,Attchd,1958.0,2.0,433.0,TA,TA,Y,,11-2008,163000.0
1943,907262070,RL,72,7226,Pave,,AllPub,CollgCr,1Fam,2Story,...,Attchd,2003.0,2.0,595.0,TA,TA,Y,,06-2008,183000.0
2306,907262070,RL,72,7226,Pave,,AllPub,CollgCr,1Fam,2Story,...,Attchd,2003.0,2.0,595.0,TA,TA,Y,,06-2008,183000.0
1577,908103280,RL,65,6500,Pave,,AllPub,Edwards,1Fam,1Story,...,Detchd,1991.0,2.0,480.0,TA,TA,Y,,05-2008,135000.0
1867,908103280,RL,65,6500,Pave,,AllPub,Edwards,1Fam,1Story,...,Detchd,1991.0,2.0,480.0,TA,TA,Y,,05-2008,135000.0


**Dropping Duplicate Rows**  

Once we identify duplicate rows, we should remove them. After dropping, we again check for duplicates to verify that we now have 0 duplicates.

In [None]:
# Remove duplicates
df = df.drop_duplicates()

# We can also alter the dataframe directly
# df.drop_duplicates(inplace=True)

df.duplicated().sum()

0

#### Unique Values
Sometimes there may be rows that are essentially duplicate entries but with some small discrepancy. A single non-duplicated value in 1 column can cause .duplicated() to overlook the duplication.

One consideration for identifying such duplicates is if there is an identifying feature in the dataset that should be unique for each entry. In our case, the PID should not be repeated as each house should have its own value.

We can check to see how many unique values are included in each column using .nunique()

In [None]:
# Check for how many unique values are in each column
df.nunique()

PID                 2930
MS Zoning              7
Lot Frontage         129
Lot Area            1960
Street                 2
Alley                  2
Utilities              3
Neighborhood          28
Bldg Type              5
House Style            8
Overall Qual          10
Overall Cond           9
Year Built           118
Year Remodeled        61
Exter Qual             4
Exter Cond             5
Bsmt Unf Sqft       1137
Total Bsmnt Sqft    1058
Central Air            4
Living Area Sqft    1292
Bsmt Full Bath         4
Bsmt Half Bath         3
Full Bath              5
Half Bath              4
Bedroom                8
Kitchen                4
Total Rooms           14
Garage Type            6
Garage Yr Blt        103
Garage Cars            6
Garage Area          603
Garage Qual            5
Garage Cond            5
Paved Drive            3
Fence                  4
Date Sold             55
SalePrice           1033
dtype: int64

In [None]:
# We can also see the fraction of the data that is unique
df.nunique() / len(df)

# OR Percentage
# df.nunique() / len(df) * 100

PID                 0.992547
MS Zoning           0.002371
Lot Frontage        0.043699
Lot Area            0.663957
Street              0.000678
Alley               0.000678
Utilities           0.001016
Neighborhood        0.009485
Bldg Type           0.001694
House Style         0.002710
Overall Qual        0.003388
Overall Cond        0.003049
Year Built          0.039973
Year Remodeled      0.020664
Exter Qual          0.001355
Exter Cond          0.001694
Bsmt Unf Sqft       0.385163
Total Bsmnt Sqft    0.358401
Central Air         0.001355
Living Area Sqft    0.437669
Bsmt Full Bath      0.001355
Bsmt Half Bath      0.001016
Full Bath           0.001694
Half Bath           0.001355
Bedroom             0.002710
Kitchen             0.001355
Total Rooms         0.004743
Garage Type         0.002033
Garage Yr Blt       0.034892
Garage Cars         0.002033
Garage Area         0.204268
Garage Qual         0.001694
Garage Cond         0.001694
Paved Drive         0.001016
Fence         

**Duplicates in a subset**  

We can apply .duplicated() to a subset of the data to check for duplicates in a particular column. In our case, we want to find rows that have duplicate PID values.

In [None]:
df.duplicated(subset=['PID','MS Zoning']).sum()

22

In [None]:
# How many rows are duplicates (including the first occurence)
duplicated_pids = df.duplicated(subset=['PID'], keep=False)
duplicated_pids.sum()

44

In [None]:
# Sorting for better display
# Visually checking the duplicate rows
df[duplicated_pids].sort_values("PID")

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
2654,526355080,RL,75,13860,Pave,,AllPub,NAmes,1Fam,SLvl,...,Attchd,1972.0,2.0,538.0,TA,TA,Y,MnPrv,07-2009,
1650,526355080,RL,75,13860,Pave,,AllPub,NAmes,1Fam,SLvl,...,Attchd,1972.0,2.0,538.0,TA,TA,Y,MnPrv,07-2009,345000.0
135,527110020,RL,-1,8530,Pave,,AllPub,Gilbert,1Fam,SLvl,...,BuiltIn,1995.0,2.0,400.0,TA,TA,Y,,05-2009,168500.0
2469,527110020,RL,-1,8530,Pave,,AllPub,Gilbert,1Fam,SLvl,...,BuiltIn,1995.0,2.0,400.0,TA,TA,Y,,05-2009,
626,527326040,RL,85,11900,Pave,,AllPub,NWAmes,1Fam,1Story,...,Attchd,1977.0,2.0,544.0,TA,TA,Y,,04-2009,82500.0
625,527326040,RL,85,11900,Pave,,AllPub,NWAmes,1Fam,1Story,...,Attchd,1977.0,2.0,544.0,TA,TA,Y,,04-2009,
2341,528178070,RL,130,16900,Pave,,AllPub,NridgHt,1Fam,2Story,...,Attchd,2001.0,3.0,746.0,TA,TA,Y,,01-2008,421250.0
929,528178070,RL,130,16900,Pave,,AllPub,NridgHt,1Fam,2Story,...,Attchd,2001.0,3.0,746.0,TA,TA,Y,,01-2008,
2599,528429100,RL,49,15218,Pave,,AllPub,Somerst,1Fam,1Story,...,Attchd,2006.0,3.0,928.0,TA,TA,Y,,09-2006,336820.0
324,528429100,RL,49,15218,Pave,,AllPub,Somerst,1Fam,1Story,...,Attchd,2006.0,3.0,928.0,TA,TA,Y,,09-2006,


# Identify Missing Values  
In a spreadsheet, a null value is just an empty cell. In NumPy and Pandas, null values will show up as "NaN," which stands for "Not a Number," but they are not strings either. This just means no data is included. NaNs are a subtype of float, and any integer columns with NaNs will be recast as floats.  
 The image below shows a null value (NaN) in a dataframe.  

 <img src="https://assets.codingdojo.com/boomyeah2015/codingdojo/curriculum/content/chapter/1680019526__Capture.PNG" width="220px" height="300px">

In [None]:
# Finding if there are null values with isna() method
df.isna()

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
0,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2954,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2955,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2956,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2957,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
df.isnull().sum()

PID                    0
MS Zoning              0
Lot Frontage           0
Lot Area               0
Street                 0
Alley               2751
Utilities              0
Neighborhood           0
Bldg Type              0
House Style            0
Overall Qual           0
Overall Cond           0
Year Built             0
Year Remodeled         0
Exter Qual             0
Exter Cond             0
Bsmt Unf Sqft          1
Total Bsmnt Sqft       1
Central Air            0
Living Area Sqft       0
Bsmt Full Bath         2
Bsmt Half Bath         2
Full Bath              0
Half Bath              0
Bedroom                0
Kitchen                0
Total Rooms            0
Garage Type          157
Garage Yr Blt        159
Garage Cars            1
Garage Area            1
Garage Qual          159
Garage Cond          159
Paved Drive            0
Fence               2378
Date Sold              0
SalePrice             22
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2952 entries, 0 to 2958
Data columns (total 37 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   PID               2952 non-null   int64  
 1   MS Zoning         2952 non-null   object 
 2   Lot Frontage      2952 non-null   int64  
 3   Lot Area          2952 non-null   int64  
 4   Street            2952 non-null   object 
 5   Alley             201 non-null    object 
 6   Utilities         2952 non-null   object 
 7   Neighborhood      2952 non-null   object 
 8   Bldg Type         2952 non-null   object 
 9   House Style       2952 non-null   object 
 10  Overall Qual      2952 non-null   int64  
 11  Overall Cond      2952 non-null   int64  
 12  Year Built        2952 non-null   int64  
 13  Year Remodeled    2952 non-null   int64  
 14  Exter Qual        2952 non-null   object 
 15  Exter Cond        2952 non-null   object 
 16  Bsmt Unf Sqft     2951 non-null   float64


In [None]:
# We can aggregate these to find how many values are missing
null_sums = df.isna().sum()
null_sums

PID                    0
MS Zoning              0
Lot Frontage           0
Lot Area               0
Street                 0
Alley               2751
Utilities              0
Neighborhood           0
Bldg Type              0
House Style            0
Overall Qual           0
Overall Cond           0
Year Built             0
Year Remodeled         0
Exter Qual             0
Exter Cond             0
Bsmt Unf Sqft          1
Total Bsmnt Sqft       1
Central Air            0
Living Area Sqft       0
Bsmt Full Bath         2
Bsmt Half Bath         2
Full Bath              0
Half Bath              0
Bedroom                0
Kitchen                0
Total Rooms            0
Garage Type          157
Garage Yr Blt        159
Garage Cars            1
Garage Area            1
Garage Qual          159
Garage Cond          159
Paved Drive            0
Fence               2378
Date Sold              0
SalePrice             22
dtype: int64

In [None]:
# Get the percentage of null values per column
null_percent = null_sums/len(df) * 100
null_percent

PID                  0.000000
MS Zoning            0.000000
Lot Frontage         0.000000
Lot Area             0.000000
Street               0.000000
Alley               93.191057
Utilities            0.000000
Neighborhood         0.000000
Bldg Type            0.000000
House Style          0.000000
Overall Qual         0.000000
Overall Cond         0.000000
Year Built           0.000000
Year Remodeled       0.000000
Exter Qual           0.000000
Exter Cond           0.000000
Bsmt Unf Sqft        0.033875
Total Bsmnt Sqft     0.033875
Central Air          0.000000
Living Area Sqft     0.000000
Bsmt Full Bath       0.067751
Bsmt Half Bath       0.067751
Full Bath            0.000000
Half Bath            0.000000
Bedroom              0.000000
Kitchen              0.000000
Total Rooms          0.000000
Garage Type          5.318428
Garage Yr Blt        5.386179
Garage Cars          0.033875
Garage Area          0.033875
Garage Qual          5.386179
Garage Cond          5.386179
Paved Driv

In [None]:
# Filter to see only those with null values
null_percent[null_percent>0]

Alley               93.191057
Bsmt Unf Sqft        0.033875
Total Bsmnt Sqft     0.033875
Bsmt Full Bath       0.067751
Bsmt Half Bath       0.067751
Garage Type          5.318428
Garage Yr Blt        5.386179
Garage Cars          0.033875
Garage Area          0.033875
Garage Qual          5.386179
Garage Cond          5.386179
Fence               80.555556
SalePrice            0.745257
dtype: float64

### Handling Null Values

We apply different tactics for addressing null values, depending on if the column is categorical or numeric.

We also use different tools/tactics when dealing with null values for Data Understanding/inspection vs. when we prepare data for a machine learning model.

The approaches demonstrated in this lesson should be used for data exploration only. We will return to this point at the beginning of the Intro to Machine Learning course.

Pandas has convenient methods for addressing null values. Throughout this lesson, we will demonstrate each of these methods:

1. Drop rows with null values.
2. Fill in with a placeholder value.
3. Impute with a central value


## Dropping Rows

Ideally, we want to keep as much of our data as possible, but that isn't always possible. One major exception is rows that have null values for the target column that we are trying to explain/predict.

When we previously checked for duplicates, we found rows with duplicate PIDs and that one of each pair of duplicate rows was missing the Sale Price. This is the code from the end of the lesson where we first addressed duplicate rows.

In [None]:
# Revisiting the duplicate rows with null values from duplicates lesson
df[duplicated_pids].sort_values("PID")


The entries at index 2654 and 1650 have the same PID number. Scrolling all the way to the far right of the data frame, we see that the only difference is that one entry has a value for "SalePrice" while the other has a NaN for Sale Price.  
If you continue to inspect the rows with duplicate PIDs, you will find a similar result for all 22 cases.

### Dropping Na Values

Now we will drop any of the rows that have NaN in the SalePrice column.

In [None]:
# Dropping Null values from SalePrice
df = df.dropna(subset=["SalePrice"])

In [None]:
# Confirming no more null sale prices
df['SalePrice'].isna().sum()

In [None]:
# Confirming  no more duplicated PIDs
df.duplicated(subset=['PID'], keep=False).sum()

## Reading(Visualizing Null Values with missingno)
Sometimes missing values are just random omissions, but in some cases, there is a pattern related to missing values. To help us identify any patterns, we can create a visual of missing data.

missingno is a visualization package designed to show null values (missing numbers).

We can create a visual representation of our dataframe, where null values are represented graphically. While there are several plotting functions available, we will use the matrix plot.

We must first import the new library, and then we can run the matrix function on the dataframe:

In [None]:
import missingno as msno
msno.matrix(df);

In [None]:
# save the filter
null_garage_type = df['Garage Type'].isna()
null_garage_type



0       False
1       False
2       False
3       False
4       False
        ...  
2954    False
2955    False
2956    False
2957    False
2958    False
Name: Garage Type, Length: 2952, dtype: bool

In [None]:
# Use the filter
df[null_garage_type]
# df[~null_garage_type] # Inverse with ~

Unnamed: 0,PID,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Utilities,Neighborhood,Bldg Type,House Style,...,Garage Type,Garage Yr Blt,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Fence,Date Sold,SalePrice
20,903458110,RM,-1,7920,Pave,Grvl,AllPub,IDOTRR,1Fam,1.5Fin,...,,,0.0,0.0,,,N,MnPrv,03-2008,89500.0
21,908225310,RL,52,8741,Pave,,AllPub,Edwards,Duplex,1Story,...,,,0.0,0.0,,,N,GdWo,07-2007,98500.0
28,904301060,RL,60,10800,Pave,,AllPub,Edwards,Duplex,1Story,...,,,0.0,0.0,,,Y,,03-2009,179000.0
79,527401160,RL,60,9000,Pave,,AllPub,NAmes,Duplex,2Story,...,,,0.0,0.0,,,Y,,09-2009,136000.0
85,904351040,C (all),-1,6449,Pave,,AllPub,SWISU,1Fam,2Story,...,,,0.0,0.0,,,N,,03-2010,93369.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2917,903225050,RM,50,6130,Pave,,AllPub,BrkSide,1Fam,1.5Unf,...,,,0.0,0.0,,,Y,,05-2008,109500.0
2919,923275140,RL,-1,8780,Pave,,AllPub,Mitchel,1Fam,1Story,...,,,0.0,0.0,,,Y,MnPrv,03-2009,112000.0
2936,904301375,RL,-1,10020,Pave,,AllPub,Edwards,1Fam,1Story,...,,,0.0,0.0,,,Y,,03-2009,61000.0
2941,534276360,RL,77,9320,Pave,,AllPub,NAmes,1Fam,1Story,...,,,0.0,0.0,,,Y,,01-2010,128950.0


In [None]:
df.describe(percentiles=[0.25])
# df["Overall Qual"].describe()

Unnamed: 0,PID,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remodeled,Bsmt Unf Sqft,Total Bsmnt Sqft,Living Area Sqft,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Bedroom,Kitchen,Total Rooms,Garage Yr Blt,Garage Cars,Garage Area,SalePrice
count,2952.0,2952.0,2952.0,2952.0,2952.0,2952.0,2952.0,2951.0,2951.0,2952.0,2950.0,2950.0,2952.0,2952.0,2952.0,2952.0,2793.0,2951.0,2951.0,2930.0
mean,714496700.0,57.525407,10140.436653,6.098238,5.564024,1971.350949,1984.256098,559.605896,1051.338529,1500.175136,0.431186,0.061017,1.566734,2.854336,1.044038,6.444783,1978.128178,1.767875,473.148424,181439.4
std,188728500.0,33.760263,7856.963967,1.410075,1.110911,30.273566,20.883391,438.745315,440.167785,505.528427,0.524584,0.245002,0.552513,0.826814,0.213311,1.574427,25.550894,0.760017,214.968334,86659.68
min,526301100.0,-1.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,2.0,1895.0,0.0,0.0,12789.0
25%,528477000.0,43.0,7433.25,5.0,5.0,1954.0,1965.0,220.0,793.0,1126.0,0.0,0.0,1.0,2.0,1.0,5.0,1960.0,1.0,321.0,129500.0
50%,535453600.0,63.0,9429.0,6.0,5.0,1973.0,1993.0,466.0,990.0,1442.0,0.0,0.0,2.0,3.0,1.0,6.0,1979.0,2.0,480.0,160000.0
max,1007100000.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,2336.0,6110.0,5642.0,3.0,2.0,4.0,8.0,3.0,15.0,2207.0,5.0,1488.0,2000000.0
