# How to deal with missing data 

In [1]:
%autosave 0

Autosave disabled


In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Load the data

In [3]:
import pandas as pd
df = pd.read_csv("winemag-data-130k.csv", index_col=0)

In [4]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 20)

In [5]:
df.shape

(129971, 13)

In [6]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


## Missing data

Entries missing values are given the value **NaN**, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

*pandas* provides some methods specific to missing data. To select NaN entreis you can use **isna()**( or **isnull()**) (or its companion notna()(notnull()).

<b>isna()</b>: Return a boolean same-sized object indicating if the values are NA

Use **.any()** to return whether any element is True over requested axis

use **.sum()** to get the sum of the Nan values for the requested axis

use **.sum().sum()** to get the total Nan in dataframe

## How to include Nan in .groupby ?

Include Nan value, find all the *taster_twitter_handle* and sort them ascending. 

Nan groups in GroupBy are automatically excluded ... if need to keep Nan as a group, use .astype(str)...

## How to deal with Nan?

## fillna()

Replacing missing values is a common operation.  *pandas* provides a really handy method for this problem: **fillna()**. fillna provides a few different strategies for mitigating such data. 

### Example 1, replace region_1 each NaN with an  "Unknown":

replace NaN in region_1 with "Unknown"

### Example 2, replace the NaN in 'price' with price's average:

## dropna()

The above operations dropped 83% of data, not a good idea ..

### Example 3, drop the rows with country = NaN :

### Example 4, drop based on threshold(number of non-NaN)

Drop column(s) has more than 50% NaN. (*require at least 65,000 non-NaN*)

Above operations cause region_2 got dropped.

## backfill/ffill 

Or we could fill each NaN with the first non-NaN value that appears sometime after/before the given record in the database. This is known as the backfill/ffill strategy:

Fill the NaN in 'taster_name' with the first non-null value that appears after the given record.

### method='backfill'

### method='ffill'

## Problems:

We want to clean up the rest of this data set based on following guidelines:

1, change all the NaN in 'taster_twitter_handle' to "@anonymous". 

2, change all the NaN in 'designation' to 'Unknown'.

3, drop the row with 'variety' = NaN

3, since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. 

verify the 'Unknown' count equal to previous 'designation' count

verify the '@anonymous' count equal to previous 'taster_twitter_handle' count

verify no NaN in 'variety'

Final DataFrame shape

## Note: How to read Microsoft Excel format file

Find the top 3 correlations based on all the data in this excel file.

In [7]:
xl =pd.ExcelFile('Cancer_Cardio.xlsx')