# How to deal with missing data 

In [None]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import warnings
warnings.filterwarnings('ignore')

Load the data

In [None]:
import pandas as pd
df = pd.read_csv("winemag-data-130k.csv", index_col=0)

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
pd.set_option('display.max_rows', 20)

In [None]:
# check the length of the datset
???(df)

In [None]:
# check the shape of the datset
df.???

In [None]:
df.head()

## Missing data

Entries missing values are given the value **NaN**, short for "Not a Number". For technical reasons these NaN values are always of the float64 dtype.

*pandas* provides some methods specific to missing data. To select NaN entreis you can use **isna()**( or **isnull()**) (or its companion notna()(notnull()).

<b>isna()</b>: Return a boolean same-sized object indicating if the values are NA

In [None]:
# check whether the value is missing
df.???()

Use **.any()** to return whether any element is True over requested axis

In [None]:
df.isna().???()

In [None]:
# apply any() to row-wise by changing axis to 1
df.isna().any(axis=???)

use **.sum()** to get the sum of the Nan values for the requested axis

In [None]:
# Check the number of missing value in each column
df.isna().???()

use **.sum().sum()** to get the total Nan in dataframe

In [None]:
# check the number of total missing value 
df.isna().sum().???()

## How to include Nan in .groupby ?

Include Nan value, find all the *taster_twitter_handle* and sort them ascending. 

In [None]:
# check number of nan in taster_twitter_handle column
df.taster_twitter_handle.???().???()

In [None]:
# check the unique value in taster_twitter_handle column
df.taster_twitter_handle.???()

In [None]:
# check the number of unique value using nunique() in taster_twitter_handle column
df.taster_twitter_handle.???()

In [None]:
# we can overwrite dropna to False in nunique() to count the unique value include nan
df.taster_twitter_handle.nunique(dropna=???)

In [None]:
# By default groupby do not include nan group
df.???('taster_twitter_handle')['taster_twitter_handle'].count().sort_values()

Nan groups in GroupBy are automatically excluded ... if need to keep Nan as a group, use .astype(str)...

In [None]:
# use astype to change the np.nan into nan string, then nan can be a group in groupby
df.astype(???).groupby('taster_twitter_handle')['taster_twitter_handle'].count().sort_values()

## How to deal with Nan?

## fillna()

Replacing missing values is a common operation.  *pandas* provides a really handy method for this problem: **fillna()**. fillna provides a few different strategies for mitigating such data. 

### Example 1, replace region_1 each NaN with an  "Unknown":

replace NaN in region_1 with "Unknown"

In [None]:
# replace np.nan with "Unknown" with and change in permentaly
df.region_1.fillna(value=???,inplace=???)

In [None]:
df.region_1

### Example 2, replace the NaN in 'price' with price's average:

In [None]:
# get the average price
price_avg = df.price.???()
price_avg

In [None]:
# replace nan with price_avg
df.price.fillna(value=???, inplace=???)

In [None]:
# prove that there is no nan in 'price' 
df.price.???().???()

## dropna()

In [None]:
# make a copy()
df2=df.???()

In [None]:
# check the shape of tht table before we drop anything
df.shape

In [None]:
# drop all nan from df2
df2 = df2.???()

In [None]:
# check the shape of tht table after we drop all nan
df2.shape

In [None]:
# show that there in no nan in all columns
df2.isna().???()

The above operations dropped 83% of data, not a good idea ..

### Example 3, drop the rows with country = NaN :

In [None]:
df.shape

In [None]:
# There are only very small proportion of missing value in country column
df.country.isna().sum()

Let's figure out where is the first country with country with NaN

In [None]:
# get the slice that country vlaue is missing
df[df.country.isna()==True].head(1)

In [None]:
# use subset to specify the columns 
df.dropna(how='any',???=['country'])[912:915]

In [None]:
# overwrite inplace parameter to change the reuslt permanently
df.dropna(how='any',subset=['country'],inplace=???)

In [None]:
# the number of row derease from 129971 to 129908, which is acceptable.
df.shape

In [None]:
# check the number of missing value in country column
df.country.isna().???()

### Example 4, drop based on threshold(number of non-NaN)

Drop **column(s)** has more than 50% NaN. (*require at least len(df)/2 ~ 65,000 non-NaN*)

In [None]:
df.shape

In [None]:
df.isna().sum()

In [None]:
# number of 50% rows?
thresh_50Percent = ???(df)/2

In [None]:
# Since we want to drop columns, we need to overwrite axis to 1
df.dropna(thresh=thresh_50Percent, axis=1, inplace=True)

In [None]:
df.shape

In [None]:
df.isna().sum()

Above operations cause region_2 got dropped.

In [None]:
df.head()

## backfill/ffill 

Or we could fill each NaN with the first non-NaN value that appears sometime after/before the given record in the database. This is known as the backfill/ffill strategy:

Fill the NaN in 'taster_name' with the first non-null value that appears after the given record.

In [None]:
df.taster_name.isna().sum()

In [None]:
df[df.taster_name.isna()==True].head()

In [None]:
df.taster_name.iloc[30:36]

### method='backfill'

In [None]:
df.taster_name.fillna(method='backfill').iloc[30:36]

### method='ffill'

In [None]:
df.taster_name.fillna(method='ffill').iloc[30:36]

In [None]:
df.taster_name.fillna(method='backfill',inplace=True)

In [None]:
df.taster_name.isna().any()

In [None]:
df.isna().sum()

## Problems:

We want to clean up the rest of this data set based on following guidelines:

1, change all the NaN in 'taster_twitter_handle' to "@anonymous". 

2, change all the NaN in 'designation' to 'Unknown'.

3, drop the row with 'variety' = NaN

4, since this dataset was published, reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. 

In [None]:
df.isna().sum()

In [None]:
# change all the NaN in 'designation' to 'Unknown'.
df.???.fillna(value=???,inplace=True)

verify the 'Unknown' count equal to previous 'designation' count

In [None]:
(df.designation=='Unknown').sum()

In [None]:
# change all the NaN in 'taster_twitter_handle' to "@anonymous". 
df.???.fillna(value=???,inplace=True)

verify the '@anonymous' count equal to previous 'taster_twitter_handle' count

In [None]:
(df.taster_twitter_handle=='@anonymous').sum()

In [None]:
# 3, drop the row with 'variety' = NaN
df.dropna(how='any',subset=[???],inplace=True)

verify no NaN in 'variety'

In [None]:
df.variety.isna().any()

In [None]:
# 4, since this dataset was published, 
# reviewer Kerin O'Keefe has changed her Twitter handle from @kerinokeefe to @kerino. 
df.taster_twitter_handle.replace(to_replace=???,value=???,inplace=True)

In [None]:
df.isna().sum()

Final DataFrame shape

In [None]:
df.shape

## Note: How to read Microsoft Excel format file

Find the top 3 correlations based on all the data in this excel file.

In [None]:
xl =pd.ExcelFile('Cancer_Cardio.xlsx')

In [None]:
type(xl)
xl

In [None]:
xl.sheet_names

In [None]:
df1 = xl.parse("Cancer")
df1

In [None]:
df2 = xl.parse("Cardio")
df2

In [None]:
df3 = xl.parse("Smoking")
df3

In [None]:
df=df1.merge(df2,on='Geocode')

In [None]:
df

In [None]:
df = df.merge(df3,on='Geocode')

In [None]:
df

Now we can do some analysis.

In [None]:
df.groupby('city')['cancer'].sum().sort_values(ascending=False)

In [None]:
df.corr()

In [None]:
cor=df.corr()

In [None]:
cor[cor<1].stack()

In [None]:
cor[cor<1].stack().nlargest(6)[::2]