## missingno   (require python3)

沒有高質量的資料，就沒有高質量的資料探勘結果，當我們做監督學習演算法，難免會碰到混亂的資料集，缺失的值，當缺失比例很小時，可直接對缺失記錄進行捨棄或進行手工處理，missingno提供了一個小型的靈活的、易於使用的資料視覺化和實用工具集，用影象的方式讓你能夠快速評估資料缺失的情況，而不是在資料表裏面步履維艱。你可以根據資料的完整度對資料進行排序或過濾，或者根據熱度圖或樹狀圖來考慮對資料進行修正。

missingno是基於matplotlib建造的一個模組，所以它出圖速度很快，並且能夠靈活的處理pandas資料。

### - 安裝

* 方法一： 

    pip install missingno 
    

* 方法二： 
    
    下載安裝(https://pypi.python.org/pypi/missingno/)

### - 使用

In [None]:
import missingno as msno
import pandas as pd
from ggplot import * 

In [None]:
custdata = pd.read_csv('data/custdata.tsv',sep='\t')

In [None]:
custdata.head()

In [None]:
msno.matrix(custdata)

In [None]:
custdata['is.employed'].value_counts(dropna=False)

In [None]:
custdata['housing.type'].value_counts(dropna=False)

In [None]:
custdata['recent.move'].value_counts(dropna=False)

In [None]:
custdata['num.vehicles'].value_counts(dropna=False)

In [None]:
msno.bar(custdata)

In [None]:
custdata.age.describe()

In [None]:
custdata.income.describe()

In [None]:
msno.heatmap(custdata)

In [None]:
msno.dendrogram(custdata)

In [None]:
ggplot(aes(x = 'age'), custdata) + geom_histogram(binwidth=5, fill="gray")

In [None]:
ggplot(aes(x='age'), data=custdata) + geom_density()

In [None]:
ggplot(aes(x='income'), custdata) + geom_density()

In [None]:
ggplot(aes(x='marital.stat'), custdata) + geom_bar(fill="gray")

In [None]:
p = ggplot(aes(x='state.of.res'), custdata)\
    + geom_bar(fill="gray")\
    + theme(axis_text_y=element_text(size=0.001))
p

In [None]:
custdata2 = custdata.copy()
custdata2 = custdata2[(custdata2.age > 0) & (custdata2.age < 100) & (custdata2.income > 0)]

In [None]:
custdata2[['age', 'income']].corr()

In [None]:
ggplot(aes(x='age', y='income'), custdata2) + geom_point() + ylim(0, 200000)

In [None]:
ggplot(aes(x='age', y='income'), custdata2) + geom_point() + stat_smooth() + ylim(0, 200000)

In [None]:
ggplot(aes(x='age', y='income'), custdata2) + geom_point() + stat_smooth(method='loose') + ylim(0, 200000)

In [None]:
ggplot(aes(x = 'marital.stat', fill = 'health.ins'), custdata) + geom_bar(position = "stack") 

In [None]:
ggplot(aes(x='marital.stat', fill='health.ins'), custdata) + geom_bar()

In [None]:
ggplot(aes(x = 'marital.stat', y = '1.05', fill = 'health.ins'), custdata)\
    + geom_bar(position="fill")\
    + geom_point(size = 0.75, alpha = 0.3)


In [None]:
custdata2['housing_type'] = pd.Categorical(custdata2['housing.type'])
custdata2['housing_type'] = custdata2['housing_type'].cat.codes

custdata3 = custdata2.copy()
custdata3 = custdata3.dropna(how='any')

In [None]:
ggplot(aes(x='housing_type', fill='marital.stat'), custdata3)\
    + geom_bar()\
    + theme(axis_text_x = element_text(angle = 45, hjust = 1))

In [None]:
ggplot(aes(x='housing_type', fill='marital.stat'), custdata3)\
    + geom_bar()\
    + facet_wrap('housing.type', scales="free_y")\
    + theme(axis_text_x = element_text(angle = 45, hjust = 1))