# 数据可视化的主要步骤
* 数据准备
* 确定图表
* 分析迭代
* 输出结论

## 数据准备
* 数据规模：数据分组、数据采样
* 数据类型：数值数据、分类数据
* 数据异常：取值异常、数据缺失

In [1]:
import numpy as np
import pandas as pd
import scipy as sp

### 数据分组
* groupby

In [3]:
df = pd.DataFrame({'key1' : ['a','a','b','b','a'],
                'key2' : ['one','two','one','two','one'],
                'data1' : np.random.normal(size=5),
                'data2' : np.random.normal(size=5)})
df

Unnamed: 0,data1,data2,key1,key2
0,-0.260895,0.733148,a,one
1,-0.550839,0.279917,a,two
2,-1.196736,-0.191847,b,one
3,0.229591,1.872546,b,two
4,-2.432248,0.925433,a,one


In [4]:
df['data1'].groupby(df['key1']).mean()

key1
a   -1.081327
b   -0.483573
Name: data1, dtype: float64

### 数据采样
* sample

In [5]:
import random
x = np.arange(0,100)
y = random.sample(x,10)
y

[54, 22, 11, 43, 89, 1, 20, 34, 13, 67]

### 取值异常
#### 去掉异常

In [6]:
x = np.arange(0,100)
y = x[(x>90)|(x<10)]
y

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 91, 92, 93, 94, 95, 96, 97,
       98, 99])

#### 替换异常

In [7]:
y = x
for i in np.arange(0,100):
    if i >= 10 and i <= 90:
        y[i] = 0
    else:
        y[i] = i
y

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, 91, 92, 93, 94, 95, 96, 97, 98, 99])

#### NA处理
* dropna
* fillna

In [9]:
from numpy import nan
data = pd.Series([1,nan,2,nan,3,nan])
data

0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
5    NaN
dtype: float64

In [10]:
data.dropna()

0    1.0
2    2.0
4    3.0
dtype: float64

In [11]:
data.fillna(0)

0    1.0
1    0.0
2    2.0
3    0.0
4    3.0
5    0.0
dtype: float64

## 确定图表
* 关联分析：散点图，曲线图(scatter, plot)
* 分布分析：灰度图，密度图(hist, gaussian_kde, plot）
* 分类分析：柱状图，箱式图(bar, boxplot)

## 分析迭代
* 确定拟合模型：OLS, fit
* 分析拟合性能：summary_table
* 确定数据分布：hist
* 确定重点区间：quartile

## 输出结论