# Pandas引入约定

In [2]:
from pandas import Series,DataFrame
import numpy as np
import pandas as pd

# Pandas数据结构

## Pandas数据结构介绍

- Series:一种类似于一维数组的对象，它是由一组数据（各种Numpy数据类型）和一组与之相关的数据标签（即索引）组成。仅由一组数据即可产生简单的Series。
- DataFrame:一个表格型的数据结构，含有一组有序的列，每列可以是不同的值类型，DataFrame既有行索引也有列索引，可以看作是由Series组成的字典。

## Series

### 通过一维数组创建Series:

In [3]:
arr = np.array([1,2,3,4])
series01 = Series(arr)
series01

0    1
1    2
2    3
3    4
dtype: int64

In [6]:
series01.index

RangeIndex(start=0, stop=4, step=1)

假如不指定索引，Series会自动创建从0到length-1的索引。

In [9]:
series02 = Series([23.1,22,25.3])
series02.index = (['product1','product2','product3'])
series02

product1    23.1
product2    22.0
product3    25.3
dtype: float64

In [10]:
series03 = Series([100,60,70],index=['语文','数学','英语'])
series03

语文    100
数学     60
英语     70
dtype: int64

In [7]:
series01.values

array([1, 2, 3, 4])

In [8]:
series01.dtype

dtype('int64')

### 通过字典方式创建Series

In [4]:
series04 = Series({'2001':100.2,'2002':100.3,'2003':200.2})
series04.index

Index(['2001', '2002', '2003'], dtype='object')

In [12]:
series04

2001    100.2
2002    100.3
2003    200.2
dtype: float64

### Series应用Numpy数组运算

通过索引获取值：

In [5]:
series04['2001']

100.2

In [6]:
series04[0]

100.2

Numpy中的数组运算，在Series中都保留使用，并且Series进行数组运算时，索引与值之间的映射关系不会改变。

In [8]:
series04[series04>100.2]

2002    100.3
2003    200.2
dtype: float64

In [9]:
series04 /100

2001    1.002
2002    1.003
2003    2.002
dtype: float64

In [10]:
np.exp(series04)

2001    3.283274e+43
2002    3.628579e+43
2003    8.825824e+86
dtype: float64

### Series缺失值检测

In [11]:
scores = Series({'Tom':89,'John':88,'Merry':96,'Max':65})
scores

Tom      89
John     88
Merry    96
Max      65
dtype: int64

In [13]:
new_index = ['Tom','Max','Joe','John','Merry']
scores = Series(scores,index=new_index)
scores

Tom      89.0
Max      65.0
Joe       NaN
John     88.0
Merry    96.0
dtype: float64

In [14]:
pd.isnull(scores)

Tom      False
Max      False
Joe       True
John     False
Merry    False
dtype: bool

In [15]:
pd.notnull(scores)

Tom       True
Max       True
Joe      False
John      True
Merry     True
dtype: bool

### Series自动对齐

不同Series之间进行算术运算，会自动对齐不同索引的数据。

In [16]:
product_num = Series([23,45,67,89],index=['p3','p1','p2','p5'])
product_price_table = Series([9.98,2.34,4.56,5.67,8.9],index=['p1','p2','p3','p4','p5'])
product_sum = product_num * product_price_table
product_sum

p1    449.10
p2    156.78
p3    104.88
p4       NaN
p5    792.10
dtype: float64

## DataFrame

### 通过二维数组创建DataFrame

In [17]:
df01 = DataFrame([['Tom','Meery','John'],[76,98,100]])
df01

Unnamed: 0,0,1,2
0,Tom,Meery,John
1,76,98,100


In [18]:
df02 = DataFrame([['Tom',76],['Merry',98],['John',100]])
df02

Unnamed: 0,0,1
0,Tom,76
1,Merry,98
2,John,100


In [19]:
arr = np.array([['Tom',76],['Merry',98],['John',100]])
df03 = DataFrame(arr,columns = ['name','score'])
df03

Unnamed: 0,name,score
0,Tom,76
1,Merry,98
2,John,100


DataFrame创建时自定义行索引使用参数`index`，自定义列索引使用参数`columns`。

In [20]:
df04 = DataFrame(arr,index = ['one','two','three'],columns = ['name','score'])
df04

Unnamed: 0,name,score
one,Tom,76
two,Merry,98
three,John,100


### 通过字典的方式创建DataFrame

In [21]:
data = {'apart':['1001','1002','1003','1001'],'profits':[567.87,987.87,873,498.87],'year':[2001,2001,2001,2000]}
df = DataFrame(data)
df

Unnamed: 0,apart,profits,year
0,1001,567.87,2001
1,1002,987.87,2001
2,1003,873.0,2001
3,1001,498.87,2000


In [22]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [23]:
df.columns

Index(['apart', 'profits', 'year'], dtype='object')

In [24]:
df.values

array([['1001', 567.87, 2001],
       ['1002', 987.87, 2001],
       ['1003', 873.0, 2001],
       ['1001', 498.87, 2000]], dtype=object)

# 索引对象

不管Series还是DataFrame都有索引对象。

索引对象负责管理轴标签和其他元数据（轴名称等）。

通过索引可以从Series、DataFrame中取值或对某个位置的值重新赋值。

Series和DataFrame自动对齐功能就是通过索引实现。

## 通过索引从Series中取值

In [25]:
series04 = Series({'2001':100.2,'2002':100.3,'2003':200.2})
series04['2001']

100.2

In [26]:
series04[0]

100.2

In [27]:
series04['2001':'2003']

2001    100.2
2002    100.3
2003    200.2
dtype: float64

注意上面右括号是包含的，与Python基础的列表不同。

## 通过索引从DataFrame中取值

In [28]:
df

Unnamed: 0,apart,profits,year
0,1001,567.87,2001
1,1002,987.87,2001
2,1003,873.0,2001
3,1001,498.87,2000


In [29]:
df['year']

0    2001
1    2001
2    2001
3    2000
Name: year, dtype: int64

In [30]:
df[0]

KeyError: 0

In [35]:
df.iloc[0]

apart        1001
profits    567.87
year         2001
Name: 0, dtype: object

In [37]:
df.iloc[0,0]

'1001'

`iloc`是根据DataFrame中数据位置获得数据。

In [36]:
df['pdn'] = np.NaN
df

Unnamed: 0,apart,profits,year,pdn
0,1001,567.87,2001,
1,1002,987.87,2001,
2,1003,873.0,2001,
3,1001,498.87,2000,


# pandas基本功能

## 重新索引

重新索引常常在原本索引不完成（可能在数据清洗过程中将带空值的行删除）时使用，这里常常使用`reset_index()`来完成。

In [41]:
df01 = pd.DataFrame(np.arange(20).reshape(5,4),index=[1,3,4,6,8])
df01

Unnamed: 0,0,1,2,3
1,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15
8,16,17,18,19


In [39]:
df01.reset_index()

Unnamed: 0,index,0,1,2,3
0,1,0,1,2,3
1,3,4,5,6,7
2,4,8,9,10,11
3,6,12,13,14,15
4,8,16,17,18,19


`reset_index`中有参数`drop`，可以控制重新索引时是否包含原索引。

In [42]:
df01.reset_index(drop=True)

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [43]:
df01

Unnamed: 0,0,1,2,3
1,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15
8,16,17,18,19


使用该方法并不会直接改变原数组，使用时候需要注意。

## 丢弃一列或者一行

这里使用`drop`来实现对于DataFrame数据的丢弃，该函数有两种使用方式：
- 一种是：`df.drop(labels = [labels],axis = axis)`，其中`labels`是相应的标签，`axis=0`是删除行，`axis=1`时删除列；
- 另一种是: `df.drop(index=[index])`（行）,`df.drop(columns=[columns])`（列）。

也要注意这里该方法也没有更改数组本身。

In [51]:
df01.drop(labels=1,axis=0)

Unnamed: 0,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15
8,16,17,18,19


In [53]:
df01.drop(labels=[1,8],axis=0)

Unnamed: 0,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15


In [54]:
df01.drop(labels=[0,3],axis=1)

Unnamed: 0,1,2
1,1,2
3,5,6
4,9,10
6,13,14
8,17,18


In [58]:
df01.drop(index=1)

Unnamed: 0,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15
8,16,17,18,19


`drop`中使用`index` `columns`进行删除时一定要使用表中相应的标签不能是按顺序使用数字来删除。

In [59]:
df01.drop(index=2)

KeyError: '[2] not found in axis'

In [56]:
df01.drop(columns=0)

Unnamed: 0,1,2,3
1,1,2,3
3,5,6,7
4,9,10,11
6,13,14,15
8,17,18,19


## 算术运算和数据对齐

In [61]:
x=DataFrame(np.arange(9).reshape(3,3),columns=['A','B','C'],index=['a','b','c'])
x

Unnamed: 0,A,B,C
a,0,1,2
b,3,4,5
c,6,7,8


In [62]:
y=DataFrame(np.arange(12).reshape((4,3)),columns=['A','B','C'],index=['a','b','c','d'])
y

Unnamed: 0,A,B,C
a,0,1,2
b,3,4,5
c,6,7,8
d,9,10,11


In [63]:
x + y

Unnamed: 0,A,B,C
a,0.0,2.0,4.0
b,6.0,8.0,10.0
c,12.0,14.0,16.0
d,,,


DataFrame类加法中对于不重叠的部分数据设为`NaN`。

下面方法可修改不重叠部分的默认值：

In [64]:
x.add(y,fill_value=0)

Unnamed: 0,A,B,C
a,0.0,2.0,4.0
b,6.0,8.0,10.0
c,12.0,14.0,16.0
d,9.0,10.0,11.0


DataFrame和Series之间计算时候默认按行运算，两个相运算的数组其`index`必须是相同的。

In [66]:
x_series = Series([1,2,3],index=['A','B','C'])
x_series

A    1
B    2
C    3
dtype: int64

In [69]:
x - x_series

Unnamed: 0,A,B,C
a,-1,-1,-1
b,2,2,2
c,5,5,5


In [70]:
y_series = Series([1,2,3])
y_series

0    1
1    2
2    3
dtype: int64

In [71]:
x - y_series

Unnamed: 0,A,B,C,0,1,2
a,,,,,,
b,,,,,,
c,,,,,,


DataFrame和Series算术运算时按照不同的轴进行计算：

In [73]:
x.sub(x_series,axis=1)

Unnamed: 0,A,B,C
a,-1,-1,-1
b,2,2,2
c,5,5,5


In [77]:
z_series = Series([1,1,1],index=['a','b','c'])
z_series

a    1
b    1
c    1
dtype: int64

In [79]:
x.sub(z_series,axis=0)

Unnamed: 0,A,B,C
a,-1,0,1
b,2,3,4
c,5,6,7


关`axis`选择和计算方向有些迷糊，之后仔细看看总结一下。

## 常用的数学和统计方法

In [80]:
df01

Unnamed: 0,0,1,2,3
1,0,1,2,3
3,4,5,6,7
4,8,9,10,11
6,12,13,14,15
8,16,17,18,19


In [81]:
df01.describe()

Unnamed: 0,0,1,2,3
count,5.0,5.0,5.0,5.0
mean,8.0,9.0,10.0,11.0
std,6.324555,6.324555,6.324555,6.324555
min,0.0,1.0,2.0,3.0
25%,4.0,5.0,6.0,7.0
50%,8.0,9.0,10.0,11.0
75%,12.0,13.0,14.0,15.0
max,16.0,17.0,18.0,19.0


In [82]:
df01.count()

0    5
1    5
2    5
3    5
dtype: int64

In [83]:
df01.count(axis=1)

1    4
3    4
4    4
6    4
8    4
dtype: int64

In [84]:
df01.min()

0    0
1    1
2    2
3    3
dtype: int64

In [85]:
df01.sum()

0    40
1    45
2    50
3    55
dtype: int64

In [86]:
df01.mean()

0     8.0
1     9.0
2    10.0
3    11.0
dtype: float64

In [88]:
df01.median(axis=1)

1     1.5
3     5.5
4     9.5
6    13.5
8    17.5
dtype: float64

In [90]:
df01.var()  # 方差

0    40.0
1    40.0
2    40.0
3    40.0
dtype: float64

In [91]:
df01.std()  # 标准差

0    6.324555
1    6.324555
2    6.324555
3    6.324555
dtype: float64

相关系数：

In [92]:
df01.corr()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0


协方差：

In [93]:
df01.cov()

Unnamed: 0,0,1,2,3
0,40.0,40.0,40.0,40.0
1,40.0,40.0,40.0,40.0
2,40.0,40.0,40.0,40.0
3,40.0,40.0,40.0,40.0


## 处理缺失数据

### 缺失值检测

In [97]:
df02 = DataFrame([['Tom',np.nan,456.7,'M'],['Merry',34,4567.8,np.nan],
                  ['John',23,np.nan,'M']],columns=['name','age','salary','gender'])
df02

Unnamed: 0,name,age,salary,gender
0,Tom,,456.7,M
1,Merry,34.0,4567.8,
2,John,23.0,,M


In [99]:
df02.isnull()

Unnamed: 0,name,age,salary,gender
0,False,True,False,False
1,False,False,False,True
2,False,False,True,False


In [101]:
df02.notnull()

Unnamed: 0,name,age,salary,gender
0,True,False,True,True
1,True,True,True,False
2,True,True,False,True


### 过滤缺失数据

In [102]:
df02.dropna()

Unnamed: 0,name,age,salary,gender


In [103]:
df02.dropna(how='all')  # 丢弃全部为缺失值的行

Unnamed: 0,name,age,salary,gender
0,Tom,,456.7,M
1,Merry,34.0,4567.8,
2,John,23.0,,M


### 填充缺失值

In [104]:
df02.fillna(0)

Unnamed: 0,name,age,salary,gender
0,Tom,0.0,456.7,M
1,Merry,34.0,4567.8,0
2,John,23.0,0.0,M


In [106]:
df02.fillna({'age':25,'salary':500,'gender':'M'})

Unnamed: 0,name,age,salary,gender
0,Tom,25.0,456.7,M
1,Merry,34.0,4567.8,M
2,John,23.0,500.0,M


# 后记

自己最近看Numpy和Pandas相关知识参考的是网上找的一份培训课程的PPT，后来有相关点不清楚上网查时发现很多相关资料，准备之后再依照看到的几份比较好的资料再看看。