# Pandas引入约定

In [2]:
from pandas import Series,DataFrame
import numpy as np
import pandas as pd

# Pandas数据结构

## Pandas数据结构介绍

- Series:一种类似于一维数组的对象，它是由一组数据（各种Numpy数据类型）和一组与之相关的数据标签（即索引）组成。仅由一组数据即可产生简单的Series。
- DataFrame:一个表格型的数据结构，含有一组有序的列，每列可以是不同的值类型，DataFrame既有行索引也有列索引，可以看作是由Series组成的字典。

## Series

### 通过一维数组创建Series:

In [3]:
arr = np.array([1,2,3,4])
series01 = Series(arr)
series01

0    1
1    2
2    3
3    4
dtype: int64

In [6]:
series01.index

RangeIndex(start=0, stop=4, step=1)

假如不指定索引，Series会自动创建从0到length-1的索引。

In [9]:
series02 = Series([23.1,22,25.3])
series02.index = (['product1','product2','product3'])
series02

product1    23.1
product2    22.0
product3    25.3
dtype: float64

In [10]:
series03 = Series([100,60,70],index=['语文','数学','英语'])
series03

语文    100
数学     60
英语     70
dtype: int64

In [7]:
series01.values

array([1, 2, 3, 4])

In [8]:
series01.dtype

dtype('int64')

### 通过字典方式创建Series

In [4]:
series04 = Series({'2001':100.2,'2002':100.3,'2003':200.2})
series04.index

Index(['2001', '2002', '2003'], dtype='object')

In [12]:
series04

2001    100.2
2002    100.3
2003    200.2
dtype: float64

### Series应用Numpy数组运算

通过索引获取值：

In [5]:
series04['2001']

100.2

In [6]:
series04[0]

100.2

Numpy中的数组运算，在Series中都保留使用，并且Series进行数组运算时，索引与值之间的映射关系不会改变。

In [8]:
series04[series04>100.2]

2002    100.3
2003    200.2
dtype: float64

In [9]:
series04 /100

2001    1.002
2002    1.003
2003    2.002
dtype: float64

In [10]:
np.exp(series04)

2001    3.283274e+43
2002    3.628579e+43
2003    8.825824e+86
dtype: float64

### Series缺失值检测

In [11]:
scores = Series({'Tom':89,'John':88,'Merry':96,'Max':65})
scores

Tom      89
John     88
Merry    96
Max      65
dtype: int64

In [13]:
new_index = ['Tom','Max','Joe','John','Merry']
scores = Series(scores,index=new_index)
scores

Tom      89.0
Max      65.0
Joe       NaN
John     88.0
Merry    96.0
dtype: float64

In [14]:
pd.isnull(scores)

Tom      False
Max      False
Joe       True
John     False
Merry    False
dtype: bool

In [15]:
pd.notnull(scores)

Tom       True
Max       True
Joe      False
John      True
Merry     True
dtype: bool

### Series自动对齐

不同Series之间进行算术运算，会自动对齐不同索引的数据。

In [16]:
product_num = Series([23,45,67,89],index=['p3','p1','p2','p5'])
product_price_table = Series([9.98,2.34,4.56,5.67,8.9],index=['p1','p2','p3','p4','p5'])
product_sum = product_num * product_price_table
product_sum

p1    449.10
p2    156.78
p3    104.88
p4       NaN
p5    792.10
dtype: float64

## DataFrame

### 通过二维数组创建DataFrame

In [17]:
df01 = DataFrame([['Tom','Meery','John'],[76,98,100]])
df01

Unnamed: 0,0,1,2
0,Tom,Meery,John
1,76,98,100


In [18]:
df02 = DataFrame([['Tom',76],['Merry',98],['John',100]])
df02

Unnamed: 0,0,1
0,Tom,76
1,Merry,98
2,John,100


In [19]:
arr = np.array([['Tom',76],['Merry',98],['John',100]])
df03 = DataFrame(arr,columns = ['name','score'])
df03

Unnamed: 0,name,score
0,Tom,76
1,Merry,98
2,John,100


In [20]:
df04 = DataFrame(arr,index = ['one','two','three'],columns = ['name','score'])
df04

Unnamed: 0,name,score
one,Tom,76
two,Merry,98
three,John,100


### 通过字典的方式创建DataFrame

In [21]:
data = {'apart':['1001','1002','1003','1001'],'profits':[567.87,987.87,873,498.87],'year':[2001,2001,2001,2000]}
df = DataFrame(data)
df

Unnamed: 0,apart,profits,year
0,1001,567.87,2001
1,1002,987.87,2001
2,1003,873.0,2001
3,1001,498.87,2000


In [22]:
df.index

RangeIndex(start=0, stop=4, step=1)

In [23]:
df.columns

Index(['apart', 'profits', 'year'], dtype='object')

In [24]:
df.values

array([['1001', 567.87, 2001],
       ['1002', 987.87, 2001],
       ['1003', 873.0, 2001],
       ['1001', 498.87, 2000]], dtype=object)

# 索引对象

不管Series还是DataFrame都有索引对象。

索引对象负责管理轴标签和其他元数据（轴名称等）。

通过索引可以从Series、DataFrame中取值或对某个位置的值重新赋值。

Series和DataFrame自动对齐功能就是通过索引实现。

## 通过索引从Series中取值

In [25]:
series04 = Series({'2001':100.2,'2002':100.3,'2003':200.2})
series04['2001']

100.2

In [26]:
series04[0]

100.2

In [27]:
series04['2001':'2003']

2001    100.2
2002    100.3
2003    200.2
dtype: float64

注意上面右括号是包含的，与Python基础的列表不同。

## 通过索引从DataFrame中取值

In [28]:
df

Unnamed: 0,apart,profits,year
0,1001,567.87,2001
1,1002,987.87,2001
2,1003,873.0,2001
3,1001,498.87,2000


In [29]:
df['year']

0    2001
1    2001
2    2001
3    2000
Name: year, dtype: int64

In [30]:
df[0]

KeyError: 0

In [35]:
df.iloc[0]

apart        1001
profits    567.87
year         2001
Name: 0, dtype: object

In [36]:
df['pdn'] = np.NaN
df

Unnamed: 0,apart,profits,year,pdn
0,1001,567.87,2001,
1,1002,987.87,2001,
2,1003,873.0,2001,
3,1001,498.87,2000,


# pandas基本功能