### Pandas-简介

pandas是一个Python语言的软件包，在我们使用Python语言进行机器学习编程的时候，这是一个非常常用的基础编程库。本文是对它的一个入门教程。

pandas提供了快速，灵活和富有表现力的数据结构。

#### Pandas的核心数据结构

##### pandas最核心的就是Series和DataFrame两个数据结构。
DataFrame可以看做是Series的容器，即：一个DataFrame中可以包含若干个Series。

In [1]:
import numpy as np
import pandas as pd

### Series 结构简介

In [2]:
a = pd.Series([1,2,3])

In [3]:
print(a,'\n\n', f'the first element of a is {a[0]}')

0    1
1    2
2    3
dtype: int64 

 the first element of a is 1


#### Series 在算术运算中会自动对齐不同索引的数据

In [4]:
d={"Shenzhen":5,"Shanghai":6,"Beijing":7}
d2={"Shenzhen":1,"Shanghai":2,"Beijing":3}
Series_d1=pd.Series(d)
Series_d2=pd.Series(d2)
print(Series_d1,'\n')
print(Series_d1+Series_d2)

Shenzhen    5
Shanghai    6
Beijing     7
dtype: int64 

Shenzhen     6
Shanghai     8
Beijing     10
dtype: int64


In [5]:
b = np.array([1,2,3])

In [6]:
print(b,type(b))

[1 2 3] <class 'numpy.ndarray'>


### DataFrame 结构简介

创建数据表

In [7]:
c = pd.DataFrame(np.arange(12).reshape(3,4)) # 3*4的一个表格

In [8]:
c

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


行列的提取

In [9]:
c[1] # get column of index 1

0    1
1    5
2    9
Name: 1, dtype: int64

In [10]:
c.iloc[1] # get row of index 1

0    4
1    5
2    6
3    7
Name: 1, dtype: int64

In [11]:
df = pd.DataFrame(np.arange(16).reshape(4,4),
                 columns=['col1','col2','col3','col4'],
                 index = ['a','b','c','d']) 
# 给column和row命名

In [12]:
df

Unnamed: 0,col1,col2,col3,col4
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [13]:
print(df.keys())
df['col1']    # 这时候不能再用0，1，2..来索引。 i.e： df[0] 会raise exception

Index(['col1', 'col2', 'col3', 'col4'], dtype='object')


a     0
b     4
c     8
d    12
Name: col1, dtype: int64

In [14]:
print(df.iloc[0],"\n") # iloc
print(df.iloc[a],"\n")   #注意，这里用a来索引而不是用'a' ; 如果在创建DataFrame前就用过变量a了，则用a来index会报错
print(a,type(a))

col1    0
col2    1
col3    2
col4    3
Name: a, dtype: int64 

   col1  col2  col3  col4
b     4     5     6     7
c     8     9    10    11
d    12    13    14    15 

0    1
1    2
2    3
dtype: int64 <class 'pandas.core.series.Series'>


In [15]:
df.head(2) # 查看DataFrame的前两个row； 在数据清理的过程中，一般使用 head(5) 来查看数据的前五条row

Unnamed: 0,col1,col2,col3,col4
a,0,1,2,3
b,4,5,6,7


In [16]:
# 数值型特征的整体统计信息查看
df.describe()

Unnamed: 0,col1,col2,col3,col4
count,4.0,4.0,4.0,4.0
mean,6.0,7.0,8.0,9.0
std,5.163978,5.163978,5.163978,5.163978
min,0.0,1.0,2.0,3.0
25%,3.0,4.0,5.0,6.0
50%,6.0,7.0,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


In [17]:
df.tail(2)#最后两个row

Unnamed: 0,col1,col2,col3,col4
c,8,9,10,11
d,12,13,14,15


### 用Series来build一个DataFrame

In [18]:
note = pd.Series(['A','B','C','D'])
ID = pd.Series(['01','02','03','04'])
df2 = pd.DataFrame([ID,note])

In [19]:
df2.T # transpose

Unnamed: 0,0,1
0,1,A
1,2,B
2,3,C
3,4,D


### 增加、删除一个column

In [20]:
df

Unnamed: 0,col1,col2,col3,col4
a,0,1,2,3
b,4,5,6,7
c,8,9,10,11
d,12,13,14,15


In [21]:
df['col6']=pd.Series(['01','02','03','04'],index = ['b','a','c','d'])

In [22]:
df

Unnamed: 0,col1,col2,col3,col4,col6
a,0,1,2,3,2
b,4,5,6,7,1
c,8,9,10,11,3
d,12,13,14,15,4


In [23]:
del df['col4']

In [24]:
df

Unnamed: 0,col1,col2,col3,col6
a,0,1,2,2
b,4,5,6,1
c,8,9,10,3
d,12,13,14,4


### 数据访问

In [25]:
df.columns   # column names(indexes)

Index(['col1', 'col2', 'col3', 'col6'], dtype='object')

In [26]:
df.index     # row names(indexes)

Index(['a', 'b', 'c', 'd'], dtype='object')

In [27]:
df.values # 矩阵

array([[0, 1, 2, '02'],
       [4, 5, 6, '01'],
       [8, 9, 10, '03'],
       [12, 13, 14, '04']], dtype=object)

In [28]:
# loc , iloc 
# loc 索引 iloc 位置
df

Unnamed: 0,col1,col2,col3,col6
a,0,1,2,2
b,4,5,6,1
c,8,9,10,3
d,12,13,14,4


In [29]:
df.loc['a']

col1     0
col2     1
col3     2
col6    02
Name: a, dtype: object

In [30]:
df.loc[['a','b'],['col1','col3']]

Unnamed: 0,col1,col3
a,0,2
b,4,6


In [31]:
df.iloc[0]

col1     0
col2     1
col3     2
col6    02
Name: a, dtype: object

In [32]:
df.iloc[0:2,1:3]

Unnamed: 0,col2,col3
a,1,2
b,5,6


### Pandas对于数据文件的读取 

In [41]:
df3 = pd.read_csv('./data.csv')
# df3.to_csv('./data_cleaned.csv',index=False,header=False) #对数据进行清理后，可以用to_csv来储存清洗过的数据

In [42]:
df3

Unnamed: 0,date,ID,class,price
0,2018/1/1,1001,A,22.0
1,2018/1/2,1002,B,10.0
2,2018/1/3,1003,A,31.0
3,2018/1/4,1004,C,
4,2018/1/5,1005,D,21.0
5,2018/1/6,1002,B,12.0


In [43]:
df3.dropna()

Unnamed: 0,date,ID,class,price
0,2018/1/1,1001,A,22.0
1,2018/1/2,1002,B,10.0
2,2018/1/3,1003,A,31.0
4,2018/1/5,1005,D,21.0
5,2018/1/6,1002,B,12.0


In [49]:
print(f'Using the average price to fill the na value; avg is {df3["price"].mean()}')
df3.fillna(df3['price'].mean())

Using the average price to fill the na value; avg is 19.2


Unnamed: 0,date,ID,class,price
0,2018/1/1,1001,A,22.0
1,2018/1/2,1002,B,10.0
2,2018/1/3,1003,A,31.0
3,2018/1/4,1004,C,19.2
4,2018/1/5,1005,D,21.0
5,2018/1/6,1002,B,12.0
