# Pandas Usage

## Series 和 DataFrame 是其两个主要数据结构
### Series 是单维索引数组
### DataFrame 是具有列级和行级索引的表格数据结构
### Pandas 是预处理数据集的绝佳工具，可提供高度优化的性能

In [3]:
import pandas as pd
series_1 = pd.Series([2, 9, 0, 1]) # Creating a series object
print(series_1.values) # Print values of the series object

[2 9 0 1]


In [5]:
series_1.index # Default index of the series object

RangeIndex(start=0, stop=4, step=1)

In [6]:
series_1.index = ['a', 'b', 'c', 'd'] # Setting index of the series object
series_1['d'] # Fetching element using new index

1

### Creating dataframe using pandas

In [22]:
class_data = {'Names':['John', 'Ryan', 'Emily'], 
             'Standard':[7, 5, 8],
             'Subject':['English', 'Mathematics', 'Science']}
class_df = pd.DataFrame(class_data, index=['Student1', 'Student2', 'Student3'],
                       columns=['Names', 'Standard', 'Subject'])
print(class_df)

          Names  Standard      Subject
Student1   John         7      English
Student2   Ryan         5  Mathematics
Student3  Emily         8      Science


In [10]:
class_df.Names

Student1     John
Student2     Ryan
Student3    Emily
Name: Names, dtype: object

### Add new entry to the dataframe

In [24]:
import numpy as np
class_df.ix['Student4'] = ['Robin', np.nan, 'History']
class_df.T # Take transpose of the dataframe

Unnamed: 0,Student1,Student2,Student3,Student4
Names,John,Ryan,Emily,Robin
Standard,7,5,8,
Subject,English,Mathematics,Science,History


In [25]:
class_df.sort_values(by='Standard') # Sorting of rows by one colmn

Unnamed: 0,Names,Standard,Subject
Student2,Ryan,5.0,Mathematics
Student1,John,7.0,English
Student3,Emily,8.0,Science
Student4,Robin,,History


### Adding one more column to the dataframe as Series object

In [27]:
col_entry = pd.Series(['A', 'B', 'A+', 'C'],
                     index=['Student1', 'Student2', 'Student3', 'Student4'])
class_df['Grade'] = col_entry
print(class_df)

          Names  Standard      Subject Grade
Student1   John       7.0      English     A
Student2   Ryan       5.0  Mathematics     B
Student3  Emily       8.0      Science    A+
Student4  Robin       NaN      History     C


### Filling the missing entries in the dataframe, inplace

In [41]:
class_df.fillna(10, inplace=True) # True是深复制，False是浅复制
print(class_df)

          Names  Standard      Subject Grade
Student1   John       7.0      English     A
Student2   Ryan       5.0  Mathematics     B
Student3  Emily       8.0      Science    A+
Student4  Robin      10.0      History     C


### Concatenation of 2 dataframes

In [42]:
student_age = pd.DataFrame(data={'Age':[13, 10, 15, 18]},
                          index=['Student1', 'Student2', 'Student3', 'Student4'])
print(student_age)

          Age
Student1   13
Student2   10
Student3   15
Student4   18


In [43]:
class_data = pd.concat([class_df, student_age], axis = 1)
print(class_data)

          Names  Standard      Subject Grade  Age
Student1   John       7.0      English     A   13
Student2   Ryan       5.0  Mathematics     B   10
Student3  Emily       8.0      Science    A+   15
Student4  Robin      10.0      History     C   18


### 使用map函数可将任意函数分别应用于列或行中的每个元素
### 使用apply函数可将任意函数同时应用于列或行中的所有元素

### MAP Function

In [44]:
class_data['Subject'] = class_data['Subject'].map(lambda x: x+'Sub')
class_data['Subject']

Student1        EnglishSub
Student2    MathematicsSub
Student3        ScienceSub
Student4        HistorySub
Name: Subject, dtype: object

### APPLY Function

In [45]:
def age_add(x): # Defining a new function which will increment the age by 1
    return(x+1)
print('-----Old values-----')
print(class_data['Age'])
print('-----New values-----')
print(class_data['Age'].apply(age_add)) # Applying the age function on top of the age column

-----Old values-----
Student1    13
Student2    10
Student3    15
Student4    18
Name: Age, dtype: int64
-----New values-----
Student1    14
Student2    11
Student3    16
Student4    19
Name: Age, dtype: int64


### Changing datatype of the column

In [46]:
class_data['Grade'] = class_data['Grade'].astype('category')
class_data.Grade.dtypes

CategoricalDtype(categories=['A', 'A+', 'B', 'C'], ordered=False)

### Storing the results

In [47]:
class_data.to_csv('class_dataset.csv', index=False)

### 合并函数(concat、merge、append) 、 groupby、pivot_table函数在数据处理任务中有大量的应用
### 参考 http://pandas.pydata.org/