# Learning Pandas from Start (Part II, DataFrame Basic)

In [99]:
import pandas as pd
import numpy as np

## Create DataFrame

Can use dictionary, or list, to create a pandas dataframe.

In [100]:
dataset1 = {'fruit name': ['apple', 'orange', 'banana', 'grape', 'mango'],
            'quantity (kg)': [100, 200, 400, 300, 600],
            'price': [2.4, 1.3, 1.5, 3.4, 5.5] }

df = pd.DataFrame(dataset1)
type(df)

pandas.core.frame.DataFrame

In [101]:
df

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3
2,banana,400,1.5
3,grape,300,3.4
4,mango,600,5.5


In [102]:
l1 = [('apple',100,2.4), ('orange',200,1.3), ('banana',400,1.5), ('grape',300,3.4), ('mango',600,5.5)]

df = pd.DataFrame(l1,columns=('fruit name','quantity (kg)','price'))
df

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3
2,banana,400,1.5
3,grape,300,3.4
4,mango,600,5.5


DataFrame object has two part: index, and data columns. most commonly the index are integer values start from 0. But we also can also use multi-index. data columns are generally multiple columns of different data with a name for each column.  
The overall data structure is very similar to relational database table.

In [103]:
df.dtypes

fruit name        object
quantity (kg)      int64
price            float64
dtype: object

In [104]:
dict(df.dtypes)

{'fruit name': dtype('O'),
 'quantity (kg)': dtype('int64'),
 'price': dtype('float64')}

In [105]:
df.shape

(5, 3)

## Select columns

can use label to get one specific column data, or some of the columns

In [106]:
names = df['fruit name']
type(names)

pandas.core.series.Series

In [107]:
names

0     apple
1    orange
2    banana
3     grape
4     mango
Name: fruit name, dtype: object

If you use 2 square braket, means inside is a list. then it will result in a dataframe, not series.

In [108]:
df_names = df[['fruit name']]
type(df_names)

pandas.core.frame.DataFrame

In [109]:
df_names

Unnamed: 0,fruit name
0,apple
1,orange
2,banana
3,grape
4,mango


In [110]:
df_names_weight = df[['fruit name', 'quantity (kg)']]
type(df_names_weight)

pandas.core.frame.DataFrame

In [111]:
df_names_weight

Unnamed: 0,fruit name,quantity (kg)
0,apple,100
1,orange,200
2,banana,400
3,grape,300
4,mango,600


## Using loc[] and iloc[]

can select rows with loc[] (loc is index based), or iloc[] (iloc is integer position based).

***
**⚠️NOTE**  
the <span style="color:red;">loc[]</span> and <span style="color:red;">iloc[]</span>, not useing (), but is using <span style="color:red;">square brakets []</span>.

***

In [112]:
rows = df.loc[0]
type(rows)

pandas.core.series.Series

In [113]:
rows

fruit name       apple
quantity (kg)      100
price              2.4
Name: 0, dtype: object

In [114]:
rows = df.iloc[:2]  # iloc is used with integer positional index, the stop position is not included.
type(rows)

pandas.core.frame.DataFrame

In [115]:
rows

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3


In [116]:
rows.shape

(2, 3)

### More selection capability with loc[] /iloc[]

when select to one sepcific cell, the result will be scalar (python data type and value).  
when select to one row, or one column, the result will be pandas series.  
when select to more than one row and more than one column, the result will be pandas dataframe.

In [117]:
df.loc[1,'fruit name']

'orange'

In [118]:
type(df.loc[1,'fruit name'])

str

In [119]:
df.loc[3:4,'price'] # the slice include the start and end.

3    3.4
4    5.5
Name: price, dtype: float64

In [120]:
type(df.loc[3:4,'price']) 

pandas.core.series.Series

In [121]:
df.loc[3:4,'fruit name':'price'] # the slice include the start and end.

Unnamed: 0,fruit name,quantity (kg),price
3,grape,300,3.4
4,mango,600,5.5


In [122]:
type(df.loc[3:6,'fruit name':'price'])

pandas.core.frame.DataFrame

In [123]:
df.iloc[[1,3], [0,2]]  # not continous rows and columns.

Unnamed: 0,fruit name,price
1,orange,1.3
3,grape,3.4


In [124]:
df.loc[[1,3],['fruit name','price']]

Unnamed: 0,fruit name,price
1,orange,1.3
3,grape,3.4


### Selection accordint to conditions

In [125]:
df['price']>4

0    False
1    False
2    False
3    False
4     True
Name: price, dtype: bool

In [126]:
df[df['price']>4]  # one series of bool data which same rows as the dataframe, can be used to filter the rows.

Unnamed: 0,fruit name,quantity (kg),price
4,mango,600,5.5


In [127]:
df['fruit name'].isin(['orange', 'apple'])  # filter rows according to some specific value

0     True
1     True
2    False
3    False
4    False
Name: fruit name, dtype: bool

In [128]:
df[df['fruit name'].isin(['orange', 'apple'])]

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3


In [129]:
(df['fruit name']=='orange') | (df['fruit name']=='apple')   # '|' operator can be used for series of bool data, perform or on item with item respectively

0     True
1     True
2    False
3    False
4    False
Name: fruit name, dtype: bool

In [130]:
df[ (df['fruit name']=='orange') | (df['fruit name']=='apple') ]

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3


In [131]:
df['fruit name'].map(lambda x: 'o' in x)  # can filter with specific customized rules. like this one find all names which has letter 'o'

0    False
1     True
2    False
3    False
4     True
Name: fruit name, dtype: bool

In [132]:
df[df['fruit name'].map(lambda x: 'o' in x)]

Unnamed: 0,fruit name,quantity (kg),price
1,orange,200,1.3
4,mango,600,5.5


In [133]:
df.isin(['apple',100,1.3])

Unnamed: 0,fruit name,quantity (kg),price
0,True,True,False
1,False,False,True
2,False,False,False
3,False,False,False
4,False,False,False


In [134]:
df[df.isin(['apple',100,1.3])]  # select only some cell in the whole dataframe, with same shape, filtered out values replaced with nan.

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100.0,
1,,,1.3
2,,,
3,,,
4,,,


## Add column data

In [135]:
dataset2 = ['U.S.A.', 'Spain', 'Phillipines', 'Australia', 'Malaysia']
df['origin']=dataset2 # 'origin' is new label which in not in current columns, so will create new column.
df

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,U.S.A.
1,orange,200,1.3,Spain
2,banana,400,1.5,Phillipines
3,grape,300,3.4,Australia
4,mango,600,5.5,Malaysia


In [136]:
df['origin']=['Malaysia' for _ in range(df.shape[0])]    # 'origin' already in the current columns, so will update the existing column
df

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,Malaysia
1,orange,200,1.3,Malaysia
2,banana,400,1.5,Malaysia
3,grape,300,3.4,Malaysia
4,mango,600,5.5,Malaysia


First time use 'origin' will create new column. second time use will update the value.

## Add row data

In [137]:
dataset3 = {'fruit name':'pear','quantity (kg)':450, 'price': 2.0, 'origin':'Thailand'}
df_new = df.append(dataset3,ignore_index=True)
df_new

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,Malaysia
1,orange,200,1.3,Malaysia
2,banana,400,1.5,Malaysia
3,grape,300,3.4,Malaysia
4,mango,600,5.5,Malaysia
5,pear,450,2.0,Thailand


## Delete data (row or column)

In [138]:
df_new = df_new.drop(5, axis = 'index')
df_new

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,Malaysia
1,orange,200,1.3,Malaysia
2,banana,400,1.5,Malaysia
3,grape,300,3.4,Malaysia
4,mango,600,5.5,Malaysia


In [139]:
df_new = df_new.drop('origin', axis = 'columns')
df_new

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3
2,banana,400,1.5
3,grape,300,3.4
4,mango,600,5.5


## Change dtype

In [140]:
df_new.dtypes

fruit name        object
quantity (kg)      int64
price            float64
dtype: object

In [141]:
df_new['fruit name'] = df_new['fruit name'].astype('string')
df_new.dtypes

fruit name        string
quantity (kg)      int64
price            float64
dtype: object

In [142]:
df_new

Unnamed: 0,fruit name,quantity (kg),price
0,apple,100,2.4
1,orange,200,1.3
2,banana,400,1.5
3,grape,300,3.4
4,mango,600,5.5


## MultiIndex

In [143]:
idx = pd.MultiIndex.from_product([['Month 1', 'Month 2'], [0, 1, 2]])
idx

MultiIndex([('Month 1', 0),
            ('Month 1', 1),
            ('Month 1', 2),
            ('Month 2', 0),
            ('Month 2', 1),
            ('Month 2', 2)],
           )

In [144]:
arrays = [np.array(['Month 1','Month 1','Month 2','Month 2','Month 2']),
         np.array([0,1,0,1,2])]
idx = pd.MultiIndex.from_arrays(arrays)
idx

MultiIndex([('Month 1', 0),
            ('Month 1', 1),
            ('Month 2', 0),
            ('Month 2', 1),
            ('Month 2', 2)],
           )

In [145]:
tuples = [('Month 1',0),('Month 1', 1),('Month 2', 0),('Month 2', 1),('Month 2', 2)]
idx = pd.MultiIndex.from_tuples(tuples)
idx

MultiIndex([('Month 1', 0),
            ('Month 1', 1),
            ('Month 2', 0),
            ('Month 2', 1),
            ('Month 2', 2)],
           )

In [146]:
df

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,Malaysia
1,orange,200,1.3,Malaysia
2,banana,400,1.5,Malaysia
3,grape,300,3.4,Malaysia
4,mango,600,5.5,Malaysia


In [147]:
df_new=df.set_index(idx,drop=True)
df_new

Unnamed: 0,Unnamed: 1,fruit name,quantity (kg),price,origin
Month 1,0,apple,100,2.4,Malaysia
Month 1,1,orange,200,1.3,Malaysia
Month 2,0,banana,400,1.5,Malaysia
Month 2,1,grape,300,3.4,Malaysia
Month 2,2,mango,600,5.5,Malaysia


In [148]:
df_new.index

MultiIndex([('Month 1', 0),
            ('Month 1', 1),
            ('Month 2', 0),
            ('Month 2', 1),
            ('Month 2', 2)],
           )

In [149]:
df_new.index.get_level_values(0)

Index(['Month 1', 'Month 1', 'Month 2', 'Month 2', 'Month 2'], dtype='object')

In [150]:
df_new.index.get_level_values(1)

Int64Index([0, 1, 0, 1, 2], dtype='int64')

In [151]:
df_new.loc[('Month 1',1)]

fruit name         orange
quantity (kg)         200
price                 1.3
origin           Malaysia
Name: (Month 1, 1), dtype: object

In [152]:
df_new.loc[('Month 1',1):('Month 2',1)]  # note the slicing here, end index is included.

Unnamed: 0,Unnamed: 1,fruit name,quantity (kg),price,origin
Month 1,1,orange,200,1.3,Malaysia
Month 2,0,banana,400,1.5,Malaysia
Month 2,1,grape,300,3.4,Malaysia


In [153]:
df_new.loc['Month 1']

Unnamed: 0,fruit name,quantity (kg),price,origin
0,apple,100,2.4,Malaysia
1,orange,200,1.3,Malaysia


In [154]:
df_new.loc[(slice(None),slice(1,2)),:]

Unnamed: 0,Unnamed: 1,fruit name,quantity (kg),price,origin
Month 1,1,orange,200,1.3,Malaysia
Month 2,1,grape,300,3.4,Malaysia
Month 2,2,mango,600,5.5,Malaysia


In [155]:
df_new.loc[pd.IndexSlice[:,1:2],:]

Unnamed: 0,Unnamed: 1,fruit name,quantity (kg),price,origin
Month 1,1,orange,200,1.3,Malaysia
Month 2,1,grape,300,3.4,Malaysia
Month 2,2,mango,600,5.5,Malaysia


## DataFramce with arithmetic operators

In [156]:
df1 = pd.DataFrame({'name':[i for i in 'abcdefg'],
                    'value': [i for i in range(1,8)]})
df1

Unnamed: 0,name,value
0,a,1
1,b,2
2,c,3
3,d,4
4,e,5
5,f,6
6,g,7


In [157]:
df1.dtypes

name     object
value     int64
dtype: object

In [158]:
df2 = pd.DataFrame({'class':[i for i in 'abcdefghij'],
                    'value':[i for i in range(10,20)]})
df2

Unnamed: 0,class,value
0,a,10
1,b,11
2,c,12
3,d,13
4,e,14
5,f,15
6,g,16
7,h,17
8,i,18
9,j,19


In [159]:
df2.dtypes

class    object
value     int64
dtype: object

In [160]:
df3=df1+df2
df3

Unnamed: 0,class,name,value
0,,,11.0
1,,,13.0
2,,,15.0
3,,,17.0
4,,,19.0
5,,,21.0
6,,,23.0
7,,,
8,,,
9,,,


In [161]:
df3.dtypes

class    float64
name     float64
value    float64
dtype: object

In [162]:
df3=df2-df1
df3

Unnamed: 0,class,name,value
0,,,9.0
1,,,9.0
2,,,9.0
3,,,9.0
4,,,9.0
5,,,9.0
6,,,9.0
7,,,
8,,,
9,,,


when 2 dataframe do calculation together, the index and column label are same will do corresponding itemwise calculation. If cannot match then treated as nan, thus calculation will result in nan in that specific row and column position.  
Numeric calculation will result in a change in dtype to float64.

In [163]:
df2 = pd.DataFrame({'name':[i for i in 'abcdefghij'],
                    'value':[i for i in range(10,20)]})
df2

Unnamed: 0,name,value
0,a,10
1,b,11
2,c,12
3,d,13
4,e,14
5,f,15
6,g,16
7,h,17
8,i,18
9,j,19


In [164]:
df3=df1+df2
df3

Unnamed: 0,name,value
0,aa,11.0
1,bb,13.0
2,cc,15.0
3,dd,17.0
4,ee,19.0
5,ff,21.0
6,gg,23.0
7,,
8,,
9,,


Can see that the arithmetic operator on string also was kept.

In [165]:
sa = pd.Series(['s',100],index=['name','value'])
sa

name       s
value    100
dtype: object

In [166]:
df3 = df3+sa
df3

Unnamed: 0,name,value
0,aas,111.0
1,bbs,113.0
2,ccs,115.0
3,dds,117.0
4,ees,119.0
5,ffs,121.0
6,ggs,123.0
7,,
8,,
9,,


the series is added to each row of dataframe. if Series data index label not match the DataFrame columnn label, then will be treated as nan and thus the result will also be nan.

In [167]:
sb = pd.Series([200*i for i in range(10)])
sb

0       0
1     200
2     400
3     600
4     800
5    1000
6    1200
7    1400
8    1600
9    1800
dtype: int64

In [168]:
df3['value']=df3['value']+sb  # update a column data in the dataframe.
df3

Unnamed: 0,name,value
0,aas,111.0
1,bbs,313.0
2,ccs,515.0
3,dds,717.0
4,ees,919.0
5,ffs,1121.0
6,ggs,1323.0
7,,
8,,
9,,


### Transposition

In [169]:
df3.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
name,aas,bbs,ccs,dds,ees,ffs,ggs,,,
value,111.0,313.0,515.0,717.0,919.0,1121.0,1323.0,,,


The 2 dimension, index and column label, was transposed.

## Other data manipulation functions

In [170]:
df = pd.DataFrame(np.arange(1,21).reshape(4,5), index=['red','blue','green','purple'],columns=['apple','banana','mango','orange','grape'])
df

Unnamed: 0,apple,banana,mango,orange,grape
red,1,2,3,4,5
blue,6,7,8,9,10
green,11,12,13,14,15
purple,16,17,18,19,20


In [171]:
df.agg('mean')

apple      8.5
banana     9.5
mango     10.5
orange    11.5
grape     12.5
dtype: float64

In [173]:
df.agg('mean',axis=1)

red        3.0
blue       8.0
green     13.0
purple    18.0
dtype: float64

In [74]:
df.agg('sum')

apple     34
banana    38
mango     42
orange    46
grape     50
dtype: int64

In [75]:
df.agg(['std','median'])

Unnamed: 0,apple,banana,mango,orange,grape
std,6.454972,6.454972,6.454972,6.454972,6.454972
median,8.5,9.5,10.5,11.5,12.5


In [76]:
df.describe()

Unnamed: 0,apple,banana,mango,orange,grape
count,4.0,4.0,4.0,4.0,4.0
mean,8.5,9.5,10.5,11.5,12.5
std,6.454972,6.454972,6.454972,6.454972,6.454972
min,1.0,2.0,3.0,4.0,5.0
25%,4.75,5.75,6.75,7.75,8.75
50%,8.5,9.5,10.5,11.5,12.5
75%,12.25,13.25,14.25,15.25,16.25
max,16.0,17.0,18.0,19.0,20.0


In [77]:
df_new

Unnamed: 0,Unnamed: 1,fruit name,quantity (kg),price,origin
Month 1,0,apple,100,2.4,Malaysia
Month 1,1,orange,200,1.3,Malaysia
Month 2,0,banana,400,1.5,Malaysia
Month 2,1,grape,300,3.4,Malaysia
Month 2,2,mango,600,5.5,Malaysia


In [96]:
df_new.groupby(level=1).sum()

Unnamed: 0,quantity (kg),price
0,500,3.9
1,500,4.7
2,600,5.5


In [97]:
df_new.groupby(level=1).mean()

Unnamed: 0,quantity (kg),price
0,250.0,1.95
1,250.0,2.35
2,600.0,5.5


## Sorting and Ranking

In [80]:
df.sort_index()

Unnamed: 0,apple,banana,mango,orange,grape
blue,6,7,8,9,10
green,11,12,13,14,15
purple,16,17,18,19,20
red,1,2,3,4,5


In [81]:
df.sort_index(axis='columns')

Unnamed: 0,apple,banana,grape,mango,orange
red,1,2,5,3,4
blue,6,7,10,8,9
green,11,12,15,13,14
purple,16,17,20,18,19


In [82]:
df.sort_values('banana', axis='rows', ascending=False)

Unnamed: 0,apple,banana,mango,orange,grape
purple,16,17,18,19,20
green,11,12,13,14,15
blue,6,7,8,9,10
red,1,2,3,4,5


In [83]:
df.rank()

Unnamed: 0,apple,banana,mango,orange,grape
red,1.0,1.0,1.0,1.0,1.0
blue,2.0,2.0,2.0,2.0,2.0
green,3.0,3.0,3.0,3.0,3.0
purple,4.0,4.0,4.0,4.0,4.0


In [84]:
df['rank']=df['apple'].rank(ascending=False)
df

Unnamed: 0,apple,banana,mango,orange,grape,rank
red,1,2,3,4,5,4.0
blue,6,7,8,9,10,3.0
green,11,12,13,14,15,2.0
purple,16,17,18,19,20,1.0


In [85]:
df = pd.DataFrame({'Auction_ID':[123,123,123,123,124,124,124,125],
                   'Bid_Price':[9,7,6,2,3,2,1,1]})
dfx = df.groupby('Auction_ID').max()
dfx

Unnamed: 0_level_0,Bid_Price
Auction_ID,Unnamed: 1_level_1
123,9
124,3
125,1


In [86]:
df['Auction_Rank'] = df.groupby('Auction_ID')['Bid_Price'].rank(ascending=False)
df

Unnamed: 0,Auction_ID,Bid_Price,Auction_Rank
0,123,9,1.0
1,123,7,2.0
2,123,6,3.0
3,123,2,4.0
4,124,3,1.0
5,124,2,2.0
6,124,1,3.0
7,125,1,1.0


This is giving rank according to the bid_price, within each auction group.

## data cleaning

when data read into dataframe, normally there will be some data wrong and before start data analysis need clean it.  
data cleaning include:  
- for invalid data, decide to remove it, or replace the invalid data with a valid one.  
- for data with different format, need to convert the data format into same type so that can be process with other data.
- for data which duplicate, remove the duplicated data only keep one copy

functions that can be used for data cleaning:  
- dropna()  
- fillna()
- replace()
- duplicated()
- drop_duplicates()

## data persistent

Python has a format of 'pickle' to save object data to file and later restore from pickle file the original data. pandas already included that pickle function into the dataframe functions: to_pickle() and read_pickle()

In [87]:
df = pd.DataFrame({'class':['year 1','year 1','year 1','year 2','year 2','year 3'],\
                   'name':['Alice','Jason','John','Bob','Carl','Simon'],\
                   'age':[6,7,6,8,8,9],\
                   'stuff':[[0],['a'],['item x'],[2,3],['a',2],[True]],\
                   'stuff1':[(1,2),(2,3),(4,),(5,),(6,),(3,4,5)],\
                   'stuff2':[{'a'},{1,2},{'x':3},{4},{True},{'x'}]})
df

Unnamed: 0,class,name,age,stuff,stuff1,stuff2
0,year 1,Alice,6,[0],"(1, 2)",{a}
1,year 1,Jason,7,[a],"(2, 3)","{1, 2}"
2,year 1,John,6,[item x],"(4,)",{'x': 3}
3,year 2,Bob,8,"[2, 3]","(5,)",{4}
4,year 2,Carl,8,"[a, 2]","(6,)",{True}
5,year 3,Simon,9,[True],"(3, 4, 5)",{x}


In [88]:
df.to_pickle('temp.pkl')

In [89]:
df1=pd.read_pickle('temp.pkl')
df1

Unnamed: 0,class,name,age,stuff,stuff1,stuff2
0,year 1,Alice,6,[0],"(1, 2)",{a}
1,year 1,Jason,7,[a],"(2, 3)","{1, 2}"
2,year 1,John,6,[item x],"(4,)",{'x': 3}
3,year 2,Bob,8,"[2, 3]","(5,)",{4}
4,year 2,Carl,8,"[a, 2]","(6,)",{True}
5,year 3,Simon,9,[True],"(3, 4, 5)",{x}


If you want to have a readable format and be able to exchange with other person, can use json format file to save the data. pandas to_json() will output the data in a tabular format json file.

In [90]:
df.to_json('test.json')

In [91]:
df2 = pd.read_json('test.json')
df2

Unnamed: 0,class,name,age,stuff,stuff1,stuff2
0,year 1,Alice,6,[0],"[1, 2]",[a]
1,year 1,Jason,7,[a],"[2, 3]","[1, 2]"
2,year 1,John,6,[item x],[4],{'x': 3}
3,year 2,Bob,8,"[2, 3]",[5],[4]
4,year 2,Carl,8,"[a, 2]",[6],[True]
5,year 3,Simon,9,[True],"[3, 4, 5]",[x]


when use json file format, the python **objects are all changed to list or dict**. other object not supported.

You can also use database system to keep the data in the dataframe, and later read from database. Just keep in mind that **for SQL database, only accept basic data like number, string.** Those objects like list, dict, or objects instatiated from class, cannot be done in database.  
So we have to remove the last column, or change the data format.  

In [92]:
df2 = df.drop(['stuff','stuff1','stuff2'],axis=1)
df2

Unnamed: 0,class,name,age
0,year 1,Alice,6
1,year 1,Jason,7
2,year 1,John,6
3,year 2,Bob,8
4,year 2,Carl,8
5,year 3,Simon,9


In [93]:
# We do not need complicated SQL to do data process, so use simple sqlalchemy to create and save to table is enough.
from sqlalchemy import create_engine
engine = create_engine(r'sqlite:///test.db')
df2.to_sql('frame', engine, if_exists='replace')

In [94]:
df3 = pd.read_sql('frame',engine)
df3

Unnamed: 0,index,class,name,age
0,0,year 1,Alice,6
1,1,year 1,Jason,7
2,2,year 1,John,6
3,3,year 2,Bob,8
4,4,year 2,Carl,8
5,5,year 3,Simon,9


there is one additional column 'index' was added after read. it is the index of original dataframe and now becomes column. 

In [95]:
df3 = df3.set_index('index',drop=True)
df3

Unnamed: 0_level_0,class,name,age
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,year 1,Alice,6
1,year 1,Jason,7
2,year 1,John,6
3,year 2,Bob,8
4,year 2,Carl,8
5,year 3,Simon,9


Other ways like save dataframe as .csv files, or excel files, or html files, also can be done. Those files are also important ways for exchanging data.