Two data structures that important in Pandas are:
1. Series
2. DataFrame

Series contain 2 arrays, 1 array for the data with any possible data type, and the other one is the label for the data called <b>Index</b>. Function to create series in Pandas <b>Series()</b>

In [1]:
import pandas as pd
import numpy as np

In [2]:
s = pd.Series([2, 4, 6, 8])
s

0    2
1    4
2    6
3    8
dtype: int64

In [3]:
# Making Series with index
s = pd.Series([2, 4, 6, 8], index=['a', 'b', 'c', 'd'])
s

a    2
b    4
c    6
d    8
dtype: int64

In [4]:
s.values

array([2, 4, 6, 8])

In [5]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

We can select the element in Series like we do it in array

In [6]:
#Using numerical index
s[0]

2

In [7]:
#Using specified name index
s['a']

2

In [8]:
s[0:3]

a    2
b    4
c    6
dtype: int64

In [9]:
s[['a','b', 'c']]

a    2
b    4
c    6
dtype: int64

Assigning value to Series

In [10]:
s[0] = 1
s

a    1
b    4
c    6
d    8
dtype: int64

In [13]:
s['b'] = 1
s

a    1
b    1
c    6
d    8
dtype: int64

Defining Series from numpy Array

In [15]:
arr = np.array([1,2,3,4])
ser1 = pd.Series(arr)
ser1

0    1
1    2
2    3
3    4
dtype: int64

In [16]:
ser2 = pd.Series(s)
ser2

a    1
b    1
c    6
d    8
dtype: int64

In [17]:
#Making series with array is not copying the value of array to a series, when we change the arr third element, it will also change the ser1 too
arr[2] = -2
ser1

0    1
1    2
2   -2
3    4
dtype: int64

Filtering Value

In [18]:
s[s<8]

a    1
b    1
c    6
dtype: int64

Operations and Mathematical Functions

In [20]:
s / 2

a    0.5
b    0.5
c    3.0
d    4.0
dtype: float64

In [21]:
np.log(s)

a    0.000000
b    0.000000
c    1.791759
d    2.079442
dtype: float64

In [22]:
serd = pd.Series([1, 0, 2, 1, 2, 3], index = ['white', 'white', 'blue', 'green', 'green', 'yellow'])

In [23]:
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

Evaluating Values

In [24]:
serd.unique()

array([1, 0, 2, 3])

In [26]:
serd.value_counts()

2    2
1    2
3    1
0    1
dtype: int64

In [27]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [28]:
serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64

NaN Values:
NaN = Not a Number is used within pandas data structures to indicate the presence of empty field or not definable nummerically

In [29]:
s2 = pd.Series([5, -3, np.NaN, 14])
s2

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [30]:
s2.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [32]:
s2.notnull()

0     True
1     True
2    False
3     True
dtype: bool

Series as Dictionaries

In [34]:
mydict = {'red' : 2000, 'blue' : 1000, 'yellow' : 500, 'orange' : 1000}
myseries = pd.Series(mydict)
myseries

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

In [36]:
colors = ['red', 'yellow', 'orange', 'blue', 'green']
myseries = pd.Series(mydict, index = colors)
myseries

red       2000.0
yellow     500.0
orange    1000.0
blue      1000.0
green        NaN
dtype: float64

Operations between Series

In [38]:
mydict2 = {'red' : 400, 'yellow' : 1000, 'black' : 700}
myseries2 = pd.Series(mydict2)
myseries2

red        400
yellow    1000
black      700
dtype: int64

In [40]:
# Only the value that has same label will be added, otherwise it will return NaN value
myseries + myseries2

black        NaN
blue         NaN
green        NaN
orange       NaN
red       2400.0
yellow    1500.0
dtype: float64

DataFrame has two index arrays. One is similar to index array in Series. The other one associated with partiluar column

In [81]:
data = {
    'color' : ['yellow', 'green', 'black', 'pink', 'blue'],
    'object' : ['ball', 'blanket', 'pen', 'pillow', 'handphone'],
    'price' : [20, 100, 15, 75, 1250]
}

In [82]:
frame = pd.DataFrame(data)
frame

Unnamed: 0,color,object,price
0,yellow,ball,20
1,green,blanket,100
2,black,pen,15
3,pink,pillow,75
4,blue,handphone,1250


In [83]:
frame2 = pd.DataFrame(data, columns = ['object', 'price'])
frame2

Unnamed: 0,object,price
0,ball,20
1,blanket,100
2,pen,15
3,pillow,75
4,handphone,1250


In [84]:
frame3 = pd.DataFrame(data, columns = ['object', 'price'], index = ['one', 'two', 'three', 'four', 'five'])
frame3

Unnamed: 0,object,price
one,ball,20
two,blanket,100
three,pen,15
four,pillow,75
five,handphone,1250


Selecting Elements

In [85]:
frame.columns

Index(['color', 'object', 'price'], dtype='object')

In [86]:
frame3.index

Index(['one', 'two', 'three', 'four', 'five'], dtype='object')

In [87]:
frame.values

array([['yellow', 'ball', 20],
       ['green', 'blanket', 100],
       ['black', 'pen', 15],
       ['pink', 'pillow', 75],
       ['blue', 'handphone', 1250]], dtype=object)

In [88]:
# to see content of one column
frame['price']

0      20
1     100
2      15
3      75
4    1250
Name: price, dtype: int64

In [89]:
frame.price

0      20
1     100
2      15
3      75
4    1250
Name: price, dtype: int64

In [90]:
frame.iloc[0]

color     yellow
object      ball
price         20
Name: 0, dtype: object

In [91]:
frame.iloc[[1, 4]]

Unnamed: 0,color,object,price
1,green,blanket,100
4,blue,handphone,1250


In [92]:
frame3.loc['one']

object    ball
price       20
Name: one, dtype: object

In [93]:
frame[0:2]

Unnamed: 0,color,object,price
0,yellow,ball,20
1,green,blanket,100


In [94]:
frame

Unnamed: 0,color,object,price
0,yellow,ball,20
1,green,blanket,100
2,black,pen,15
3,pink,pillow,75
4,blue,handphone,1250


In [95]:
frame['object'].iloc[4]

'handphone'

In [96]:
frame.index.name = 'id'
frame.columns.name = 'feature'
frame

feature,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,yellow,ball,20
1,green,blanket,100
2,black,pen,15
3,pink,pillow,75
4,blue,handphone,1250


Changing internal data structures of DataFrame

In [97]:
# Adding new column 
frame['supplier'] = 'Jakarta'
frame

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,yellow,ball,20,Jakarta
1,green,blanket,100,Jakarta
2,black,pen,15,Jakarta
3,pink,pillow,75,Jakarta
4,blue,handphone,1250,Jakarta


In [98]:
frame['supplier'] = ['Jakarta', 'Bandung', 'Surabaya', 'Malang', 'Jakarta']
frame

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,yellow,ball,20,Jakarta
1,green,blanket,100,Bandung
2,black,pen,15,Surabaya
3,pink,pillow,75,Malang
4,blue,handphone,1250,Jakarta


In [99]:
#Change single value
frame['color'].iloc[2] = 'white'
frame

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,yellow,ball,20,Jakarta
1,green,blanket,100,Bandung
2,white,pen,15,Surabaya
3,pink,pillow,75,Malang
4,blue,handphone,1250,Jakarta


Membership of a Value

In [100]:
frame.isin(['100', 'pen'])

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,False,False,False,False
1,False,False,False,False
2,False,True,False,False
3,False,False,False,False
4,False,False,False,False


In [101]:
frame[frame.isin(['100', 'pen'])]

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,,,,
1,,,,
2,,pen,,
3,,,,
4,,,,


In [102]:
frame['description'] = 'Good Quality'
frame

feature,color,object,price,supplier,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,yellow,ball,20,Jakarta,Good Quality
1,green,blanket,100,Bandung,Good Quality
2,white,pen,15,Surabaya,Good Quality
3,pink,pillow,75,Malang,Good Quality
4,blue,handphone,1250,Jakarta,Good Quality


In [103]:
# deleting a column

del frame['description']
frame

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,yellow,ball,20,Jakarta
1,green,blanket,100,Bandung
2,white,pen,15,Surabaya
3,pink,pillow,75,Malang
4,blue,handphone,1250,Jakarta


Filtering

In [104]:
frame[(frame['price'] < 80)]

feature,color,object,price,supplier
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,yellow,ball,20,Jakarta
2,white,pen,15,Surabaya
3,pink,pillow,75,Malang


### Nested Dictionary in Creating DataFrame
external keys as column name
internal keys as index name

In [106]:
nestdict = {'income' : {2010 : 8000, 2011 : 8500, 2012 : 9000},
           'outcome' :{2010 : 7000, 2011 : 7250 }}
frame4 = pd.DataFrame(nestdict)
frame4

Unnamed: 0,income,outcome
2010,8000,7000.0
2011,8500,7250.0
2012,9000,


Transposition of Data Frame

In [107]:
frame4.T

Unnamed: 0,2010,2011,2012
income,8000.0,8500.0,9000.0
outcome,7000.0,7250.0,


In [110]:
ser = pd.Series(np.arange(5), index = ['red', 'green', 'black', 'white', 'purple'])
ser

red       0
green     1
black     2
white     3
purple    4
dtype: int64

Method of Index

In [111]:
ser.idxmin()

'red'

In [112]:
ser.idxmax()

'purple'

In [113]:
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

In [114]:
serd['white']

white    1
white    0
dtype: int64

In [115]:
serd.index.is_unique

False

In [116]:
frame.index.is_unique

True