<b>Reference</b>


1. <i>Python Data Science Handbook</i>


2. <a>https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/</a>


3. <a>https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html</a>

In [2]:
import numpy as np
import pandas as pd

<b>Introducing Pandas Objects</b>

Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. Pandas provides a host of useful tools, methods, and functionality on top of basic data structures.

We mainly focus on three fundamental Pandas data structures:<b> the Series, DataFrame, and Index</b>

# Introducing Pandas Objects

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data.

In [21]:
# create an example of Series from a list of array
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The Series wraps both a sequence of values and a sequence of indiecs, which we can access with the values and index attributes.

In [22]:
print(data.values)
print(type(data.values))

[0.25 0.5  0.75 1.  ]
<class 'numpy.ndarray'>


In [23]:
print(data.index)
print(type(data.index))

RangeIndex(start=0, stop=4, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>


The index is an array-like object of type pd.Index

In [24]:
# data can be accessed by the associate index
data[1]

0.5

In [25]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as generalized NumPy array

The essential difference between Series and one-dimensional NumPy array is the presence of the index: while the NumPy array has an implicitly defined integer used to access the values, the Pandas Series has an explicitly defined index associate with the values.


Series object has additional capabilities, eg: the index need not be an integer, but can consist of values of any desired type.

In [26]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = ['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

We can even use non-sequential indices.

In [153]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index = [2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [155]:
data[8] = 10 
data

2     0.25
5     0.50
3     0.75
7     1.00
8    10.00
dtype: float64

### Series as specialized dictionary

A dictionary is a struture that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set typed values. This type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [5]:
# Series can be constructed directly from a Python dictionary
population_dict = {'California': 3025485,
                   'Texas': 235464346,
                   'New York': 934923425,
                   'Florida': 32572935023,
                   'Illinois': 1288212039,
                  }
population = pd.Series(population_dict)
population

California        3025485
Texas           235464346
New York        934923425
Florida       32572935023
Illinois       1288212039
dtype: int64

In [30]:
population['California']

3025485

In [31]:
population['California':'Illinois']

California        3025485
Texas           235464346
New York        934923425
Florida       32572935023
Illinois       1288212039
dtype: int64

### Constructing Series object summary

In [16]:
# pd.Series(data, index = index)

## The Pandas DataFrame object

### DataFrame as a generalized NumPy array

DataFrame ia an anolog of a two-dimensional array with both flexible row indices and flexible column names.

In [3]:
area_dict = {'California':38974283, 'New York':387420,'Texas':39084923, 'Florida':327900893, 'Illinois': 38724829}
area = pd.Series(area_dict)
area

California     38974283
New York         387420
Texas          39084923
Florida       327900893
Illinois       38724829
dtype: int64

In [11]:
states = pd.DataFrame({'Population': population, 'area': area})
states # sorted by row indices 

Unnamed: 0,Population,area
California,3025485,38974283
Florida,32572935023,327900893
Illinois,1288212039,38724829
New York,934923425,387420
Texas,235464346,39084923


In [39]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [42]:
states.columns

Index(['Population', 'area'], dtype='object')

We can also think of a DataFrame as a specialization of a dictionary. A DataFrame maps a column name to a Series of column.

In [43]:
states['area']

California     38974283
Florida       327900893
Illinois       38724829
New York         387420
Texas          39084923
Name: area, dtype: int64

In [44]:
states['area']['California']

38974283

### Constructing DataFrame objects

In [6]:
# From a single Series object
pd.DataFrame(population, columns = ['population'])

Unnamed: 0,population
California,3025485
Texas,235464346
New York,934923425
Florida,32572935023
Illinois,1288212039


In [7]:
# From a list of dicts
data = [{'a': i, 'b': 2*i}
       for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [9]:
pd.DataFrame([{'a': 1, 'b': 2},{'b': 3, 'c': 4}]) # fit missing keys with NaN

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [10]:
# from a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
            columns = ['foo', 'bar'],
            index = ['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.773187,0.308272
b,0.692775,0.439862
c,0.466604,0.021744


## The Pandas index objext

In [65]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

In [66]:
ind[1]

3

In [67]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [68]:
# some attributes of Index objects
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


one difference between Index objects and NumPy arrays is that indices are immutable

In [69]:
# ind[1] = 0 is worong

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in-set data structure, so that unions, intersections, differences.

In [71]:
indA = pd.Index([1,2,3,4,7,8])
indB = pd.Index([3,4,7,8,9])
print(indA & indB) # intersection
print(indA | indB) # union
print(indA ^ indB) # symmetric difference

Int64Index([3, 4, 7, 8], dtype='int64')
Int64Index([1, 2, 3, 4, 7, 8, 9], dtype='int64')
Int64Index([1, 2, 9], dtype='int64')


These operations may also be accessed via object methods, eg ... indA.intersection(indB)

# Data Indexing and Selection

## Data Selection in Series

In [74]:
data = pd.Series([0.2, 0.4, 0.5, 0.7, 0.9], index = ['a','b','c','d','e'])
data

a    0.2
b    0.4
c    0.5
d    0.7
e    0.9
dtype: float64

In [75]:
data['b']

0.4

In [77]:
data[1:4]

b    0.4
c    0.5
d    0.7
dtype: float64

In [78]:
'a' in data

True

In [79]:
data.keys

<bound method Series.keys of a    0.2
b    0.4
c    0.5
d    0.7
e    0.9
dtype: float64>

In [80]:
list(data.items())

[('a', 0.2), ('b', 0.4), ('c', 0.5), ('d', 0.7), ('e', 0.9)]

In [81]:
data['f'] = 1
data

a    0.2
b    0.4
c    0.5
d    0.7
e    0.9
f    1.0
dtype: float64

In [82]:
# slicing by explicit index
data['a':'c']

a    0.2
b    0.4
c    0.5
dtype: float64

In [83]:
# slicig by implicit index
data[0:2]

a    0.2
b    0.4
dtype: float64

In [85]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.4
c    0.5
d    0.7
dtype: float64

In [87]:
# fancing indexing
data[['a', 'e']]

a    0.2
e    0.9
dtype: float64

From above, we can see that slicing can be a source of confusion because of two the explicit indexing and implicit indexing.

In [89]:
data1 = pd.Series([1,2,3,4])
data1

0    1
1    2
2    3
3    4
dtype: int64

In [91]:
data1[0:2]
# this is implicit indexing but maybe our real intend is explicit indexing

0    1
1    2
dtype: int64

To solve the confusion of explicit indexing and implicit indexing, we can use indexers: loc, iloc.

In [92]:
data1.loc[0:2] # explicit indexing

0    1
1    2
2    3
dtype: int64

In [93]:
data1.iloc[0:2] # implicit indexing

0    1
1    2
dtype: int64

## Data Selection in DataFrame

In [14]:
area = pd.DataFrame([[1,2],[3,4],[5,6],[7,8]], 
                    columns = ['a','b'], 
                    index = ['boston', 'new york', 'worcester','california'])
area

Unnamed: 0,a,b
boston,1,2
new york,3,4
worcester,5,6
california,7,8


In [15]:
area['a']

boston        1
new york      3
worcester     5
california    7
Name: a, dtype: int64

In [16]:
area.a

boston        1
new york      3
worcester     5
california    7
Name: a, dtype: int64

Note: if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. Therefore, area['a'] is more suggested than using area.a

In [17]:
area['density'] = area['a'] / area['b']
area

Unnamed: 0,a,b,density
boston,1,2,0.5
new york,3,4,0.75
worcester,5,6,0.833333
california,7,8,0.875


We can view DataFrame as an enhanced two-dimensional array.

In [18]:
area.values

array([[1.        , 2.        , 0.5       ],
       [3.        , 4.        , 0.75      ],
       [5.        , 6.        , 0.83333333],
       [7.        , 8.        , 0.875     ]])

In [19]:
area.T

Unnamed: 0,boston,new york,worcester,california
a,1.0,3.0,5.0,7.0
b,2.0,4.0,6.0,8.0
density,0.5,0.75,0.833333,0.875


In [20]:
area

Unnamed: 0,a,b,density
boston,1,2,0.5
new york,3,4,0.75
worcester,5,6,0.833333
california,7,8,0.875


In [21]:
area.values[0]

array([1. , 2. , 0.5])

In [22]:
area['a']

boston        1
new york      3
worcester     5
california    7
Name: a, dtype: int64

In [23]:
area['a'][1:3]

new york     3
worcester    5
Name: a, dtype: int64

In [24]:
area['a'].iloc[1:3]

new york     3
worcester    5
Name: a, dtype: int64

In [25]:
area['a'].loc[['boston', 'new york']]

boston      1
new york    3
Name: a, dtype: int64

In [26]:
area['a'].loc[['boston', 'new york']]

boston      1
new york    3
Name: a, dtype: int64

In [27]:
area.iloc[:3,:2]

Unnamed: 0,a,b
boston,1,2
new york,3,4
worcester,5,6


In [28]:
area.loc[:'california',:'a']

Unnamed: 0,a
boston,1
new york,3
worcester,5
california,7


The <b>ix</b> indexer allows a hybrid of these two approaches.

In [31]:
area.ix[:3,:'b'] # but it is not recommended.

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b
boston,1,2
new york,3,4
worcester,5,6


In [132]:
area.loc[area['b'] > 3, 'a']

new york      3
worcester     5
california    7
Name: a, dtype: int64

In [135]:
area

Unnamed: 0,a,b,density
boston,1,2,0.5
new york,3,4,0.75
worcester,5,6,0.833333
california,7,8,0.875


In [137]:
area.iloc[0, 2] = 10
area

Unnamed: 0,a,b,density
boston,1,2,10.0
new york,3,4,0.75
worcester,5,6,0.833333
california,7,8,0.875


In [140]:
area['boston': 'worcester']

Unnamed: 0,a,b,density
boston,1,2,10.0
new york,3,4,0.75
worcester,5,6,0.833333


In [141]:
area[1:3]

Unnamed: 0,a,b,density
new york,3,4,0.75
worcester,5,6,0.833333


In [142]:
area[area['a'] > 3]

Unnamed: 0,a,b,density
worcester,5,6,0.833333
california,7,8,0.875


In [148]:
area.iloc[2:4, :3]

Unnamed: 0,a,b,density
worcester,5,6,0.833333
california,7,8,0.875


In [149]:
area.loc['worcester':'california','a':'density']

Unnamed: 0,a,b,density
worcester,5,6,0.833333
california,7,8,0.875


In [151]:
area[2:4]

Unnamed: 0,a,b,density
worcester,5,6,0.833333
california,7,8,0.875


In [152]:
area[2:4].iloc[:2,:3]

Unnamed: 0,a,b,density
worcester,5,6,0.833333
california,7,8,0.875
