In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


FROM:
https://pandas.pydata.org/pandas-docs/stable/10min.html
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro

In [3]:
 pd.__version__

u'0.18.1'

In [4]:
data= np.random.randn(5)#list(range(5))
index=['a', 'b', 'c', 'd', 'e']

# Series

In [5]:
#Creating a Series, i.e. a one-dimensional labeled array capable of holding any data type. 
#Axis Labels are collectively referred to as index
s=pd.Series(data, index=index)
#if no index is passed, an automatic index will be created
t=pd.Series(data)

In [6]:
s

a   -1.226261
b    0.789539
c    0.744442
d    2.409003
e    0.883068
dtype: float64

In [7]:
t

0   -1.226261
1    0.789539
2    0.744442
3    2.409003
4    0.883068
dtype: float64

If data is a dict, if index is passed the values in data corresponding to the labels in the index will be pulled out. Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [20]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}


In [21]:
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

In [22]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

 If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

In [23]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [24]:
s[0]

0.51496804724857004

In [26]:
s[:3]

a    0.514968
b   -0.818135
c   -0.460008
dtype: float64

In [27]:
s[s > s.median()]

a    0.514968
d    1.629933
dtype: float64

In [28]:
s

a    0.514968
b   -0.818135
c   -0.460008
d    1.629933
e   -0.051556
dtype: float64

In [29]:
sum(s) / float(len(s))

0.16304036704465358

In [30]:
np.exp(s)

a    1.673585
b    0.441254
c    0.631279
d    5.103534
e    0.949750
dtype: float64

A Series is like a fixed-size dict in that you can get and set values by index label:

In [32]:
s['d']

1.6299331662158092

In [33]:
'e' in s

True

In [34]:
'f' in s

False

If a label is not contained, an exception is raised:s['f'] gives error. Using the get method, a missing label will return None or specified default:

In [41]:
s.get('f')

In [42]:
s.get('f', np.nan)

nan

When doing data analysis, as with raw NumPy arrays looping through Series value-by-value is usually not necessary. Series can also be passed into most NumPy methods expecting an ndarray.

In [43]:
 s + s

a    1.029936
b   -1.636271
c   -0.920016
d    3.259866
e   -0.103112
dtype: float64

In [44]:
s

a    0.514968
b   -0.818135
c   -0.460008
d    1.629933
e   -0.051556
dtype: float64

The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

Note In general, we chose to make the default result of operations between differently indexed objects yield the union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is typically important information as part of a computation. You of course have the option of dropping labels with missing data via the dropna function.

Series can also have a 'name' attribute:

In [4]:
s = pd.Series(np.random.randn(5), name='something')

In [5]:
s

0    0.425571
1    0.779531
2    0.067159
3    1.513270
4   -2.371322
Name: something, dtype: float64

The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as you will see below.

New from version 0.18.0.

You can rename a Series with the pandas.Series.rename() method.
s2=s.rename("different")

In [8]:
s2=s.rename("different")

In [9]:
s2

0    0.425571
1    0.779531
2    0.067159
3    1.513270
4   -2.371322
Name: different, dtype: float64

In [10]:
s

0    0.425571
1    0.779531
2    0.067159
3    1.513270
4   -2.371322
Name: something, dtype: float64

In [11]:
s2.name

'different'

# DATAFRAME #

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

- Dict of 1D ndarrays,lists, dicts, or Series
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame


Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.

## From dict of Series or dicts

The result index will be the union of the indeces of the various series. If there are any nested dicts, these will be firt converted to Series. If no column are passed, the columns will be sorted list of dict keys.

In [12]:
d={'one':pd.Series([1.,2.,3.],index=['a','b','c']), 'two':pd.Series([1.,2.,3.,4.],index=['a','b','c','d'])}

In [15]:
d

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [13]:
df=pd.DataFrame(d)

In [14]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [16]:
pd.DataFrame(d, index=['d','b','a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [18]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'one','three'])

Unnamed: 0,two,one,three
d,4.0,,
b,2.0,2.0,
a,1.0,1.0,


The row and column labels can be accessed respectively by accessing the index and columns attributes:
(Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict)


In [19]:
pd.DataFrame(d, index=['d'])

Unnamed: 0,one,two
d,,4.0


In [20]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [21]:
df.columns

Index(['one', 'two'], dtype='object')

## From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [29]:
d={'one':[1.,2.,3.,4.],'two':[4.,3.,2.,1.]}

In [30]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [31]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


## From structured or record array

This case is handled identically to a dict of arrays

In [8]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

In [9]:
data

array([(0, 0.0, ''), (0, 0.0, '')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [10]:
data[:] = [(1,2.,'Hello'), (2,3.,"World")]

In [11]:
data

array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [12]:
pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2.0,Hello
1,2,3.0,World


In [37]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,A,B,C
first,1,2.0,b'Hello'
second,2,3.0,b'World'


In [38]:
pd.DataFrame(data, columns=['C', 'A', 'B'])

Unnamed: 0,C,A,B
0,b'Hello',1,2.0
1,b'World',2,3.0


DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.

## From a list of dicts

You can automatically create a multi-indexed frame by passing a tuples dictionary

In [40]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


# Summarising

In [121]:
import string
indeces=list(string.ascii_lowercase)

In [122]:
data=np.random.randn(5)
index=indeces[0:5]

In [123]:
len(data)

5

## Series

In [124]:
Series=pd.Series(data,index, name='Example of Series')

In [125]:
Series

a   -0.028489
b    0.963577
c    1.964645
d   -0.447764
e    0.420159
Name: Example of Series, dtype: float64

Appending series

In [126]:
s2=pd.Series(np.random.randn(5),index)
Series.append(s2)

a   -0.028489
b    0.963577
c    1.964645
d   -0.447764
e    0.420159
a   -1.123656
b    1.388948
c    1.039195
d    1.209024
e   -0.342967
dtype: float64

summing up series element by element

In [127]:
Series.add(s2)

a   -1.152145
b    2.352525
c    3.003841
d    0.761260
e    0.077192
dtype: float64

## DataFrame 

In [128]:
Gdata=pd.DataFrame(data, index=index, columns=['Column One'])

In [129]:
Gdata

Unnamed: 0,Column One
a,-0.028489
b,0.963577
c,1.964645
d,-0.447764
e,0.420159


Add a column to a DataFrame

In [130]:
Gdata['Column two'] = pd.Series(np.random.randn(len(data)), index=Gdata.index)

In [131]:
Gdata

Unnamed: 0,Column One,Column two
a,-0.028489,-0.064099
b,0.963577,-0.347377
c,1.964645,1.437631
d,-0.447764,-0.372191
e,0.420159,-0.169463


In [132]:
Gdata['Column three'] = pd.Series(np.random.randn(len(data)), index=Gdata.index)

In [133]:
Gdata

Unnamed: 0,Column One,Column two,Column three
a,-0.028489,-0.064099,-1.5624
b,0.963577,-0.347377,1.210186
c,1.964645,1.437631,0.410536
d,-0.447764,-0.372191,0.039054
e,0.420159,-0.169463,-0.980397


In [134]:
Gdata.columns

Index([u'Column One', u'Column two', u'Column three'], dtype='object')

Add a row

In [135]:
indeces[5:7]

['f', 'g']

In [136]:
GdataNew=Gdata.append(pd.DataFrame([[5, 6, 4], [7, 8, 6]],index=indeces[5:7],  columns=Gdata.columns))

In [137]:
GdataNew

Unnamed: 0,Column One,Column two,Column three
a,-0.028489,-0.064099,-1.5624
b,0.963577,-0.347377,1.210186
c,1.964645,1.437631,0.410536
d,-0.447764,-0.372191,0.039054
e,0.420159,-0.169463,-0.980397
f,5.0,6.0,4.0
g,7.0,8.0,6.0


## Indexing / Selection

The basics of indexing are as follows for a data frame named df:

|Operation | Syntax | Result|
|----------|--------|-------|
|Select column | df[col] | Series|
|Select row by label | df.loc[label] | Series|
|Select row by integer location| df.iloc[loc]|Series|
|Slice rows | df[5:10] | DataFrame|
|Select rows by boolean vector | df[bool_vec]| DataFrame|



Row selection, for example, returns a Series whose index is the columns of the DataFrame:

Take a particular Column

In [138]:
GdataNew.columns

Index([u'Column One', u'Column two', u'Column three'], dtype='object')

In [139]:
GdataNew['Column One']

a   -0.028489
b    0.963577
c    1.964645
d   -0.447764
e    0.420159
f    5.000000
g    7.000000
Name: Column One, dtype: float64

In [140]:
GdataNew.loc[GdataNew.index=='a']

Unnamed: 0,Column One,Column two,Column three
a,-0.028489,-0.064099,-1.5624


In [143]:
GdataNew.iloc[0]

Column One     -0.028489
Column two     -0.064099
Column three   -1.562400
Name: a, dtype: float64

One element of one column

In [141]:
GdataNew.loc[GdataNew.index=='a']['Column One']

a   -0.028489
Name: Column One, dtype: float64

In [146]:
GdataNew[0:2]

Unnamed: 0,Column One,Column two,Column three
a,-0.028489,-0.064099,-1.5624
b,0.963577,-0.347377,1.210186


In [149]:
GdataNew[range(0,2)]

Unnamed: 0,Column One,Column two
a,-0.028489,-0.064099
b,0.963577,-0.347377
c,1.964645,1.437631
d,-0.447764,-0.372191
e,0.420159,-0.169463
f,5.0,6.0
g,7.0,8.0
