In [2]:
import pandas as pd
import numpy as np

## Introduction to pandas Data Structures

- __Series and DataFrame__

### Series
- is a one-dimensional array-like object containing a sequence of values and associated array of data labels called its _index_.
- Another way to think abt Series is as a fixed-length, orderd dict, as it is a mapping of index values to data values.

In [3]:
obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Left column is index, which we did not specify for the data but a default one consisting of integers 0 through N - 1 is created

In [4]:
obj.values

array([ 4,  7, -5,  3])

In [5]:
obj.index   # like range(4)

RangeIndex(start=0, stop=4, step=1)

__Assigning index so to identify each data point with a label__

In [6]:
obj2 = pd.Series([4,7,-5,3], index=['d','b','a','c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [8]:
#selecting single values or a set of values using labels in the index
obj2['a']

-5

In [9]:
obj2['d']=6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [10]:
obj2[['c','a','d']]   #here the list of labels is interpreted as a list of indeices, even though it contains string instrad of integers

c    3
a   -5
d    6
dtype: int64

In [11]:
#using NumPy-like operations preserves the index-value link
obj2[obj2>0]

d    6
b    7
c    3
dtype: int64

In [12]:
obj2*2

d    12
b    14
a   -10
c     6
dtype: int64

In [13]:
np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [14]:
'b' in obj2

True

In [15]:
# Series can be created using Python dict by passing the dict
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)    # index of the resulting Series will have the dict's key in sorted order
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [16]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)   #overriding by passing dict keys in the order we want
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In the above example, there was no value for 'California' which was replaced with 'NaN'. Also, since 'Utah' data is in dict but not in states list which is why it is excluded from the resulting object.

___isnull and notnull___ fucntions can be used to detect missing data

In [17]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [18]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [19]:
#these also has instance methods
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
#it automatically aligns by index label in arithmetic operations
print(obj3)
obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [21]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [22]:
obj4.name = 'population'   #give name to the Series object
obj4.index.name = 'state'  #give name to the object's index
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [23]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [24]:
obj.index = ['a','h','e','i']   #series's index can be altered in-place by assignment
obj

a    4
h    7
e   -5
i    3
dtype: int64

### DataFrame

Represents a rectangular table of data and contains an ordered collection of columns, each of which can be different value types. The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index

- Data is stored as one of more two-dimensional blocks rather than list or dict or some other collection of 1D array
- Can use it to represetn higher dimensional data in a tabular format using __hierarchial indexing__

In [25]:
#construction of dataframe
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)   #index is assigned automatically and the columns are placed in sorted order
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [26]:
frame.head()  #select only the first five rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [27]:
frame.head(10)  #displays first 10 rows

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [28]:
pd.DataFrame(data, columns=['year','state','pop'])   #df's columns will be arranged in specified columns order

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [29]:
#passed column name that is not in the dict then it will appear NaN
frame2 = pd.DataFrame(data, columns=['year','state','pop', 'debt'], index=['one', 'two', 'three', 'four','five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [30]:
#column in df can be retrieved as a Series either by dict-like notation or by attribute
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [31]:
frame2.year   #works only when the column name is a valid Python variable name

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [32]:
#rows can be retrieved by position or name
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [33]:
#columns can be modified by assignment
#frame2['debt'] = 16.5
frame2.debt = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [34]:
#while assigning lists or arrays to a column, the value's length must match the length of the df
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val    #labels gets realigned exactly to the df's index, inserting values in any holes
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [35]:
#del keyword can be used to delete columns
frame2['eastern'] = frame2.state == 'Ohio'   #new columns cannot be created with frame2.eastern syntax(dict-like notation)
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [36]:
del frame2['eastern']
frame2.columns    #columns returned is a view of the underlying data, not a copy

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [37]:
#nested dict of dicts
pop = {'Nevada': {2001: 2.4, 2002: 2.9},'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [38]:
frame3 = pd.DataFrame(pop)   #if this kind of data is passed then pd will interpret the outer dict keys as columns and the inner keys as the row indices
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [39]:
#transpose of DataFrame
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [40]:
pd.DataFrame(pop, index=[2001,2002,2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [41]:
pdata = {'Ohio': frame3['Ohio'][:-1],'Nevada': frame3['Nevada'][:2]}   #dicts of series are treated in much the same way
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [42]:
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [43]:
frame3.index.name = 'year'; frame3.columns.name='state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [44]:
#values attribute returns the data contained in the DataFrame as a 2D ndarray
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [45]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

#### Possible data inputs to DataFrame constructor

| Type | Notes |
| :--- | :---- |
|2D ndarray | A matrix of data, passing optional row and column labels |
| dict of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length |
| NumPy structured/record array | Treated as the “dict of arrays” case |
| dict of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed |
| dict of dicts | Each inner dict becomes a column; keys are unioned to form the row index as in the “dict of Series” case |
| List of dicts or Series | Each item becomes a row in the DataFrame; union of dict keys or Series indexes become the DataFrame’s column labels |
| List of lists or tuples | Treated as the “2D ndarray” case |
| Another DataFrame | The DataFrame’s indexes are used unless different ones are passed |
| NumPy MaskedArray | Like the “2D ndarray” case except masked values become NA/missing in the DataFrame result |

### Index Objects

- panda's Index objects are responsible for holding the axis labels and other metadata (like the axis name or names.)
- labels used while constructing a Series or DF is internally converted to an Index
- Index objects are immutable

In [46]:
obj = pd.Series(range(3), index=['a','b','c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [47]:
index[1:]

Index(['b', 'c'], dtype='object')

In [48]:
#index[1] = 'd'   # TypeError - it's not mutable

In [49]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [50]:
obj2 = pd.Series([1.5,-2.5,0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [51]:
obj2.index is labels

True

In [52]:
#an Index also behaves like a fixed-size set
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [53]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [54]:
'Ohio' in frame3.columns

True

In [55]:
2003 in frame3.index

False

In [56]:
# Unlike Python sets, pandas Index can contain duplicate labels
dup_labels = pd.Index(['foo','foo','bar','bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [57]:
lab = pd.Index(['foo','too','hee','shee'])
dup_labels.union(lab)

Index(['bar', 'bar', 'foo', 'foo', 'hee', 'shee', 'too'], dtype='object')

In [58]:
dup_labels.delete(0)

Index(['foo', 'bar', 'bar'], dtype='object')

#### Some Index methods and properties
| Method | Description |
| :----- | :---------- |
| append | Concatenate with additional Index objects, producing a new Index |
| difference | Compute set difference as an Index |
| intersection | Compute set intersection |
| union | Compute set union |
| isin | Compute boolean array indicating whether each value is contained in the passed collection |
| delete | Compute new Index with element at index i deleted |
| drop | Compute new Index by deleting passed values |
| insert | Compute new Index by inserting element at index i |
| is_monotonic | Returns True if each element is greater than or equal to the previous element |
| is_unique | Returns True if the Index has no duplicate values |
| unique | Compute the array of unique values in the Index |

## Essential Functionality

### Reindexing
- method __reindex__ which means to create a new object with the data _conformed_ to a new index
- calling __reindex__ on the Series rearranges the data according to the new index
- With DataFrame, __reindex__ can alter either the (row) index, columns, or both

In [59]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [60]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The __method__ option allows us to do this, using a method such as __ffill__, which forward-fills the values

In [61]:
obj3 = pd.Series(['blue','purple','yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [62]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [73]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [74]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [65]:
# columns can be reindexed with the columns keyword
states = ['Texas','Utah','California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [77]:
#label-indexing with loc
#frame.loc[['a','b','c','d'], states]   #missing values is no longer supported

#### reindex function arguments
| Argument | Description |
| :------- | :---------- |
| index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying. |
| method | Interpolation (fill) method; __'ffill'__ fills forward, while __'bfill'__ fills backward. |
| fill_value | Substitute value to use when introducing missing data by reindexing. |
| limit | When forward- or backfilling, maximum size gap (in number of elements) to fill. |
| tolerance | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches. |
| level | Match simple Index on level of MultiIndex; otherwise select subset of. |
| copy | If True, always copy underlying data even if new index is equivalent to old index; if False, do not copy  |the data when the indexes are equivalent.|

### Dropping Entries from an Axis

___drop___ method will return a new object with the indicated value or values deleted from an axis


In [79]:
obj = pd.Series(np.arange(5.), index=['a','b','c','d','e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [81]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64