<a href="https://colab.research.google.com/github/chitreshkr/Options-Pricing-using-Matlab/blob/master/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#    pandas



Notebooks text taken from Python for Data Analysis Book 

 While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data. 

In [0]:
import pandas as pd

Whenever you see pd in code, it’s referring to pandas. You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:

In [0]:
from pandas import Series, DataFrame

#  Introduction to pandas Data Structures



> **Series**



A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.
Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

In [0]:
x  = pd.Series([1,2,3,4])

In [6]:
x

0    1
1    2
2    3
3    4
dtype: int64

So the value of x will show us index number on the left ,the value of x on the right and the data type at the bottom of pandas series X

In [11]:
x.values

array([1, 2, 3, 4])

In [12]:
x.index

RangeIndex(start=0, stop=4, step=1)

In [0]:
x = pd.Series([1,2,3,4],index=['d','a','b','c'])

In [14]:
x

d    1
a    2
b    3
c    4
dtype: int64

In [15]:
x.dtype

dtype('int64')

In [16]:
x['d']

1

In [17]:
x['c']

4

In [19]:
x[['a','b','d']]

a    2
b    3
d    1
dtype: int64

In [20]:
x[x>2]

b    3
c    4
dtype: int64

In [21]:
x[2]*2

6

In [22]:
x ** 2

d     1
a     4
b     9
c    16
dtype: int64

In [23]:
np.exp(x)

d     2.718282
a     7.389056
b    20.085537
c    54.598150
dtype: float64

In [25]:
'b' in x

True

In [26]:
'f' in x

False

In [0]:
data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [28]:
data

{'Ohio': 35000, 'Oregon': 16000, 'Texas': 71000, 'Utah': 5000}

In [0]:
y = pd.Series(data)

In [30]:
y

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [0]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [0]:
x = pd.Series(data,index=states)

In [34]:
x

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

The isnull and notnull functions in pandas should be used to detect missing data for missing values NA

In [36]:
pd.isnull(x)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [37]:
pd.notna(x)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [38]:
pd.notnull(x)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [0]:
x.name = 'population'

In [0]:
x.index.name = 'state'

In [41]:
x

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [0]:
x.index = ['Utah','Virginia','Georgia','Tennesse']
x.index.name = 'state'

In [45]:
x

state
Utah            NaN
Virginia    35000.0
Georgia     16000.0
Tennesse    71000.0
Name: population, dtype: float64

# Dataframe

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

In [0]:
data1 = {'cities':['Dallas','New York','Atlanta','Chicago','Los Angeles'],'Population in million':[1.2,3.4,1,2.4,2.7]}

In [88]:
data1

{'Population in million': [1.2, 3.4, 1, 2.4, 2.7],
 'cities': ['Dallas', 'New York', 'Atlanta', 'Chicago', 'Los Angeles']}

In [0]:
y = pd.DataFrame(data1)

In [90]:
y

Unnamed: 0,cities,Population in million
0,Dallas,1.2
1,New York,3.4
2,Atlanta,1.0
3,Chicago,2.4
4,Los Angeles,2.7


In [0]:
y = pd.DataFrame(data1, columns=['cities', 'Population in million', 'debt'],
   ....:                       index=['one', 'two', 'three', 'four','five'])

In [92]:
y

Unnamed: 0,cities,Population in million,debt
one,Dallas,1.2,
two,New York,3.4,
three,Atlanta,1.0,
four,Chicago,2.4,
five,Los Angeles,2.7,


The column which is not present in the dataframe will be assigned a NaN values .

In [93]:
y['cities']

one           Dallas
two         New York
three        Atlanta
four         Chicago
five     Los Angeles
Name: cities, dtype: object

In [94]:
y.loc['three']

cities                   Atlanta
Population in million          1
debt                         NaN
Name: three, dtype: object

In [0]:
y['debt'] = 13

In [96]:
y

Unnamed: 0,cities,Population in million,debt
one,Dallas,1.2,13
two,New York,3.4,13
three,Atlanta,1.0,13
four,Chicago,2.4,13
five,Los Angeles,2.7,13


In [0]:
y['debt'] = np.arange(5.)

In [98]:
y

Unnamed: 0,cities,Population in million,debt
one,Dallas,1.2,0.0
two,New York,3.4,1.0
three,Atlanta,1.0,2.0
four,Chicago,2.4,3.0
five,Los Angeles,2.7,4.0


Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.

In [0]:
y['south'] = y.cities == 'Chicago'

In [104]:
y

Unnamed: 0,cities,Population in million,debt,south
one,Dallas,1.2,0.0,False
two,New York,3.4,1.0,False
three,Atlanta,1.0,2.0,False
four,Chicago,2.4,3.0,True
five,Los Angeles,2.7,4.0,False


**Caution:**  New columns cannot be created with the frame2.eastern syntax. 

Del method can be used to remove the columns

In [0]:
del y['south']

In [106]:
y

Unnamed: 0,cities,Population in million,debt
one,Dallas,1.2,0.0
two,New York,3.4,1.0
three,Atlanta,1.0,2.0
four,Chicago,2.4,3.0
five,Los Angeles,2.7,4.0


You can extract the column names by using df.columns where df is the name of your dataframe 

In [108]:
y.columns

Index(['cities', 'Population in million', 'debt'], dtype='object')

In [0]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
   ....:        'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [0]:
df1 = pd.DataFrame(pop)

In [113]:
df1.head()

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


You can transpose the Dataframe using T method .

In [115]:
df1.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [116]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [0]:
 df1.index.name = 'year'; df1.columns.name = 'state'

In [118]:
df1

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [119]:
df1.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [0]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [121]:
obj

a    0
b    1
c    2
dtype: int64

In [0]:
index = obj.index

In [123]:
index

Index(['a', 'b', 'c'], dtype='object')

In [124]:
index[1:]

Index(['b', 'c'], dtype='object')

# Index objects are immutable and thus can’t be modified by the user:

In [125]:
index[1] = 'd'

TypeError: ignored