In [6]:
import pandas as pd
pd.__version__

'1.0.1'

# PANDAS

At the very basic level, Pandas objects can be thought of as enhanced versions of
NumPy structured arrays in which the rows and columns are identified with labels
rather than simple integer indices

 before we go any further, let’s introduce these three
fundamental Pandas data structures: the Series, DataFrame, and Index

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data.

In [4]:
data = pd.Series([0.25, 0.5, 0.75, 1.0]) # The pandas series is actually an indexed array
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [5]:
data.values # helps in only printing the data omitting index

array([0.25, 0.5 , 0.75, 1.  ])

In [26]:
data.values.astype(int) # .astype(int) converts the float into int

array([0, 0, 0, 1])

In [27]:
data.index # to access the whatabouts of index

RangeIndex(start=0, stop=4, step=1)

## Series as generalized NumPy array

It may look like the Series object is basically inter‐
changeable with a one-dimensional NumPy array. The essential difference is the pres‐
ence of the index: while the NumPy array has an implicitly defined integer index used
to access the values, the Pandas Series has an explicitly defined index associated with
the values.

This explicit behaviour gives an extra advantage to series as the index can contain any type of value and not necessarily
integer.

In [28]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],    # here a,b,c,d are used as index of series
                  index=["a","b","c","d"])  # We can use non contiguous or non sequential data in same way
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [29]:
data['a'] # accessing is done in same way as seen before .

0.25

In [30]:
# We can even use non contiguous or non sequential data
data=pd.Series([0.25,0.5,0.75,1.],
            index=[2,4,6,8])
data[4]

0.5

## Series as specialized dictionary

you can think of a Pandas Series a bit like a specialization of a Python
dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary
values, and a Series is a structure that maps typed keys to a set of typed values. This
typing is important: just as the type-specific compiled code behind a NumPy array
makes it more efficient than a Python list for certain operations, the type information
of a Pandas Series makes it much more efficient than Python dictionaries for certain
operations.

In [31]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [32]:
population_dict # calling the population_dict directly presents the result as it is in the form of dictionary

{'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}

In [33]:
population['California':'Illinois'] #The series also support array like operations unlike python dictionary
# Here the name of the states itself is the index

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [34]:
import numpy
#same value can also be asssigned to different index
pd.Series('India',index=[100,200,300])

100    India
200    India
300    India
dtype: object

In [35]:
#data can be a dictionary, in which index defaults to the sorted dictionary keys, Here the index is not sorted which should be as per the book
pd.Series({2:'Sri Lanka',1:'India',3:'Australia'})

2    Sri Lanka
1        India
3    Australia
dtype: object

In [36]:
#In each case, the index can be explicitly set if a different result is preferred:
pd.Series({2:'Sri Lanka',1:'India',3:'Australia'},index=[1,3]) # Even in case of dictionary we can print only desired index

1        India
3    Australia
dtype: object

# Pandas DataFrame Object


If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible
column names.  Just as you might think of a two-dimensional array as an ordered
sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. Here, by “aligned” we mean that they share the
same index.

In [37]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, #area_dict is a dictionary and is converted to 
          'Florida': 170312, 'Illinois': 149995}                        # pandas series(array)
area=pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

Now that we have this along with the population Series from before, we can use a
dictionary to construct a single two-dimensional object containing this information:


In [38]:
states=pd.DataFrame({'area':area,       # using the pd.DataFrame command, we can actually combine the population and 
                    'state_population':population})   # area data into dataframe.
states

Unnamed: 0,area,state_population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [82]:
states.columns # This is used to access the name of the columns in the dataframe
#states.keys() can also be used for the same purpose

Index(['area', 'state_population'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional
NumPy array, where both the rows and columns have a generalized index for access‐
ing the data.


In [105]:
states["area"]

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Notice the potential point of confusion here: in a two-dimensional NumPy array,
data[0] will return the first row. For a DataFrame, data['col0'] will return the first
column. Because of this, it is probably better to think about DataFrames as generalized
dictionaries rather than generalized arrays, though both ways of looking at the situa‐
tion can be useful

## Constructing DataFrame objects

### From a single Series object. 

A DataFrame is a collection of Series objects, and a singlecolumn DataFrame can be constructed from a single Series

In [119]:
pd.DataFrame(population,columns=['population'])
# This can also be written as 
#pd.DataFrame({'population':population})

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


## From a list of dicts.

In [125]:
data = [{'a': i, 'b': 2 * i}
       for i in range(3)]
#pd.DataFrame(data)
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e.,
“not a number”) values

In [124]:
pd.DataFrame([{'a':1,'b':2},{'c':3,'d':4}]) # inside every{} , one row is their, and when no value is assigned, NaN occurs.

Unnamed: 0,a,b,c,d
0,1.0,2.0,,
1,,,3.0,4.0


### From a dictionary of Series objects. 

In [126]:
pd.DataFrame({'population': population,
              'area': area})


Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### From a two-dimensional NumPy array

In [5]:
import numpy as np
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.187064,0.041915
b,0.289169,0.129848
c,0.990357,0.155631


### From a NumPy structured array

In [6]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
pd.DataFrame(A)  # We have called for a zero array (np.zero)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## The Pandas Index Object

 let’s construct an Index from a list of integers

In [7]:
ind = pd.Index([2, 3, 5, 7, 11])
ind


Int64Index([2, 3, 5, 7, 11], dtype='int64')

The Index object in many ways operates like an array. For example, we can use stan‐
dard Python indexing notation to retrieve values or slices

In [12]:
ind[::2] # from position to the last possible position , on every second position

Int64Index([2, 5, 11], dtype='int64')

In [13]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between Index objects and NumPy arrays is that indices are immuta‐
ble—that is, they cannot be modified via the normal means

### Index as ordered set


The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersec‐
tions, differences, and other combinations can be computed in a familiar way

In [15]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])


In [16]:
indA & indB # it performs the intersection function

Int64Index([3, 5, 7], dtype='int64')

In [17]:
indA|indB # union is performed

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [18]:
indA^indB # symmetric difference

Int64Index([1, 2, 9, 11], dtype='int64')

Series objects can even be modified with a dictionary-like syntax. Just as you can
extend a dictionary by assigning to a new key, you can extend a Series by assigning
to a new index value

In [7]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
 index=['a', 'b', 'c', 'd'])
data


a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
data['e']=1.25 # Hence the above data is amended in this way
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [10]:
data['c']=8 # We can even amend a value already in data 
data

a    0.25
b    0.50
c    8.00
d    1.00
e    1.25
dtype: float64

In [11]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 8.0), ('d', 1.0), ('e', 1.25)]

### Series as one-dimensional array

In [12]:
# masking 
data[(data > 0.3) & (data < 0.8)]
# masking is actually hiding all other values and taking out only the required value

b    0.5
dtype: float64

In [13]:
# fancy indexing
data[['a', 'e']]
# taking out only the required values by uniquely mentioning 

a    0.25
e    1.25
dtype: float64

In [15]:
# slicing by explicit index
data['a':'c']


a    0.25
b    0.50
c    8.00
dtype: float64

In [16]:
# slicing by implicit integer index
data[0:2]


a    0.25
b    0.50
dtype: float64

Among these, slicing may be the source of the most confusion. Notice that when you
are slicing with an explicit index (i.e., data['a':'c']), the final index is included in
the slice, while when you’re slicing with an implicit index (i.e., data[0:2]), the final
index is excluded from the slice.


### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion. For example, if
your Series has an explicit integer index, an indexing operation such as data[1] will
use the explicit indices, while a slicing operation like data[1:3] will use the implicit
Python-style index

In [17]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data


1    a
3    b
5    c
dtype: object

In [20]:
data[3] # This uses explicit indexing, means the index created by us 

'b'

In [21]:
data[1:3] # whereas this uses implicit indexing (natural index)

3    b
5    c
dtype: object

To remove this confusion .loc[] and .iloc[] attributes came into being.
.loc[] always follow explicit indexing and .iloc[] always follow implicit
indexing

In [22]:
data.loc[1]
#explicit indexing

'a'

In [27]:
data.loc[1:3]
#explicit slicing

1    a
3    b
dtype: object

In [28]:
data.iloc[1] #implicit indexing

'b'

In [29]:
data.iloc[1:3] #implicit slicing

3    b
5    c
dtype: object

 DataFrame acts in many ways like a two-dimensional or structured array,
and in other ways like a dictionary of Series structures sharing the same index.

### DataFrame as a dictionary

In [65]:
area = pd.Series({'California': 423967, 'Texas': 695662,
 'New York': 141297, 'Florida': 170312,
 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
 'New York': 19651127, 'Florida': 19552860,
 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [66]:
data['area'] # It shows the area column of the dataframe and can also be written as below
#data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [67]:
data['pop'] # it returns the pop column , but similarly data.pop would not function as pop is a precompiled function.

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

In [68]:
#data.pop('pop') due to this function data['pop'] is not working

Like with the Series objects discussed earlier, this dictionary-style syntax can also be
used to modify the object, in this case to add a new column

In [86]:
data['density']=data['pop']/data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [162]:
density=pop/area
density

California     90.413926
Texas          38.018740
New York      139.076746
Florida       114.806121
Illinois       85.883763
dtype: float64

### DataFrame as two-dimensional array

Now treating the dataframe as an array , we can swap the rows and columns as follows

In [75]:
swap=data.T # used to swap between rows and columns
swap

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [76]:
swap.values[0:3]

array([[4.23967000e+05, 6.95662000e+05, 1.41297000e+05, 1.70312000e+05,
        1.49995000e+05],
       [3.83325210e+07, 2.64481930e+07, 1.96511270e+07, 1.95528600e+07,
        1.28821350e+07],
       [9.04139261e+01, 3.80187404e+01, 1.39076746e+02, 1.14806121e+02,
        8.58837628e+01]])

In [74]:
data.values[0]

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [80]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

In [82]:
swap['California']

area       4.239670e+05
pop        3.833252e+07
density    9.041393e+01
Name: California, dtype: float64

In [83]:
data.iloc[1:2,:] # taking a row or series of rows out of a dataframe

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874


In [94]:
swap.loc['area':'pop','Texas':'Florida'] # .loc is used to directly call the position by its name and not implicit index

Unnamed: 0,Texas,New York,Florida
area,695662.0,141297.0,170312.0
pop,26448193.0,19651127.0,19552860.0


In [100]:
#data.ix[:, :]
 
#This is not working expected result
 
#             area     pop
#California  423967   38332521

#Florida     170312   19552860

#Illinois    149995   12882135


Keep in mind that for integer indices, the ix indexer is subject to the same potential
sources of confusion as discussed for integer-indexed Series objects.
In the .loc indexer we can combine masking and fancy indexing as
in the following

In [115]:
data.loc[data.density>100,'area':'density']  # we find all the columns of states having density greater than 100

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [116]:
data.loc[data.density>100,['area','density']] # used to pop up area and density columns

Unnamed: 0,area,density
New York,141297,139.076746
Florida,170312,114.806121


In [126]:
data.loc[['New York']] # used to find a row by naming 

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746


In [128]:
data.iloc[0,2]=90 # used to amend the value at (0,2 position). It's indexing.
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


while indexing refers to columns, slicing refers to rows

## Ufuncs: Index Preservation

In [143]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4)) # put 4 values in between 0 to 10
ser

0    6
1    3
2    7
3    4
dtype: int32

In [149]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df


Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


If we apply a NumPy ufunc on either of these objects, the result will be another Pan‐
das object with the indices preserved

In [150]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [153]:
np.sin(df * np.pi / 4) # it's actually sin (nπ/4) where n is the values in df


Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


## Index Alignment

In [164]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
 'New York': 19651127}, name='population')

In [165]:
population/area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [166]:
area.index|population.index # This shows index of both series is aligned even though a city is missing in both series

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [167]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

Now since this is an addition function and we want to add when there is no value, function A.add(B,fill_value=0)

In [169]:
A.add(B,fill_value=0) # The unavailable value is supposed to be 0

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64