# Pandas: An Introduction 

- Pandas is a python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
- It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. 
- It has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. 

![](../all_images/pandas.png)

## Background on Pandas

- According to Wikipedia, pandas is “a software library written for the Python programming language for data manipulation and analysis.” 
- It was initially developed by Wes McKinney in 2008 while working at AQR Capital Management. 
- He was able to convince AQR to allow him to open source the library, which not only allows, but encourages data scientists across the globe to use it for free, make contributions to the official repository, provide bug reports and fixes, documentation improvements, enhancements, and provide ideas for improving the software.

### Features of Pandas

Here are just a few of the things that pandas does well:

- Easy to manipulate missing data(represented as NaN).
- Provide capability to reshaping and pivoting.
- Good for merging ang joining data sets.
- Provide a robust IO tools for loading data from flat files (CSV), Excel files, databases and HDFS.
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

Source: https://github.com/pandas-dev/pandas#installation-from-sources

## Pandas Installation

`pip install pandas`

If you install Anaconda Python package, Pandas will be installed by default with the following,

- Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy stack. It is available for Windows, Linux and Mac.

- Canopy (https://www.enthought.com/products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.

- Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from http://python-xy.github.io/)

### Pandas Dependencies

- NumPy
- python-dateutil
- pytz

### Import Pandas

In [1]:
import pandas as pd
import numpy as np

## Create: Pandas Objects

Pandas mainly works with following objects,
- Series
- Data Frames

## Working with Series

Series is a one-dimensional labeled array that can holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [2]:
# Syntax
# s = pd.Series(data, index=index)

Here, data can be,

- a Python dict

- an ndarray

- a scalar value (like 5)

The passed index is a list of __axis labels__.

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [2]:
s = pd.Series(np.random.randn(6), index=['u', 'v', 'w', 'x', 'y','z'])

In [4]:
s

u    0.504271
v    1.944209
w   -0.099103
x    0.155506
y   -0.032992
z    0.628450
dtype: float64

In [3]:
s.index

Index(['u', 'v', 'w', 'x', 'y', 'z'], dtype='object')

In [4]:
pd.Series(np.random.randn(5))

0   -0.710941
1   -0.272843
2    1.116225
3    1.730865
4   -1.750229
dtype: float64

### From dictionary

Series can be instantiated from dictionary

In [5]:
d = {'abc': 1, 'pqw': 0, 'xyz': 2}

In [6]:
pd.Series(d)

abc    1
pqw    0
xyz    2
dtype: int64

### Note

- When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion order, if you’re using Python version >= 3.6 and Pandas version >= 0.23.

- If user using Python < 3.6 or Pandas < 0.23, and an index is not passed, the Series index will be the lexically ordered list of dict keys.

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [2]:
d = {'a': 0., 'b': 1., 'c': 2.}

In [5]:
pd.Series(d)    # Create object without index

a    0.0
b    1.0
c    2.0
dtype: float64

In [6]:
pd.Series(d, index=['b', 'c', 'd', 'a'])    # Create object with index

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

### Note

NaN (not a number) is the standard missing data marker used in pandas.

### From Scalar Value

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [7]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

## Working with DataFrame

The most commonly used pandas object, DataFrame, is a 2-dimensional labeled data structure with columns of potentially different types. Like Series, DataFrame works with many types of input,

- Dict of 1D ndarrays, lists, dicts, or Series

- 2-D numpy.ndarray

- Structured or record ndarray

- A Series

- Another DataFrame

Along with the data, we can optionally pass index (row labels) and columns (column labels) arguments. 

If axis labels are not passed, they will be constructed from the input data.

### Note

- When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s insertion order, if you are using Python version >= 3.6 and Pandas >= 0.23.

- If user are using Python < 3.6 or Pandas < 0.23, and columns is not specified, the DataFrame columns will be the lexically ordered list of dict keys.

## From dict of Series or dicts

The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.

In [8]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [9]:
df = pd.DataFrame(d)

In [10]:
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [16]:
df = pd.DataFrame(d, index=['d', 'b', 'a']) # define index

In [18]:
df = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])  # define index and column name

**Note**
When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

In [19]:
df.index

Index(['d', 'b', 'a'], dtype='object')

In [20]:
df.columns

Index(['two', 'three'], dtype='object')

### From dict of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be range(n), where n is the array length.

In [7]:
d = {'one': [1., 2., 3., 4.],'two': [4., 3., 2., 1.]}

In [8]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


In [10]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


### From a list of dicts

In [31]:
data2 = [{'a': 1.0, 'b': 2.0}, {'a': 5.0, 'b': 10.0, 'c': 20.0}]

In [32]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1.0,2.0,
1,5.0,10.0,20.0


In [33]:
pd.DataFrame(data2, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1.0,2.0,
second,5.0,10.0,20.0


In [34]:
pd.DataFrame(data2, columns=['a', 'b'])

Unnamed: 0,a,b
0,1.0,2.0
1,5.0,10.0


## From a dict of tuples

We can automatically create a MultiIndexed dataframe by passing a tuples dictionary.

In [35]:
pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
   ....:    ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,b,a,c,a,b
A,B,1.0,4.0,5.0,8.0,10.0
A,C,2.0,3.0,6.0,7.0,
A,D,,,,,9.0


## From CSV and XlS Files

### Pandas - IO Tools

The Pandas I/O API is a set of top level reader functions accessed like pd.read_csv() that generally return a Pandas object.

The two workhorse functions for reading text files (or the flat files) are read_csv() and read_table(). They both use the same parsing code to intelligently convert tabular data into a DataFrame object −

In [39]:
# Syntax
# pd.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None)

In [None]:
# Syntax
# pandas.read_csv(filepath_or_buffer, sep='\t', delimiter=None, header='infer', names=None, index_col=None, usecols=None)

Here is how the csv file data looks like −

S.No, Name, Age, City, Salary

1, Tom, 28, Toronto, 20000

2, Lee, 32, HongKong, 3000

3, Steven, 43, Bay Area, 8300

4, Ram, 38, Hyderabad, 3900

Save this data as temp.csv and conduct operations on it.

### `read.csv()`

read.csv reads data from the csv files and creates a DataFrame object.



In [14]:
import pandas as pd

df = pd.read_csv("temp.csv")
print(df)

   S.No    Name  Age      City   Salary
0     1     Tom   28    Toronto   20000
1     2     Lee   32   HongKong    3000
2     3  Steven   43        Bay    8300
3     4     Ram   38  Hyderabad   39000


### Custom Index

This specifies a column in the csv file to customize the index using index_col.

In [15]:
import pandas as pd

df = pd.read_csv("temp.csv", index_col=['S.No'])
print (df)

        Name  Age      City   Salary
S.No                                
1        Tom   28    Toronto   20000
2        Lee   32   HongKong    3000
3     Steven   43        Bay    8300
4        Ram   38  Hyderabad   39000


### Datatype Conversion

dtype of the columns can be passed as a dict.

In [16]:
import pandas as pd

df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print(df.dtypes)

S.No        int64
Name       object
Age         int64
City       object
Salary    float64
dtype: object


By default, the dtype of the Salary column is int, but the result shows it as float because we have explicitly casted the type.

### Custom Header Names

Specify the names of the header using the names argument.

In [17]:
import pandas as pd
 
df = pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print(df)

      a       b    c          d       e
0  S.No    Name  Age      City   Salary
1     1     Tom   28    Toronto   20000
2     2     Lee   32   HongKong    3000
3     3  Steven   43        Bay    8300
4     4     Ram   38  Hyderabad   39000


Observe, the header names are appended with the custom names, but the header in the file has not been eliminated. Now, we use the header argument to remove that.

If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows.

In [18]:
import pandas as pd 

df = pd.read_csv("temp.csv", names=['a','b','c','d','e'], header=0)
print(df)

   a       b   c          d      e
0  1     Tom  28    Toronto  20000
1  2     Lee  32   HongKong   3000
2  3  Steven  43        Bay   8300
3  4     Ram  38  Hyderabad  39000


### Skipping Rows

For skiping a specific row

In [19]:
import pandas as pd

df = pd.read_csv("temp.csv", skiprows=2)
print(df)

   2     Lee  32   HongKong   3000
0  3  Steven  43        Bay   8300
1  4     Ram  38  Hyderabad  39000


## Indexing and Selecting Data

The Python and NumPy indexing operator "[]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, we recommend that you take advantage of the optimized pandas data access methods explained in this chapter.

Pandas now supports three types of Multi-axes indexing; the three types are mentioned in the following table −

![](../all_images/pandas_index.PNG)

## Indexing with .loc()

.loc()
Pandas provide various methods to have purely label based indexing. When slicing, the start bound is also included. Integers are valid labels, but they refer to the label and not the position.

.loc() has multiple access methods like −

- A single scalar label
- A list of labels
- A slice object
- A Boolean array

loc takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

### Example 1


In [20]:
df = pd.DataFrame(np.random.randn(8, 4), index = ['p','q','r','s','t','u','v','w'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print (df.loc[:,'A'])

p   -0.265433
q   -1.025960
r    0.336007
s   -3.185059
t   -0.303443
u   -1.026256
v   -0.109916
w    1.355430
Name: A, dtype: float64


### Example 2

In [21]:
df = pd.DataFrame(np.random.randn(8, 4), index = ['p','q','r','s','t','u','v','w'], columns = ['A', 'B', 'C', 'D'])

# Select all rows for multiple columns, say list[]
print (df.loc[:,['A','C']])

          A         C
p  1.652936 -0.238378
q  0.716496 -0.604624
r -0.867570  0.282570
s  1.196433  0.427658
t  0.506004  0.195948
u -0.798844 -0.113927
v  0.267385 -0.582168
w  0.668178  1.426415


### Example 3

In [22]:
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

# Select few rows for multiple columns
print (df.loc[['a','b','f','h'],['A','C']])

          A         C
a -1.364030 -0.495333
b  0.088712 -0.621808
f  2.432601  1.068432
h  0.408565  0.484206


### Example 4

In [23]:
df = pd.DataFrame(np.random.randn(8, 4),
index = ['p','q','r','s','t','u','v','w'], columns = ['A', 'B', 'C', 'D'])

# Select range of rows for all columns
print (df.loc['p':'t'])

          A         B         C         D
p -1.155335  1.385747 -1.926556  0.586541
q -0.036768 -0.311639 -0.016075 -0.431020
r  0.580010 -0.606948  1.630403 -0.572697
s  0.167874 -0.744030 -1.088734 -0.521950
t -0.423888 -0.554509 -0.853873  1.489969


### Example 5: Conditional Access

In [26]:
df = pd.DataFrame(np.random.randn(8, 4),
index = ['p','q','r','s','t','u','v','w'], columns = ['A', 'B', 'C', 'D'])

# for getting values with a boolean array
print (df.loc['p']>0)

A    False
B     True
C    False
D    False
Name: p, dtype: bool


## Indexing with .iloc()

Pandas provide various methods in order to get purely integer based indexing. Like python and numpy, these are 0-based indexing.

The various access methods are as follows −

- An Integer
- A list of integers
- A range of values

### Example 1

In [53]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['a', 'b', 'c', 'd'])

# select all rows for a specific column
print (df.iloc[:4])

          a         b         c         d
0  0.904754  0.337680  0.508223 -1.058878
1 -0.532695 -0.431545 -0.162719  1.928955
2 -0.991537  1.038848  0.689747  0.637071
3 -0.626152  1.014243  1.054004 -1.508686


### Example 2

In [54]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['a', 'b', 'c', 'd'])

# Integer slicing
print (df.iloc[:4])
print (df.iloc[1:5, 2:4])

          a         b         c         d
0 -0.388369  1.319347  0.851811 -1.217552
1 -0.164599  0.702187  0.534779 -0.118588
2 -0.512155 -0.162196  1.318415 -0.974556
3  1.943318 -0.898667  0.619465 -0.044965
          c         d
1  0.534779 -0.118588
2  1.318415 -0.974556
3  0.619465 -0.044965
4 -0.716947 -0.773281


### Example 3

In [56]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['a', 'b', 'c', 'd'])
# Slicing through list of values
print (df.iloc[[1, 3, 5], [1, 3]])
print (df.iloc[1:3, :])
print (df.iloc[:,1:3])

          b         d
1 -0.333712  2.202506
3 -0.694579 -0.567488
5  0.484190 -0.422380
          a         b         c         d
1 -0.665368 -0.333712  1.011703  2.202506
2  0.565993 -0.131646  0.757702 -1.628908
          b         c
0 -0.040862 -0.346029
1 -0.333712  1.011703
2 -0.131646  0.757702
3 -0.694579  1.196766
4 -0.533025  1.123446
5  0.484190  0.591972
6 -0.320016 -0.211021
7  0.422171  1.914503


## Indexing with .ix()

Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator. Use of ix is deprecated.

### Example 1

In [27]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['a', 'b', 'c', 'd'])

# Integer slicing
print (df.ix[:4])

          a         b         c         d
0 -0.198267 -1.502448 -0.545179  0.301060
1 -1.107565 -0.542683  0.684250 -2.378254
2 -0.778596 -1.655079  0.137603 -0.141611
3  0.068435 -1.053913 -0.788315  1.048681
4 -0.145025  0.030205  1.282661  0.048547


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.


### Example 2

In [58]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['a', 'b', 'c', 'd'])
# Index slicing
print (df.ix[:,'b'])

0    0.011401
1   -0.654068
2    0.274163
3   -0.968567
4    0.224667
5   -1.618608
6   -0.421572
7    1.249111
Name: b, dtype: float64


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


## Use of Notations

Getting values from the Pandas object with Multi-axes indexing uses the following notation −

![](../all_images/pandas_noti.PNG)

### Indexing with []

Let us now see how each operation can be performed on the DataFrame object. We will use the basic indexing operator '[ ]'

### Example 1

In [59]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['c_0', 'c_1', 'c_2', 'c_3'])
print (df['c_2'])

0    0.815728
1   -2.018651
2   -0.712686
3   -0.250534
4    1.284177
5    0.208350
6   -0.449534
7    1.140519
Name: c_2, dtype: float64


Note − We can pass a list of values to [ ] to select those columns.

### Example 2

In [60]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['c_0', 'c_1', 'c_2', 'c_3'])

print (df[['c_1','c_2']])

        c_1       c_2
0 -0.903945 -0.456470
1  0.679146 -0.756001
2 -0.494437  0.584851
3 -0.824518 -0.110430
4 -1.088866  0.658546
5  0.805630 -1.244811
6  0.645045 -1.244987
7  0.626973  1.364706


### Example 3

In [63]:
df = pd.DataFrame(np.random.randn(8, 4), columns = ['c_0', 'c_1', 'c_2', 'c_3'])
print(df[0:2])

        c_0       c_1       c_2       c_3
0 -2.006115 -0.271351  0.333551 -1.192172
1 -0.525778 -1.049774 -0.290127  0.089914


### Attribute Access
Columns can be selected using the attribute operator '.'.

### Example

In [28]:
df = pd.DataFrame(np.random.randn(9, 5), columns = ['v_0', 'v_1', 'v_2', 'v_3','v_4'])
print (df.v_1)

0   -0.158471
1    1.419055
2    0.955843
3   -0.875995
4    0.812115
5   -1.314532
6   -0.320590
7    0.486258
8   -0.843228
Name: v_1, dtype: float64


## Vectorized Operations

When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.

In [33]:
import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5))

print("Original series")
print(s)
print("Series after Addition")
print(s + s)
print("Series after Exponentiation")
print(np.exp(s))


Original series
0    0.883022
1   -0.775152
2   -0.164110
3    1.076081
4   -1.268015
dtype: float64
Series after Addition
0    1.766044
1   -1.550305
2   -0.328221
3    2.152162
4   -2.536030
dtype: float64
Series after Exponentiation
0    2.418197
1    0.460634
2    0.848648
3    2.933163
4    0.281390
dtype: float64


## Pandas - Missing Data

Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.

### When and Why Is Data Missed?

Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.

Let us now see how we can handle missing values (say NA or NaN) using Pandas.

In [34]:
# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 3), index=['a', 'c', 'e', 'f', 'h','b','d','g'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df)

        one       two     three
a  0.964042  1.997078  0.589641
b  0.659743  0.667158  0.017106
c  0.679618 -0.106987  0.274920
d -1.892245 -0.300482  0.507583
e -0.194126  0.337466  0.089299
f  0.026672 -0.120068 -0.717947
g  0.782839  0.439512  1.699541
h  0.052338  0.744852 -1.925176


Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number.

### Checking for Missing Values

Pandas provides the following functions:

- isnull()
- notnull() 

which are also methods on Series and DataFrame objects −

### Example 1

In [35]:
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(8, 3), index=['a', 'c', 'e', 'f', 'h','b','d','g'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df['one'].isnull())

a    False
b    False
c    False
d    False
e    False
f    False
g    False
h    False
Name: one, dtype: bool


### Example 2

In [39]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])    # reindexing introduces missing values

print(df)
print("-"*40)
print (df['one'].notnull())
print("-"*40)
print (df['one'].isnull())

        one       two     three
a -1.780543 -0.455291 -0.478708
b       NaN       NaN       NaN
c -1.225284 -0.338467  2.158692
d       NaN       NaN       NaN
e -1.291847  0.103529  2.295500
f -0.481004  0.503503  0.221543
g       NaN       NaN       NaN
h  0.947167  1.068833 -0.743487
----------------------------------------
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool
----------------------------------------
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


### Calculations with Missing Data

- When summing data, NA will be treated as Zero
- If the data are all NA, then the result will be NA

### Example 1

In [41]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'])
print("-"*40)
print(df['one'].sum())
print("-"*40)
print(df['one'].count())   # Number of non-missing values

a   -0.870422
b         NaN
c   -0.264225
d         NaN
e   -0.569504
f   -0.072260
g         NaN
h   -0.356483
Name: one, dtype: float64
----------------------------------------
-2.1328934911059467
----------------------------------------
5


### Example 2

In [44]:
df = pd.DataFrame(index=[0,1,2,3,4,5],columns=['one','two'])
print(df)
print("-"*40)
print (df['one'].sum())

   one  two
0  NaN  NaN
1  NaN  NaN
2  NaN  NaN
3  NaN  NaN
4  NaN  NaN
5  NaN  NaN
----------------------------------------
0


## Cleaning / Filling Missing Data

Pandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.

## Replace NaN with a Scalar Value

The following program shows how you can replace "NaN" with "0".

In [45]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print (df)
print("-"*40)
print (("NaN replaced with '0':"))
print (df.fillna(0))

        one       two     three
a -1.066508  0.699219  1.276876
b       NaN       NaN       NaN
c  1.658481  0.222874 -1.277126
----------------------------------------
NaN replaced with '0':
        one       two     three
a -1.066508  0.699219  1.276876
b  0.000000  0.000000  0.000000
c  1.658481  0.222874 -1.277126


Here, we are filling with value zero; instead we can also fill with any other value.

## Fill NA Forward and Backward

Using the concepts of filling we will fill the missing values.

![](../all_images/pandas_missing.PNG)

### Example 1

In [47]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'e', 'f', 'h'],columns=['m-0', 'm-1', 'm-2'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print("-"*40)

print (df.fillna(method='pad'))

        m-0       m-1       m-2
a  1.321987 -1.426011  0.425190
b       NaN       NaN       NaN
c       NaN       NaN       NaN
d       NaN       NaN       NaN
e  2.332227  0.382206  1.780067
f -0.270088 -1.280743 -0.974436
g       NaN       NaN       NaN
h  0.129827  0.543173  0.846100
----------------------------------------
        m-0       m-1       m-2
a  1.321987 -1.426011  0.425190
b  1.321987 -1.426011  0.425190
c  1.321987 -1.426011  0.425190
d  1.321987 -1.426011  0.425190
e  2.332227  0.382206  1.780067
f -0.270088 -1.280743 -0.974436
g -0.270088 -1.280743 -0.974436
h  0.129827  0.543173  0.846100


### Example 2

In [49]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'e', 'f', 'h'],columns=['m-0', 'm-1', 'm-2'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print("-"*40)

print (df.fillna(method='backfill'))

        m-0       m-1       m-2
a  0.496717 -0.815986 -1.358696
b       NaN       NaN       NaN
c       NaN       NaN       NaN
d       NaN       NaN       NaN
e -0.430516  0.012630 -0.076249
f -1.270131  0.713687  1.141775
g       NaN       NaN       NaN
h  0.880315 -2.325500  0.576816
----------------------------------------
        m-0       m-1       m-2
a  0.496717 -0.815986 -1.358696
b -0.430516  0.012630 -0.076249
c -0.430516  0.012630 -0.076249
d -0.430516  0.012630 -0.076249
e -0.430516  0.012630 -0.076249
f -1.270131  0.713687  1.141775
g  0.880315 -2.325500  0.576816
h  0.880315 -2.325500  0.576816


## Drop Missing Values

If user want to simply exclude the missing values, then use the dropna function along with the axis argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.

### Example 1


In [52]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print("-"*40)

print (df.dropna()) ## droppiing rows with missing values

        one       two     three
a  0.291153 -0.068817  0.524530
b       NaN       NaN       NaN
c  1.213827  1.196916  0.798348
d       NaN       NaN       NaN
e -1.045626  0.133332  2.470245
f -0.234102  0.702337 -1.911084
g       NaN       NaN       NaN
h -2.134151 -0.558050  0.040926
----------------------------------------
        one       two     three
a  0.291153 -0.068817  0.524530
c  1.213827  1.196916  0.798348
e -1.045626  0.133332  2.470245
f -0.234102  0.702337 -1.911084
h -2.134151 -0.558050  0.040926


### Example 2

In [55]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)
print("-"*40)

print (df.dropna(axis=1)) ## dropping columns with mising values

        one       two     three
a -0.132769 -0.435357 -0.290375
b       NaN       NaN       NaN
c -0.591893 -0.779379 -0.355633
d       NaN       NaN       NaN
e  0.839953  0.596616  0.937402
f  0.894088 -0.757391 -0.240640
g       NaN       NaN       NaN
h  1.200537 -0.052903 -1.642685
----------------------------------------
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]


## Replace Missing (or) Generic Values

Many times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.

Replacing NA with a scalar value is equivalent behavior of the fillna() function.

### Example 1

In [56]:
df = pd.DataFrame({'one':[10.1,20.2,30.3,40,50,2000], 'two':[1000.00,0,30,40,50,60]})
print(df)
print("-"*40)
print (df.replace({1000.00:10,2000:60.66}))

      one     two
0    10.1  1000.0
1    20.2     0.0
2    30.3    30.0
3    40.0    40.0
4    50.0    50.0
5  2000.0    60.0
----------------------------------------
     one   two
0  10.10  10.0
1  20.20   0.0
2  30.30  30.0
3  40.00  40.0
4  50.00  50.0
5  60.66  60.0


### Example 2

## Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level good for some quite sophisticated data analysis and manipulation, especially when user working with higher dimensional data. In case,data structures like Series (1d) and DataFrame (2d).

### Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. User can think of MultiIndex as an array of tuples where each tuple is unique. It can be created from:
- list of arrays (using MultiIndex.from_arrays())
- an array of tuples (using MultiIndex.from_tuples())
- a crossed set of iterables (using MultiIndex.from_product())
- a DataFrame (using MultiIndex.from_frame()).

The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [77]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))

In [78]:
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [79]:
index = pd.MultiIndex.from_tuples(tuples, names=['one', 'two'])


In [80]:
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['one', 'two'])

In [81]:
s = pd.Series(np.random.randn(8), index=index)

In [82]:
s

one  two
bar  one    0.689673
     two    0.453947
baz  one    0.102386
     two    0.911272
foo  one    0.242827
     two   -1.468812
qux  one    0.608270
     two    2.076161
dtype: float64

When user want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [83]:
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

In [84]:
 pd.MultiIndex.from_product(iterables, names=['first', 'second'])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

As a convenience, user can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [85]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
   ....:           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]

In [86]:
s = pd.Series(np.random.randn(8), index=arrays)


In [87]:
s

bar  one   -0.342713
     two   -0.178773
baz  one    1.426398
     two    0.773785
foo  one    0.621397
     two   -0.964799
qux  one    0.480430
     two   -0.408228
dtype: float64

In [88]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [89]:
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.574036,-0.049006,1.954441,-0.193692
bar,two,0.854612,-0.350718,0.154998,0.283176
baz,one,0.485028,0.55205,1.433819,-0.399775
baz,two,-0.529789,-0.51551,-1.030298,-1.086188
foo,one,-0.916279,-0.088082,2.525949,-1.261116
foo,two,-0.689671,0.292921,-0.423463,0.422791
qux,one,0.069551,1.457015,0.332795,-1.613163
qux,two,0.176659,-1.265058,0.18096,0.464527


All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [90]:
df.index.names

FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levels of the index is up to user:

In [91]:
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

In [92]:
df

one,bar,bar,baz,baz,foo,foo,qux,qux
two,one,two,one,two,one,two,one,two
A,0.973758,0.006658,-0.383914,0.045786,-1.620563,0.195389,-0.531078,0.590996
B,0.263032,-1.372064,-0.460883,-0.780552,1.396432,0.619219,-1.275708,-0.807469
C,-1.256723,-0.35637,0.147443,0.47886,0.96735,-0.459532,-0.315192,-2.178029


In [93]:
 pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,one,bar,bar,baz,baz,foo,foo
Unnamed: 0_level_1,two,one,two,one,two,one,two
one,two,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
bar,one,-0.050312,0.94267,0.539197,0.976262,0.267907,-0.628794
bar,two,-0.773005,-0.169579,-0.876828,-0.497828,-0.087466,-0.459555
baz,one,-0.141164,-0.593003,1.886547,0.487645,-1.088602,-0.232656
baz,two,-0.271612,0.348401,1.064819,-0.180327,0.166782,-0.222782
foo,one,0.365652,0.398211,-0.61941,-0.333158,-0.721233,-0.787124
foo,two,-0.795196,1.162408,0.991722,1.025523,0.792536,-0.05121


## Groupby

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

In [76]:
# Let's compute mean speed of animals in the below example
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'], 'Max Speed': [380., 370., 24., 26.]})
print(df)
df.groupby(['Animal']).mean()

   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0


Unnamed: 0_level_0,Max Speed
Animal,Unnamed: 1_level_1
Falcon,375.0
Parrot,25.0


In [96]:
# Group by based on hierarchical index
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
                  index=index)
print(df)

print("-"*40)
print(df.groupby(level=0).mean())
print("-"*40)
print(df.groupby(level="Type").mean())

                Max Speed
Animal Type              
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
----------------------------------------
        Max Speed
Animal           
Falcon      370.0
Parrot       25.0
----------------------------------------
         Max Speed
Type              
Captive      210.0
Wild         185.0


## Pivot Table

Spreadsheet-style pivot tables can be created as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

In [97]:
# Let's create a dataframe with data and multi
import pandas as pd

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small", "small", "large", "small", "small", "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df)

     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9


In [98]:
# Aggregate values, and compute the sum!
import numpy as np
table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum)
table

Unnamed: 0_level_0,C,large,small
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,4.0,5.0
bar,two,7.0,6.0
foo,one,4.0,1.0
foo,two,,6.0


In [99]:
# Aggregate by taking the mean across multiple columns
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E': np.mean})
table

Unnamed: 0_level_0,Unnamed: 1_level_0,D,E
A,C,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,large,5.5,7.5
bar,small,5.5,8.5
foo,large,2.0,4.5
foo,small,2.333333,4.333333


In [100]:
# Calculate multiple types of aggregations for any given value column
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'], aggfunc={'D': np.mean, 'E': [min, max, np.mean]})
table

Unnamed: 0_level_0,Unnamed: 1_level_0,D,E,E,E
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,max,mean,min
A,C,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
bar,large,5.5,9.0,7.5,6.0
bar,small,5.5,9.0,8.5,8.0
foo,large,2.0,5.0,4.5,4.0
foo,small,2.333333,6.0,4.333333,2.0


## Export Pandas DataFrame to a CSV File



In [102]:
#syntax

#df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv')

In [103]:
#Example

from pandas import DataFrame

cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
        'Price': [22000,25000,27000,35000]
        }

df = DataFrame(cars, columns= ['Brand', 'Price'])

print (df)

            Brand  Price
0     Honda Civic  22000
1  Toyota Corolla  25000
2      Ford Focus  27000
3         Audi A4  35000


if user want to export the DataFrame you just created to a CSV file.

In [105]:
export_csv = df.to_csv (r'export_dataframe.csv', index = None, header=True) #Don't forget to add '.csv' at the end of the path

User can check export_dataframe.csv into repective directory.

## Practice Work

- Write a Pandas program to convert a Panda module Series to Python list and it’s type.
- Write a Pandas program to convert a dictionary to a Pandas series.

  Sample dictionary: d1 = {'a': 100, 'b': 200, 'c':300, 'd':400, 'e':800}
  
  
- Write a Pandas program to create and display a DataFrame from a specified dictionary data which has the index labels.

Sample DataFrame:

exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],

'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],

'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

- Write a Pandas program to select the specified columns and rows from a given DataFrame.
Select 'name' and 'score' columns in rows 1, 3, 5, 6 from the above data frame.

## Source(s)
- https://pandas.pydata.org/pandas-docs/stable/overview.html
- https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
- https://www.tutorialspoint.com/python_pandas/python_pandas_missing_data.htm