<h1>Getting Started with pandas</h1>
Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and easy in Python.While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [1]:
import pandas as pd
from pandas import Series, DataFrame

<h2>Introduction to pandas Data Structures</h2>
To  get  started  with  pandas,  you  will  need  to  get  comfortable  with  its  two  workhorse
data  structures:  
Series and DataFrame.

<h3>Series</h3>
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.The simplest Series is formed from only an array of data:

In [4]:
obj = Series([1,2,3,4])
obj

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
obj.values # Returns Numpy version of Series

array([1, 2, 3, 4])

In [12]:
obj.index # Returns indexes of a series

RangeIndex(start=0, stop=4, step=1)

In [16]:
obj1 = Series([1,2,3,4],index=["d","c","b","a"]) # define index yourself
obj1["a"] # Returns 4
obj1.index

Index(['d', 'c', 'b', 'a'], dtype='object')

In [17]:
obj1["c"] = 5
obj1

d    1
c    5
b    3
a    4
dtype: int64

In [18]:
obj1[["a","b","c"]] # Select indexes by list

a    4
b    3
c    5
dtype: int64

All standard Numpy operation are possible to apply on Series.

In [20]:
obj1 == 4 # return Boolean list like numpy

d    False
c    False
b    False
a     True
dtype: bool

In [31]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj2 = Series(sdata)
obj2 # Return Series using keys as index value as value

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

There are two methods isnull() and notnull() return boolean array telling is data missing and is data present respectively.

In [32]:
obj2.name = "Population"
obj2.index.name = "State"
obj2

State
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
Name: Population, dtype: int64

In [34]:
obj.index = ["kuch bhi","kuch bhi not","everything","everything not"] # A Series’s index can be altered in-place by assignment
obj

kuch bhi          1
kuch bhi not      2
everything        3
everything not    4
dtype: int64

<h3>DataFrame</h3>
A  DataFrame  represents  a  rectangular  table  of  data  and  contains  an  ordered  collection  of  columns,  each  of  which  can  be  a  different  value  type  (numeric,  string,boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or  more  two-dimensional  blocks  rather  than  a  list,  dict,  or  some  other  collection  of one-dimensional  arrays.

While a DataFrame is physically two-dimensional, you can use it to
represent  higher  dimensional  data  in  a  tabular  format  using  hierarchical  indexing, 

In [51]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],\
        'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [41]:
print("First five element of Frame ",frame.head(),"Last five element of Frame " , frame.tail(),"Frame" ,frame , sep = "\n\n\n")  #For large DataFrames, the head method selects only the first five rows and tail does the opposite.

First five element of Frame 


    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9


Last five element of Frame 


    state  year  pop
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


Frame


    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


In [61]:
frame.columns = ["State", "Year", "Population"]
frame

Unnamed: 0,State,Year,Population
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [48]:
frame[["State", "Population"]][0:3]

Unnamed: 0,State,Population
0,Ohio,1.5
1,Ohio,1.7
2,Ohio,3.6


In [65]:
"""frame2[column]  works  for  any  column  name,  but  frame2.column only  works  when  the  column  name  is  a  valid  Python  variable name."""
frame.Population

0    1.5
1    1.7
2    3.6
3    2.4
4    2.9
5    3.2
Name: Population, dtype: float64

In [68]:
frame1 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['one', 'two', 'three', 'four', 'five', 'six'])
frame1 # new column has NaN value if not set

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [75]:
frame1["debt"] = "10 Million $"
print(frame1) # assign all rows a value, can also assign array/list
frame1["debt"] = [f"{10*i} Million $" for i in range(6)]
frame1

       year   state  pop          debt
one    2000    Ohio  1.5  10 Million $
two    2001    Ohio  1.7  10 Million $
three  2002    Ohio  3.6  10 Million $
four   2001  Nevada  2.4  10 Million $
five   2002  Nevada  2.9  10 Million $
six    2003  Nevada  3.2  10 Million $


Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0 Million $
two,2001,Ohio,1.7,10 Million $
three,2002,Ohio,3.6,20 Million $
four,2001,Nevada,2.4,30 Million $
five,2002,Nevada,2.9,40 Million $
six,2003,Nevada,3.2,50 Million $


Rows can also be retrieve by loc method.

In [76]:
frame1.loc["three"]

year             2002
state            Ohio
pop               3.6
debt     20 Million $
Name: three, dtype: object