# Pandas Basics

Basics of Numpy covering topics like :<br>
1. <a href="#intro">Introduction</a>
2. <a href="#series">Series </a>  
3. <a href="#DF">Dataframes</a>   


## <a id="intro">Introduction</a> :

In simple words it is <b><i> Excel of Python </i></b> ,  

In a more detailed defination , it is an <b>open source library</b> built on top of <i> Numpy</i> used for <b>data loading, processing, cleaning</b> and also has certain visualization options. It can work with a <b>variety of data sources</b> both as input and output  
Example: Excel (.xlsx or.xls files) or text(.csv) files. 

In [1]:
#importing Pandas and giving it abbreviation pd to use easily
import numpy as np
import pandas as pd

# <a id="series">Series</a>
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). <br>The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet.<br>
<i>~Source - geeksforgeeks.com</i>


In [11]:
labels=['row1','row2','row3']
my_data=[10,20,30]
arr=np.array(data)
data

[10, 20, 30]

In [13]:
#making series with labels as labels and data as data
ser1=pd.Series(data=my_data,index=labels)
ser1

row1    10
row2    20
row3    30
dtype: int64

In [21]:
arr2=np.arange(0,9).reshape(3,3)
ser2=pd.Series(data=arr2,index=labels)
#It is clear from this example that Pandas Series only excepts 1D arrays and gives error if 2D array is passed


Exception: Data must be 1-dimensional

<b>Pandas Series can also store functions as data</b>

In [23]:
#storing functions as data in Series
#if no index is passed then an index(lables) is taken by default as 0,1,2.....
pd.Series(data=[sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

### Operations on Series

In [24]:
ser1

row1    10
row2    20
row3    30
dtype: int64

In [28]:
ser2=pd.Series(data=[5,3],index=['row2','row1'])   
ser2

row2    5
row1    3
dtype: int64

In [29]:
#adding both the series
ser1+ser2

row1    13.0
row2    25.0
row3     NaN
dtype: float64

In [30]:
ser2+ser1

row1    13.0
row2    25.0
row3     NaN
dtype: float64

As we can see from above adding 2 series we get the sum of elements as per their labels , irrespective of the order<br>
<i>Note: We get <code>NaN</code> for elements which were not present in both lists.

In [31]:
ser1*ser2

row1     30.0
row2    100.0
row3      NaN
dtype: float64

<hr>

# <a id="DF">Dataframes</a>

Pandas DataFrame is two-dimensional structure , potentially understood as an extension to pandas series and is tabular in nature.

In [37]:
import numpy as np
import pandas as pd
from numpy.random import randn

In [40]:
#pd.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],['W','X','Y','Z'])

#in this first arg was data, followed by row names then column names.

In [41]:
df

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,0.122627,0.402753,1.648286,-0.227295
C,0.061513,0.420233,0.413903,0.454575
D,-0.762936,0.219159,-0.693938,-0.136191
E,-0.671568,-0.795715,-1.083965,-0.765652


In [42]:
df['W']

A    0.407916
B    0.122627
C    0.061513
D   -0.762936
E   -0.671568
Name: W, dtype: float64

In [52]:
#creating new Column Sum which is sum of other colmns
df['sum']=df['W']+df['X']+df['Y']+df['Z']
df

Unnamed: 0,W,X,Y,Z,sum
A,0.407916,1.359632,0.410242,0.554724,2.732514
B,0.122627,0.402753,1.648286,-0.227295,1.94637
C,0.061513,0.420233,0.413903,0.454575,1.350223
D,-0.762936,0.219159,-0.693938,-0.136191,-1.373906
E,-0.671568,-0.795715,-1.083965,-0.765652,-3.3169


In [58]:
#will return error
df.drop('sum')

KeyError: "['sum'] not found in axis"

### <b><i>NOTE:</i></b><br>
This <code> KeyError: "['new'] not found in axis" </code> will occur as deletion axis is not specified

In [66]:
#axis =0 inplies row , axis = 1 implies column
#deleting axis again using axis
df.drop('sum',axis=1)

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,0.122627,0.402753,1.648286,-0.227295
C,0.061513,0.420233,0.413903,0.454575
D,-0.762936,0.219159,-0.693938,-0.136191
E,-0.671568,-0.795715,-1.083965,-0.765652


We can see that Sum is now removed 

In [67]:
#running df again
df

Unnamed: 0,W,X,Y,Z,sum
A,0.407916,1.359632,0.410242,0.554724,2.732514
B,0.122627,0.402753,1.648286,-0.227295,1.94637
C,0.061513,0.420233,0.413903,0.454575,1.350223
D,-0.762936,0.219159,-0.693938,-0.136191,-1.373906
E,-0.671568,-0.795715,-1.083965,-0.765652,-3.3169


we can see even on droping sum table and running df again that sum was not permanently deleted.
To permanently delete a column use  <br><code> drop("sum",axis=1,<b>inplace = TRUE)</b> </code>

The use of drop without inplace is when giving variable to the operation and storing it as a copy without making deletion in the original table

In [69]:
df.drop('sum',axis=1,inplace=True)

In [70]:
df

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,0.122627,0.402753,1.648286,-0.227295
C,0.061513,0.420233,0.413903,0.454575
D,-0.762936,0.219159,-0.693938,-0.136191
E,-0.671568,-0.795715,-1.083965,-0.765652


<hr>Printing only certain columns 

In [73]:
df[['W','Z']]

Unnamed: 0,W,Z
A,0.407916,0.554724
B,0.122627,-0.227295
C,0.061513,0.454575
D,-0.762936,-0.136191
E,-0.671568,-0.765652


<hr>Printing only certain rows<br>
We will use <code><b> df.loc[label] </b></code> <br>
where Loc stands for location , we will put label of row which we want   <br> 
The result will be an Pandas series

In [75]:
df.loc['A']

W    0.407916
X    1.359632
Y    0.410242
Z    0.554724
Name: A, dtype: float64

<hr> 
Getting Value of Particular row and column
using <br><code> <b> df.loc[row label,column label]

In [78]:
df.loc['A','W']

0.40791570157811313

Getting value of particular rows and columns.

In [80]:
#passing rows and then columns with double brackets (implying series)
df.loc[['A','B'],['W','Z']]

Unnamed: 0,W,Z
A,0.407916,0.554724
B,0.122627,-0.227295


<hr>
Getting back dataframe where condition is true :

In [89]:
cond_df=df > 0.3
cond_df

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,False,True,True,False
C,False,True,True,True
D,False,False,False,False
E,False,False,False,False


In [90]:
df[cond_df]

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,,0.402753,1.648286,
C,,0.420233,0.413903,0.454575
D,,,,
E,,,,


In [97]:
#returns true where condition is true in column per row
df['W']>0.1

A     True
B     True
C    False
D    False
E    False
Name: W, dtype: bool

In [98]:
df[df['W']>0.1]

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,0.122627,0.402753,1.648286,-0.227295


As we can see if we put condition on only only column , NaN values are not returned from that column , yet other columns may contain non true values like in the above example <code>df['B','Z']</code>is not True yet is still shown.

In [120]:
# to use 2 different conditions put & (and) || (or) between them
df[(df['W']>0.1) & (df['Z']<-0.1)]

Unnamed: 0,W,X,Y,Z
B,0.122627,0.402753,1.648286,-0.227295


<b><i>Note: All operations are temporary , to make them permanent pass argument <code> inplace=True </code></i></b>
                                                                                    

In [125]:
#adds new index in front
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.407916,1.359632,0.410242,0.554724
1,B,0.122627,0.402753,1.648286,-0.227295
2,C,0.061513,0.420233,0.413903,0.454575
3,D,-0.762936,0.219159,-0.693938,-0.136191
4,E,-0.671568,-0.795715,-1.083965,-0.765652


In [126]:
df

Unnamed: 0,W,X,Y,Z
A,0.407916,1.359632,0.410242,0.554724
B,0.122627,0.402753,1.648286,-0.227295
C,0.061513,0.420233,0.413903,0.454575
D,-0.762936,0.219159,-0.693938,-0.136191
E,-0.671568,-0.795715,-1.083965,-0.765652


In [129]:
country='IN CA US AU EU'.split()
df['Country']=country
df

Unnamed: 0,W,X,Y,Z,Country
A,0.407916,1.359632,0.410242,0.554724,IN
B,0.122627,0.402753,1.648286,-0.227295,CA
C,0.061513,0.420233,0.413903,0.454575,US
D,-0.762936,0.219159,-0.693938,-0.136191,AU
E,-0.671568,-0.795715,-1.083965,-0.765652,EU


In [131]:
df.set_index('Country')
#we set Country as Index

Unnamed: 0_level_0,W,X,Y,Z
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IN,0.407916,1.359632,0.410242,0.554724
CA,0.122627,0.402753,1.648286,-0.227295
US,0.061513,0.420233,0.413903,0.454575
AU,-0.762936,0.219159,-0.693938,-0.136191
EU,-0.671568,-0.795715,-1.083965,-0.765652


In [132]:
df

Unnamed: 0,W,X,Y,Z,Country
A,0.407916,1.359632,0.410242,0.554724,IN
B,0.122627,0.402753,1.648286,-0.227295,CA
C,0.061513,0.420233,0.413903,0.454575,US
D,-0.762936,0.219159,-0.693938,-0.136191,AU
E,-0.671568,-0.795715,-1.083965,-0.765652,EU


to make the index permanent pass <code> inplace= True</code>

In [133]:
df.set_index('Country',inplace=True)

In [134]:
df

Unnamed: 0_level_0,W,X,Y,Z
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
IN,0.407916,1.359632,0.410242,0.554724
CA,0.122627,0.402753,1.648286,-0.227295
US,0.061513,0.420233,0.413903,0.454575
AU,-0.762936,0.219159,-0.693938,-0.136191
EU,-0.671568,-0.795715,-1.083965,-0.765652


## Multi- Level Index
In this kind of index we have one primary index and 1 or more secondary indexes. <br>
This is useful if we have data which has to be broken down like Country and States or Course and Sections

In [149]:
level1_index=['India','India','India','USA','USA','Canada']
level2_index=['Chandigarh','Delhi','Noida','Florida','Boston','Waterloo']
hier_index=list(zip(level1_index,level2_index))

#The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed 
#iterator is paired together, and then the second item in each passed iterator are paired together

hier_index=pd.MultiIndex.from_tuples(hier_index)

#pandas MultiIndex combines both indexs which were combined into a tuple
hier_index

MultiIndex(levels=[['Canada', 'India', 'USA'], ['Boston', 'Chandigarh', 'Delhi', 'Florida', 'Noida', 'Waterloo']],
           codes=[[1, 1, 1, 2, 2, 0], [1, 2, 4, 3, 0, 5]])

In [150]:
df_country=pd.DataFrame(randn(6,2),hier_index,['Employment Index','Gender Ratio'])
#passing row lables as hier_index 

In [151]:
df_country

Unnamed: 0,Unnamed: 1,Employment Index,Gender Ratio
India,Chandigarh,-0.532998,-1.800171
India,Delhi,0.726451,-0.325613
India,Noida,-0.621033,-0.359268
USA,Florida,1.366847,0.056813
USA,Boston,0.88104,-1.149734
Canada,Waterloo,1.869794,-1.648532


In [155]:
#giving names to index
df_country.index.names=['Country','City']
df_country

Unnamed: 0_level_0,Unnamed: 1_level_0,Employment Index,Gender Ratio
Country,City,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Chandigarh,-0.532998,-1.800171
India,Delhi,0.726451,-0.325613
India,Noida,-0.621033,-0.359268
USA,Florida,1.366847,0.056813
USA,Boston,0.88104,-1.149734
Canada,Waterloo,1.869794,-1.648532


We can get required data by the following:


In [158]:
df_country.loc['India']

Unnamed: 0_level_0,Employment Index,Gender Ratio
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Chandigarh,-0.532998,-1.800171
Delhi,0.726451,-0.325613
Noida,-0.621033,-0.359268


In [160]:
df_country.loc['India'].loc['Chandigarh']['Gender Ratio']

-1.8001714366273804

In [162]:
df_country.xs('Chandigarh',level="City")

Unnamed: 0_level_0,Employment Index,Gender Ratio
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
India,-0.532998,-1.800171


<code><b>df_country.xs(match_value,level)</b></code> stands for cross section , where dataframe is returned where condition is true in that level .<br>
<i> Note: the matched level (column index) is not returned