[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rbg-research/AI-Training/blob/main/python/Pandas/Pandas_DataFrame.ipynb)

# Python | Pandas DataFrame

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

In this notebook we will get a brief insight on all these basic operation which can be performed on Pandas DataFrame :

* Creating a DataFrame
* Dealing with Rows and Columns
* Indexing and Selecting Data
* Working with Missing Data
* Iterating over rows and columns

# Creating a Pandas DataFrame
In the real world, a Pandas DataFrame will be created by loading the datasets from existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas DataFrame can be created from the lists, dictionary, and from a list of dictionary etc. Dataframe can be created in different ways here are some ways by which we create a dataframe:

1. Creating a dataframe using List: DataFrame can be created using a single list or a list of lists

In [2]:
# example program to create a pandas DataFrame using list
# import pandas as pd
import pandas as pd
 
# list of strings
lst = ['Hello', 'Viewers', 'Welcome', 'to', 
            'RBG.AI', 'Never', 'Unprepared']
 
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
print(df)

            0
0       Hello
1     Viewers
2     Welcome
3          to
4      RBG.AI
5       Never
6  Unprepared


2. Creating DataFrame from dict of ndarray/lists: To create DataFrame from dict of narray/list, all the narray must be of same length. If index is passed then the length index should be equal to the length of arrays. If no index is passed, then by default, index will be range(n) where n is the array length.

In [3]:
# example program for creating a DataFrame from dict narray / lists 

import pandas as pd
 
# intialise data of lists.
data = {'Name':['jai', 'barathi', 'mano', 'kamal'],
        'Age':[20, 21, 27, 25]}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

      Name  Age
0      jai   20
1  barathi   21
2     mano   27
3    kamal   25


# Dealing with Rows and Columns

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. We can perform basic operations on rows/columns like selecting, deleting, adding, and renaming.

1. Column Selection: 
In Order to select a column in Pandas DataFrame, we can either access the columns by calling them by their columns name.

In [10]:
# Define a dictionary containing employee data
data = {'Name':['Jai', 'Barathi', 'Mano', 'Kamal'],
        'Age':[20, 24, 30, 21],
        'Address':['Tamilnadu', 'Delhi', 'Karanataka', 'Kerala'],
        'Qualification':['Msc', 'Phd', 'MBA', 'M.Tech']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
 
# select two columns
print(df[['Name', 'Qualification']])
print() 
print(df[['Name','Qualification','Address']])

      Name Qualification
0      Jai           Msc
1  Barathi           Phd
2     Mano           MBA
3    Kamal        M.Tech

      Name Qualification     Address
0      Jai           Msc   Tamilnadu
1  Barathi           Phd       Delhi
2     Mano           MBA  Karanataka
3    Kamal        M.Tech      Kerala


2. Row Selection: Pandas provide a unique method to retrieve rows from a Data frame. DataFrame.loc[ ] method is used to retrieve rows from Pandas DataFrame. Rows can also be selected by passing integer location to an iloc[ ] function.



In [8]:
#import pandas
import pandas as pd
# making data frame from csv file
data = pd.read_csv("rbg.csv", index_col ="name")  # here "pd.read_csv" is used to read the csv file ## use your own csv 
# retrieving row by loc method
first = data.loc["jairam"]  
second = data.loc["barathi"]
 
 
print(first, "\n\n\n", second)

qualification          Msc
age                   21.0
address          Tamilnadu
designation         Intern
Name: jairam, dtype: object 


 qualification          Phd
age                   22.0
address          Tamilnadu
designation            CEO
Name: barathi, dtype: object


# Indexing and Selecting Data

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.

Indexing a Dataframe using indexing operator [ ] :
Indexing operator is used to refer to the square brackets following an object. The .loc and .iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

Column Selection: In order to select a single column, we simply put the name of the column in-between the brackets

In [13]:
# making data frame from csv file
data = pd.read_csv("rbg.csv", index_col ="name")
 
# retrieving columns by indexing operator
first = data["age"]
print(first)

name
jairam       21.0
barathi      22.0
manogaran    23.0
kamal         NaN
Kishore       NaN
Name: age, dtype: float64


# Indexing a DataFrame using .iloc[ ] :

This function allows us to retrieve rows and columns by position. In order to do that, we’ll need to specify the positions of the rows that we want, and the positions of the columns that we want as well. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

Selecting a single row:
In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

In [16]:
# making data frame from csv file
data = pd.read_csv("rbg.csv", index_col ="name")
 
# retrieving rows by iloc method 
row2 = data.iloc[0] 
print(row2)

qualification          Msc
age                   21.0
address          Tamilnadu
designation         Intern
Name: jairam, dtype: object


# Handling Missing Data

Missing Data can occur when no information is provided for one or more items or for a whole unit. Missing Data is a very big problem in real life scenario. Missing Data can also refer to as NA(Not Available) values in pandas.

# Checking for missing values
In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull(). Both function help in checking whether a value is NaN or not. These function can also be used in Pandas Series in order to find null values in a series.

In [17]:
## example program for checking missing data in the dataset
# importing pandas as pd
import pandas as pd
 
# importing numpy as np
import# importing pandas as pd
import pandas as pd
 
# importing numpy as np
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
 
# filling missing value using fillna()  
df.fillna(0) numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from dictionary of lists
df = pd.DataFrame(dict)
 
# using isnull() function  
df.isnull()

Unnamed: 0,First Score,Second Score,Third Score
0,False,False,True
1,False,False,False
2,True,False,False
3,False,True,False


# Filling missing values 
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

In [2]:
## example program for filling missing values
# importing pandas as pd
import pandas as pd
 
# importing numpy as np
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, 45, 56, np.nan],
        'Third Score':[np.nan, 40, 80, 98]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
 
# filling missing value using fillna()  
df.fillna(0) # this is going to replace the "nan" value with "o"

# Note: You can use any value for replacing nan values. 

Unnamed: 0,First Score,Second Score,Third Score
0,100.0,30.0,0.0
1,90.0,45.0,40.0
2,0.0,56.0,80.0
3,95.0,0.0,98.0


# Dropping missing values 
In order to drop a null values from a dataframe, we used dropna() function this fuction drop Rows/Columns of datasets with Null values in different ways.

In [28]:
# importing pandas as pd
import pandas as pd
 
# importing numpy as np
import numpy as np
 
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
        'Second Score': [30, np.nan, 45, 56],
        'Third Score':[52, 40, np.nan, 98],
        'Fourth Score':[np.nan, np.nan, np.nan, 65]}
 
# creating a dataframe from dictionary
df = pd.DataFrame(dict)
print(df) # printing the orginal dataframe

#dropping missing values
df.dropna()

   First Score  Second Score  Third Score  Fourth Score
0        100.0          30.0         52.0           NaN
1         90.0           NaN         40.0           NaN
2          NaN          45.0          NaN           NaN
3         95.0          56.0         98.0          65.0


Unnamed: 0,First Score,Second Score,Third Score,Fourth Score
3,95.0,56.0,98.0,65.0


# Iterating over rows

Inorder to iterate over rows, we can use three functions iteritems(), iterrows(), itertuples() . These three function will help in iteration over rows.

In [30]:
# importing pandas as pd
import pandas as pd

# Define a dictionary containing employee data
data = {'Name':['Jai', 'Barathi', 'Mano', 'Kamal'],
        'Age':[20, 24, 30, 21],
        'Address':['Tamilnadu', 'Delhi', 'Karanataka', 'Kerala'],
        'Qualification':['Msc', 'Phd', 'MBA', 'M.Tech']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
df

# iterating over rows using iterrows() function 
for i, j in df.iterrows():
    print(i, j)
    print()

0 Name                   Jai
Age                     20
Address          Tamilnadu
Qualification          Msc
Name: 0, dtype: object

1 Name             Barathi
Age                   24
Address            Delhi
Qualification        Phd
Name: 1, dtype: object

2 Name                   Mano
Age                      30
Address          Karanataka
Qualification           MBA
Name: 2, dtype: object

3 Name              Kamal
Age                  21
Address          Kerala
Qualification    M.Tech
Name: 3, dtype: object



# Iterating over Columns :
In order to iterate over columns, we need to create a list of dataframe columns and then iterating through that list to pull out the dataframe columns.

In [31]:
# importing pandas as pd
import pandas as pd

# Define a dictionary containing employee data
data = {'Name':['Jai', 'Barathi', 'Mano', 'Kamal'],
        'Age':[20, 24, 30, 21],
        'Address':['Tamilnadu', 'Delhi', 'Karanataka', 'Kerala'],
        'Qualification':['Msc', 'Phd', 'MBA', 'M.Tech']}
 
# Convert the dictionary into DataFrame 
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Address,Qualification
0,Jai,20,Tamilnadu,Msc
1,Barathi,24,Delhi,Phd
2,Mano,30,Karanataka,MBA
3,Kamal,21,Kerala,M.Tech


Now we iterate through columns, in order to iterate through columns we first create a list of dataframe columns and then iterate through list.

In [34]:
# creating a list of dataframe columns
columns = list(df)
 
for i in columns:
 
    # printing the third element of the column
    print (df[i][2]) 

Mano
30
Karanataka
MBA
