# Important Pandas Functions for Data Science and Data Analysis

### Pandas is a predominantly used python data analysis library. It provides many functions and methods to expedite the data analysis process. What makes pandas so common is its functionality, flexibility, and simple syntax.

I am using examples on the USA housing dataset.

Here’s how to import the Pandas library and load it in the dataset:

In [1]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('USA_Housing.csv')
df

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386
...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316


### 1. Head and Tail

Once we read a dataset into a pandas data frame, we can take a look at it to get an overview. The simplest way is to display some rows. Head and tail allow us to display rows from the top of the bottom of the data frame, respectively.

In [4]:
df.head()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [5]:
df.tail()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
4995,60567.94414,7.830362,6.137356,3.46,22837.361035,1060194.0,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1482618.0,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.14549,1030730.0,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1198657.0,USS Wallace\nFPO AE 73316
4999,65510.581804,5.992305,6.792336,4.07,46501.283803,1298950.0,"37778 George Ridges Apt. 509\nEast Holly, NV 2..."


The top and bottom 5 rows are displayed by default but we can adjust it just bypassing the number of rows we’d like to display.

### 2. DataFrame.info( )

Pandas dataframe.info() the function is used to get a concise summary of the data frame. It comes really handy when doing exploratory analysis of the data.

To get a quick overview of the dataset we use thedataframe.info()function.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Avg. Area Income              5000 non-null   float64
 1   Avg. Area House Age           5000 non-null   float64
 2   Avg. Area Number of Rooms     5000 non-null   float64
 3   Avg. Area Number of Bedrooms  5000 non-null   float64
 4   Area Population               5000 non-null   float64
 5   Price                         5000 non-null   float64
 6   Address                       5000 non-null   object 
dtypes: float64(6), object(1)
memory usage: 273.6+ KB


### 3. Shape

The shape can be used on NumPy arrays, pandas series and data frames. It shows the number of dimensions as well as the size in each dimension.

Since data frames are two-dimensional, what shape returns is the number of rows and columns. It is a measure of how much data we have and a key input to the data analysis process.

In [7]:
df.shape

(5000, 7)

### 4. describe( )

If there’s one thing you do over and over again in the process of exploratory data analysis — that’s performing a statistical summary for every (or almost every) attribute.
The describe() the method will do a quick statistical summary for every numerical column, as shown below:

In [8]:
df.describe()

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,68583.108984,5.977222,6.987792,3.98133,36163.516039,1232073.0
std,10657.991214,0.991456,1.005833,1.234137,9925.650114,353117.6
min,17796.63119,2.644304,3.236194,2.0,172.610686,15938.66
25%,61480.562388,5.322283,6.29925,3.14,29403.928702,997577.1
50%,68804.286404,5.970429,7.002902,4.05,36199.406689,1232669.0
75%,75783.338666,6.650808,7.665871,4.49,42861.290769,1471210.0
max,107701.748378,9.519088,10.759588,6.5,69621.713378,2469066.0


### 5. Identifying Missing Values

Handling missing values is a critical step to build a robust data analysis process. The missing values should be a top priority since they have a significant effect on the accuracy of any analysis.

### 6. Isna

Isna function returns a data frame filled with boolean values with true indicating missing values.

In [11]:
df.isna().any()

Avg. Area Income                False
Avg. Area House Age             False
Avg. Area Number of Rooms       False
Avg. Area Number of Bedrooms    False
Area Population                 False
Price                           False
Address                         False
dtype: bool

### 7. Sum of Missing Values in a DataFrame

Let’s first compute the total number of missing values in the data frame. You can calculate the number of missing values in each column by

In [13]:
df.isnull().sum()

Avg. Area Income                0
Avg. Area House Age             0
Avg. Area Number of Rooms       0
Avg. Area Number of Bedrooms    0
Area Population                 0
Price                           0
Address                         0
dtype: int64

### 8. Nunique

Nunique counts the number of unique entries over columns or rows. It is very useful in categorical features especially in cases where we do not know the number of categories beforehand. Let’s take a look at our initial data frame:



In [14]:
df.nunique()

Avg. Area Income                5000
Avg. Area House Age             5000
Avg. Area Number of Rooms       5000
Avg. Area Number of Bedrooms     255
Area Population                 5000
Price                           5000
Address                         5000
dtype: int64

### 9.Index and column

index() is an inbuilt function in Python, which searches for a given element from the start of the list and returns the lowest index where the element appears.

In [16]:
df.index

RangeIndex(start=0, stop=5000, step=1)

### Columns()

In [18]:
df.columns

Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population', 'Price', 'Address'],
      dtype='object')

### 10. Loc and iloc

Loc and iloc are used to select rows and columns.

-loc: select by labels

-iloc: select by positions

loc is used to select data by the label. The labels of columns are the column names. We need to be careful about row labels. If we do not assign any specific indices, pandas create an integer index by default. Thus, the row labels are integers starting from 0 and going up. The row positions that are used with iloc are also integers starting from 0.

In [19]:
df.loc[:5,['Avg. Area Income', 'Price']]

Unnamed: 0,Avg. Area Income,Price
0,79545.458574,1059034.0
1,79248.642455,1505891.0
2,61287.067179,1058988.0
3,63345.240046,1260617.0
4,59982.197226,630943.5
5,80175.754159,1068138.0


In [20]:
df.iloc[:4,:6]

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0


### 11.Slicing

In [21]:
df[0:5]

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


### 12. Groupby

pandas groupby function is a great tool in exploring the data. It makes it easier to unveil the underlying relationships among variables. The figure below shows an overview of what groupby function does.

In [23]:
df[['Avg. Area Number of Rooms','Avg. Area Number of Bedrooms','Price']].groupby(['Avg. Area Number of Rooms','Avg. Area Number of Bedrooms']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Unnamed: 2_level_1
3.236194,3.42,1.365081e+06
3.950225,2.31,1.128720e+06
3.950973,4.06,8.859177e+04
3.969632,4.38,6.237216e+05
4.027931,3.13,2.662989e+05
...,...,...
10.024375,3.06,1.311432e+06
10.144988,3.14,1.810158e+06
10.219902,3.43,2.050594e+06
10.280022,6.23,2.065710e+06


### 13.Dropna

The dropna () function is used to remove a row or a column from a dataframe which has a NaN or no values in it based on axis passed as an argument that is axis= 0 for rows and axis =1 for columns.

In [24]:
df.dropna(axis=0, inplace=True)

# End Thank you