### Pandas Groupby Dataframe - .groupby()

In [1]:
import pandas as pd
# US_congress = pd.read_csv("https://theunitedstates.io/congress-legislators/legislators-historical.csv")
air_quality = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/airquality.csv", usecols=range(1,7))
air_quality

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5
...,...,...,...,...,...,...
148,30.0,193.0,6.9,70,9,26
149,,145.0,13.2,77,9,27
150,14.0,191.0,14.3,75,9,28
151,18.0,131.0,8.0,76,9,29


In [2]:
air_quality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Ozone    116 non-null    float64
 1   Solar.R  146 non-null    float64
 2   Wind     153 non-null    float64
 3   Temp     153 non-null    int64  
 4   Month    153 non-null    int64  
 5   Day      153 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 7.3 KB


In [3]:
air_quality.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
0,41.0,190.0,7.4,67,5,1
1,36.0,118.0,8.0,72,5,2
2,12.0,149.0,12.6,74,5,3
3,18.0,313.0,11.5,62,5,4
4,,,14.3,56,5,5


### .groupby() is split-apply-combine. This refers to a chain of three steps:

#### 1. Split a table into groups.
#### 2. Apply some operations to each of those smaller tables.
#### 3. Combine the results i.e. Aggregate

![image.png](attachment:dc3b9fd9-fb0d-43b7-8e9f-c5cf90829387.png)

#### Groupby on a single column

In [4]:
air_quality.groupby('Month').agg(avg_temp= ("Temp","mean"))

Unnamed: 0_level_0,avg_temp
Month,Unnamed: 1_level_1
5,65.548387
6,79.1
7,83.903226
8,83.967742
9,76.9


#### Groupby on multiple columns

In [5]:
air_quality.groupby(['Month','Day',]).agg(avg_temp=("Temp", "mean"))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_temp
Month,Day,Unnamed: 2_level_1
5,1,67
5,2,72
5,3,74
5,4,62
5,5,56
...,...,...
9,26,70
9,27,77
9,28,75
9,29,76


In [6]:
air_quality.groupby(['Month','Day',]).agg(avg_temp=("Temp", "mean")).head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_temp
Month,Day,Unnamed: 2_level_1
5,1,67
5,2,72
5,3,74
5,4,62
5,5,56
5,6,66
5,7,65
5,8,59
5,9,61
5,10,69


#### Groupby and aggregate on multiple columns

In [7]:
air_quality.groupby(['Month','Day',]).agg(avg_temp=("Temp", "mean"), max_wind=("Wind", "max"))

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_temp,max_wind
Month,Day,Unnamed: 2_level_1,Unnamed: 3_level_1
5,1,67,7.4
5,2,72,8.0
5,3,74,12.6
5,4,62,11.5
5,5,56,14.3
...,...,...,...
9,26,70,6.9
9,27,77,13.2
9,28,75,14.3
9,29,76,8.0


#### Above result is not correct as groupby is performed for each day of a month. So max value will not produce the desired result. Instead we get the actual value for wind for each day.

#### Correct way of doing this is as below

In [8]:
air_quality.groupby(['Month']).agg(avg_temp=("Temp", "mean"), max_wind=("Wind", "max"))

Unnamed: 0_level_0,avg_temp,max_wind
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
5,65.548387,20.1
6,79.1,20.7
7,83.903226,14.9
8,83.967742,15.5
9,76.9,16.6


#### Grouping and sorting the values

In [9]:
air_quality.groupby(['Month']).agg(avg_temp=("Temp", "mean"), 
                                   max_wind=("Wind", "max")).sort_values(by="max_wind", ascending=False)

Unnamed: 0_level_0,avg_temp,max_wind
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
6,79.1,20.7
5,65.548387,20.1
9,76.9,16.6
8,83.967742,15.5
7,83.903226,14.9


### Dropna from a dataframe

#### **Syntax**: 
#### dataframe.dropna(axis, how, thresh, subset, inplace)


![image.png](attachment:8ffecd11-a007-450e-970e-fffced278cf3.png)

In [10]:
import numpy as np
import pandas as pd
df = pd.DataFrame({"name": ['Superman', 'Batman', 'Spiderman'],
                   "toy": [np.nan, 'Batmobile', 'Spiderman toy'],
                   "born": [np.nan, pd.Timestamp("1956-06-26"),np.nan]})
df

Unnamed: 0,name,toy,born
0,Superman,,NaT
1,Batman,Batmobile,1956-06-26
2,Spiderman,Spiderman toy,NaT


![image.png](attachment:f0073c86-c088-4776-aa6f-c9bb8c6782d1.png)

In [11]:
df.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


![image.png](attachment:58e10af8-8f31-4480-be9c-5fb0bcde79f1.png)

Drop the columns where at least one element is missing:



In [12]:
df.dropna(axis='columns')

Unnamed: 0,name
0,Superman
1,Batman
2,Spiderman


![image.png](attachment:98cae7b7-9bda-4f56-ae6d-d0075d97b9b4.png)

Drop the rows where all elements are missing.

In [13]:
df.dropna(how='all')

Unnamed: 0,name,toy,born
0,Superman,,NaT
1,Batman,Batmobile,1956-06-26
2,Spiderman,Spiderman toy,NaT


![image.png](attachment:fbf4d783-11f1-4d3c-a361-88e839ce9e22.png)

Keep only the rows with at least 2 non-NA values

In [14]:
df.dropna(thresh=3)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


![image.png](attachment:0ccdf45b-64c0-409a-8364-88e891c61d1d.png)

Define in which columns to look for missing values

In [15]:
df.dropna(subset=['name', 'born'],inplace=True)

![image.png](attachment:93e19a16-c978-4de0-91ee-ac1d2c7cb215.png)

In [16]:
df

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


Keep the DataFrame with valid entries in the same variable:

In [17]:
df.dropna(inplace=True)
df

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


![image.png](attachment:83c64e1b-4d88-4bc9-8ec9-d4ee6719aae6.png)

In [18]:
df

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


In [19]:
df.replace(np.nan,0)

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


In [20]:
df.replace(np.nan,df.name.max())

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26


In [21]:
df.fillna(df.name.max())

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1956-06-26
