### SESSION 19 - GROUPBY OBJECT IN PANDAS

In [2]:
import numpy as np
import pandas as pd

**WHAT IS PANDAS GROUPBY?**
- Pandas groupby splits all the records from your data set into different categories or groups so that you can analyze the data by these groups. 
- When you use the .groupby() function on any categorical column of DataFrame, it returns a GroupBy object, which you can use other methods on to group the data. 
- **Generally we two types of columns in datasets numerical and categorical**
- **Numerical Columns:**
    - Numerical columns contain data that consists of numbers. These numbers can be integers or floating-point numbers (decimals).
    - Examples of numerical columns include columns like "Age," "Salary," "Temperature," "Number of Items Sold," and "Height.

- **Categorical Columns:**
    - Categorical columns contain data that represents categories or discrete values. These values are often labels or strings.
    - Categorical columns include columns like "Gender" (with values like "Male" and "Female"), "Product Category" (with values like "Electronics," "Clothing," and "Furniture"), and "Country" (with values like "USA," "Canada," and "UK").

- **Note : Groupby() always apply on Categorical columns**



- **Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)**
    - **by :** mapping, function, str, or iterable
    - **axis :** int, default 0
    - **level :** If the axis is a MultiIndex (hierarchical), group by a particular level or levels
    - **as_index :** For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output
    - **sort :** Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. groupby preserves the order of rows within each group.
    - **group_keys :** When calling apply, add group keys to index to identify pieces
    - **squeeze :** Reduce the dimensionality of the return type if possible, otherwise return a consistent type

In [3]:
movies = pd.read_csv('DATASETS/S19/imdb-top-1000.csv')

In [5]:
movies.head(20)

Unnamed: 0,Series_Title,Released_Year,Runtime,Genre,IMDB_Rating,Director,Star1,No_of_Votes,Gross,Metascore
0,The Shawshank Redemption,1994,142,Drama,9.3,Frank Darabont,Tim Robbins,2343110,28341469.0,80.0
1,The Godfather,1972,175,Crime,9.2,Francis Ford Coppola,Marlon Brando,1620367,134966411.0,100.0
2,The Dark Knight,2008,152,Action,9.0,Christopher Nolan,Christian Bale,2303232,534858444.0,84.0
3,The Godfather: Part II,1974,202,Crime,9.0,Francis Ford Coppola,Al Pacino,1129952,57300000.0,90.0
4,12 Angry Men,1957,96,Crime,9.0,Sidney Lumet,Henry Fonda,689845,4360000.0,96.0
5,The Lord of the Rings: The Return of the King,2003,201,Action,8.9,Peter Jackson,Elijah Wood,1642758,377845905.0,94.0
6,Pulp Fiction,1994,154,Crime,8.9,Quentin Tarantino,John Travolta,1826188,107928762.0,94.0
7,Schindler's List,1993,195,Biography,8.9,Steven Spielberg,Liam Neeson,1213505,96898818.0,94.0
8,Inception,2010,148,Action,8.8,Christopher Nolan,Leonardo DiCaprio,2067042,292576195.0,74.0
9,Fight Club,1999,139,Drama,8.8,David Fincher,Brad Pitt,1854740,37030102.0,66.0


m

In [15]:
# creating group by object
genres = movies.groupby('Genre')

In [18]:
# Applying builtIn aggregation function on goupby objects
genres.mean() # sum() min() mode() median() std() etc

  genres.mean()


Unnamed: 0_level_0,Runtime,IMDB_Rating,No_of_Votes,Gross,Metascore
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Action,129.046512,7.949419,420246.581395,189722400.0,73.41958
Adventure,134.111111,7.9375,313557.819444,131901700.0,78.4375
Animation,99.585366,7.930488,268032.073171,178432600.0,81.093333
Biography,136.022727,7.938636,272805.045455,94049520.0,76.240506
Comedy,112.129032,7.90129,178195.658065,101057200.0,78.72
Crime,126.392523,8.016822,313398.271028,78996560.0,77.08046
Drama,124.737024,7.957439,212343.612457,122525900.0,79.701245
Family,107.5,7.8,275610.5,219555300.0,79.0
Fantasy,85.0,8.0,73111.0,391363300.0,
Film-Noir,104.0,7.966667,122405.0,41970180.0,95.666667


In [11]:
# find the top 3 genres by total earning
movies.groupby('Genre').sum()['Gross'].sort_values(ascending=False).head(3)

  movies.groupby('Genre').sum()['Gross'].sort_values(ascending=False).head(3)


Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

In [12]:
# efficienct way 
movies.groupby('Genre')['Gross'].sum().sort_values(ascending=False).head(3)

Genre
Drama     3.540997e+10
Action    3.263226e+10
Comedy    1.566387e+10
Name: Gross, dtype: float64

In [21]:
# find the genre with highest avgrage IMDB rating
movies.groupby('Genre')['IMDB_Rating'].mean().sort_values(ascending=False).head(1)

Genre
Western    8.35
Name: IMDB_Rating, dtype: float64

In [24]:
# find director with most popularity
movies.groupby('Director')['No_of_Votes'].sum().sort_values(ascending=False).head(1)

Director
Christopher Nolan    11578345
Name: No_of_Votes, dtype: int64

In [45]:
# find number of movies done by each actor
movies['Star1'].value_counts()
movies.groupby('Star1')['Series_Title'].count().sort_values(ascending=False).head(3)

Star1
Tom Hanks         12
Robert De Niro    11
Clint Eastwood    10
Name: Series_Title, dtype: int64

In [None]:
# GroupBy Attributes and Methods


# first()/last() -> nth item
# get_group -> vs filtering
# groups
# describe
# sample
# nunique

**len():**
- To find the total number of groups created by the groupby operation.

In [51]:
# find total number of groups -> len
len(movies.groupby('Genre'))
# movies['Genre'].nunique()

14

**size():**
- To find the number of items in each group.

In [57]:
# find items in each group -> size
movies.groupby('Genre').size()
movies['Genre'].value_counts()

Drama        289
Action       172
Comedy       155
Crime        107
Biography     88
Animation     82
Adventure     72
Mystery       12
Horror        11
Western        4
Film-Noir      3
Fantasy        2
Family         2
Thriller       1
Name: Genre, dtype: int64

**first()** and **last():**
- To retrieve the first or last item within each group.

**get_group():**
- To retrieve a specific group by its name, which is useful for selective group access as opposed to filtering

**groups:**
- To access the groups as a dictionary where keys are unique group labels and values are group indices.

**describe():**
- To generate descriptive statistics for each group, providing information like mean, std deviation, min, max, and more.

**nunique():**
- To count the number of unique values within each group.