# Module 3: Data Analytics With Python - Applied Statistics 

## Lab 4: Data Manipulation With Pandas
<br><br><br><br><br><br>

## Objective
***

<ul type='disc'>
  <li>Creating New Categorical Features from Continuous Variable
  </li>
  <li>Groupby Operation</li>
  <ul type='circle'>
   <li>Aggregation</li>
   <li>Transformation</li>
   <li>Filtering</li>
  </ul>
  <li>Groupby statistical Analysis</li>

</ul>

<br><br><br><br><br><br>
## Creating new categorical features from continuous variable
*** 
Binning, also known as quantization is used for transforming continuous numeric features into discrete ones (categories). These discrete values or numbers can be thought of as categories or bins into which the raw, continuous numeric values are binned or grouped into. 

Why do we need to create new categorical features from continous variables? 
Wide range of numerical data will be more readable if it is in groups statistical analysis of groups will provide better insight.

Let's say we want to group age by:
- 0 to 2 = ‘Toddler/Baby’
- 3 to 17 = ‘Child’
- 18 to 65 = ‘Adult’
- 66 to 99=’Elderly’

Then we can use pandas cut (): 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creating Age
rng=np.random.RandomState(42)
age = pd.DataFrame({'AGE': rng.randint(1, 100, 20)})
age['bins'] = pd.cut(x=age['AGE'], bins=[0,2,17,65,99],
                    labels=['Toddler/Baby','Child', 'Adult','Elderly'])
 
age

Unnamed: 0,AGE,bins
0,52,Adult
1,93,Elderly
2,15,Child
3,72,Elderly
4,61,Adult
5,21,Adult
6,83,Elderly
7,87,Elderly
8,75,Elderly
9,75,Elderly


<br><br><br><br><br><br>
## Groupby Operation
***


n order to split the data, we use groupby() function this function is used to split the data into groups based on some criteria. Pandas objects can be split on any of their axes.

By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria.

- Applying a function to each group independently.

- Combining the results into a data structure.

- **Aggregation: compute a summary statistic (or statistics) for each group. Some examples:**

     - Compute group sums or means.

     - Compute group sizes / counts.

- **Transformation: perform some group-specific computations and return a like-indexed object. Some examples:**

    - Standardize data (zscore) within a group.

    - Filling NAs within groups with a value derived from each group.

- **Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:**

    - Discard data that belongs to groups with only a few members.

    - Filter out data based on the group sum or mean.

In [3]:
df = pd.DataFrame(
    [
        ("bird", "Falconiformes", 389.0),
        ("bird", "Psittaciformes", 24.0),
        ("mammal", "Carnivora", 80.2),
        ("mammal", "Primates", np.nan),
        ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"],
    columns=("class", "order", "max_speed"),
)
df

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do one of the following: 
<br><br><br><br><br><br>
### Aggregation
***
*   Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
1.   Compute group sums or means.
2.   Compute group sizes / counts

In [4]:
# Aggregation
df.groupby(['class'])['max_speed'].sum()

class
bird      413.0
mammal    138.2
Name: max_speed, dtype: float64

###Transformation
*      Transformation: perform some group-specific computations and return a like-indexed object. Some examples: 
1.    Standardize data (zscore) within a group. 
2.    Filling NAs within groups with a value derived from each group. 


In [5]:
# Transformation
df.groupby(['class'])['max_speed'].transform(lambda x:x.fillna(x.mean()))

falcon     389.0
parrot      24.0
lion        80.2
monkey      69.1
leopard     58.0
Name: max_speed, dtype: float64

<br><br><br><br><br><br>
### Filtering
***
*         Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples: 
1.  Discard data that belongs to groups with only a few members. 
2.  Filter out data based on the group sum or mean. 


In [6]:
# filtering
df.groupby(['class'])['max_speed'].filter(lambda x: x.mean() > 200)

falcon    389.0
parrot     24.0
Name: max_speed, dtype: float64

In [7]:
# Getting Groups
groups=df.groupby(['class'])
groups.get_group('bird')

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0


In [8]:
# Iterating through groups
for name,data in groups:
    print("\nName:",name)
    print("\n",data)


Name: bird

        class           order  max_speed
falcon  bird   Falconiformes      389.0
parrot  bird  Psittaciformes       24.0

Name: mammal

           class      order  max_speed
lion     mammal  Carnivora       80.2
monkey   mammal   Primates        NaN
leopard  mammal  Carnivora       58.0


<br><br><br><br><br><br>
## Grouby statistical Analysis
***

Pandas groupby can be used to find group statistics. Considering the example of tips data in seaborn. We can do the following- 

In [9]:
# Getting the data
import seaborn as sns
df1=sns.load_dataset('tips')
df1.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
df1.groupby(['sex', 'smoker']).agg({
        'total_bill': ['median'], 
        'tip': ['median', 'min', 'max','count']
    })

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,tip,tip,tip
Unnamed: 0_level_1,Unnamed: 1_level_1,median,median,min,max,count
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Male,Yes,20.39,3.0,1.0,10.0,60
Male,No,18.24,2.74,1.25,9.0,97
Female,Yes,16.27,2.88,1.0,6.5,33
Female,No,16.69,2.68,1.0,5.2,54


### Thank You !