# Data manipulation using .groupby()

In [3]:
import pandas as pd
restaurant_data = pd.read_csv("data/restaurant_data.csv")

## Data transformation using .groupby().transform


### The min-max normalization using .transform()
A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).

You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

In [11]:
# Define the min-max transformation
min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())

# Group the data according to the time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_min_max_group = restaurant_grouped.transform(min_max_tr)
print(restaurant_min_max_group.head())

   total_bill       tip  size
0    0.291579  0.001111   0.2
1    0.152283  0.073333   0.4
2    0.375786  0.277778   0.4
3    0.431713  0.256667   0.2
4    0.450775  0.290000   0.6


### Transforming values to probabilities
In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.

The transformation will be a exponential transformation. The exponential distribution is defined as

$$e^{−λ∗x}∗λ$$
where λ (lambda) is the mean of the group that the observation x belongs to.

In [14]:
import numpy as np


# Define the exponential transformation
exp_tr = lambda x: np.exp(-x.mean()*x) * x.mean()

# Group the data according to the time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_exp_group = restaurant_grouped['tip'].transform(exp_tr)
print(restaurant_exp_group.head())

0    0.135141
1    0.017986
2    0.000060
3    0.000108
4    0.000042
Name: tip, dtype: float64


### Validation of normalization
For this exercise, we will perform a z-score normalization and verify that it was performed correctly.

A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.

After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.

You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.

In [19]:
poker_hands = pd.read_csv("data/poker_hand.csv")
poker_grouped = poker_hands.groupby('Class')


In [20]:
zscore = lambda x: (x - x.mean()) / x.std()

# Apply the transformation
poker_trans = poker_grouped.transform(zscore)

# Re-group the grouped object and print each group's means and standard deviation
poker_regrouped = poker_trans.groupby(poker_hands['Class'])

print(np.round(poker_regrouped.mean(), 3))
print(poker_regrouped.var())

        S1   R1   S2   R2   S3   R3   S4   R4   S5   R5
Class                                                  
0     -0.0 -0.0  0.0 -0.0  0.0  0.0  0.0  0.0 -0.0  0.0
1      0.0  0.0 -0.0  0.0 -0.0  0.0  0.0  0.0 -0.0  0.0
2     -0.0 -0.0  0.0 -0.0 -0.0  0.0  0.0 -0.0 -0.0  0.0
3      0.0  0.0  0.0 -0.0 -0.0 -0.0 -0.0 -0.0  0.0 -0.0
4     -0.0 -0.0 -0.0 -0.0  0.0 -0.0 -0.0  0.0  0.0  0.0
5     -0.0 -0.0 -0.0  0.0 -0.0  0.0 -0.0 -0.0 -0.0  0.0
6     -0.0 -0.0 -0.0  0.0  0.0 -0.0  0.0  0.0 -0.0  0.0
7      0.0 -0.0 -0.0  0.0 -0.0  0.0  0.0 -0.0 -0.0 -0.0
8     -0.0  0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0 -0.0 -0.0
9      0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0  0.0  0.0 -0.0
        S1   R1   S2   R2   S3   R3   S4   R4   S5   R5
Class                                                  
0      1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
1      1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
2      1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
3      1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1

## Missing value imputation using transform()
- can count missing values
- replace them wiht fillna

### Identifying missing values
The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.

For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.

Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.

We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not impute financial data in real life, as it could be considered fraud.

In [23]:
restaurant_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [25]:
# Group both objects according to smoke condition
restaurant_nan_grouped = restaurant_data.groupby('smoker')

# Store the number of present values
restaurant_nan_nval = restaurant_nan_grouped['tip'].count()

# Print the group-wise missing entries
print(restaurant_nan_grouped['total_bill'].count() -restaurant_nan_nval )


smoker
No     0
Yes    0
dtype: int64


### Missing value imputation
As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.

In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).

In [26]:
# Define the lambda function
missing_trans = lambda x: x.fillna(x.median())

# Group the data according to time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_impute = restaurant_grouped.transform(missing_trans)
print(restaurant_impute.head())

   total_bill   tip  size
0       16.99  1.01     2
1       10.34  1.66     3
2       21.01  3.50     3
3       23.68  3.31     2
4       24.59  3.61     4


## Data filtration using the filter() function


### Data filtration
As you noticed in the video lesson, you may need to filter your data for various reasons.

In this exercise, you will use filtering to select a specific part of our DataFrame:

by the number of entries recorded in each day of the week
by the mean amount of money the customers paid to the restaurant each day of the week

In [28]:
# Filter the days where the count of total_bill is greater than $40
total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)

# Select only the entries that have a mean total_bill greater than $20
total_bill_20 = total_bill_40.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)

# Print days of the week that have a mean total_bill greater than $20
print('Days of the week that have a mean total_bill greater than $20:', total_bill_20.day.unique())

Days of the week that have a mean total_bill greater than $20: ['Sun' 'Sat']


In [30]:
total_bill_20.count()

total_bill    163
tip           163
sex           163
smoker        163
day           163
time          163
size          163
dtype: int64