# WRITING EFFICIENT CODE IN PANDAS


In [3]:
# Import any packages you want to use here
import time
import numpy as np
import pandas as pd

## Take Notes

Add notes here about the concepts you've learned and code cells with code you want to keep.

# Dataset

In [31]:
# Add your code snippets here
poker_hand = pd.read_csv("poker_hand.csv")
names = pd.read_csv("Popular_Baby_Names.csv")
restaurant_data = pd.read_csv("restaurant_data.csv")

In [5]:
poker_hand.head()

Unnamed: 0,S1,R1,S2,R2,S3,R3,S4,R4,S5,R5,Class
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


In [18]:
names.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,SOPHIA,119,1
1,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,CHLOE,106,2
2,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMILY,93,3
3,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,OLIVIA,89,4
4,2011,FEMALE,ASIAN AND PACIFIC ISLANDER,EMMA,75,5


In [32]:
restaurant_data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


# 1. SELECTING ROW AND COLUMN

### Note
1. .loc better for column
2. .iloc better for rows

Row selection: `loc` vs `iloc`
A big part of working with DataFrames is to locate specific entries in the dataset. You can locate rows in two ways:

By a specific value of a column (feature).
By the index of the rows (index). In this exercise, we will focus on the second way.
If you have previous experience with pandas, you should be familiar with the .loc and .iloc indexers, which stands for 'location' and 'index location' respectively. In most cases, the indices will be the same as the position of each row in the Dataframe (e.g. the row with index 13 will be the 14th entry).

While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.

In [9]:
# Define the range of rows to select: row_nums
row_nums = range(0, 1000)

# Select the rows using .loc[] and row_nums and record the time before and after
loc_start_time = time.time()
rows = poker_hand.loc[row_nums]
loc_end_time = time.time()

# Print the time it took to select the rows using .loc
print("Time using .loc[]: {} sec".format(loc_end_time - loc_start_time))

# Select the rows using .iloc[] and row_nums and record the time before and after
iloc_start_time = time.time()
rows = poker_hand.iloc[row_nums]
iloc_end_time = time.time()

# Print the time it took to select the rows using .iloc
print("Time using .iloc[]: {} sec".format(iloc_end_time - iloc_start_time))

Time using .loc[]: 0.0017616748809814453 sec
Time using .iloc[]: 0.0005860328674316406 sec


### Column selection: `.iloc[]` vs by name
In the previous exercise, you saw how the `.loc[]` and `.iloc[]` functions can be used to locate specific rows of a DataFrame (based on the index). Turns out, the `.iloc[]` function performs a lot faster (~ 2 times) for this task!

Another important task is to find the faster function to select the targeted features (columns) of a DataFrame. In this exercise, we will compare the following:

using the index locator .iloc()
using the names of the columns While we can use both functions to perform the same task, we are interested in which is the most efficient in terms of speed.
In this exercise, you will continue working with the poker data which is stored in poker_hands. Take a second to examine the structure of this DataFrame by calling poker_hands.head() in the console!

In [10]:
# Use .iloc to select the first, fourth, fifth, seventh and eighth column and record the times before and after
iloc_start_time = time.time()
cols = poker_hand.iloc[:,[0,3,4,6,7]]
iloc_end_time = time.time()

# Print the time it took
print("Time using .iloc[] : {} sec".format(iloc_end_time - iloc_start_time))

# Use simple column selection to select the first, fourth, fifth, seventh and eighth column and record the times before and after
names_start_time = time.time()
cols = poker_hand[['S1', 'S2', 'R2', 'R3', 'S4']]
names_end_time = time.time()

# Print the time it took
print("Time using selection by name : {} sec".format(names_end_time - names_start_time))

Time using .iloc[] : 0.0008645057678222656 sec
Time using selection by name : 0.0012655258178710938 sec


### Random row selection
In this exercise, you will compare the two methods described for selecting random rows (entries) with replacement in a pandas DataFrame:

The built-in pandas function `.random()`
The NumPy random integer number generator `np.random.randint()`
Generally, in the fields of statistics and machine learning, when we need to train an algorithm, we train the algorithm on the 75% of the available data and then test the performance on the remaining 25% of the data.

For this exercise, we will randomly sample the 75% percent of all the played poker hands available, using each of the above methods, and check which method is more efficient in terms of speed.

**Built-in function run faster**

In [12]:
# Extract number of rows in dataset
N=poker_hand.shape[0]

# Select and time the selection of the 75% of the dataset's rows
rand_start_time = time.time()
poker_hand.iloc[np.random.randint(low=0, high=N, size=int(0.75 * N))]
print("Time using Numpy: {} sec".format(time.time() - rand_start_time))

# Select and time the selection of the 75% of the dataset's rows using sample()
samp_start_time = time.time()
poker_hand.sample(int(0.75 * N), axis=0, replace = True)
print("Time using .sample: {} sec".format(time.time() - samp_start_time))

Time using Numpy: 0.002725839614868164 sec
Time using .sample: 0.0014526844024658203 sec


### Random column selection
In the previous exercise, we examined two ways to select random rows from a pandas DataFrame. We can use the same functions to randomly select columns in a pandas DataFrame.

To randomly select 4 columns out of the poker dataset, you will use the following two functions:

The built-in pandas function `.sample()`
The NumPy random integer number generator `np.random.randint()`

In [13]:
# Extract number of columns in dataset
D=poker_hand.shape[1]

# Select and time the selection of 4 of the dataset's columns using NumPy
np_start_time = time.time()
poker_hand.iloc[:,np.random.randint(low=0, high=D, size=4)]
print("Time using NymPy's random.randint(): {} sec".format(time.time() - np_start_time))

# Select and time the selection of 4 of the dataset's columns using pandas
pd_start_time = time.time()
poker_hand.sample(4, axis=1)
print("Time using panda's .sample(): {} sec".format(time.time() - pd_start_time))

Time using NymPy's random.randint(): 0.0008451938629150391 sec
Time using panda's .sample(): 0.0005755424499511719 sec


# 2. REPLACING VALUE

### Replacing scalar values I
In this exercise, we will replace a list of values in our dataset by using the `.replace() `method with another list of desired values.

We will apply the functions in the poker_hands DataFrame. Remember that in the poker_hands DataFrame, each row of columns R1 to R5 represents the rank of each card from a player's poker hand spanning from 1 (Ace) to 13 (King). The Class feature classifies each hand as a category, and the Explanation feature briefly explains each hand.

The poker_hands DataFrame is already loaded for you, and you can explore the features Class and Explanation.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

In [16]:
# Replace Class 1 to -2 
poker_hands = poker_hand.copy()
poker_hands['Class'].replace(1, -2, inplace=True)
# Replace Class 2 to -3
poker_hands['Class'].replace(2, -3, inplace=True)

print(poker_hands[['Class', 'S1']])

       Class  S1
0          9   1
1          9   2
2          9   3
3          9   4
4          9   4
...      ...  ..
25005      0   3
25006     -2   4
25007     -2   2
25008     -2   2
25009     -2   1

[25010 rows x 2 columns]


### Replace scalar values II
As discussed in the video, in a pandas DataFrame, it is possible to replace values in a very intuitive way: we locate the position (row and column) in the Dataframe and assign in the new value you want to replace with. In a more pandas-ian way, the .replace() function is available that performs the same task.

You will be using the names DataFrame which includes, among others, the most popular names in the US by year, gender and ethnicity.

Your task is to replace all the babies that are classified as FEMALE to GIRL using the following methods:

intuitive scalar replacement
using the .replace() function

### Replace multiple values I
In this exercise, you will apply the .replace() function for the task of replacing multiple values with one or more values. You will again use the names dataset which contains, among others, the most popular names in the US by year, gender and Ethnicity.

Thus you want to replace all ethnicities classified as black or white non-hispanics to non-hispanic. Remember, the ethnicities are stated in the dataset as follows:\
`['BLACK NON HISP', 'BLACK NON HISPANIC', 'WHITE NON HISP' , 'WHITE NON HISPANIC']` \
and should be replaced to 'NON HISPANIC'

In [19]:
names1 = names.copy()
start_time = time.time()

# Replace all non-Hispanic ethnicities with 'NON HISPANIC'
names1['Ethnicity'].replace(['BLACK NON HISP', 'BLACK NON HISPANIC', 'WHITE NON HISP' , 'WHITE NON HISPANIC'], 'NON HISPANIC', inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0028188228607177734 sec


### Replace multiple values II
As discussed in the video, instead of using the .replace() function multiple times to replace multiple values, you can use lists to map the elements you want to replace one to one with those you want to replace them with.

As you have seen in our popular names dataset, there are two names for the same ethnicity. We want to standardize the naming of each ethnicity by replacing

'ASIAN AND PACI' to 'ASIAN AND PACIFIC ISLANDER'\
'BLACK NON HISP' to 'BLACK NON HISPANIC'\
'WHITE NON HISP' to 'WHITE NON HISPANIC'\
In the DataFrame names, you are going to replace all the values on the left by the values on the right.

In [20]:
names2 = names.copy()
start_time = time.time()

# Replace ethnicities as instructed
names2['Ethnicity'].replace(['ASIAN AND PACI','BLACK NON HISP', 'WHITE NON HISP'], ['ASIAN AND PACIFIC ISLANDER','BLACK NON HISPANIC','WHITE NON HISPANIC'], inplace=True)

print("Time using .replace(): {} sec".format(time.time() - start_time))

Time using .replace(): 0.0024690628051757812 sec


### Replace single values I
In this exercise, we will apply the following replacing technique of replacing multiple values using dictionaries on a different dataset.

We will apply the functions in the data DataFrame. Each row represents the rank of 5 cards from a playing card deck, spanning from 1 (Ace) to 13 (King) (features R1, R2, R3, R4, R5). The feature 'Class' classifies each row to a category (from 0 to 9) and the feature 'Explanation' gives a brief explanation of what each class represents.

The purpose of this exercise is to categorize the two types of flush in the game ('Royal flush' and 'Straight flush') under the 'Flush' name.

**Replace Royal flush or Straight flush to Flush**\
`poker_hands.replace({'Explanation' :{'Royal flush':'Flush', 'Straight flush':'Flush'}}, inplace=True)
print(poker_hands['Explanation'].head())`

### Replace single values II
For this exercise, we will be using the names DataFrame. In this dataset, the column 'Rank' shows the ranking of each name by year. For this exercise, you will use dictionaries to replace the first ranked name of every year as 'FIRST', the second name as 'SECOND' and the third name as 'THIRD'.

You will use dictionaries to replace one single value per key.

You can already see the first 5 names of the data, which correspond to the 5 most popular names for all the females belonging to the 'ASIAN AND PACIFIC ISLANDER' ethnicity in 2011.

In [22]:
names3 = names.copy()

# Replace the number rank by a string
names3['Rank'].replace({1: 'FIRST', 2:'SECOND', 3:'THIRD'}, inplace=True)
print(names3.head())

   Year of Birth  Gender  ... Count    Rank
0           2011  FEMALE  ...   119   FIRST
1           2011  FEMALE  ...   106  SECOND
2           2011  FEMALE  ...    93   THIRD
3           2011  FEMALE  ...    89       4
4           2011  FEMALE  ...    75       5

[5 rows x 6 columns]


### Replace multiple values III
As you saw in the video, you can use dictionaries to replace multiple values with just one value, even from multiple columns. To show the usefulness of replacing with dictionaries, you will use the names dataset one more time.

In this dataset, the column 'Rank' shows which rank each name reached every year. You will change the rank of the first three ranked names of every year to 'MEDAL' and those from 4th and 5th place to 'ALMOST MEDAL'.

You can already see the first 5 names of the data, which correspond to the 5 most popular names for all the females belonging to the 'ASIAN AND PACIFIC ISLANDER' ethnicity in 2011.

In [23]:
# Replace the rank of the first three ranked names to 'MEDAL'
names2.replace({'Rank': {1:'MEDAL', 2:'MEDAL', 3:'MEDAL'}}, inplace=True)

# Replace the rank of the 4th and 5th ranked names to 'ALMOST MEDAL'
names2.replace({'Rank': {4:'ALMOST MEDAL', 5:'ALMOST MEDAL'}}, inplace=True)
print(names2.head())

   Year of Birth  Gender  ... Count          Rank
0           2011  FEMALE  ...   119         MEDAL
1           2011  FEMALE  ...   106         MEDAL
2           2011  FEMALE  ...    93         MEDAL
3           2011  FEMALE  ...    89  ALMOST MEDAL
4           2011  FEMALE  ...    75  ALMOST MEDAL

[5 rows x 6 columns]


# ITERATING VIA `.iterrows`

### Create a generator for a pandas DataFrame
As you've seen in the video, you can easily create a generator out of a pandas DataFrame. Each time you iterate through it, it will yield two elements:

the index of the respective row
a pandas Series with all the elements of that row
You are going to create a generator over the poker dataset, imported as poker_hands. Then, you will print all the elements of the 2nd row, using the generator.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

In [24]:
# Create a generator over the rows
generator = poker_hands.iterrows()

# Access the elements of the 2nd row
first_element = next(generator)
second_element = next(generator)
print(first_element, second_element)

(0, S1        1
R1       10
S2        1
R2       11
S3        1
R3       13
S4        1
R4       12
S5        1
R5        1
Class     9
Name: 0, dtype: int64) (1, S1        2
R1       11
S2        2
R2       13
S3        2
R3       10
S4        2
R4       12
S5        2
R5        1
Class     9
Name: 1, dtype: int64)


### The iterrows() function for looping
You just saw how to create a generator out of a pandas DataFrame. You will now use this generator and see how to take advantage of that method of looping through a pandas DataFrame, still using the poker_hands dataset.

Specifically, we want the sum of the ranks of all the cards, if the index of the hand is an odd number. The ranks of the cards are located in the odd columns of the DataFrame.

In [25]:
data_generator = poker_hands.iterrows()

for index, values in data_generator:
  	# Check if index is odd
    if index%2 != 0:
      	# Sum the ranks of all the cards
        hand_sum = sum([values[1], values[3], values[5], values[7], values[9]])

### `.apply()` function in every cell
As you saw in the lesson, you can use .apply() to map a function to every cell of the DataFrame, regardless the column or the row.

You're going to try it out on the poker_hands dataset. You will use .apply() to square every cell of the DataFrame. The native Python way to square a number n is n**2.

In [26]:
# Define the lambda transformation
get_square = lambda x: np.square(x)

# Apply the transformation
data_sum = poker_hands.apply(get_square)
print(data_sum.head())

   S1   R1  S2   R2  S3   R3  S4   R4  S5   R5  Class
0   1  100   1  121   1  169   1  144   1    1     81
1   4  121   4  169   4  100   4  144   4    1     81
2   9  144   9  121   9  169   9  100   9    1     81
3  16  100  16  121  16    1  16  169  16  144     81
4  16    1  16  169  16  144  16  121  16  100     81


### `.apply()` for rows iteration
.apply() is a very useful to iterate through the rows of a DataFrame and apply a specific function.

You will work on a subset of the poker_hands dataset, which includes only the rank of all the five cards of each hand in each row (this subset is generated for you in the script). You're going to get the variance of every hand for all ranks, and every rank for all hands.

In [27]:
poker = poker_hand.copy()
get_variance = lambda x: np.var(x)

# Apply the transformation
data_tr = poker[['R1', 'R2', 'R3', 'R4', 'R5']].apply(get_variance, axis=0)
print(data_tr.head())

R1    14.060473
R2    14.189523
R3    14.024270
R4    14.040552
R5    13.998851
dtype: float64


### pandas vectorization in action
Pandas performs more efficiently when an operation is performed to a whole array than to each value separately or sequentially. Vectorization is the process of executing operations on entire arrays.

In this exercise, you will apply vectorization over pandas series to:

calculate the mean rank of all the cards in each hand (row)
calculate the mean rank of each of the 5 cards in each hand (column)
You will use the poker_hands dataset once again to compare both methods' efficiency.

In [29]:
# Calculate the mean rank in each hand
row_start_time = time.time()
mean_r = poker[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=1)
print("Time using pandas vectorization for rows: {} sec".format(time.time() - row_start_time))
print(mean_r.head())

# Calculate the mean rank of each of the 5 card in all hands
col_start_time = time.time()
mean_c = poker[['R1', 'R2', 'R3', 'R4', 'R5']].mean(axis=0)
print("Time using pandas vectorization for columns: {} sec".format(time.time() - col_start_time))
print(mean_c.head())

Time using pandas vectorization for rows: 0.0020020008087158203 sec
0    9.4
1    9.4
2    9.4
3    9.4
4    9.4
dtype: float64
Time using pandas vectorization for columns: 0.0014743804931640625 sec
R1    6.995242
R2    7.014194
R3    7.014154
R4    6.942463
R5    6.962735
dtype: float64


Similar to pandas working with array, numpy operates with array called ndarrays. Major difference is the `ndarrays` leave out many operations such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on pandas Series. NumPy arrays can be used in place of pandas Series when the additional functionality offered by pandas Series isn’t critical.

### Vectorization methods for looping a DataFrame
Now that you're familiar with vectorization in pandas and NumPy, you're going to compare their respective performances yourself.

Your task is to calculate the variance of all the hands in each hand using the vectorization over pandas Series and then modify your code using the vectorization over Numpy ndarrays method.

In [30]:
# Calculate the variance in each hand
start_time = time.time()
poker_var = poker[['R1', 'R2', 'R3', 'R4', 'R5']].values.var(axis =1, ddof=1)
print("Time using NumPy vectorization: {} sec".format(time.time() - start_time))
print(poker_var[0:5])

Time using NumPy vectorization: 0.003055572509765625 sec
[23.3 23.3 23.3 23.3 23.3]


# 4. TRANSFORM AND GROUP BY

### The min-max normalization using .transform()
A very common operation is the min-max normalization. It consists in rescaling our value of interest by deducting the minimum value and dividing the result by the difference between the maximum and the minimum value. For example, to rescale student's weight data spanning from 160 pounds to 200 pounds, you subtract 160 from each student's weight and divide the result by 40 (200 - 160).

You're going to define and apply the min-max normalization to all the numerical variables in the restaurant data. You will first group the entries by the time the meal took place (Lunch or Dinner) and then apply the normalization to each group separately.

Remember you can always explore the dataset and see how it changes in the IPython Shell, and refer to the slides in the Slides tab.

In [33]:
# Define the min-max transformation
min_max_tr = lambda x: (x - x.min()) / (x.max() - x.min())

# Group the data according to the time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_min_max_group = restaurant_grouped.transform(min_max_tr)
print(restaurant_min_max_group.head())

   total_bill       tip  size
0    0.291579  0.001111   0.2
1    0.152283  0.073333   0.4
2    0.375786  0.277778   0.4
3    0.431713  0.256667   0.2
4    0.450775  0.290000   0.6


### Transforming values to probabilities
In this exercise, we will apply a probability distribution function to a pandas DataFrame with group related parameters by transforming the tip variable to probabilities.

The transformation will be a exponential transformation. The exponential distribution is defined as

**`λ^(-λx).x`**

where λ (lambda) is the mean of the group that the observation x belongs to.

You're going to apply the exponential distribution transformation to the size of each table in the dataset, after grouping the data according to the time of the day the meal took place. Remember to use each group's mean for the value of λ.

In Python, you can use the exponential as np.exp() from the NumPy library and the mean value as .mean().

In [34]:
# Define the exponential transformation
exp_tr = lambda x: np.exp((-1*x.mean())*x) * x.mean()

# Group the data according to the time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_exp_group = restaurant_grouped['tip'].transform(exp_tr)
print(restaurant_exp_group.head())

0    0.135141
1    0.017986
2    0.000060
3    0.000108
4    0.000042
Name: tip, dtype: float64


### Validation of normalization
For this exercise, we will perform a z-score normalization and verify that it was performed correctly.

A distinct characteristic of normalized values is that they have a mean equal to zero and standard deviation equal to one.

After you apply the normalization transformation, you can group again on the same variable, and then check the mean and the standard deviation of each group.

You will apply the normalization transformation to every numeric variable in the poker_grouped dataset, which is the poker_hands dataset grouped by Class.

In [36]:
zscore = lambda x: (x - x.mean()) / x.std()

# Apply the transformation
poker_trans = poker.groupby('Class').transform(zscore)
print(poker_trans.head())

         S1        R1        S2  ...        R4        S5        R5
0 -1.380537  0.270364 -1.380537  ...  0.350823 -1.380537 -0.724286
1 -0.613572  0.495666 -0.613572  ...  0.350823 -0.613572 -0.724286
2  0.153393  0.720969  0.153393  ... -1.403293  0.153393 -0.724286
3  0.920358  0.270364  0.920358  ...  1.227881  0.920358  1.267500
4  0.920358 -1.757363  0.920358  ... -0.526235  0.920358  0.905357

[5 rows x 10 columns]


In [37]:
# Re-group the grouped object and print each group's means and standard deviation
poker_regrouped = poker_trans.groupby(poker_hands['Class'])

print(np.round(poker_regrouped.mean(), 3))
print(poker_regrouped.std())

        S1   R1   S2   R2   S3   R3   S4   R4   S5   R5
Class                                                  
-3    -0.0  0.0  0.0 -0.0 -0.0  0.0  0.0 -0.0  0.0  0.0
-2     0.0 -0.0  0.0 -0.0  0.0  0.0  0.0 -0.0 -0.0  0.0
 0    -0.0  0.0 -0.0 -0.0  0.0 -0.0 -0.0 -0.0 -0.0 -0.0
 3     0.0  0.0  0.0 -0.0 -0.0 -0.0  0.0 -0.0  0.0  0.0
 4    -0.0 -0.0 -0.0 -0.0  0.0 -0.0 -0.0  0.0  0.0  0.0
 5    -0.0 -0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0
 6    -0.0 -0.0 -0.0  0.0  0.0 -0.0  0.0  0.0 -0.0  0.0
 7     0.0 -0.0 -0.0  0.0 -0.0  0.0  0.0 -0.0 -0.0 -0.0
 8    -0.0  0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0 -0.0 -0.0
 9     0.0 -0.0  0.0 -0.0  0.0 -0.0  0.0  0.0  0.0 -0.0
        S1   R1   S2   R2   S3   R3   S4   R4   S5   R5
Class                                                  
-3     1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
-2     1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 0     1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 3     1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1

### Identifying missing values
The first step before missing value imputation is to identify if there are missing values in our data, and if so, from which group they arise.

For the same restaurant_data data you encountered in the lesson, an employee erased by mistake the tips left in 65 tables. The question at stake is how many missing entries came from tables that smokers where present vs tables with no-smokers present.

Your task is to group both datasets according to the smoker variable, count the number or present values and then calculate the difference.

We're imputing tips to get you to practice the concepts taught in the lesson. From an ethical standpoint, you should not i

#### Group both objects according to smoke condition
`restaurant_nan_grouped = restaurant_nan.groupby('smoker')`

#### Store the number of present values
`restaurant_nan_nval = restaurant_nan_grouped['tip'].count()`

#### Print the group-wise missing entries
`print(restaurant_nan_grouped['total_bill'].count() - restaurant_nan_nval)`

### Missing value imputation
As the majority of the real world data contain missing entries, replacing these entries with sensible values can increase the insight you can get from our data.

In the restaurant dataset, the "total_bill" column has some missing entries, meaning that you have not recorded how much some tables have paid. Your task in this exercise is to replace the missing entries with the median value of the amount paid, according to whether the entry was recorded on lunch or dinner (time variable).

In [40]:
# Define the lambda function
missing_trans = lambda x: x.fillna(x.median())

# Group the data according to time
restaurant_grouped = restaurant_data.groupby('time')

# Apply the transformation
restaurant_impute = restaurant_grouped.transform(missing_trans)
print(restaurant_impute.head())

   total_bill   tip  size
0       16.99  1.01     2
1       10.34  1.66     3
2       21.01  3.50     3
3       23.68  3.31     2
4       24.59  3.61     4


### Data filtration
As you noticed in the video lesson, you may need to filter your data for various reasons.

In this exercise, you will use filtering to select a specific part of our DataFrame:

by the number of entries recorded in each day of the week
by the mean amount of money the customers paid to the restaurant each day of the week

In [41]:
# Filter the days where the count of total_bill is greater than $40
total_bill_40 = restaurant_data.groupby('day').filter(lambda x: x['total_bill'].count() > 40)

# Select only the entries that have a mean total_bill greater than $20
total_bill_20 = total_bill_40.groupby('day').filter(lambda x : x['total_bill'].mean() > 20)

# Print days of the week that have a mean total_bill greater than $20
print('Days of the week that have a mean total_bill greater than $20:', total_bill_20.day.unique())

Days of the week that have a mean total_bill greater than $20: ['Sun' 'Sat']
