# Useful Methods

Let's cover some useful methods and functions built in to pandas. This is actually just a small sampling of the functions and methods available in Pandas, but they are some of the most commonly used.
The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) is a great resource to continue exploring more methods and functions.
Here is a list of functions and methods we'll cover here

<hr>

(click on one to jump to that section in this notebook.):

* [apply() method](#apply_method)
* [apply() with a function](#apply_function)
* [apply() with a lambda expression](#apply_lambda)
* [apply() on multiple columns](#apply_multiple)
* [describe()](#describe)
* [sort_values()](#sort)
* [corr()](#corr)
* [idxmin and idxmax](#idx)
* [value_counts](#v_c)
* [replace](#replace)
* [unique and nunique](#uni)
* [map](#map)
* [duplicated and drop_duplicates](#dup)
* [between](#bet)
* [sample](#sample)
* [nlargest](#n)

<a id='apply_method'></a>

## The .apply() method

Here we will learn about a very useful method known as **apply** on a DataFrame. This allows us to apply and broadcast custom functions on a DataFrame column

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('assets/tips.csv')

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


<a id='apply_function'></a>
### apply with a function

In [4]:
# view structure of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   total_bill        244 non-null    float64
 1   tip               244 non-null    float64
 2   sex               244 non-null    object 
 3   smoker            244 non-null    object 
 4   day               244 non-null    object 
 5   time              244 non-null    object 
 6   size              244 non-null    int64  
 7   price_per_person  244 non-null    float64
 8   Payer Name        244 non-null    object 
 9   CC Number         244 non-null    int64  
 10  Payment ID        244 non-null    object 
dtypes: float64(3), int64(2), object(6)
memory usage: 21.1+ KB


In [5]:
def last_four(num):
    return str(num)[-4:]

In [6]:
last_four(df['CC Number'][0])

'3410'

In [7]:
df['last_four'] = df['CC Number'].apply(last_four)

In [8]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221


### Using .apply() with more complex functions

In [9]:
df['total_bill'].mean()

19.785942622950824

In [10]:
def yelp(price):
    if price < 10:
        return '$'
    elif price >= 10 and price < 30:
        return '$$'
    else:
        return '$$$'

In [11]:
df['Expensive'] = df['total_bill'].apply(yelp)

<a id='apply_lambda'></a>
### apply with lambda

In [12]:
lambda num: num*2

<function __main__.<lambda>(num)>

In [13]:
df['total_bill'].apply(lambda bill:bill*0.18)

0      3.0582
1      1.8612
2      3.7818
3      4.2624
4      4.4262
        ...  
239    5.2254
240    4.8924
241    4.0806
242    3.2076
243    3.3804
Name: total_bill, Length: 244, dtype: float64

<a id='apply_multiple'></a>
## apply that uses multiple columns

Note, there are several ways to do this:

https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column

In [14]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,Expensive
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$


In [48]:
i=0
def quality(total_bill,tip):
    # this portion of code just for clarification and viewer type of params one time
    global i
    if i==0:
        print(type(tip))
        print(tip)
        i=i+1
    # ----------------------------------------------
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"

In [54]:
print(type(df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)))
print(df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1))

<class 'pandas.core.series.Series'>
0      Other
1      Other
2      Other
3      Other
4      Other
       ...  
239    Other
240    Other
241    Other
242    Other
243    Other
Length: 244, dtype: object


In [50]:
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)

In [26]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,Tip Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,Other


### Let's go deep dive into how the 'apply' method works with multiple columns.

#### why lambda function

In [40]:
def quality2(total_bill,tip):
    print("total : \n","type -> ", type(total_bill),"\n", total_bill)
   
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"

In [41]:
df['Tip Quality'] = df[['total_bill','tip']].apply(quality2(df['total_bill'],df['tip']),axis=1)

total : 
 type ->  <class 'pandas.core.series.Series'> 
 0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The error "ValueError: The truth value of a Series is ambiguous" is caused by the way you are calling the quality function with apply. **Let's break it down in detail.**
The issue lies in how you call the quality function. When using apply with axis=1, the function you provide should be a function that takes an entire row of the DataFrame as an argument. However, in your call, you are directly invoking quality(df['total_bill'], df['tip']) with specific columns, causing confusion for pandas.

Here is how the error occurs:

- apply iterates over the rows of the DataFrame.
- For each row, it calls the quality function.
- You initially called quality(df['total_bill'], df['tip']), essentially passing entire column series for each call, rather than a single row.

This creates ambiguity on how pandas should interpret the truth value (the condition in your case) because it expects a single value, but we provided an entire series.


***a nother version whitout using lambda function***

In [None]:
def quality3(row):
    total_bill = row['total_bill']
    tip = row['tip']
    
    if tip / total_bill > 0.25:
        return "Generous"
    else:
        return "Other"

df['Tip Quality'] = df.apply(quality3, axis=1)

#### why axis = 1

In [44]:
def quality4(total_bill,tip):
    print("total : \n","type -> ", type(total_bill),"\n", total_bill)
   
    if tip/total_bill  > 0.25:
        return "Generous"
    else:
        return "Other"

In [45]:
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality4(df['total_bill'],df['tip']))

KeyError: 'total_bill'

When using the **apply** method in pandas, **the axis** parameter specifies whether the function should be applied to rows (axis=0) or columns (axis=1). In the context of Our code, axis=1 is used because you want to apply the function to each row of the DataFrame.

Here's a breakdown:

- **axis=0:** Apply the function to each column. The function would receive a Series representing a column.
- **axis=1:** Apply the function to each row. The function would receive a Series representing a row.
In our case, we are interested in applying the function to each row because we want to calculate the 'Tip Quality' based on values from both the 'total_bill' and 'tip' columns for each row. Therefore, axis=1 is the correct choice to ensure the function is applied along the rows of the DataFrame.

###### vectorize

***vectorize*** in NumPy allows you to broadcast a function designed to operate on scalars to a function that can handle arrays. Broadcasting, in general, is a mechanism that enables NumPy to perform operations on arrays of different shapes and sizes, and vectorize specifically extends this capability to functions, making it possible to apply them element-wise to entire arrays efficiently. So, it adapts a scalar-oriented function to work seamlessly with array inputs through broadcasting.

In [58]:
print(type(np.vectorize(quality)))

<class 'numpy.vectorize'>


In [59]:
df['Tip Quality'] = np.vectorize(quality)(df['total_bill'], df['tip'])

In [60]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,last_four,Expensive,Tip Quality
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,3410,$$,Other
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,9230,$$,Other
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,1322,$$,Other
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,5994,$$,Other
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,7221,$$,Other


So, which one is faster?

Vectorization is much faster! Keep **np.vectorize()** in mind for the future.

Full Details:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html