[Pandas Performance Optimizaztion](https://www.w3resource.com/python-exercises/pandas/python-pandas-performance-optimization.php)

In [1]:
import pandas as pd
import numpy as np
import time

pd.set_option('display.max_columns', None)

<div class="alert alert-warning">

**1. Large DataFrame Sum Performance**

**Write a Pandas program to create a large DataFrame and measure the time taken to sum a column using a for loop vs. using the sum method.**

</div>

In [2]:
# Create a large DataFrame with random integers
np.random.seed(0)  # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1))  # Generate random data
df = pd.DataFrame(data, columns=['Values'])  # Create a DataFrame
df

Unnamed: 0,Values
0,45
1,48
2,65
3,68
4,68
...,...
999995,70
999996,41
999997,93
999998,62


<div class="alert alert-success">

**Solution 01:**
</div>

In [3]:
# 1. Large DataFrame Sum Performance
# Write a Pandas program to create a large DataFrame and measure the time taken to sum a column using a for loop vs. using the sum method.

# Measure the time taken to sum the column using a for loop
start_time = time.time()  # Record the start time
sum_for_loop = 0  # Initialize the sum variable
for value in df['Values']:  # Iterate through each value in the column
    sum_for_loop += value  # Add the value to the sum variable
time_for_loop = time.time() - start_time  # Calculate the time taken

# Measure the time taken to sum the column using the sum method
start_time = time.time()  # Record the start time
sum_method = df['Values'].sum()  # Use the sum method to calculate the sum
time_sum_method = time.time() - start_time  # Calculate the time taken

# Print the results
print("Sum using for loop:", sum_for_loop)
print("Time taken using for loop:", time_for_loop, "seconds")
print("Sum using sum method:", sum_method)
print("Time taken using sum method:", time_sum_method, "seconds")

Sum using for loop: 49988718
Time taken using for loop: 0.061686038970947266 seconds
Sum using sum method: 49988718
Time taken using sum method: 0.0005431175231933594 seconds


<div class="alert alert-warning">

**2. Custom Function: Apply vs. Vectorized Operations**

**Write a Pandas program to compare the performance of applying a custom function to a column using apply vs. using vectorized operations.**

</div>

In [4]:
# Create a large DataFrame with random integers
np.random.seed(0)  # Set seed for reproducibility
data = np.random.randint(1, 100, size=(1000000, 1))  # Generate random data
df = pd.DataFrame(data, columns=['Values'])  # Create a DataFrame
df

Unnamed: 0,Values
0,45
1,48
2,65
3,68
4,68
...,...
999995,70
999996,41
999997,93
999998,62


<div class="alert alert-success">

**Solution 02:**
</div>

In [5]:
# 2. Custom Function: Apply vs. Vectorized Operations
# Write a Pandas program to compare the performance of applying a custom function to a column using apply vs. using vectorized operations.

# Define a custom function to apply
def custom_function(x):
    return x * 2 + 3

# Measure the time taken to apply the custom function using apply
start_time = time.time()  # Record the start time
df['Apply_Result'] = df['Values'].apply(custom_function)  # Apply the custom function using apply
time_apply = time.time() - start_time  # Calculate the time taken

# Measure the time taken to apply the custom function using vectorized operations
start_time = time.time()  # Record the start time
df['Vectorized_Result'] = custom_function(df['Values'])  # Apply the custom function using vectorized operations
time_vectorized = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using apply:", time_apply, "seconds")
print("Time taken using vectorized operations:", time_vectorized, "seconds")

Time taken using apply: 0.11264896392822266 seconds
Time taken using vectorized operations: 0.002717256546020508 seconds


<div class="alert alert-warning">

**3. Optimize Memory Usage When Loading CSV**

**Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.**

</div>

In [6]:
path = '../data/'
file_name = 'titanic.csv'

<div class="alert alert-success">

**Solution 03:**
</div>

In [7]:
# 3. Optimize Memory Usage When Loading CSV
# Write a Pandas program that loads a large CSV file into a DataFrame and optimizes memory usage by specifying appropriate data types.


df = pd.read_csv(path+file_name)
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null    int64  
 2   sex          891 non-null    object 
 3   age          714 non-null    float64
 4   sibsp        891 non-null    int64  
 5   parch        891 non-null    int64  
 6   fare         891 non-null    float64
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
 15  Unnamed: 15  0 non-null      float64
dtypes: bool(2), float64(3), int64(4), object(7)
memory usage: 361.8 KB


In [8]:
dtype_dict = {
    'survived': 'int32', 
    'pclass': 'int32', 
    'sex': 'object', 
    'age': 'float32', 
    'sibsp': 'int32', 
    'parch': 'int32', 
    'fare': 'float32', 
    'embarked': 'object', 
    'class': 'object', 
    'who': 'object', 
    'adult_male': 'bool', 
    'deck': 'object', 
    'embark_town': 'object', 
    'alive': 'object', 
    'alone': 'bool', 
    'Unnamed: 15': 'float32'
}

df = pd.read_csv(path+file_name, dtype=dtype_dict)
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int32  
 1   pclass       891 non-null    int32  
 2   sex          891 non-null    object 
 3   age          714 non-null    float32
 4   sibsp        891 non-null    int32  
 5   parch        891 non-null    int32  
 6   fare         891 non-null    float32
 7   embarked     889 non-null    object 
 8   class        891 non-null    object 
 9   who          891 non-null    object 
 10  adult_male   891 non-null    bool   
 11  deck         203 non-null    object 
 12  embark_town  889 non-null    object 
 13  alive        891 non-null    object 
 14  alone        891 non-null    bool   
 15  Unnamed: 15  0 non-null      float32
dtypes: bool(2), float32(3), int32(4), object(7)
memory usage: 337.4 KB


In [9]:
def optimize_dataframe(df, convert_obj_to_category=True, verbose=True):
    """
    Downcast numeric columns and convert objects to category where possible.
    
    Parameters
    ----------
    df : pd.DataFrame
        The dataframe to optimize.
    convert_obj_to_category : bool, optional (default=True)
        Whether to convert object/string columns to category if beneficial.
    verbose : bool, optional (default=True)
        Whether to print memory usage before and after optimization.
    
    Returns
    -------
    df : pd.DataFrame
        Optimized dataframe.
    """
    
    start_mem = df.memory_usage(deep=True).sum() / 1024**2
    
    for col in df.columns:
        col_type = df[col].dtypes
        
        if np.issubdtype(col_type, np.number):
            # Handle numerics
            if pd.api.types.is_integer_dtype(col_type):
                df[col] = pd.to_numeric(df[col], downcast='integer')
            else:
                df[col] = pd.to_numeric(df[col], downcast='float')
        
        elif convert_obj_to_category and col_type == 'object':
            num_unique_values = df[col].nunique()
            num_total_values = len(df[col])
            
            # Convert to category if it's beneficial
            if num_unique_values / num_total_values < 0.5:
                df[col] = df[col].astype('category')
    
    end_mem = df.memory_usage(deep=True).sum() / 1024**2
    
    if verbose:
        print(f"Memory usage before optimization: {start_mem:.2f} MB")
        print(f"Memory usage after optimization:  {end_mem:.2f} MB")
        print(f"Reduced by {100 * (start_mem - end_mem) / start_mem:.1f}%")
    
    return df

In [10]:
# Load normally
df = pd.read_csv(path+file_name)

# Optimize in place
df = optimize_dataframe(df)

Memory usage before optimization: 0.35 MB
Memory usage after optimization:  0.02 MB
Reduced by 93.4%


<div class="alert alert-warning">

**4. Data Type Conversion with astype**

**Write a Pandas program that uses the "astype" method to convert the data types of a DataFrame and measures the reduction in memory usage.**

</div>

In [11]:
# Create a sample DataFrame with mixed data types
np.random.seed(0)  # Set seed for reproducibility
data = {
    'int_col': np.random.randint(0, 100, size=100000),
    'float_col': np.random.random(size=100000) * 100,
    'category_col': np.random.choice(['A', 'B', 'C'], size=100000),
    'object_col': np.random.choice(['foo', 'bar', 'baz'], size=100000)
}
df = pd.DataFrame(data)
df

Unnamed: 0,int_col,float_col,category_col,object_col
0,44,58.934278,A,foo
1,47,41.351199,C,bar
2,64,45.193166,B,baz
3,67,80.839972,A,bar
4,67,21.803010,B,baz
...,...,...,...,...
99995,43,54.791847,A,bar
99996,73,21.779164,A,foo
99997,20,54.611310,B,foo
99998,71,93.732822,A,foo


<div class="alert alert-success">

**Solution 04:**
</div>

In [12]:
# 4. Data Type Conversion with astype
# Write a Pandas program that uses the "astype" method to convert the data types of a DataFrame and measures the reduction in memory usage.


# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert data types using astype method
df['int_col'] = df['int_col'].astype('int16')
df['float_col'] = df['float_col'].astype('float32')
df['category_col'] = df['category_col'].astype('category')
df['object_col'] = df['object_col'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   int_col       100000 non-null  int64  
 1   float_col     100000 non-null  float64
 2   category_col  100000 non-null  object 
 3   object_col    100000 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 11.3 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   int_col       100000 non-null  int16   
 1   float_col     100000 non-null  float32 
 2   category_col  100000 non-null  category
 3   object_col    100000 non-null  category
dtypes: category(2), float32(1), int16(1)
memory usage: 781.9 KB
None


<div class="alert alert-warning">

**5. Row Filtering: For Loop vs. Boolean Indexing**

**Write a Pandas program to filter rows of a DataFrame based on a condition using a for loop vs. using boolean indexing. Compare performance.**

</div>

In [13]:
# Create a sample DataFrame
np.random.seed(0)  # Set seed for reproducibility
data = {
    'A': np.random.randint(1, 100, size=100000),
    'B': np.random.randint(1, 100, size=100000)
}
df = pd.DataFrame(data)

df

Unnamed: 0,A,B
0,45,64
1,48,95
2,65,85
3,68,15
4,68,53
...,...,...
99995,33,8
99996,53,12
99997,32,47
99998,79,56


<div class="alert alert-success">

**Solution 05:**
</div>

In [14]:
# 5. Row Filtering: For Loop vs. Boolean Indexing
# Write a Pandas program to filter rows of a DataFrame based on a condition using a for loop vs. using boolean indexing. Compare performance.

# Define the condition
condition = 50

# Filter rows using a for loop
start_time = time.time()  # Record the start time
filtered_rows_loop = []
for index, row in df.iterrows():
    if row['A'] > condition:
        filtered_rows_loop.append(row)
filtered_df_loop = pd.DataFrame(filtered_rows_loop)
time_for_loop = time.time() - start_time  # Calculate the time taken

# Filter rows using boolean indexing
start_time = time.time()  # Record the start time
filtered_df_bool = df[df['A'] > condition]
time_boolean_indexing = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using for loop:", time_for_loop, "seconds")
print("Time taken using boolean indexing:", time_boolean_indexing, "seconds")

Time taken using for loop: 1.0889291763305664 seconds
Time taken using boolean indexing: 0.0004570484161376953 seconds


<div class="alert alert-warning">

**6. GroupBy Aggregation vs. Manual Iteration**

**Write a Pandas program that uses the groupby method to aggregate data and compares performance with manually iterating through the DataFrame.**

</div>

In [15]:
# Create a sample DataFrame
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

df

Unnamed: 0,Category,Values
0,A,74
1,D,54
2,B,45
3,A,96
4,D,94
...,...,...
999995,D,10
999996,B,64
999997,A,89
999998,A,58


<div class="alert alert-success">

**Solution 06:**
</div>

In [16]:
# 6. GroupBy Aggregation vs. Manual Iteration
# Write a Pandas program that uses the groupby method to aggregate data and compares performance with manually iterating through the DataFrame.

# Define a custom aggregation function
def custom_aggregation(data):
    result = {}
    for category in data['Category'].unique():
        result[category] = data[data['Category'] == category]['Values'].sum()
    return result

# Aggregate data using the groupby method
start_time = time.time()  # Record the start time
groupby_result = df.groupby('Category')['Values'].sum()
time_groupby = time.time() - start_time  # Calculate the time taken

# Aggregate data using manual iteration
start_time = time.time()  # Record the start time
manual_result = custom_aggregation(df)
time_manual = time.time() - start_time  # Calculate the time taken

# Print the results
print("Aggregation result using groupby:")
print(groupby_result)
print("\nTime taken using groupby:", time_groupby, "seconds")

print("\nAggregation result using manual iteration:")
print(manual_result)
print("\nTime taken using manual iteration:", time_manual, "seconds")


Aggregation result using groupby:
Category
A    12541392
B    12440541
C    12477135
D    12502875
Name: Values, dtype: int64

Time taken using groupby: 0.022678136825561523 seconds

Aggregation result using manual iteration:
{'A': np.int64(12541392), 'D': np.int64(12502875), 'B': np.int64(12440541), 'C': np.int64(12477135)}

Time taken using manual iteration: 0.12448310852050781 seconds


<div class="alert alert-warning">

**7. Merge Operation: merge() vs. Nested For Loop**

**Write a Pandas program that performs a merge operation on two large DataFrames using the "merge" method. It compares the performance with a nested for loop.**

</div>

In [17]:
import pandas as pd  # Import the Pandas library
import numpy as np  # Import the NumPy library
import time  # Import the time module to measure execution time

# Create two large DataFrames
np.random.seed(0)  # Set seed for reproducibility
data1 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value1': np.random.randint(1, 100, size=1000)
}
data2 = {
    'Key': np.random.randint(1, 1000, size=1000),
    'Value2': np.random.randint(1, 100, size=1000)
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

display(df1)
display(df2)

Unnamed: 0,Key,Value1
0,685,1
1,560,67
2,630,98
3,193,13
4,836,16
...,...,...
995,51,56
996,243,11
997,871,75
998,21,25


Unnamed: 0,Key,Value2
0,44,70
1,988,50
2,654,26
3,975,13
4,56,53
...,...,...
995,575,47
996,58,46
997,294,68
998,942,59


<div class="alert alert-success">

**Solution 07:**
</div>

In [18]:
# 7. Merge Operation: merge() vs. Nested For Loop
# Write a Pandas program that performs a merge operation on two large DataFrames using the "merge" method. It compares the performance with a nested for loop.


# Perform merge using the merge method
start_time = time.time()  # Record the start time
merged_df = pd.merge(df1, df2, on='Key')
time_merge = time.time() - start_time  # Calculate the time taken

# Perform merge using a nested for loop
start_time = time.time()  # Record the start time
merged_data = []
for index1, row1 in df1.iterrows():
    for index2, row2 in df2.iterrows():
        if row1['Key'] == row2['Key']:
            merged_data.append({**row1, **row2})
merged_df_loop = pd.DataFrame(merged_data)
time_nested_loop = time.time() - start_time  # Calculate the time taken

# Print the time taken for both methods
print("Time taken using merge method:", time_merge, "seconds")
print("Time taken using nested for loop:", time_nested_loop, "seconds") 


Time taken using merge method: 0.0013489723205566406 seconds
Time taken using nested for loop: 7.091788053512573 seconds


<div class="alert alert-warning">

**8. Optimize Memory with Categorical Data**

**Write a Pandas program to create a DataFrame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.**

</div>

In [19]:
# Create a sample DataFrame with categorical data
np.random.seed(0)  # Set seed for reproducibility
data = {
    'Category': np.random.choice(['A', 'B', 'C', 'D'], size=1000000),
    'Values': np.random.randint(1, 100, size=1000000)
}
df = pd.DataFrame(data)

df

Unnamed: 0,Category,Values
0,A,74
1,D,54
2,B,45
3,A,96
4,D,94
...,...,...
999995,D,10
999996,B,64
999997,A,89
999998,A,58


<div class="alert alert-success">

**Solution 08:**
</div>

In [20]:
# 8. Optimize Memory with Categorical Data
# Write a Pandas program to create a DataF
# rame with categorical data and use the category data type to optimize memory usage. Measure the performance difference.

# Print memory usage before optimization
print("Memory usage before optimization:")
print(df.info(memory_usage='deep'))

# Convert the 'Category' column to the category data type
df['Category'] = df['Category'].astype('category')

# Print memory usage after optimization
print("\nMemory usage after optimization:")
print(df.info(memory_usage='deep'))

Memory usage before optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype 
---  ------    --------------    ----- 
 0   Category  1000000 non-null  object
 1   Values    1000000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 55.3 MB
None

Memory usage after optimization:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column    Non-Null Count    Dtype   
---  ------    --------------    -----   
 0   Category  1000000 non-null  category
 1   Values    1000000 non-null  int64   
dtypes: category(1), int64(1)
memory usage: 8.6 MB
None
