### Handling Large Datasets

61. **Use memory-mapped arrays to handle large datasets that don’t fit into RAM.**

In [1]:
import numpy as np

# Create a memory-mapped array
filename = 'large_dataset.dat'
shape = (10000, 10000)
dtype = np.float64

# Creating a new memory-mapped array
large_array = np.memmap(filename, dtype=dtype, mode='w+', shape=shape)

# Example of writing data to the memory-mapped array
large_array[:] = np.random.rand(*shape)

# Flush changes to disk
large_array.flush()

# Reading data from the memory-mapped array
loaded_array = np.memmap(filename, dtype=dtype, mode='r', shape=shape)
print("Memory-mapped array loaded:")
print(loaded_array)

Memory-mapped array loaded:
[[0.46904986 0.50610463 0.19230354 ... 0.05345926 0.86922024 0.32299666]
 [0.03306956 0.48121221 0.13335201 ... 0.59556979 0.77666386 0.60753406]
 [0.27178225 0.22073499 0.86542861 ... 0.94954508 0.09123997 0.26137299]
 ...
 [0.42758915 0.44524973 0.72509501 ... 0.5805393  0.68880136 0.3873468 ]
 [0.0245044  0.41796409 0.44553429 ... 0.46232153 0.60653977 0.63094818]
 [0.74529689 0.32045345 0.11521009 ... 0.82690464 0.59307137 0.37760109]]


62. **Perform batch processing on large datasets with NumPy.**

In [2]:
import numpy as np

# Example large dataset
large_array = np.random.rand(1000000)

# Function to process data in batches
def batch_process(data, batch_size, func):
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        func(batch)

# Example processing function
def process_batch(batch):
    # Example: computing the mean of the batch
    batch_mean = np.mean(batch)
    print(f"Batch mean: {batch_mean}")

# Perform batch processing
batch_process(large_array, batch_size=100000, func=process_batch)

Batch mean: 0.4980195094430185
Batch mean: 0.5007727840866029
Batch mean: 0.4999730070543666
Batch mean: 0.5010107921057971
Batch mean: 0.5003541615068549
Batch mean: 0.5010029231235742
Batch mean: 0.4998494021272701
Batch mean: 0.4997786940416838
Batch mean: 0.4996790521204079
Batch mean: 0.4994175368022695


63. **Use NumPy to preprocess data for machine learning models.**

In [3]:
import numpy as np

# Example dataset
data = np.random.rand(100, 5)

# Preprocessing: normalize the data to have zero mean and unit variance
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
normalized_data = (data - mean) / std

print("Normalized data:")
print(normalized_data)

Normalized data:
[[-0.41612542  1.4400142  -1.66851217 -0.59933053  0.08714549]
 [-1.40449066  0.52050864  1.05966696  1.45289307 -1.51849841]
 [ 0.59950148 -1.19098327 -1.06305219 -0.08228433 -1.06742622]
 [-0.89513602  1.57089599  0.33587551  0.8639603  -0.41539926]
 [ 0.51026759 -1.48181343 -0.78499648  0.05010248  0.40979018]
 [ 0.22059262 -0.84572851 -0.9404906  -0.74435517 -0.16991596]
 [ 0.52696524 -0.34609592 -0.18984308 -0.79974085  0.03845424]
 [ 1.44369785 -1.21641811  0.76776327 -0.33417778  1.62002651]
 [-1.82672289 -0.24575242  0.35947619  0.41017368 -1.47223408]
 [ 1.3400738   1.43050949  0.37956034  1.2487955  -1.21920602]
 [ 1.47175658 -1.2113485  -0.14246784 -1.05346358 -0.87914944]
 [-0.15540641  1.11790771  0.14061834 -0.5877479   0.29436513]
 [-1.37123046  1.36679279 -0.92499427 -1.444702    1.25435644]
 [ 0.0119007   1.68914069 -1.23979719 -0.60441208  1.09375004]
 [-1.20970608  1.61742425  1.74479865  0.77665634 -0.55626253]
 [ 1.67384968 -0.43228472 -0.65252396 

64. **Load a large dataset incrementally and compute summary statistics.**

In [4]:
import numpy as np

# Simulating loading data in chunks
def load_data_in_chunks(file_path, chunk_size):
    # Simulating with random data
    total_size = 1000000
    for i in range(0, total_size, chunk_size):
        yield np.random.rand(min(chunk_size, total_size - i))

# Initialize summary statistics
sum_total = 0
count_total = 0

# Process data in chunks
for chunk in load_data_in_chunks('large_dataset.csv', chunk_size=100000):
    sum_total += np.sum(chunk)
    count_total += chunk.size

# Compute mean
mean_total = sum_total / count_total

print("Summary statistics:")
print(f"Mean: {mean_total}")

Summary statistics:
Mean: 0.4998949487601498


### Miscellaneous

65. **Create a custom NumPy dtype.**

In [5]:
import numpy as np

# Define a custom dtype for a structured array
custom_dtype = np.dtype([('name', 'S10'), ('age', 'i4'), ('height', 'f4')])

# Create an array with the custom dtype
data = np.array([('Alice', 25, 5.5), ('Bob', 30, 6.0)], dtype=custom_dtype)

print("Custom dtype array:")
print(data)

Custom dtype array:
[(b'Alice', 25, 5.5) (b'Bob', 30, 6. )]


66. **Implement a NumPy ufunc (universal function).**

In [6]:
import numpy as np

# Define a simple custom ufunc
def add_elements(x, y):
    return x + y

# Vectorize the function to create a ufunc
vectorized_add = np.vectorize(add_elements)

# Use the custom ufunc
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = vectorized_add(a, b)

print("Custom ufunc result:")
print(result)

Custom ufunc result:
[5 7 9]


67. **Use NumPy with Pandas for data manipulation.**

In [7]:
import numpy as np
import pandas as pd

# Create a Pandas DataFrame
df = pd.DataFrame({
    'A': np.random.rand(5),
    'B': np.random.rand(5)
})

# Convert DataFrame to NumPy array for manipulation
array = df.values

# Perform row-wise mean calculation
row_mean_array = np.mean(array, axis=1)

# Update the DataFrame with the result
df['mean_A_B'] = row_mean_array

print("Updated DataFrame:")
print(df)

Updated DataFrame:
          A         B  mean_A_B
0  0.596119  0.123851  0.359985
1  0.369512  0.849971  0.609741
2  0.136907  0.691515  0.414211
3  0.991506  0.673133  0.832319
4  0.721371  0.225993  0.473682


68. **Combine multiple conditions to filter an array (logical operations).**

In [8]:
import numpy as np

# Example array
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Combine multiple conditions
filtered_data = data[(data > 3) & (data < 8)]

print("Filtered data:")
print(filtered_data)

Filtered data:
[4 5 6 7]



69. **Sort an array based on a specific column or row.**

In [9]:
import numpy as np

# Example 2D array
array = np.array([[3, 2, 1], [6, 5, 4], [9, 8, 7]])

# Sort based on the second column
sorted_array = array[array[:, 1].argsort()]

print("Array sorted by the second column:")
print(sorted_array)

Array sorted by the second column:
[[3 2 1]
 [6 5 4]
 [9 8 7]]


70. **Use `np.where` to replace elements based on a condition.**

In [10]:
import numpy as np

# Example array
data = np.array([1, 2, 3, 4, 5])

# Replace elements greater than 3 with 0
replaced_data = np.where(data > 3, 0, data)

print("Array with elements > 3 replaced with 0:")
print(replaced_data)

Array with elements > 3 replaced with 0:
[1 2 3 0 0]
