# A Comprehensive Guide to NumPy for Data Science Application|

NumPy is a fundamental package for numerical computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays. Given its efficiency and versatility, NumPy is widely used in scientific computing, data analysis, and machine learning.

Creating Arrays

In [2]:
import numpy as np

array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

print("1D Array:", array_1d)
print("2D Array:\n", array_2d)


1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]


PERFORMING BASIC OPERATIONS WITH NUMPY

NumPy allows you to perform a wide range of basic operations on arrays, including arithmetic,mathematical functions, and aggregation. These operations are typically performed element-wise, making them highly efficient

In [5]:
import numpy as np

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

add_result = a + b
mul_result = a * b
scalar_add = a + 10
scalar_mul = b * 2

print("Addition result:", add_result)
print("Multiplication result:", mul_result)
print("Scalar addition result:", scalar_add)
print("Scalar multiplication result:", scalar_mul)


Addition result: [5 7 9]
Multiplication result: [ 4 10 18]
Scalar addition result: [11 12 13]
Scalar multiplication result: [ 8 10 12]


UNDERSTANDING ARRAY PROPERTIES USING NUMPY

Understanding the properties of NumPy arrays is essential for efficient data manipulation and analysis. NumPy arrays come with various attributes that provide valuable information about their structure, including shape, dimensions, size, data type, and more.

1. SHAPE AND DIMENSIONS

In [6]:
import numpy as np

array_1d = np.array([1, 2, 3, 4, 5])
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

shape_1d = array_1d.shape
shape_2d = array_2d.shape
dim_1d = array_1d.ndim
dim_2d = array_2d.ndim

print("Shape of 1D array:", shape_1d)
print("Shape of 2D array:", shape_2d)
print("Dimensions of 1D array:", dim_1d)
print("Dimensions of 2D array:", dim_2d)


Shape of 1D array: (5,)
Shape of 2D array: (2, 3)
Dimensions of 1D array: 1
Dimensions of 2D array: 2


2. SIZE

In [7]:
size_1d = array_1d.size
size_2d = array_2d.size

print("Size of 1D array:", size_1d)
print("Size of 2D array:", size_2d)


Size of 1D array: 5
Size of 2D array: 6


3. DATA TYPE (dtype)

In [8]:
dtype_1d = array_1d.dtype
float_array = np.array([1.2, 2.3, 3.4], dtype='float32')
int_array = float_array.astype('int32')

print("Data type of 1D array:", dtype_1d)
print("Float array:", float_array)
print("Array after type casting:", int_array)


Data type of 1D array: int32
Float array: [1.2 2.3 3.4]
Array after type casting: [1 2 3]


DATA MANIPULATION USING NUMPY

NumPy provides powerful capabilities for data manipulation, enabling efficient handling of arrays through creation, indexing, slicing, reshaping, and mathematical operations.

INDEXING

In [14]:
element_1d = array_1d[2]
element_2d = array_2d[1, 2]
print("Element from 1D array:", element_1d)
print("Element from 2D array:", element_2d)

Element from 1D array: 3
Element from 2D array: 6


SLICING

In [15]:
slice_1d = array_1d[1:4]
slice_2d = array_2d[:, 1:3]

print("Slice of 1D array:", slice_1d)
print("Slice of 2D array:\n", slice_2d)

Slice of 1D array: [2 3 4]
Slice of 2D array:
 [[2 3]
 [5 6]]


RESHAPING

In [16]:
reshaped_array = np.arange(12).reshape((3, 4))

print("Reshaped array:\n", reshaped_array)

Reshaped array:
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


Applying Mathematical Operations

In [17]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

add_result = a + b
mul_result = a * b
scalar_add = a + 10
sqrt_result = np.sqrt(a)

print("Addition result:", add_result)
print("Multiplication result:", mul_result)
print("Scalar addition result:", scalar_add)
print("Square root result:", sqrt_result)

Addition result: [5 7 9]
Multiplication result: [ 4 10 18]
Scalar addition result: [11 12 13]
Square root result: [1.         1.41421356 1.73205081]


DATA AGGREGATION USING NUMPY

Data aggregation involves computing summary statistics and performing operations that summarize data. NumPy provides efficient functions for calculating various summary statistics, such as mean, median, standard deviation, and sum. Grouping data and aggregating results are essential for data analysis

In [18]:
import numpy as np

data = np.array([10, 20, 30, 40, 50])

mean_result = np.mean(data)
median_result = np.median(data)
std_dev_result = np.std(data)
sum_result = np.sum(data)

print("Mean:", mean_result)
print("Median:", median_result)
print("Standard Deviation:", std_dev_result)
print("Sum:", sum_result)


Mean: 30.0
Median: 30.0
Standard Deviation: 14.142135623730951
Sum: 150


GROUPING DATA

In [19]:
data_2d = np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]])

mean_per_column = np.mean(data_2d, axis=0)
mean_per_row = np.mean(data_2d, axis=1)

print("Mean per column:", mean_per_column)
print("Mean per row:", mean_per_row)

Mean per column: [40. 50. 60.]
Mean per row: [20. 50. 80.]


APPLYING AGGREGATIONS TO GROUPS

In [22]:
data_structured = np.array([(1, 'A', 10), (2, 'B', 20), (1, 'A', 30), (2, 'B', 40)],
                           dtype=[('group', 'i4'), ('category', 'U1'), ('value', 'i4')])

grouped_data = np.unique(data_structured['group'])
mean_per_group = [np.mean(data_structured[data_structured['group'] == g]['value']) for g in grouped_data]

print("Mean value per group:", dict(zip(grouped_data, mean_per_group)))


Mean value per group: {1: 20.0, 2: 30.0}


DATA ANALYSIS USING NUMPY

Data analysis with NumPy involves various techniques to extract insights from data. This includes finding correlations, identifying outliers, and calculating percentiles. NumPy is highly efficient for handling large datasets and performing these analyses quickly.

FINDING CORRELATIONS

In [23]:
import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 4, 3, 2, 1])

correlation = np.corrcoef(x, y)[0, 1]

print("Correlation coefficient:", correlation)


Correlation coefficient: -0.9999999999999999


IDENTIFYING OUTLIERS

In [24]:
from scipy import stats

data = np.array([10, 12, 13, 12, 15, 16, 19, 100])
z_scores = np.abs(stats.zscore(data))
outliers = np.where(z_scores > 2)

print("Outliers:", data[outliers])


Outliers: [100]


CALCULATING PERCENTILES

In [25]:
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])

percentile_25 = np.percentile(data, 25)
percentile_50 = np.percentile(data, 50)  # Median
percentile_75 = np.percentile(data, 75)

print("25th percentile:", percentile_25)
print("50th percentile (Median):", percentile_50)
print("75th percentile:", percentile_75)


25th percentile: 32.5
50th percentile (Median): 55.0
75th percentile: 77.5


HIGHLIGHTING NUMPY’S EFFICIENCY IN HANDLING LARGE DATASETS

1. Optimized Data Structures

In [26]:
import numpy as np

large_array = np.arange(1_000_000)
print("Memory size of NumPy array:", large_array.nbytes, "bytes")


Memory size of NumPy array: 4000000 bytes


2. Vectorized Operation

In [27]:
import numpy as np

large_array = np.random.rand(1_000_000)
result = large_array * 2  # Vectorized operation

print("First 5 elements of the result:", result[:5])


First 5 elements of the result: [0.39110793 0.83721924 1.81622003 0.27136852 1.1296074 ]


3.Broadcasting

In [28]:
import numpy as np

array_2d = np.random.rand(1_000, 1_000)
result = array_2d + 10  # Broadcasting operation

print("Shape of result array:", result.shape)


Shape of result array: (1000, 1000)


APPLICATION IN DATA SCIENCE

NumPy plays a pivotal role in data science by offering advanced capabilities for numerical computations, data manipulation, and analysis. Its efficient handling of large datasets and powerful array operations make it an essential tool for data science professionals.



In [29]:
import numpy as np
import time

data = np.random.rand(10_000_000)

start_time = time.time()
mean_value = np.mean(data)
end_time = time.time()

print("Mean value:", mean_value)
print("Time taken:", end_time - start_time, "seconds")


Mean value: 0.4999057326433459
Time taken: 0.012458324432373047 seconds


Convenient Mathematical Functions


In [30]:
import numpy as np

data = np.random.rand(1000)
mean = np.mean(data)
std_dev = np.std(data)
cov_matrix = np.cov(data, rowvar=False)

print("Mean:", mean)
print("Standard Deviation:", std_dev)
print("Covariance Matrix:\n", cov_matrix)


Mean: 0.49803054386623374
Standard Deviation: 0.28563959887250967
Covariance Matrix:
 0.08167165209614437


Support for Multi-Dimensional Arrays

In [31]:
import numpy as np

array_3d = np.random.rand(10, 10, 10)
mean_per_slice = np.mean(array_3d, axis=(1, 2))

print("Mean per slice:", mean_per_slice)


Mean per slice: [0.49470633 0.51832025 0.53029841 0.48975521 0.49055357 0.46053951
 0.48746854 0.52182145 0.50125793 0.54923857]


Integration with Other Libraries

In [32]:
import numpy as np
import pandas as pd

data = np.random.rand(100, 3)
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print("DataFrame:\n", df.head())


DataFrame:
           A         B         C
0  0.893037  0.291425  0.542391
1  0.899151  0.247227  0.173199
2  0.687049  0.057271  0.288107
3  0.755465  0.463895  0.471268
4  0.462153  0.211424  0.598398


REAL-WORLD EXAMPLES

In [33]:
import numpy as np
from sklearn.linear_model import LinearRegression

X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

model = LinearRegression().fit(X, y)
predictions = model.predict(X)

print("Predictions:", predictions)


Predictions: [ 2.  4.  6.  8. 10.]


CONCLUSION

NumPy is a powerful and essential library in Python for numerical computations and data analysis. Its core functionalities and features significantly enhance the efficiency and performance of handling numerical data.NumPy’s role in data science and numerical computing cannot be overstated. It addresses the challenges of handling large datasets, performing complex calculations, and conducting in-depth data analysis. Its optimized array structures, vectorized operations, and advanced mathematical functions make it a go-to tool for data scientists, researchers, and analysts. By leveraging NumPy, professionals can achieve faster computations, clearer data manipulation, and more insightful analyses, ultimately driving better decision-making and innovation across various domains.









