# <span style="color:darkblue;">[LDATS2350] - DATA MINING</span>

### <span style="color:darkred;">Python01 - Numpy library</span>

**Prof. Robin Robin Van Oirbeek **  

---

**<span style="color:darkgreen;">Guillaume Deside</span>** (<span style="color:gray;">guillaume.deside@uclouvain.be</span>)  

### **NumPy: The Core Library for Scientific Computing in Python**

**NumPy** (short for *Numerical Python*) is a foundational library in Python that provides support for efficient numerical computations and is widely used in data science, machine learning, and scientific research. At its core, it offers a **high-performance multidimensional array object** along with a suite of tools for manipulating these arrays.

---

### **Key Features of NumPy**
1. **Efficient Multidimensional Arrays**:  
   - NumPy provides the `ndarray` object, which allows storage and manipulation of data in multiple dimensions (1D, 2D, 3D, etc.) with superior performance compared to Python lists.  

2. **Vectorized Operations**:  
   - Operations are performed **over entire arrays** without requiring loops, enabling fast, clean, and efficient computations.

3. **Numerical Computation**:  
   - NumPy supports advanced mathematical operations like linear algebra, Fourier transforms, statistical functions, and more.

4. **Performance**:  
   - Written in C, NumPy is **highly optimized** for speed and significantly faster than Python's built-in data structures.

---

### **Why Use NumPy?**

- **High Performance**:  
  NumPy arrays are more memory-efficient and computationally faster than Python lists due to their fixed data type and compact structure.

- **Easy Integration**:  
  NumPy integrates seamlessly with libraries like Pandas, Scikit-learn, TensorFlow, and Matplotlib.

- **Foundation for Scientific Libraries**:  
  Many popular libraries in Python (like SciPy and Pandas) are built on top of NumPy.

---

### **Important Notes**
- **Fixed Data Types**:  
  NumPy arrays can only contain **values of the same data type**. This fixed type ensures efficiency but differs from Python lists, which can contain mixed data types.

- **Array Operations vs Lists**:  
  Operators like `+`, `*`, or `/` work differently in NumPy arrays compared to Python lists. For example:
  - **With Lists**: `list1 + list2` concatenates two lists.
  - **With NumPy Arrays**: `array1 + array2` performs element-wise addition.


### **When to Use NumPy**
- Large-scale numerical datasets that need efficient storage and operations.
- Mathematical operations on multidimensional data (e.g., matrices and tensors).
- Scenarios requiring integration with other scientific computing libraries or machine learning workflows.


## Example of a numpy array



In NumPy, an **array** is a central data structure that allows efficient storage and computation on multidimensional data. Below, we demonstrate how to create and manipulate NumPy arrays.



In [16]:
import numpy as np

scores = [24, 23, 30, 29, 17, 16, 15]

np_scores = np.array(scores)

print(np_scores)  #or try *2 or whatever operation...

print(scores*2)
print(np_scores*2)

#print(np_scores) 
print(np_scores+1)  

print(type(scores))
print(type(np_scores))

[24 23 30 29 17 16 15]
[24, 23, 30, 29, 17, 16, 15, 24, 23, 30, 29, 17, 16, 15]
[48 46 60 58 34 32 30]
[25 24 31 30 18 17 16]
<class 'list'>
<class 'numpy.ndarray'>


In [17]:
np_scores[3:-2]

array([29, 17])

In [18]:
np_scores >= 18

array([ True,  True,  True,  True, False, False, False])

In [19]:
np_scores[np_scores>=18]

array([24, 23, 30, 29])

## 2D Numpy arrays


In addition to one-dimensional arrays, NumPy supports **two-dimensional arrays**, which are commonly used to represent **matrices** or **tabular data**. A 2D array consists of rows and columns, making it ideal for applications such as:

- Mathematical operations like matrix multiplication.
- Representing datasets with multiple features.
- Image processing, where images are stored as 2D arrays of pixel values.

This section will cover how to create, manipulate, and perform operations on 2D NumPy arrays, helping you work with structured data efficiently.


In [21]:
scores_1 = [24, 23, 30, 29, 17, 16, 15]
scores_2 = [15, 26, 24, 25, 18, 30, 23]

np_2d_scores = np.array([scores_1,scores_2])

np_2d_scores.shape  

(2, 7)

In [22]:
np_2d_scores[0][2]
np_2d_scores[0,2]

30

In [23]:
np_2d_scores[:,1:3] #all rows

array([[23, 30],
       [26, 24]])

In [24]:
np_2d_scores[1,:] #all columns first row

array([15, 26, 24, 25, 18, 30, 23])

## Loop on a numpy array


When working with NumPy arrays, **looping** over elements or rows is sometimes necessary, even though NumPy's vectorized operations are generally more efficient. Here's how looping behaves with 1D and 2D arrays, and how to handle nested structures like 2D arrays.

### **1. Looping Over a 1D Array**

In [27]:
# Example: 1D NumPy array
scores = np.array([10, 20, 30, 40])

for score in scores:
    print(score)  # Prints each element in the 1D array

10
20
30
40


### **2. Looping Over a 2D Array**

Direct iteration over a 2D NumPy array (np_2d_scores) behaves differently compared to 1D arrays.
Row-by-Row Iteration
If you directly loop over a 2D array, it iterates over the rows, not the individual elements:

In [28]:
for score in np_2d_scores: #doesnt work if 2D
    print(score)

[24 23 30 29 17 16 15]
[15 26 24 25 18 30 23]


To loop through individual elements in a 2D array, you need to use a nested loop or leverage `np.nditer()`:

In [29]:
for val in np.nditer(np_2d_scores):
    print(val) 

24
23
30
29
17
16
15
15
26
24
25
18
30
23


In [30]:
for row in np_2d_scores:
    for x in row:
        print(x)

24
23
30
29
17
16
15
15
26
24
25
18
30
23


## Exericse

<div style="border: 2px solid darkblue; padding: 10px; background-color: #89D9F5;">

### **Exercise: Working with NumPy Arrays**

**Task:**

- **Create** a random array of length 100.  
  *(Hint: use `np.random.rand(100)`)*

- **Sort** your array.

- **Compute** the **mean**, **median**, and **sample variance** of the array.



</div>

### Solution

In [34]:
# Create a random array of length 100
random_array = np.random.rand(100)
print("Original Array:")
print(random_array)

# Sort the array
sorted_array = np.sort(random_array)
print("\nSorted Array:")
print(sorted_array)

# Compute the mean of the array
mean_value = np.mean(random_array)
print("\nMean:", mean_value)

# Compute the median of the array
median_value = np.median(random_array)
print("Median:", median_value)

# Compute the sample variance of the array
# Note: np.var() by default computes the population variance.
# To compute the sample variance, use ddof=1.
sample_variance = np.var(random_array, ddof=1)
print("Sample Variance:", sample_variance)


Original Array:
[0.43095639 0.32396466 0.48406012 0.44689359 0.83092119 0.84612499
 0.71644489 0.81441378 0.40412578 0.50047544 0.23045233 0.49821645
 0.47539404 0.25195161 0.51202798 0.07977045 0.15041631 0.04718655
 0.89593196 0.90918937 0.84215439 0.30274484 0.16675647 0.16265006
 0.51513919 0.33377953 0.93562502 0.30791154 0.37002742 0.19374875
 0.92629693 0.81204471 0.36979264 0.48752335 0.86235152 0.44597167
 0.05156159 0.64929207 0.15964168 0.73119284 0.63079934 0.13260202
 0.2659712  0.9122396  0.88975573 0.76195988 0.81829801 0.96647694
 0.96323577 0.74668841 0.55534658 0.95533698 0.82930465 0.57850232
 0.01943188 0.48503132 0.83651132 0.79043093 0.58549271 0.0028428
 0.18773581 0.96532707 0.08242148 0.61022036 0.84370054 0.44304735
 0.20495544 0.33763894 0.94175568 0.31483616 0.59585744 0.8332324
 0.70346269 0.17249132 0.38379778 0.5431434  0.57842513 0.74015237
 0.65561938 0.22127278 0.92280371 0.83992566 0.25403811 0.85297394
 0.74845129 0.73914421 0.78972429 0.14349095 0.4