# Lesson 9. NumPy and Pandas 

NumPy (Numerical Python) is a Python library used for scientific computing. It can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined using NumPy which allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Let's dive into each of these concepts.

## Theory

### NumPy in Python

NumPy is a Python package. It stands for 'Numerical Python'. It is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

We can install numpy using pip:

```python
!pip install numpy
```

Once NumPy is installed, you can import and use it. Here is a simple example:

```python
import numpy as np

arr = np.array([1, 2, 3, 4, 5])

print(arr)  # Output: [1 2 3 4 5]
print(type(arr))  # Output: <class 'numpy.ndarray'>
```

Key features of NumPy:

1.   ndarray: An efficient multi-dimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.
2.   Mathematical functions for fast operations on entire arrays of data without having to write loops.
3.  Tools for reading/writing array data to disk and working with memory-mapped files.
4. Linear algebra, random number generation, and Fourier transform capabilities.



**Creating Arrays**: You can create arrays using the numpy.array() function.

```python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr) # Output: array([1, 2, 3, 4, 5])
```

**Arithmetic Operations**: You can perform element-wise addition, subtraction, multiplication, and division on arrays.

```python
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print(arr1 + arr2)  # Output: array([5, 7, 9])
```

**Reshaping Arrays**: The reshape() function allows you to change the number of rows and columns in an array.

```python
arr = np.array([1, 2, 3, 4, 5, 6])
new_arr = arr.reshape(2, 3)
print(new_arr)
'''
Output:
[[1 2 3]
 [4 5 6]]
'''
```

**Indexing and Slicing**: You can access array elements through indices.


```python
arr = np.array([1, 2, 3, 4, 5])
print(arr[0])  # Output: 1
print(arr[1:3])  # Output: array([2, 3])
```

**Statistical Functions**: NumPy provides functions like `mean()`, `median()`, `std()`, etc.


```python
arr = np.array([1, 2, 3, 4, 5])
print(np.mean(arr))  # Output: 3.0
```




In [None]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5, 6])
new_arr = arr.reshape(2, 3)
print(new_arr)

[[1 2 3]
 [4 5 6]]


You might think why we should use NumPy Array instead of built-in List? So, here is the comparison table 

| **Feature**          | **Python List**                                             | **NumPy Array**                                                                                  |
|----------------------|-------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| **Memory Usage**     |                     Higher memory usage.                    |                            Lower memory usage due to type uniformity.                            |
| **Performance**      |     Slower due to dynamic type checking during runtime.     |                   Faster due to static typing and contiguous memory allocation.                  |
| **Functionality**    | Basic operations like appending, inserting, removing items. |     Advanced operations like vector addition, matrix multiplication, broadcasting, and more.     |
| **Storage**          |             Can store heterogeneous data types.             |  Stores homogeneous data types, which is efficient for mathematical and scientific computations. |
| **Size Flexibility** |        Size is dynamic, can grow or shrink as needed.       |                           Size is static, cannot change after creation.                          |
| **Operations**       |       Arithmetic operations require explicit looping.       | Supports element-wise operations and operations between differently sized arrays (broadcasting). |
| **Integration**      |      Not directly compatible with scientific libraries.     |       Directly compatible with many scientific libraries (SciPy, Matplotlib, Pandas, etc.).      |

Check out the [NumPy official documentation](https://numpy.org/doc/stable/index.html) or google other sites to know more fucntions and methods, because there are plenty of them, so the notebook will be endless

### Pandas in Python

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

We can install pandas using pip:

```python
!pip install pandas
```

Once Pandas is installed, you can import and use it. Here is a simple example:

```python
import pandas as pd

data = {
    'apples': [3, 2, 0, 1],
    'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)

print(purchases)
```

This will output:

```
   apples  oranges
0       3        0
1       2        3
2       0        7
3       1        2
```

**Creating a DataFrame**: You can create a DataFrame using the `pandas.DataFrame()` function.



```python
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [25, 34, 57]}
df = pd.DataFrame(data)
print(df)
```


**Reading Data**: You can read data from a CSV file using `pandas.read_csv()`.


```python
df = pd.read_csv('file.csv')
```

**Data Selection**: You can select data using column names, or using conditions.


```python
ages = df['Age']  # select the 'Age' column
old_people = df[df['Age'] > 18]  # select people older than 30
```

**Data Manipulation**: You can manipulate data using functions like `groupby()`, `merge()`, etc.


```python
average_age = df['Age'].mean()  # calculate the average age
grouped = df.groupby('Age').count()  # group by age and count
```

**Handling Missing Data**: Pandas provides functions like **isnull()**, **notnull()**, **dropna()**, **fillna()**, etc.


```python
df.isnull()  # checks for null Values, Returns Boolean DataFrame
df.dropna()  # drops all rows that contain null values
df.fillna(x)  # replaces all null values with x
```




Check out the [Pandas official documentation](https://pandas.pydata.org/docs/) or Google other sites because there are plenty of functions and methods, so the notebook will be endless

## Practice

### 1. NumPy Array Operations

Given a NumPy array `arr = np.array([1, 2, 3, 4, 5])`, write a function that squares each element in the array and returns a new array with the squared values. Do not use any form of looping (`for`, `while`, etc.).

In [None]:
import numpy as np

def square_elements(arr):
    # your code here

arr = np.array([1, 2, 3, 4, 5])
squared_arr = square_elements(arr)

assert np.array_equal(squared_arr, np.array([1, 4, 9, 16, 25]))


### 2. Pandas DataFrame Creation

Given two lists `names = ['Alice', 'Bob', 'Charlie', 'David']` and `ages = [25, 32, 18, 47]`, create a Pandas DataFrame that has a `'Name'` column with the names and an `'Age'` column with the ages.

In [None]:
import pandas as pd

def create_dataframe(names, ages):
  # your code here

names = ['Alice', 'Bob', 'Charlie', 'David']
ages = [25, 32, 18, 47]
df = create_dataframe(names, ages)

assert df.equals(pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 32, 18, 47]}))


### 3.  Data Filtering with Pandas

Given a Pandas DataFrame `df` that has a `'Salary'` column, write a function that returns a new DataFrame containing only the rows in which the salary is above given `x`.

In [None]:
def filter_by_salary(df, x):
  # your code here

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Salary': [50000, 60000, 70000, 40000]}
df = pd.DataFrame(data)
filtered_df = filter_by_salary(df, 50000)

assert filtered_df.equals(pd.DataFrame({'Name': ['Bob', 'Charlie'], 'Salary': [60000, 70000]}, index=[1, 2]))


### 4. Matrix Multiplication with NumPy

Given two NumPy arrays:



```python
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
```


Write a function that performs matrix multiplication (*dot product*) of A and B. The function should return a new NumPy array with the result.



In [None]:
import numpy as np

def matrix_multiply(A, B):
  # your code here

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
result = matrix_multiply(A, B)

assert np.array_equal(result, np.array([[19, 22], [43, 50]]))


[5 4 3 2 1]
