# üî¢ NumPy

## üìù Notes

**NumPy**, which stands for **Numerical Python**, is foundational for numerical computing in Python.

- üéØ Designed for scientific computation and is used extensively for data analysis because of its ability to handle large, multi-dimensional arrays and matrices efficiently
- üèóÔ∏è Serves as the basis for many other Python data science libraries, including Pandas, due to its speed and efficiency in numerical computations
- üöÄ A few special use cases for NumPy specifically for data analysis:
  - Array operations
  - Linear algebra
  - Statistical functions
  - Random number generation

## üí° Importance

- üìä **Important for data analysis**
- üóÇÔ∏è Way to store data in a structured format, making it easier to organize, access, and manipulate data
- üîÑ Pandas library is built on top of NumPy
- üìö For more information on NumPy check out the [official documentation here](https://numpy.org/doc/)

---

# üßÆ Array Operations

## üìù Notes

- **Creation**: `import numpy as np; np.array([1, 2, 3])`
- **Basic Operations**: `+`, `-`, `/`, `*` performed element-wise
- **Slicing**: `array[1:3]`
- **Boolean Indexing**: `array[array > 0]`

## üí° Examples

Let's start with creating a fictional array with the number of years of experience required for five different data science job listings.

### Create an Array


In [51]:
# Import Package
import numpy as np

In [52]:
# Example: An array representing the number of years of experience required for three different data science job listings.
years_of_experience = np.array([1, 2, 3, 4, 5])

#### Basic Operations
Applying mathematics operations to the entire array.

**Addition**

Adding 1 year to the experience requirements for each job listing.

In [53]:
years_of_experience_plus_one = years_of_experience + 1
years_of_experience_plus_one

array([2, 3, 4, 5, 6])

**Subtraction**

Subtracting 1 year from the experience requirements for each job listing.

In [54]:
years_of_experience_minus_one = years_of_experience - 1
years_of_experience_minus_one

array([0, 1, 2, 3, 4])

**Division**

Dividing the experience requirements by 2 (maybe for a junior role).

In [55]:
years_of_experience_half = years_of_experience / 2
years_of_experience_half

array([0.5, 1. , 1.5, 2. , 2.5])

**Multiplication**

Doubling the experience requirements for each job listing.

In [56]:
years_of_experience_mul_2 = years_of_experience * 2
years_of_experience_mul_2

array([ 2,  4,  6,  8, 10])

**Slicing**

Get subset of information from the array. Let's get the experience requirements for the second and third job postings.

In [57]:
# Example: Selecting the experience requirement for the second and third job listings.
second_and_third_jobs_experience = years_of_experience[1:3]
second_and_third_jobs_experience

array([2, 3])

**Boolean Indexing**

Get items from the array that meet a specific condition. For this get only the job postings that require more than 2 years of experience.

In [58]:
# Example: Selecting only those job listings that require more than 1 year of experience.
jobs_with_more_than_one_year_exp = years_of_experience[years_of_experience > 2]
jobs_with_more_than_one_year_exp

array([3, 4, 5])

# üßÆ Math Operations with NumPy

## üìù Notes

### üìä Aggregate Functions:
- **`sum`**: Sum of all elements
- **`prod`**: Product of all elements  
- **`cumsum`**: Cumulative sum
- **`cumprod`**: Cumulative product

### üî¢ Mathematical Operations (we won't be going into this during the course):
- **`sqrt`**: Square root
- **`exp`**: Exponential
- **`log`**: Logarithm
- **`sin`**: Sine
- **`cos`**: Cosine

## üí° Examples

First let's create a list with 10 yearly salaries for a Senior Data Analyst job. We're just using a combination of the random library to get random integers between 100000 and 150000. Then using a for loop to get 10 (random) values.

In [59]:
import random

salary = [random.randint(100000,150000) for num in range(10)]
salary

[147963,
 130915,
 107286,
 122220,
 134806,
 121132,
 124062,
 145990,
 140014,
 122352]

Now we'll convert this list into a NumPy array.

In [60]:
salary_array = np.array(salary)
salary_array

array([147963, 130915, 107286, 122220, 134806, 121132, 124062, 145990,
       140014, 122352])

**Sum**

Calculate the total sum of the elements in the `salary_array.`

In [61]:
total_sum_salaries = np.sum(salary_array)
total_sum_salaries

np.int64(1296740)

**Prod**

Calculate the product of the elements in the `salary_array.`

In [62]:
# This is a conceptual example since taking the product of a boolean series isn't common
product_salaries = np.prod(salary_array)
print(product_salaries)

-474709517974069248


**Cumsum (Cumulative Sum)**

Calculates the cumulative sum of elements of the `salary_array`. It calculates the cumulative sum at each index, meaning each element in the output array is the sum of all preceding elements including the current one from the original array.

For the `salary_array`:

- First element of cumsum is 110003 (just the first element)
- Second element is 110003 + 133394 = 243397
- Third element is 243397 + 148741 = 392138
- And so on...

In [63]:
cumulative_sum_salaries = np.cumsum(salary_array)
cumulative_sum_salaries

array([ 147963,  278878,  386164,  508384,  643190,  764322,  888384,
       1034374, 1174388, 1296740])

**Cumprod (Cumulative Product)**

Calculates the cumulative product of elements of the `salary_array`. It calculates the cumulative product at each index, meaning each element in the output array is the product of all preceding elements including the current one from the original array.

For the `salary_array`:

- First element of cumprod is 110003
- Second element is 110003 * 133394
- Third element is (110003 * 133394) * 148741
- And so on...

In [64]:
# Cumulative product of 'job_no_degree_mention' column (conceptual example)
cumulative_prod_salaries = np.cumprod(salary_array)
cumulative_prod_salaries

array([              147963,          19370576145,     2078191632292470,
       -4257835733148039224,  7084754791832452912, -7357096940783694784,
        8736099641871667072, -6251795360013337344, -3975751241771600384,
        -474709517974069248])

**Note**: Due to the large numbers, the cumulative product values quickly escalate to the point where they exceed the numerical limit for typical data types in Python, leading to integer overflow. This is why some numbers appear as negative.

# üìä Statistics Operations with NumPy

## üìù Notes

A lot of these you can also use Pandas because Pandas is built on top of NumPy. So here's a few examples but we won't be diving deep into them.

### üìà Statistical Functions:
- **`mean`**: Average value
- **`median`**: Middle value
- **`var`**: Variance
- **`std`**: Standard deviation
- **`min`**: Minimum value
- **`max`**: Maximum value

## üéØ Mean

Calculate the average salary in the salary_array.


In [65]:
average_salary = np.mean(salary_array)
average_salary

np.float64(129674.0)

## üéØ Median

Find the median salary in the salary_array.

In [66]:
median_salary = np.median(salary_array)
median_salary

np.float64(127488.5)

## üìè Var
Calculate the standard deviation of the salary_array.

In [67]:
salary_variance = np.var(salary_array, ddof=1)  # ddof=1 for sample variance
salary_variance

np.float64(161149903.7777778)

## üìè Standard Deviation
Calculate the standard deviation of the salary_array.

In [68]:
# Standard deviation of 'salary_year_avg' column
salary_std_dev = np.std(salary_array, ddof=1)  # ddof=1 for sample standard deviation
salary_std_dev

np.float64(12694.4832024694)

## üìâ Minimum
Find the minimum element in the salary_array.

In [69]:
# Minimum of 'salary_year_avg' column
min_salary = np.min(salary_array)
min_salary

np.int64(107286)

## üìà Maximum
Find the maximum element in the salary_array.

In [70]:
# Maximum of 'salary_year_avg' column
max_salary = np.max(salary_array)
max_salary

np.int64(147963)

# üî¢ NaN Values in NumPy

## üìù Notes

- Generate NaN values using `np.nan`
- `np.nan` value is used in NumPy (and by extension, Pandas) to represent missing or undefined data
- Helpful because it:
  - Handles missing data
  - Helps with computations since it won't return errors but instead return `np.nan`
  - Helps filter out or fill in missing data using other methods that we'll use often in the pandas library like `dropna()`, `fillna()`, `isna()`, or `notna()`

## üí° Examples

Below are a few examples of what you can do.

### üéØ Insert Missing Values

If you want to insert missing values into your array intentionally, perhaps to indicate that data is expected but not yet available. You use `np.nan`.


In [71]:
salary_with_nan = np.array([123124, np.nan, 145000, 128000, 110000, 149999, np.nan, 135000, 115000, 140000] , dtype= float)
salary_with_nan

array([123124.,     nan, 145000., 128000., 110000., 149999.,     nan,
       135000., 115000., 140000.])

#### Replace Values with NaN
If you want to replace existing values with np.nan, for example, if certain values are considered invalid or outliers:

In [72]:
salary_with_nan[salary_with_nan < 130000] = np.nan
salary_with_nan

array([    nan,     nan, 145000.,     nan,     nan, 149999.,     nan,
       135000.,     nan, 140000.])

# üéØ NumPy Where Function

## üìù Notes

- `np.where` checks elements of an array against a condition and assigns a value for `True` and another for `False`
- **Syntax**: `np.where(condition, x, y)`
- It's commonly used to conditionally replace array elements

## üí° Example

We're going to replace all values in `salary_array` that are less than 120,000 with 120,000 (to apply a minimum salary threshold). We'll use this syntax for it: `np.where(condition, x, y)`. With a condition and if it's true we do `x` and if not then do `y`.


In [73]:
salary_array = np.where(salary_array < 120000 , 120000 ,salary_array)
salary_array

array([147963, 130915, 120000, 122220, 134806, 121132, 124062, 145990,
       140014, 122352])

# üé≤ Random Sampling with NumPy

## üìù Notes

- Generate random numbers or samples
- `np.random.normal` - draws random samples from a normal (Gaussian) distribution

### Specify the Arguments:
- **`loc`**: This is the mean (Œº) of the normal distribution
- **`scale`**: This is the standard deviation (œÉ) of the normal distribution, representing the dispersion from the mean
- **`size`**: This defines the number of random samples to draw, which is set to match the number of job postings

**Syntax**: `np.random.normal(loc=0.0, scale=1.0, size=None)`

### A few other random sampling functions:
- `np.random.rand`
- `np.random.randn` 
- `np.random.randint`
- `np.random.random`
- `np.random.uniform`
- `np.random.binomial`
- `np.random.poisson`

## üí° Example

Let's add some random noise to the `salary_array` to simulate salary variations. We can generate random values from a normal distribution with a mean of 0 and a standard deviation of 5000, then add these values to the salaries.

**Why?** This can be used to simulate salary data for job postings if actual salary data isn't available, for instance, in modeling or simulation scenarios.

In [74]:
# Generate numbers based on normal distribution 
noise = np.random.normal(0, 5000, salary_array.size)
# Add these numbers to the salary_array
salary_array_with_noise = salary_array + noise
salary_array_with_noise

array([145606.01170513, 124894.81412902, 128349.48812357, 115198.57938196,
       132295.92634846, 122281.51377284, 119943.73920553, 143760.57105855,
       145276.8850143 , 126267.08310242])