# 22 NumPy

One of the most powerful and useful libraries for data analytics.

In [44]:
# create a list of 1,000,000 salaries ranging from 50,000 to 150,000

import random # importing the random module to generate random values within specified parameters

salary_list = [random.randint(50000, 150000) for _ in range(1_000_000)] # creating the list of salaries using list comprehension!

In [45]:
import statistics

In [46]:
# use the statistics library to define info from the list of salaries.

%%timeit

statistics.mean(salary_list)

UsageError: Line magic function `%%timeit` not found.


In [11]:
# now let's do a similar thing with numpy to see which one is faster

In [12]:
import numpy as np

In [13]:
%%timeit

np.mean(salary_list)

# numpy is using approx 1/6 of the time of the normal statistics package. The more rows we have, the more time we save!

31.7 ms ± 719 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


---
Numpy is very useful to do arithmetics and statistics manipulations on large datasets and is a great tool integrated with other libraries such as MatPlotLyb and Pandas.
---

Let's see a couple of use cases for numpy.

In [19]:
# arrays are a useful case for NumPy

my_array = np.array([1, 2, 3, 4])

In [21]:
# can then do calculations on that array

my_array.mean()

np.float64(2.5)

In [31]:
# let's say we have data across multiple arrays
# we can do operations across arrays

import numpy as np

# Job Titles
job_titles = np.array(["Data Analyst", "Data Scientist", "Data Engineer", "Machine Learning Engineer", "AI Engineer"])

# Base salaries
base_salaries = np.array([60000, 80000, 75000, 90000, np.nan])

# Bonus rates
bonus_rates = np.array([.05, .1, .08, .12, np.nan])

In [28]:
total_salaries = base_salaries * (1 + bonus_rates)

print(total_salaries)

[ 63000.  88000.  81000. 100800.]


In [30]:
print(np.mean(total_salaries))

83200.0


In [33]:
# what about when we have missing data?

# we added an extra job in the arrays above and missing data i.e. salary and bonus rates

# numpy as a specific command for these cases of numerical data np.nan . It is an empty float.as_integer_ratio

# we however can't use normal np methods for calculcations

np.mean(total_salaries)

np.nanmean(total_salaries)

np.float64(83200.0)

## 22.1 Basic Operations - a copy from the course code (not part of the video)

### Install package

In [41]:
import numpy as np

### Create an array

In [48]:
# create an array representing the number of years of experience for a job
years_of_experience = np.array([1, 2, 3, 4, 5])

### Addition

In [49]:
# adding 1 year to the array
years_of_experience_plus_one = years_of_experience + 1
years_of_experience_plus_one

array([2, 3, 4, 5, 6])

### Subtraction

In [50]:
# remove 1 year to the array
years_of_experience_minus_one = years_of_experience - 1
print(years_of_experience_minus_one)

[0 1 2 3 4]


### Division

In [51]:
# divide by 2
years_of_experience_half = years_of_experience / 2
print(years_of_experience_half)

[0.5 1.  1.5 2.  2.5]


### Multiplication

In [52]:
# multiply by 2
years_of_experience_double = years_of_experience * 2
print(years_of_experience_double)

[ 2  4  6  8 10]


### Slicing

In [53]:
# only get some digits from an array using their index numbers. Only get the 2 and 3 listing in the array
second_and_third_job_experience = years_of_experience[1:3]
print(second_and_third_job_experience) 

[2 3]


### Boolean indexing

In [54]:
# only get items that meet a certain condition
above_one_year_exp = years_of_experience[years_of_experience > 2]
print(above_one_year_exp)

[3 4 5]


## 22.2 Math Operations

Aggregate functions:
- sum: sum
- prod: product
- cumsum: cumulative sum
- cumprod: cumulative product

Mathematical operations (we won't be going into this during the course):
- sqrt
- exp
- log
- sin
- cos

In [58]:
# first let's create a random list with 10 yearly salaries.

import random

salary = [random.randint(100_000, 150_000) for num in range(10)]

print(salary)

[119351, 112286, 105593, 111658, 145267, 148677, 111432, 140406, 112898, 102538]


In [60]:
# now let's convert into an array

salary_array = np.array(salary)

salary_array

array([119351, 112286, 105593, 111658, 145267, 148677, 111432, 140406,
       112898, 102538])

### Sum

In [61]:
# sum all items within the array

salary_array_sum = np.sum(salary_array)

salary_array_sum

np.int64(1210106)

### Prod

In [62]:
product_salary_array = np.prod(salary_array)

product_salary_array

np.int64(-6104619559599422208)

### Cumsum (cumulative sum)

Calculates the cumulative sum of elements of the salary_array. It calculates the cumulative sum at each index, meaning each element in the output array is the sum of all preceding elements including the current one from the original array.

In [63]:
cumsum_salary_array = np.cumsum(salary_array)

cumsum_salary_array

array([ 119351,  231637,  337230,  448888,  594155,  742832,  854264,
        994670, 1107568, 1210106])

### Cumprod (cumulative product)

Calculates the cumulative product of elements of the salary_array. It calculates the cumulative product at each index, meaning each element in the output array is the product of all preceding elements including the current one from the original array.

**Note**: Due to the large numbers, the cumulative product values quickly escalate to the point where they exceed the numerical limit for typical data types in Python, leading to integer overflow. This is why some numbers appear as negative.

In [64]:
cumprod_salary_array = np.cumprod(salary_array)

cumprod_salary_array

array([              119351,          13401446386,     1415098928236898,
       -8013580534310407660, -8571962155025265924, -4962758245959736340,
        4863721953308057184, -2721032556523847616, -5504506944185089920,
       -6104619559599422208])

## 22.3 Statistic operations

A lot of these you can also use Pandas because Pandas is built on top of NumPy. So here's a few examples but we won't be diving deep into them.

- mean
- median
- var: variance
- std: standard deviation
- min
- max

### Mean

In [65]:
average_salary = np.mean(salary_array)

average_salary

np.float64(121010.6)

### Median

In [66]:
median_salary = np.median(salary_array)

median_salary

np.float64(112592.0)

### Var

In [67]:
var_salary = np.var(salary_array)

var_salary

np.float64(263283141.24)

### Std

In [68]:
std_salary = np.std(salary_array)

std_salary

np.float64(16226.00201035363)

### Min

In [69]:
min_salary = np.min(salary_array)

min_salary

np.int64(102538)

### Max

In [70]:
max_salary = np.max(salary_array)

max_salary

np.int64(148677)

## 22.4 NaN

- Generate NaN values using np.nan
- np.nan value is used in NumPy (and by extension, Pandas) to represent missing or undefined data
- Helpful because it:
    - Handles missing data.
    - Helps with computations since it won't return errors but instead return np.nan.
    - Help filter out or fill in missing data using other methods that we'll use often in the pandas library like dropna(), fillna(), isna(), or notna().

### Insert missing values

In [73]:
# when we have missing values within an array. We add np.nan rather than 0 or None or NULL

salary_nan = np.array([np.nan, 112286, np.nan, 111658, 145267, 148677, 111432, 140406, 112898, 102538])

salary_nan

array([    nan, 112286.,     nan, 111658., 145267., 148677., 111432.,
       140406., 112898., 102538.])

### replace values with nan

In [74]:
# replace all values under e.g. a certain threshold

salary_nan[salary_nan < 120_000] = np.nan
salary_nan

array([    nan,     nan,     nan,     nan, 145267., 148677.,     nan,
       140406.,     nan,     nan])

## 22.5 Where

- np.where check elements of an array against a condition and to assign a value for True and another for false.
- Syntax: np.where(condition).
- It's commonly used to conditionally replace array elements.

In [75]:
# replace all values under 120_000 with 120_000, otherwise leave the same

salary_where = np.where(salary_array < 120_000, 120_000, salary_array)

salary_where

array([120000, 120000, 120000, 120000, 145267, 148677, 120000, 140406,
       120000, 120000])

## 22.6 Random

- Generate a random numbers or samples.
- np.random.normal - draws random samples from a normal (Gaussian) distribution.
    - Specify the Arguments:
        - loc: This is the mean (μ) of the normal distribution.
        - scale: This is the standard deviation (σ) of the normal distribution, representing the dispersion from the mean.
        - size: This defines the number of random samples to draw, which is set to match the number of job postings.
    - Syntax: np.random.normal(loc=0.0, scale=1.0, size=None)
- A few other random sampling functions:
    - np.random.rand
    - np.random.randn
    - np.random.randint
    - np.random.random
    - np.random.uniform
    - np.random.binomial
    - np.random.poisson

In [77]:
# let's add some noise to the salary_array distribution using a random function.
# this is a use case to simulate data for job postings in cases where actual data is nor available.

# generate random variables / noise

noise = np.random.normal(0, 5000, salary_array.size)

# add these numbers to the salary_array

salary_with_noise = salary_array + noise
salary_with_noise

array([119469.24947264, 106559.51882167, 112566.48763514, 114878.24528747,
       149695.81032986, 155359.38678684, 110513.45435895, 141677.79729437,
       111694.11674924, 105976.13338643])

# 22 Problems

### 1.22.1

Create a NumPy array representing the number of job applications received each day for a week (7 days). The array should be called applications and contain the following numbers: [10, 15, 7, 20, 25, 30, 5]. Print the array.

In [79]:
# applications data
applications_list = [10, 15, 7, 20, 25, 30, 5]

# create an array

applications_array = np.array(applications_list)

# print the array

print(applications_array)

[10 15  7 20 25 30  5]


### 1.22.2

Create a NumPy array representing the number of job postings for different job titles called postings. The array contains [10, 15, 7, 20, 25, 30, 5]. Use slicing to get the number of postings for the first three job titles.

In [81]:
# add postings data

postings_list = [10, 15, 7, 20, 25, 30, 5]

# create an array

postings_array = np.array(postings_list)

# sliced array

sliced_array = applications_array[0:3]

# print

print(sliced_array)

[10 15  7]


### 1.22.3

Create a NumPy array representing the salaries called salaries offered for five different job positions. Calculate the highest and lowest salary using NumPy functions. The array salaries contains [70000, 85000, 60000, 95000, 80000].

In [82]:
# add salaries data

salaries_data = [70000, 85000, 60000, 95000, 80000]

# convert to array

salaries_array = np.array(salaries_data)

# calculate max

salaries_max = np.max(salaries_array)

# calculate min

salaries_min = np.min(salaries_array)

# print

print(salaries_max, salaries_min)

95000 60000
