# Week 4 Assessment

Questions 1-3 are conceptual questions that you can find in the Week 4 quiz. The exercises here will start on Question 4

In [1]:
import numpy as np

x = np.array([
    [3,6,-2],
    [-9,7,4],
    [8,2,0]
])

print(np.min(x))

-9


In [2]:
x = np.array([
    [3,6,-2],
    [-9,7,4],
    [8,2,0]
])

y=np.min(x, axis=1)
print(y[2])

0


## Practical Exercises

### Questions 4 and 5: Random number generation

You are given the function below, `gen_random_number`, which is a random number generator from a very specific distribution (the nature of that distribution is not relevant here). You don't know what the data look like, but you want to summarize them in some way, in this case by **finding the mean and standard deviation**. You'll need a lot of samples to get a good estimate, so make sure you use at least one million.

In [3]:
import numpy as np
def gen_random_number(n):
    return np.random.gamma(shape=0.5,scale=1.3,size=n)

In [4]:
# Generate 1 million samples
n_samples = 1_000_000
samples = gen_random_number(n_samples)

# Compute mean and standard deviation
mean_estimate = np.mean(samples)
std_dev_estimate = np.std(samples, ddof=0)  # Population standard deviation

print(f"Estimated Mean: {mean_estimate}")
print(f"Estimated Standard Deviation: {std_dev_estimate}")

Estimated Mean: 0.649817513188198
Estimated Standard Deviation: 0.9190944969506496


### Questions 6 and 7: Filtering and querying a dataset

In data science we frequently need to filter data as we've previously discussed: remove missing or anomalous values, remove predictors/features from a dataset, remove redundant values, etc. Additionally, we often want to query the data, exploring subsets of the larger dataset that meet certain criteria. We'll see later in this specialization that Pandas offers many excellent tools for doing that, but they're based on principles we've discussed here around matrix and vector operations. We've also discussed summarization strategies in this course. Let's bring all of these pieces together and create some tools for filtering and querying our data.

The goal of this exercise is to create a set of functions that can:
1. Remove data from a dataset that are greater than a certain value
2. Remove data from a dataset that are less than a certain value
3. Remove specific values from a dataset
4. Remove duplicate values in a dataset

Once we have the functions to accomplish this, we'll apply this to a dataset.

The first step is to create the functions. To help get you started, some skeleton code is provided below (replace "pass" with your code to construct the functions):

In [5]:
def remove_greater_than(array, threshold):
    '''remove entries in `array' greater than `threshold' '''
    return array[array <= threshold]

def remove_less_than(array, threshold):
    '''remove entries in `array' less than `threshold' '''        
    return array[array >= threshold]

def remove_if_equal(array, value_list):
    '''remove entries in `array' that equal any value in `value_list' '''
    return array[~np.isin(array, value_list)]

def remove_duplicates(array):
    '''remove duplicate entries in `array' leaving only one of each '''
    return np.unique(array)

Once you have built your functions to filter the data. Generate tests to verify that each function is working properly.

Now it's time to apply your function. The dataset that we will use will be a set of integer values ranging from 1 to 1000 (the code is provided below - do NOT change the random seed). 

In [6]:
# Example usage:
data = np.array([1, 2, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10])

# Remove values greater than 7
filtered_data1 = remove_greater_than(data, 7)

# Remove values less than 3
filtered_data2 = remove_less_than(data, 3)

# Remove specific values (e.g., 2, 4, and 6)
filtered_data3 = remove_if_equal(data, [2, 4, 6])

# Remove duplicates
filtered_data4 = remove_duplicates(data)

print("Original Data:", data)
print("After removing values > 7:", filtered_data1)
print("After removing values < 3:", filtered_data2)
print("After removing values in [2, 4, 6]:", filtered_data3)
print("After removing duplicates:", filtered_data4)


Original Data: [ 1  2  2  3  4  5  6  7  7  8  9 10]
After removing values > 7: [1 2 2 3 4 5 6 7 7]
After removing values < 3: [ 3  4  5  6  7  7  8  9 10]
After removing values in [2, 4, 6]: [ 1  3  5  7  7  8  9 10]
After removing duplicates: [ 1  2  3  4  5  6  7  8  9 10]


In [7]:
# Generate the 10000 random numbers [DO NOT MODIFY THIS CODE]
import numpy as np
np.random.seed(14) # This guarantees the code will generate the same set of random numbers whenever executed
random_integers = np.random.randint(1,high=1000, size=(500))

print(random_integers[:100]) # Prints the first 100 numbers to get a sense of the data

[620 345 269 407 328 762 359 747 669 651 209 574 489 878 667  26 472 492
 545 261 974 139 651 513 471  85 134 550 633 102  42 745 232 580 720 325
 974 174 380 249 786 764 897 160 778 188  28 219 812 759 258 893 328 105
 704 907  46 752  31 916 578  58 588 206 338 358 782 953 336 682 640 278
 526 324  62 331 751 442 463 605 239 957 522 335 389 179 657 486 565 258
 140 997 882 388 866  13 925 854 489 402]


We strongly encourage you to test your code for each of the above four functions on a simple example. For example, when testing the `remove_greater_than()` function, you could input an array `[1,2,20,21,20000]` with a threshold of 20 and verify that the resulting output is `[1,2,20]`.

Once you are confident in your function, your goal is to filter the data in the following ways:
1. Remove values greater than 800
2. Remove values less than 25
3. Remove values equal to even integers
4. Remove all duplicates

Lastly, summarize the remaining data after your filtering is complete by computing the **mean and median** of the remaining data.

In [9]:
# Apply the filters
filtered_data = remove_greater_than(random_integers, 800)
filtered_data = remove_less_than(filtered_data, 25)
filtered_data = remove_if_equal(filtered_data, np.arange(2, 801, 2))  # Remove even numbers
filtered_data = remove_duplicates(filtered_data)

# Compute summary statistics
mean_value = np.mean(filtered_data)
median_value = np.median(filtered_data)

print("Filtered Data:", filtered_data)
print(f"Mean of remaining data: {mean_value:.4g}")
print(f"Median of remaining data: {median_value:.4g}")

Filtered Data: [ 27  29  31  37  41  49  57  77  79  85  87  97 105 111 125 127 129 139
 145 153 155 159 163 165 175 179 185 205 207 209 215 219 223 229 233 239
 241 249 251 261 267 269 273 287 289 299 303 305 311 313 315 321 325 329
 331 333 335 343 345 347 351 357 359 361 371 375 377 385 389 393 395 401
 405 407 411 419 423 425 427 431 437 441 447 451 453 457 463 465 471 481
 485 489 493 495 503 505 509 513 517 531 537 539 545 547 559 561 563 565
 567 577 587 599 605 621 631 633 635 639 641 647 649 651 657 665 667 669
 675 677 681 683 693 695 697 699 703 705 707 715 717 723 737 739 745 747
 751 753 757 759 769 773 785 797 799]
Mean of remaining data: 425.1
Median of remaining data: 423


Once you're confident in your solution, head back to the Week 4 Quiz to enter your solutions for the exercises above and answer some conceptual questions about this chapter.