# SOLUTION to Advanced Problems (Day2)

# List

*Standard Deviation*: Write some code that calculates the standard deviation of a list of numbers. The formula for the standard deviation is  
  
$$ \sigma = \sqrt{\frac{1}{N-1} \sum_{i = 1}^N(x_i - \bar{x})^2}$$  
  
where $N$ is the number of elements in the list, $x_i$ is a particular element of the list, and $\bar{x}$ is the average of the list of numbers. (Hint: you will need to calculate the average first).

Test your code on the given list. The answer is approximately 31.23.

In [1]:
test_list = [7, 39, 2, 56, 98, 74, 34, 17, 56, 88, 66, 0, 56, 34]

N = len(test_list)  # number of data in list

sum = 0             # initialize sum
for num in test_list:
    sum = sum + num # sum up all data
avg = sum/N         # compute average

dsum = 0;           # initialize sum of differences between data and their mean
for num in test_list:
    dsum = dsum + (num-avg)**2  # sum of the differences between data and their mean

sigma = (dsum/(N-1))**0.5  # compute standard deviation
print(sigma)

31.234050875061563


*Median*: Write some code that calculates the median of a list of numbers. The answer for test_list is 47.5.

In [2]:
test_list = [7, 39, 2, 56, 98, 74, 34, 17, 56, 88, 66, 0, 56, 34]

N = len(test_list)  # number of data in list

new_list = sorted(test_list)   # sort list and save it as a new list

if N % 2 == 0:                 # if number of data is even
    median = (new_list[N//2] + new_list[N//2-1])/2   # median is the mean of the two data in the middle
else:                          # if number of data is odd
    median = new_list(N//2)    # median is the middle data
    
print(median)

47.5


*Mode*: Write some code that calculates the mode of a list of numbers. The answer for test_list is 56.

In [7]:
test_list = [7, 39, 2, 56, 98, 74, 34, 17, 56, 88, 66, 0, 56, 34]

unique_num = sorted(list(set(test_list)))   # get unique number in list

Mcount = 0                    # initialize maximum frequency as zero
for num in unique_num:        # for each unique numer
    count = 0                 # initial count as zero
    for data in test_list:
        if data == num:
            count +=1         # compute the frequency of the unique number appeared in list
    
    if count > Mcount:        # if current frequency is larger
        Mcount = count        # update the maximum frequency
        mode = num            # save the corresponding data
        
print(mode)

56


# For Loop

*Minimum*: Write some code that takes a list of numbers as input and returns the index of the smallest (most negative) number in the list. For example, given the list [4,7,-5,9,1,-2,6,4], the code would print 2. (*Hint*: If you're having trouble, try first writing some code that finds the minimum *number* itself. For the given example, the answer would be -5. Then modify your solution to instead find the index.)

In [9]:
numlist = [4,7,-5,9,1,-2,6,4]

#for i in sorted(enumerate(numlist),key=lambda x: x[1]):     # This code is for the testing
#    print(i)
    
newlist = sorted(enumerate(numlist),key=lambda x: x[1])
print(newlist[0][0])                                         # print the first element of the first cell

2


*Sorting*: Write some code that takes a list of numbers as input and returns a new list that contains the same numbers, but in increasing order (from least to greatest). If you did the problem above, then you can use that function in your solution to this problem. Otherwise, you can use the `.index()` and `min()` functions, as demonstrated in the example below:

In [10]:
mylist = [4,7,-5,9,1,-2,6,4]
mylist.index(min(mylist))       # min finds the lowest number, and mylist.index() finds 
                                # the first index of that number

2

In [11]:
# You can write your solution to the Sorting problem here:

newlist = []
for i in sorted(enumerate(numlist),key=lambda x: x[1]):
    newlist.append(i[1])
    
print(newlist)    

[-5, -2, 1, 4, 4, 6, 7, 9]


# Practice

### Staircase

For a given $n$, print the corresponding staircase made out of hash symbols. For example, if $n=6$, we should get:

```
#
##
###
####
#####
######
```

In [27]:
n = 7     # give any num greater than 0
for row in range(n):                                    # for each row
    [print('#',end="") for num in range(row+1)]         # print num=row hash symbols
    print("")                                           # next line

#
##
###
####
#####
######
#######


## Data cleaning

When scientists perform experiments, the data that they collect needs to be processed before it can be analyzed. Sometimes the detector may not be working, it might be cloudy so the telescope can't see anything, or the scientist may go to get a coffee and miss a reading. Sometimes things interfere with the readings, leading the strange, *anomalous* readings that are far outside the range of expected values. Before any analysis is performed, the data must therefore be *cleaned* to remove any missing or anomalous values.

In this exercise, we're going to pretend that we've collected some readings of the brightness of a star that we observed using Mount Stony Brook every evening for four weeks. Unfortunately, on a few of those nights it was cloudy, so no reading was taken. In this case, the brightness is recorded as `0`. On a few other nights, the flood lights at the nearby football stadium were turned on, leading to light contamination which produced anomalously large readings. On one night, an anomalously small reading was recorded (maybe the person taking the reading put the decimal point in the wrong place??). 

Before we can analyze our data to work out how bright the star is, we will therefore need to clean it. 

To do this, create a new list which does not contain any values which are 0, or much smaller/larger than would reasonably be expected. Decide for yourself what to consider to be anomalously small/large values. You should create the new list by using a for loop to iterate over the elements of the list, adding any values which satisfy the criteria to your new list. 

In [28]:
data = [105.77696802, 110.406054  , 106.36737707,  95.02908826,
        84.13182033,  0, 0, 101.47121241,
       106.07343453,  90.65935074,  93.66283734, 102.19944747,
        82.82894661, 102.20360106, 102.29047846, 596.23884439,
       104.03586589,  99.09490557,  76.09848805, 114.83901321,
        86.5806938 , 497.74438934,  9.891387187, 506.57861168,
       101.61619984,  92.62959516,  0,  90.04324646]

In [37]:
# create your clean data here
# Here, we simply consider the outlier criterion by using Interquartile Range
# The interquartile range, often abbreviated IQR, is the difference between the 
# 25th percentile (Q1) and the 75th percentile (Q3) in a dataset. It measures 
# the spread of the middle 50% of values. One popular method is to declare an 
# observation to be an outlier if it has a value 1.5 times greater than the IQR 
# or 1.5 times less than the IQR.

# First step: Remove zeros
nonzero_data = []
for num in data:
    if num != 0:
        nonzero_data.append(num)
        
print(nonzero_data)

# Second step: find Q1, Q3 and Q3-Q1
N = len(nonzero_data)  # number of data in list
index_q1 = int(N*0.25)
index_q3 = int(N*0.75)
IQR = nonzero_data[index_q3-1] - nonzero_data[index_q1-1]

# Thrid step: find the boundries Q1-1.5IQR and Q3+1.5IQR
low_bound = nonzero_data[index_q1-1] - 1.5*IQR
up_bound = nonzero_data[index_q3-1] + 1.5*IQR

# Fourth step: clean data
clean_data = []
for num in nonzero_data:
    if (num<=up_bound) and (num>=low_bound):
        clean_data.append(num)
print(clean_data)                
        



[105.77696802, 110.406054, 106.36737707, 95.02908826, 84.13182033, 101.47121241, 106.07343453, 90.65935074, 93.66283734, 102.19944747, 82.82894661, 102.20360106, 102.29047846, 596.23884439, 104.03586589, 99.09490557, 76.09848805, 114.83901321, 86.5806938, 497.74438934, 9.891387187, 506.57861168, 101.61619984, 92.62959516, 90.04324646]
[105.77696802, 110.406054, 106.36737707, 95.02908826, 84.13182033, 101.47121241, 106.07343453, 90.65935074, 93.66283734, 102.19944747, 82.82894661, 102.20360106, 102.29047846, 104.03586589, 99.09490557, 114.83901321, 86.5806938, 101.61619984, 92.62959516, 90.04324646]


Now you have your nice clean data, let's analyze it to deduce how bright this star is. 
- Calculate the mean of your clean data (this gives an estimate of the true value of the brightness)
- *Advanced*: calculate the standard deviation of the clean data (see the advanced problem in the lists - if you did that problem you can reuse your code here!). The standard deviation gives us an estimate of the error in our measurement.

In [51]:
N = len(clean_data)  # number of data in list

sum = 0             # initialize sum
for num in clean_data:
    sum = sum + num # sum up all data
avg = sum/N         # compute average

print(avg)

98.59700681150002


In [52]:
dsum = 0;           # initialize sum of differences between data and their mean
for num in clean_data:
    dsum = dsum + (num-avg)**2  # sum of the differences between data and their mean

sigma = (dsum/(N-1))**0.5  # compute standard deviation
print(sigma)

8.793061675007301


Or using statistics

In [49]:
import statistics
print(statistics.mean(clean_data))
print(statistics.stdev(clean_data))

98.5970068115
8.7930616750073


(END)