### Plotting a histogram of iris data

##### Import matplotlib.pyplot and seaborn as their usual aliases (plt and sns).
##### Use seaborn to set the plotting defaults.
##### Plot a histogram of the Iris versicolor petal lengths using plt.hist() and the provided NumPy array versicolor_petal_length.
##### Show the histogram using plt.show().

In [None]:
# Import plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# Set default Seaborn style
sns.set()

# Plot histogram of versicolor petal lengths
plt.hist(versicolor_petal_length)

# Show histogram
plt.show()

### Axis labels!

##### Label the axes. Don't forget that you should always include units in your axis labels. Your 
##### -axis label is just 'count'. Your 
##### -axis label is 'petal length (cm)'. The units are essential!
##### Display the plot constructed in the above steps using plt.show().

In [None]:
# Plot histogram of versicolor petal lengths
_ = plt.hist(versicolor_petal_length)

# Label axes
plt.xlabel('petal length (cm)')
plt.ylabel('count')

# Show histogram
plt.show()

### Adjusting the number of bins in a histogram

##### Import numpy as np. This gives access to the square root function, np.sqrt().
##### Determine how many data points you have using len().
##### Compute the number of bins using the square root rule.
##### Convert the number of bins to an integer using the built in int() function.
##### Generate the histogram and make sure to use the bins keyword argument.
##### Hit submit to plot the figure and see the fruit of your labors!

In [None]:
# Import numpy
import numpy as np

# Compute number of data points: n_data
n_data = len(versicolor_petal_length)

# Number of bins is the square root of number of data points: n_bins
n_bins = np.sqrt(n_data)


# Convert number of bins to integer: n_bins
n_bins = int(n_bins)

# Plot the histogram
plt.hist(versicolor_petal_length, bins = n_bins )

# Label axes
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('count')

# Show histogram
plt.show()

### Bee swarm plot

##### In the IPython Shell, inspect the DataFrame df using df.head(). This will let you identify which column names you need to pass as the x and y keyword arguments in your call to sns.swarmplot().
##### Use sns.swarmplot() to make a bee swarm plot from the DataFrame containing the Fisher iris data set, df. The x-axis should contain each of the three species, and the y-axis should contain the petal lengths.
##### Label the axes.
##### Show your plot.

In [None]:
# Create bee swarm plot with Seaborn's default settings
sns.swarmplot(x = 'species', y='petal length (cm)', data=df)

# Label the axes
plt.xlabel ('species')
plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

### Computing the ECDF

##### Define a function with the signature ecdf(data). Within the function definition,
##### Compute the number of data points, n, using the len() function.
##### The x-values are the sorted data. Use the np.sort() function to perform the sorting.
##### The y data of the ECDF go from 1/n to 1 in equally spaced increments. You can construct this using np.arange(). Remember, however, that the end value in np.arange() is not inclusive. Therefore, np.arange() will need to go from 1 to n+1. Be sure to divide this by n.
##### The function returns the values x and y.

In [None]:
def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

### Plotting the ECDF

##### Use ecdf() to compute the ECDF of versicolor_petal_length. Unpack the output into x_vers and y_vers.
##### Plot the ECDF as dots. Remember to include marker = '.' and linestyle = 'none' in addition to x_vers and y_vers as arguments inside plt.plot().
##### Label the axes. You can label the y-axis 'ECDF'.
##### Show your plot.

In [None]:
# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none')

# Label the axes
plt.xlabel('length')
plt.ylabel('ECDF')


# Display the plot
plt.show()

### Comparison of ECDFs

##### Compute ECDFs for each of the three species using your ecdf() function. The variables setosa_petal_length, versicolor_petal_length, and virginica_petal_length are all in your namespace. Unpack the ECDFs into x_set, y_set, x_vers, y_vers and x_virg, y_virg, respectively.
##### Plot all three ECDFs on the same plot as dots. To do this, you will need three plt.plot() commands. Assign the result of each to _.
##### A legend and axis labels have been added for you, so hit submit to see all the ECDFs!

In [None]:
# Compute ECDFs
x_set, y_set = ecdf(setosa_petal_length)
x_vers, y_vers = ecdf(versicolor_petal_length)
x_virg, y_virg = ecdf(virginica_petal_length)

# Plot all ECDFs on the same plot
plt.plot(x_set, y_set, marker='.', linestyle='none')
plt.plot(x_vers, y_vers, marker='.', linestyle='none')
plt.plot(x_virg, y_virg, marker='.', linestyle='none')

# Annotate the plot
plt.legend(('setosa', 'versicolor', 'virginica'), loc='lower right')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

### Computing means

##### Compute the mean petal length of Iris versicolor from Anderson's classic data set. The variable versicolor_petal_length is provided in your namespace. Assign the mean to mean_length_vers.
##### Hit submit to print the result.

In [None]:
# Compute the mean: mean_length_vers
mean_length_vers = np.mean(versicolor_petal_length)

# Print the result with some nice formatting
print('I. versicolor:', mean_length_vers, 'cm')

### Computing percentiles

##### Create percentiles, a NumPy array of percentiles you want to compute. These are the 2.5th, 25th, 50th, 75th, and 97.5th. You can do so by creating a list containing these ints/floats and convert the list to a NumPy array using np.array(). For example, np.array([30, 50]) would create an array consisting of the 30th and 50th percentiles.
##### Use np.percentile() to compute the percentiles of the petal lengths from the Iris versicolor samples. The variable versicolor_petal_length is in your namespace.
##### Print the percentiles.

In [None]:
# Specify array of percentiles: percentiles
percentiles = np.array([2.5,25,50,75,97.5])

# Compute percentiles: ptiles_vers
ptiles_vers = np.percentile (versicolor_petal_length,percentiles)

# Print the result
print(ptiles_vers)

### Comparing percentiles to ECDF

##### Plot the percentiles as red diamonds on the ECDF. Pass the x and y co-ordinates - ptiles_vers and percentiles/100 - as positional arguments and specify the marker='D', color='red' and linestyle='none' keyword arguments. The argument for the y-axis - percentiles/100 has been specified for you.
##### Display the plot.

In [None]:
# Plot the ECDF
_ = plt.plot(x_vers, y_vers, '.')
_ = plt.xlabel('petal length (cm)')
_ = plt.ylabel('ECDF')

# Overlay percentiles as red diamonds.
_ = plt.plot(ptiles_vers, percentiles/100, marker='D', color='red',
         linestyle='none')

# Show the plot
plt.show()

### Box-and-whisker plot

##### The set-up is exactly the same as for the bee swarm plot; you just call sns.boxplot() with the same keyword arguments as you would sns.swarmplot(). The x-axis is 'species' and y-axis is 'petal length (cm)'.
##### Don't forget to label your axes!
##### Display the figure using the normal call.

In [None]:
# Create box plot with Seaborn's default settings
sns.boxplot(x='species', y='petal length (cm)', data=df)

# Label the axes
plt.xlabel('species')
plt.ylabel('petal length (cm)')

# Show the plot
plt.show()

### Computing the variance

##### Create an array called differences that is the difference between the petal lengths (versicolor_petal_length) and the mean petal length. The variable versicolor_petal_length is already in your namespace as a NumPy array so you can take advantage of NumPy's vectorized operations.
##### Square each element in this array. For example, x**2 squares each element in the array x. Store the result as diff_sq.
##### Compute the mean of the elements in diff_sq using np.mean(). Store the result as variance_explicit.
##### Compute the variance of versicolor_petal_length using np.var(). Store the result as variance_np.
##### Print both variance_explicit and variance_np in one print call to make sure they are consistent.

In [None]:
# Array of differences to mean: differences
differences = versicolor_petal_length - np.mean(versicolor_petal_length)

# Square the differences: diff_sq
diff_sq = differences**2

# Compute the mean square difference: variance_explicit
variance_explicit = np.mean(diff_sq)

# Compute the variance using NumPy: variance_np
variance_np = np.var(differences)

# Print the results
print(variance_explicit, variance_np)


### The standard deviation and the variance

##### Compute the variance of the data in the versicolor_petal_length array using np.var() and store it in a variable called variance.
##### Print the square root of this value.
##### Print the standard deviation of the data in the versicolor_petal_length array using np.std().

In [None]:
# Compute the variance: variance
variance = np.var(versicolor_petal_length)

# Print the square root of the variance
print(np.sqrt(variance))

# Print the standard deviation
print(np.std(versicolor_petal_length))

### Scatter plots

##### Use plt.plot() with the appropriate keyword arguments to make a scatter plot of versicolor petal length (x-axis) versus petal width (y-axis). The variables versicolor_petal_length and versicolor_petal_width are already in your namespace. Do not forget to use the marker='.' and linestyle='none' keyword arguments.
##### Label the axes.
##### Display the plot.

In [None]:
# Make a scatter plot
plt.plot(versicolor_petal_length, versicolor_petal_width, marker='.', linestyle='none')

# Label the axes
plt.xlabel('versicolor petal length')
plt.ylabel('petal width')

# Show the result
plt.show()

### Computing the covariance

##### Use np.cov() to compute the covariance matrix for the petal length (versicolor_petal_length) and width (versicolor_petal_width) of I. versicolor.
##### Print the covariance matrix.
##### Extract the covariance from entry [0,1] of the covariance matrix. Note that by symmetry, entry [1,0] is the same as entry [0,1].
##### Print the covariance.

In [None]:
# Compute the covariance matrix: covariance_matrix
covariance_matrix = np.cov(versicolor_petal_length,versicolor_petal_width)

# Print covariance matrix
print(covariance_matrix)

# Extract covariance of length and width of petals: petal_cov
petal_cov = covariance_matrix[0,1]

# Print the length/width covariance
print(petal_cov)


### Computing the Pearson correlation coefficient

##### Define a function with signature pearson_r(x, y).
##### Use np.corrcoef() to compute the correlation matrix of x and y (pass them to np.corrcoef() in that order).
##### The function returns entry [0,1] of the correlation matrix.
##### Compute the Pearson correlation between the data in the arrays versicolor_petal_length and versicolor_petal_width. Assign the result to r.
##### Print the result.

In [None]:
def pearson_r(x, y):
    """Compute Pearson correlation coefficient between two arrays."""
    # Compute correlation matrix: corr_mat
    corr_mat = np.corrcoef(x,y)

    # Return entry [0,1]
    return corr_mat[0,1]

# Compute Pearson correlation coefficient for I. versicolor: r
r = pearson_r(versicolor_petal_length, versicolor_petal_width)

# Print the result
print(r)

### Generating random numbers using the np.random module

##### Instantiate and seed a random number generator, rng, using the seed 42.
##### Initialize an empty array, random_numbers, of 100,000 entries to store the random numbers. Make sure you use np.empty(100000) to do this.
##### Write a for loop to draw 100,000 random numbers using rng.random(), storing them in the random_numbers array. To do so, loop over range(100000).
##### Plot a histogram of random_numbers. It is not necessary to label the axes in this case because we are just checking the random number generator. Hit submit to show your plot.

In [None]:
# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt

# Instantiate and seed the random number generator
rng = np.random.default_rng(42)

# Initialize random numbers: random_numbers
random_numbers = np.empty(100000)

# Generate random numbers by looping over range(100000)
for i in range(100000):
    random_numbers[i] = rng.random()

# Plot a histogram
_ = plt.hist(random_numbers)

# Show the plot
plt.show()


### The np.random module and Bernoulli trials

##### Define a function with signature perform_bernoulli_trials(n, p).
##### Initialize to zero a variable n_success the counter of Trues, which are Bernoulli trial successes.
##### Write a for loop where you perform a Bernoulli trial in each iteration and increment the number of success if the result is True. Perform n iterations by looping over range(n).
##### To perform a Bernoulli trial, choose a random number between zero and one using rng.random(). If the number you chose is less than p, increment n_success (use the += 1 operator to achieve this). An RNG has already been instantiated as the variable rng and seeded.
##### The function returns the number of successes n_success.

In [None]:
def perform_bernoulli_trials(n, p):
    """Perform n Bernoulli trials with success probability p
    and return number of successes."""
    # Initialize number of successes: n_success
    n_success = 0

    # Perform trials
    for i in range(n):
        # Choose random number between zero and one: random_number
        random_number = rng.random()

        # If less than p, it's a success so add one to n_success
        if random_number < p:
            n_success += 1

    return n_success

### How many defaults might we expect?

##### Seed the random number generator to 42.
##### Initialize n_defaults, an empty array, using np.empty(). It should contain 1000 entries, since we are doing 1000 simulations.
##### Write a for loop with 1000 iterations to compute the number of defaults per 100 loans using the perform_bernoulli_trials() function. It accepts two arguments: the number of trials n - in this case 100 - and the probability of success p - in this case the probability of a default, which is 0.05. On each iteration of the loop store the result in an entry of n_defaults.
##### Plot a histogram of n_defaults. Include the density=True keyword argument so that the height of the bars of the histogram indicate the probability.
##### Show your plot.

In [None]:
# Instantiate and seed random number generator
rng = np.random.default_rng(42)

# Initialize the number of defaults: n_defaults
n_defaults = np.empty(1000)

# Compute the number of defaults
for i in range(1000):
    n_defaults[i] = perform_bernoulli_trials(100, 0.05)


# Plot the histogram with default number of bins; label your axes
_ = plt.hist(n_defaults, density=True)
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('probability')

# Show the plot
plt.show()

### Will the bank fail?

##### Compute the x and y values for the ECDF of n_defaults.
##### Plot the ECDF, making sure to label the axes. Remember to include marker = '.' and linestyle = 'none' in addition to x and y in your call plt.plot().
##### Show the plot.
##### Compute the total number of entries in your n_defaults array that were greater than or equal to 10. To do so, compute a boolean array that tells you whether a given entry of n_defaults is >= 10. Then sum all the entries in this array using np.sum(). For example, np.sum(n_defaults <= 5) would compute the number of defaults with 5 or fewer defaults.
##### The probability that the bank loses money is the fraction of n_defaults that are greater than or equal to 10. Print this result by hitting submit!

In [None]:
# Compute ECDF: x, y
x, y = ecdf(n_defaults)

# Plot the CDF with labeled axes
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of defaults out of 100')
_ = plt.ylabel('CDF')

# Show the plot
plt.show()

# Compute the number of 100-loan simulations with 10 or more defaults: n_lose_money
n_lose_money = np.sum(n_defaults >= 10)

# Compute and print probability of losing money
print('Probability of losing money =', n_lose_money / len(n_defaults))


### Sampling out of the Binomial distribution

##### Draw samples out of the Binomial distribution using rng.binomial(). You should use parameters n = 100 and p = 0.05, and set the size keyword argument to 10000.
##### Compute the CDF using your previously-written ecdf() function.
##### Plot the CDF with axis labels. The x-axis here is the number of defaults out of 100 loans, while the y-axis is the CDF.
##### Show the plot.

In [None]:
# Take 10,000 samples out of the binomial distribution: n_defaults
n_defaults = rng.binomial(n=100, p=0.05, size=10000)

# Compute CDF: x, y
x, y = ecdf(n_defaults)

# Plot the CDF with axis labels
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('number of defaults out of 100 loans')
_ = plt.ylabel('CDF')

# Show the plot
plt.show()


### Plotting the Binomial PMF

##### Using np.arange(), compute the bin edges such that the bins are centered on the integers. Store the resulting array in the variable bins.
##### Use plt.hist() to plot the histogram of n_defaults with the density=True and bins=bins keyword arguments.
##### Show the plot.

In [None]:
# Compute bin edges: bins
bins = np.arange(0, max(n_defaults) + 1.5) - 0.5

# Generate histogram
plt.hist(n_defaults, density=True, bins=bins)

# Label axes
plt.xlabel('xsdhslkds')
plt.ylabel('slhddishlds')

# Show the plot
plt.show()

### Relationship between Binomial and Poisson distributions

##### Using the rng.poisson() function, draw 10000 samples from a Poisson distribution with a mean of 10.
##### Make a list of the n and p values to consider for the Binomial distribution. Choose n = [20, 100, 1000] and p = [0.5, 0.1, 0.01] so that is always 10.
##### Using rng.binomial() inside the provided for loop, draw 10000 samples from a Binomial distribution with each n, p pair and print the mean and standard deviation of the samples. There are 3 n, p pairs: 20, 0.5, 100, 0.1, and 1000, 0.01. These can be accessed inside the loop as n[i], p[i].

In [None]:
# Draw 10,000 samples out of Poisson distribution: samples_poisson
samples_poisson = rng.poisson(10, size=10000)

# Print the mean and standard deviation
print('Poisson:     ', np.mean(samples_poisson),
                       np.std(samples_poisson))

# Specify values of n and p to consider for Binomial: n, p
n = [20, 100, 1000]
p = [0.5, 0.1, 0.01]


# Draw 10,000 samples for each n,p pair: samples_binomial
for i in range(3):
    samples_binomial = rng.binomial( n[i], p[i], size=10000)
  
    # Print results
    print('n =', n[i], 'Binom:', np.mean(samples_binomial),
                                 np.std(samples_binomial))

### Was 2015 anomalous?

##### Draw 10000 samples from a Poisson distribution with a mean of 251/115 and assign to n_nohitters.
##### Determine how many of your samples had a result greater than or equal to 7 and assign to n_large.
##### Compute the probability, p_large, of having 7 or more no-hitters by dividing n_large by the total number of samples (10000).
##### Hit submit to print the probability that you calculated.

In [None]:
# Draw 10,000 samples out of Poisson distribution: n_nohitters
n_nohitters = rng.poisson(251/115, size=10000)

# Compute number of samples that are seven or greater: n_large
n_large = np.sum(n_nohitters >=7)

# Compute probability of getting seven or more: p_large
p_large = n_large/10000

# Print the result
print('Probability of seven or more no-hitters:', p_large)


### The Normal PDF

##### Draw 100,000 samples from a Normal distribution that has a mean of 20 and a standard deviation of 1. Do the same for Normal distributions with standard deviations of 3 and 10, each still with a mean of 20. Assign the results to samples_std1, samples_std3 and samples_std10, respectively.
##### Plot a histogram of each of the samples; for each, use 100 bins, also using the keyword arguments density=True and histtype='step'. The latter keyword argument makes the plot look much like the smooth theoretical PDF. You will need to make 3 plt.hist() calls.
##### Hit submit to make a legend, showing which standard deviations you used, and show your plot! There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of PDFs.

In [None]:
# Draw 100000 samples from Normal distribution with stds of interest: samples_std1, samples_std3, samples_std10
samples_std1 = rng.normal(20,1, size=100000)
samples_std3 = rng.normal(20,3, size=100000)
samples_std10 = rng.normal(20,10, size=100000)


# Make histograms
plt.hist(samples_std1, density=True, histtype='step', bins=100)
plt.hist(samples_std3, density=True, histtype='step', bins=100)
plt.hist(samples_std10, density=True, histtype='step', bins=100)

# Make a legend, set limits and show plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'))
plt.ylim(-0.01, 0.42)
plt.show()


### The Normal CDF

##### Use your ecdf() function to generate x and y values for CDFs: x_std1, y_std1, x_std3, y_std3 and x_std10, y_std10, respectively.
##### Plot all three CDFs as dots (do not forget the marker and linestyle keyword arguments!).
##### Hit submit to make a legend, showing which standard deviations you used, and to show your plot. There is no need to label the axes because we have not defined what is being described by the Normal distribution; we are just looking at shapes of CDFs.

In [None]:
# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)
x_std3, y_std3 = ecdf(samples_std3)
x_std10, y_std10 = ecdf(samples_std10)

# Plot CDFs
plt.plot(x_std1, y_std1, marker ='.', linestyle='none')
plt.plot(x_std3, y_std3, marker ='.', linestyle='none')
plt.plot(x_std10, y_std10, marker ='.', linestyle='none')

# Make a legend and show the plot
_ = plt.legend(('std = 1', 'std = 3', 'std = 10'), loc='lower right')
plt.show()

### Are the Belmont Stakes results Normally distributed?

##### Compute mean and standard deviation of Belmont winners' times with the two outliers removed. The NumPy array belmont_no_outliers has these data.
##### Take 10,000 samples out of a normal distribution with this mean and standard deviation using rng.normal().
##### Compute the CDF of the theoretical samples and the ECDF of the Belmont winners' data, assigning the results to x_theor, y_theor and x, y, respectively.
##### Hit submit to plot the CDF of your samples with the ECDF, label your axes and show the plot.

In [None]:
# Compute mean and standard deviation: mu, sigma
mu = np.mean(belmont_no_outliers)
sigma = np.std(belmont_no_outliers)

# Sample out of a normal distribution with this mu and sigma: samples
samples = rng.normal(mu, sigma, size=10000)

# Get the CDF of the samples and of the data
x_theor, y_theor = ecdf(samples)
x, y = ecdf(belmont_no_outliers)

# Plot the CDFs and show the plot
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('Belmont winning time (sec.)')
_ = plt.ylabel('CDF')
plt.show()


### What are the chances of a horse matching or beating Secretariat's record?

##### Take 1,000,000 samples from the normal distribution using the rng.normal() function. The mean mu and standard deviation sigma are already loaded into the namespace of your IPython instance.
##### Compute the fraction of samples that have a time less than or equal to Secretariat's time of 144 seconds.
##### Print the result.

In [None]:
# Take a million samples out of the Normal distribution: samples
samples = rng.normal(mu, sigma, size=1000000)

# Compute the fraction that are faster than 144 seconds: prob
prob = np.sum(samples <=144)/len(samples)

# Print the result
print('Probability of besting Secretariat:', prob)

### If you have a story, you can simulate it!

##### Define a function with call signature successive_poisson(tau1, tau2, size=1) that samples the waiting time for a no-hitter and a hit of the cycle.
##### Draw waiting times (size number of samples) for the no-hitter out of an exponential distribution parametrized by tau1 and assign to t1.
##### Draw waiting times (size number of samples) for hitting the cycle out of an exponential distribution parametrized by tau2 and assign to t2.
##### The function returns the sum of the waiting times for the two events.

In [None]:
def successive_poisson(tau1, tau2, size=1):
    """Compute time for arrival of 2 successive Poisson processes."""
    # Draw samples out of first exponential distribution: t1
    t1 = rng.exponential(tau1, size=size)

    # Draw samples out of second exponential distribution: t2
    t2 = rng.exponential(tau2, size=size)

    return t1 + t2

### Distribution of no-hitters and cycles

##### Use your successive_poisson() function to draw 100,000 out of the distribution of waiting times for observing a no-hitter and a hitting of the cycle.
##### Plot the PDF of the waiting times using the step histogram technique of a previous exercise. Don't forget the necessary keyword arguments. You should use bins=100, density=True, and histtype='step'.
##### Label the axes.
##### Show your plot.

In [None]:
# Draw samples of waiting times: waiting_times
waiting_times = successive_poisson(764, 715, size=100000)

# Make the histogram
plt.hist(waiting_times, bins=100, density=True, histtype='step')


# Label axes
plt.xlabel('waiting_times')
plt.ylabel('number of samples')

# Show the plot
plt.show()