1\. From sample mean to population mean
---------------------------------------

00:00 - 00:13

Now we're going to study some patterns that we can observe in the sample mean when the sample size becomes larger. These patterns form the basis of the law of large numbers.

2\. Sample mean review
----------------------

00:13 - 00:27

Jakob Bernoulli developed the law of large numbers in his book Ars Conjectandi (1713). The law states that the sample mean tends to the expected value as the sample grows larger.

3\. Sample mean review (Cont.)
------------------------------

00:27 - 00:33

For example, we calculate the sample mean of two values by adding the values and dividing by two.

4\. Sample mean review (Cont.)
------------------------------

00:33 - 00:39

For three values, we add up the values and divide by three.

5\. Sample mean review (Cont.)
------------------------------

00:39 - 00:45

If we have n samples, we add the n values and divide by n.

6\. Sample mean review (Cont.)
------------------------------

00:45 - 00:54

As the sample becomes larger, the sample mean gets nearer to the population mean. Let's code a bit.

7\. Generating the sample
-------------------------

00:54 - 01:34

To generate a sample of coin flips, we will use the binomial distribution. First we import the binom object and the describe method from scipy dot stats, then we generate the sample using binom dot rvs. We specify n as 1 coin flip and p as the probability of success (0.5 for a fair coin), then we specify the sample size as 250 and set random_state so we can reproduce our results. After that, we print the first 100 values from our samples.

8\. Calculating the sample mean
-------------------------------

01:34 - 01:53

To calculate the sample mean we pass the sample to describe dot mean. We specify samples from 0 to 10, and we see that for the first 10 values the sample mean is 0.6. Now let's see what this process looks like with an animation.

9\. Sample mean of coin flips (Cont.)
-------------------------------------

01:53 - 02:31

In this animation you see how we take the sample mean for values from 2 to 250 using the describe method. The red line represents the population mean, in this case 0.5, and the blue line is the sample mean. As you'll notice, due to the randomness of the data, the sample mean fluctuates around the population mean -- but as more data becomes available, the sample mean approaches the population mean. Let's see another example with the normal distribution.

10\. Sample mean of normal distribution
---------------------------------------

02:31 - 03:19

Now we have three animated plots. At the top left we have our sample data from a normal distribution. We use one dot for each sample. At the top right we've plotted a histogram of the sample data, and at the bottom we've plotted the sample mean. In all the plots the population mean is represented with a black line and the sample mean is drawn using a red line. You can see how the red line moves and gets nearer to the population mean as more data becomes available. Enjoy the animations for a bit, and get some perspective. Now let's move on and learn how to plot the sample mean with Python.

11\. Plotting the sample mean
-----------------------------

03:19 - 03:45

First we import the binom object and describe from scipy dot stats, along with matplotlib dot pyplot as plt. Then we initialize the variables, setting coin_flips to 1, p to 0.5, sample_size to 1000, and averages to an empty list.

12\. Plotting the sample mean (Cont.)
-------------------------------------

03:45 - 04:06

Finally, we calculate the sample mean using describe from 0 to the i index that goes from 2 to sample_size plus 1. We store the result in the averages list using append, then we print the first 10 values.

13\. Plotting the sample mean (Cont.)
-------------------------------------

04:06 - 04:20

We add a red line with plt dot axhline at the population mean and plot the averages. Then we add a legend in the upper-right corner and show our plot.

14\. Sample mean plot
---------------------

04:20 - 04:25

The result is this beautiful plot that shows the law of large numbers in action.

15\. Let's practice!
--------------------

04:25 - 04:32

Let's get some hands-on practice with the law of large numbers.

Generating a sample
===================

A hospital's planning department is investigating different treatments for newborns. As a data scientist you are hired to simulate the sex of 250 newborn children, and you are told that on average 50.50% are males.

Instructions
------------

-   Import the `binom` object from `scipy.stats`.
-   Generate a sample of 250 newborns with 50.50% probability of being male.
-   Print the sample.

In [None]:
# Import the binom object
from scipy.stats import binom

# Generate a sample of 250 newborn children
sample = binom.rvs(n=1, p=0.505, size=250, random_state=42)

# Show the sample values
print(sample)

Calculating the sample mean
===========================

Now you can calculate the sample mean for this generated sample by taking some elements from the sample.

Using the `sample` variable you just created, you'll calculate the sample means of the first 10, 50, and 250 samples.

The `binom` object and `describe()` method from `scipy.stats` have been imported for your convenience.

Instructions 1/3
----------------

Print the sample mean of the first 10 samples.

In [None]:
# Print the sample mean of the first 10 samples
print(describe(sample[0:10]).mean)

Instructions 2/3
----------------

-   Print the sample mean of the first 10 samples.

In [None]:
# Print the sample mean of the first 50 samples
print(describe(sample[0:50]).mean)

Instructions 3/3
----------------

-   Print the sample mean of the first 50 samples.

In [None]:
# Print the sample mean of the first 250 samples
print(describe(sample[0:250]).mean)

Plotting the sample mean
========================

Now let's plot the sample mean, so you can see more clearly how it evolves as more data becomes available.

For this exercise we'll again use the sample you generated earlier, which is available in the `sample` variable. The `binom` object and `describe()` function have already been imported for you from `scipy.stats`, and `matplotlib.pyplot` is available as `plt`.

Instructions 1/3
----------------

In a `for` statement for `i` in a range that goes from `2` to `251`, do the following:

-   Calculate the sample mean for the first `i` values.
-   Use `append` to add the value to the `averages` array.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

Instructions 2/3
----------------

Add a horizontal line at the mean value of the binomial distribution with `n=1` and `p=0.505`.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color='red')
plt.plot(averages, '-')

Instructions 3/3
----------------

Add a legend with labels `Population mean` and `Sample mean` and show the plot.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color='red')
plt.plot(averages, '-')

# Add legend
plt.legend(("Population mean","Sample mean"), loc='upper right')
plt.show()

1\. Adding random variables
---------------------------

00:00 - 00:11

The most important result in probability and statistics is the central limit theorem. Let's take a look at what happens when you add random variables.

2\. The central limit theorem (CLT)
-----------------------------------

00:11 - 00:54

The CLT states that the sum of random variables tends to a normal distribution as the number of them grows to infinity. This theorem works under certain conditions: the variables must have the same distribution, and the variables must be independent. You can start adding binomial, geometric, or even Poisson random variables, and as you add more, you get a normal distribution. Recall that random variables are independent when the outcome on one variable does not affect the outcome on the others. Let's see an example.

3\. Poisson sample generation
-----------------------------

00:54 - 01:22

In an example we saw previously about a busy highway with two accidents per day on average, we modeled the number of accidents per day with a Poisson random variable. Now imagine we have the data from 1,000 days. In the following animation you can see on the left the values of our population, and on the right you can see the histogram of the population values. This is our data.

4\. Selection from population
-----------------------------

01:22 - 01:45

Now we are going to take 10 values from our population many times, so we can calculate the sample mean of those values. Notice the red dots. Recall that when calculating the sample mean we are adding the values, and the central limit theorem applies to the sum of random variables that are equally distributed.

5\. Selection from population (Cont.)
-------------------------------------

01:45 - 01:57

Notice the histogram of the population -- it's skewed! Now we are going to repeat this process 350 times and see the outcome.

6\. Poisson sample mean plot
----------------------------

01:57 - 02:39

Take a look at these animations. At the top we have the population. We're highlighting in red the 10 randomly selected values used to calculate the sample means, and plotting those values. At the bottom left we're plotting the sample means, and at the bottom right is a histogram of the sample means. Notice that as we calculate more sample means from our population the histogram is centered at 2, which is the mean of our population, and the histogram takes on a bell shape. That is the magic of the central limit theorem. Now let's code this important result.

7\. Poisson population plot
---------------------------

02:39 - 03:11

First we import poisson and describe from scipy dot stats. Then, from matplotlib we import pyplot as plt, and we import numpy as np. We generate our population with poisson dot rvs with mu equals 2, size equals 1000, and the random_state seed set to reproduce our results. Now we can plot a histogram of our population.

8\. Poisson population plot (Cont.)
-----------------------------------

03:11 - 03:17

This is the plot. It's a Poisson skewed plot of our data. Next, let's plot the sample means.

9\. Sample means plot
---------------------

03:17 - 03:50

We first fix our random seed make the results reproducible. We define an empty list called sample_means to store the sample mean values. Then we write a for statement to loop for and arbitrarily chosen large number of samples like, 350 times. We select 10 values from our population using np dot random dot choice and then we append the sample mean of the 10 values to the sample_means list.

10\. Sample means plot (Cont.)
------------------------------

03:50 - 04:08

Outside the for statement, we add labels and a title to the plot. Finally, we plot and show the histogram. We get a plot centered at 2, which is the mean of the population, with a bell shape as we expected.

11\. Let's add random variables
-------------------------------

04:08 - 04:29

We've finished with the most important results in probability and statistics. After exercising a bit with the central limit theorem, we will work on two applications of probability in data science, linear regression and logistic regression. Let's add random variables!

Sample means
============

An important result in probability and statistics is that the shape of the distribution of the means of random variables tends to a normal distribution, which happens when you add random variables with **any** distribution with the same expected value and variance.

For your convenience, we've loaded `binom` and `describe()` from the `scipy.stats` library and imported `matplotlib.pyplot` as `plt` and `numpy` as `np`. We generated a simulated population with size 1,000 that follows a binomial distribution for 10 fair coin flips and is available in the `population` variable.

Instructions 1/4
----------------

Select 20 random values from the `population` using `np.random.choice()`.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)

Instructions 2/4
----------------

Calculate the sample mean of `sample` and add the calculated sample mean to the `sample_means` list.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

Instructions 3/4
----------------

Plot a histogram of the `sample_means` list.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.xlabel("Sample mean values")
plt.ylabel("Frequency")
plt.show()

Instructions 4/4
----------------

Question
--------

Inspecting the plot, what is the distribution of the sample mean?

### Possible answers

Same as the generated sample

Binomial

[x] Normal

Sample means follow a normal distribution
=========================================

In the previous exercise, we generated a population that followed a binomial distribution, chose 20 random samples from the population, and calculated the sample mean. Now we're going to test some other probability distributions to see the shape of the sample means.

From the `scipy.stats` library, we've loaded the `poisson` and `geom` objects and the `describe()` function. We've also imported `matplotlib.pyplot` as `plt` and `numpy` as `np`.

As you'll see, the shape of the distribution of the means is the same even though the samples are generated from different distributions.

Instructions 1/2
----------------

Select 20 values from the population, add the sample mean to the `sample_means` list, and plot a histogram.

In [None]:
# Generate the population
population = geom.rvs(p=0.5, size=1000)

# Create list for sample means
sample_means = []
for _ in range(3000):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.show()

Instructions 2/2
----------------

-   Select 20 values from the population, add the sample mean to the `sample_means` list, and plot a histogram.

In [None]:
# Generate the population
population = poisson.rvs(mu=2, size=1000)

# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
     sample = np.random.choice(population, 20)
    # Calculate the sample mean
     sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.show()

Adding dice rolls
=================

To illustrate the central limit theorem, we are going to work with dice rolls. We'll generate the samples and then add them to plot the outcome.

You're provided with a function named `roll_dice()` that will generate the sample dice rolls. `numpy` is already imported as `np` for your convenience: you have to use `np.add(sample1, sample2)` to add samples. Also, `matplotlib.pyplot` is imported as `plt` so you can plot the histograms.

Instructions 1/3
----------------

Generate a sample of `2000` dice rolls using `roll_dice()` and plot a histogram of the sample.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate the sample
sample1 = roll_dice(num_rolls=2000)

# Plot the sample
plt.hist(sample1, bins=range(1, 8), width=0.9)
plt.show()   

Instructions 2/3
----------------

-   Generate a sample of `2000` dice rolls using `roll_dice()` and plot a histogram of the sample.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate two samples of 2000 dice rolls
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)

# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)

# Plot the sum
plt.hist(sum_of_1_and_2, bins=range(2, 14), width=0.9)
plt.show()

Instructions 3/3
----------------

-   Add `sample1` and `sample2` using `np.add()`, store the result in the variable `sum_of_1_and_2`, then plot `sum_of_1_and_2`.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate the samples
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)
sample3 = roll_dice(2000)  # Generate the third sample

# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)

# Add the first two with the third sample
sum_of_3_samples = np.add(sum_of_1_and_2, sample3)

# Plot the result
plt.hist(sum_of_3_samples, bins=range(3, 20), width=0.9)
plt.show()

1\. Linear regression
---------------------

00:00 - 00:12

We've already finished our core content for this course, but before closing, we will take a quick look at linear regression and logistic models as applications of probability and statistics in data science.

2\. Linear functions
--------------------

00:12 - 00:25

Let's start with a linear function. A linear function is a constant relationship between an independent variable x and a dependent variable y that is represented by a line.

3\. Linear function parameters
------------------------------

00:25 - 00:51

The relationship is expressed with two parameters, the slope and the intercept value. When x equals 0, if we apply the line formula we get the intercept value. In our example, the slope is 1.5 and the intercept is 10. Now consider what would happen if we were to add a random number to the value of the function.

4\. Linear function with random perturbations
---------------------------------------------

00:51 - 01:01

Our values are not on the line anymore. Some real data has similar behavior. Now let's go backwards.

5\. Start from the data and find a model that fits
--------------------------------------------------

01:01 - 01:21

Imagine we have data that shows the relationship between hours of study and students' scores on a test. You can see in the plot that when the hours of study increase, the scores also increase. The idea is to determine if a linear model fits.

6\. What model will fit the data?
---------------------------------

01:21 - 01:31

We might ask ourselves a few questions, like: What model will fit the data? What criteria can we use to determine which is the best model?

7\. What model will fit the data? (Cont.)
-----------------------------------------

01:31 - 01:40

What are the parameters of such a model? Let's assume that the model is linear and try to answer the other two questions.

8\. Residuals of the model
--------------------------

01:40 - 02:00

In the plot, the data are the blue dots and the green line is a linear model. These represent the difference between the data points and the model's predictions. The red lines are the distance between the data and the model. All the red lines are the residuals of the linear model.

9\. Minimizing residuals
------------------------

02:00 - 02:20

If we calculate the residuals and add them up, we can start looking for the slope and intercept that minimize the residuals. That is the foundation for many models in data science: we look for the model parameters that minimize the distance between the model and the data.

10\. Probability and statistics in action
-----------------------------------------

02:20 - 02:42

An interesting link between probability and the linear model is that to apply this model to data you must study the distribution of the residuals and its variance. The distribution of the residuals should be normal with constant variance. Otherwise, the linear model is not a good fit. Let's code a bit.

11\. Calculating linear model parameters
----------------------------------------

02:42 - 03:13

To get the parameters from a model we will use the LinearRegression class from sklearn dot linear_model. We use the provided data for hours of study and scores. Then we get the slope and intercept in model dot coef_ and model dot intercept_. In our case the slope is 1.5 and the intercept is 52.45. Now let's predict with our model.

12\. Predicting scores based on hours of study
----------------------------------------------

03:13 - 03:40

After fitting the model, we can predict the scores based on hours of study. If we want to predict the score for someone who studies a certain number of hours, we call model dot predict and pass an array with the values we want to evaluate. For 15 hours we get 74.90 as the predicted score. Now let's plot our model.

13\. Plotting the linear model
------------------------------

03:40 - 04:05

We first import matplotlib dot pyplot as plt. We use plt dot scatter to plot the data in hours_of_study and scores, and we use plt dot plot to plot the provided values and model dot predict to generate the predicted scores. Then we show our plot.

14\. Plot the linear model (Cont.)
----------------------------------

04:05 - 04:13

The result is this plot with a linear relation between hours of study and scores, with minimal error.

15\. Let's practice with linear models
--------------------------------------

04:13 - 04:19

Now, let's move on and practice with linear models.

Fitting a model
===============

A university has provided you with data that shows a relationship between the hours of study and the scores that the students get on a given test.

You have access to the data through the variables `hours_of_study` and `scores`. Use a linear model to learn from the data.

Instructions
------------

-   Import the `linregress()` function from `scipy.stats`.
-   Fit a linear model using the provided data in the `hours_of_study` and `scores` variables.
-   Print the parameters.

In [None]:
# Import the linregress() function
from scipy.stats import linregress

# Get the model parameters
slope, intercept, r_value, p_value, std_err = linregress(hours_of_study, scores)


# Print the linear model parameters
print('slope:', slope)
print('intercept:', intercept)

Predicting test scores
======================

With the relationship between the hours of study and the scores that students got on a given test, you already got the parameters of a linear model, `slope` and `intercept`. With those parameters, let's predict the test score for a student who studies for 10 hours.

For this exercise, the `linregress()` function has been imported for you from `scipy.stats`.

Instructions 1/3
----------------

Predict the test score for `10` hours of study using the provided parameters in `slope` and `intercept`, then print the score.

In [None]:
# Get the predicted test score for given hours of study
score = slope*10 + intercept
print('score:', score)

Instructions 2/3
----------------

-   Predict the test score for `10` hours of study using the provided parameters in `slope` and `intercept`, then print the score.

In [None]:
# Get the predicted test score for given hours of study
score = slope*9 + intercept
print('score:', score)
score: 63.70809994642248

Instructions 3/3
----------------

-   Now predict the score for `9` hours of study using the parameters in `slope` and `intercept`, then print the score.

In [None]:
# Get the predicted test score for given hours of study
score = slope*12 + intercept
print('score:', score)

Studying residuals
==================

To implement a linear model you must study the **residuals**, which are the distances between the predicted outcomes and the data.

Three conditions must be met:

1.  The mean should be 0.
2.  The variance must be constant.
3.  The distribution must be normal.

We will work with data of test scores for two schools, A and B, on the same subject. `model_A` and `model_B` were fitted with `hours_of_study_A` and `test_scores_A` and `hours_of_study_B` and `test_scores_B`, respectively.

`matplotlib.pyplot` has been imported as `plt`, `numpy` as `np` and `LinearRegression` from `sklearn.linear_model`.

Instructions 1/4
----------------

Make a scatter of `hours_of_study_A` and `test_scores_A` and plot `hours_of_study_values_A` and the outcomes from `model_A`.

In [None]:
# Scatterplot of hours of study and test scores
plt.scatter(hours_of_study_A, test_scores_A)

# Plot of hours_of_study_values_A and predicted values
plt.plot(hours_of_study_values_A, model_A.predict(hours_of_study_values_A))
plt.title("Model A", fontsize=25)
plt.show()

Instructions 2/4
----------------

-   Make a scatter of `hours_of_study_A` and `test_scores_A` and plot `hours_of_study_values_A` and the outcomes from `model_A`.

In [None]:
# Calculate the residuals
residuals_A = model_A.predict(hours_of_study_A) - test_scores_A

# Make a scatterplot of residuals of model_A
plt.scatter(hours_of_study_A, residuals_A)

# Add reference line and title and show plot
plt.hlines(0, 0, 30, colors='r', linestyles='--')
plt.title("Residuals plot of Model A", fontsize=25)
plt.show()

Instructions 3/4
----------------

-   Subtract the predicted values and `test_scores_A`, then make a scatterplot with `hours_of_study_A` and `residuals_A`.

In [None]:
# Scatterplot of hours of study and test scores
plt.scatter(hours_of_study_B, test_scores_B)

# Plot of hours_of_study_values_B and predicted values
plt.plot(hours_of_study_values_B, model_B.predict(hours_of_study_values_B))
plt.title("Model B", fontsize=25)
plt.show()

Instructions 4/4
----------------

-   Make a scatter of `hours_of_study_B` and `test_scores_B` and plot `hours_of_study_values_B` and the outcomes from `model_B`.

In [None]:
# Calculate the residuals
residuals_B = model_B.predict(hours_of_study_B) - test_scores_B

# Make a scatterplot of residuals of model_B
plt.scatter(hours_of_study_B, residuals_B)

# Add reference line and title and show plot
plt.hlines(0, 0, 30, colors='r', linestyles='--')
plt.title("Residuals plot of Model B", fontsize=25)
plt.show()

1\. Logistic regression
-----------------------

00:00 - 00:27

This is our final lesson of the course -- we're going to work with the logistic regression model. Suppose a university has provided you with data on students' test scores and the hours of study they put in before the test. The logistic model will allow you to classify and predict based on the hours of study how likely it is that a student will pass the test. Let's get started!

2\. Original data
-----------------

00:27 - 00:45

A plot of a sample of the original data looks like this. Now, for this first exercise, let's say that the data provided is not the actual scores but, based on the hours of study, whether a student passed or failed the test.

3\. New data
------------

00:45 - 00:53

So, for each student you have the hours of study and only two possible values, pass or fail.

4\. Where would you draw the line?
----------------------------------

00:53 - 01:13

In this case, where would you draw the line to classify between pass or fail based on the hours of study? You could put it at 10 or 11 or even 12, but depending on where you draw the line you will have some misclassified values.

5\. Solution based on probability
---------------------------------

01:13 - 01:32

To draw the line, we need a function that will provide probabilities based on the hours of study. So, the model will provide the probability, which is a value between 0 and 1 for each value of hours of study. This means we have to change our scale.

6\. The logistic function
-------------------------

01:32 - 01:54

The function we need is the logistic function, also called sigmoid. This function: Will throw values from 0 to 1 Will get values based on a linear model using the slope and intercept So, we pass it a linear model and the logistic function returns probabilities.

7\. Changing the slope
----------------------

01:54 - 02:15

From now on we will call the slope of the linear model beta1 and the intercept beta0. If we study the effect of the parameters we can see that increasing beta1 (the slope) will make the logistic function steeper, or more aggressive to classify at a certain value of x.

8\. Changing the intercept
--------------------------

02:15 - 02:27

On the other hand, adjusting the parameter beta0 (intercept), will translate the function left or right on the x-axis. This is how you draw the line.

9\. From data to probability
----------------------------

02:27 - 02:36

So, for each value we will get the probability of passing the test based on the hours of study applying the logistic model parameters.

10\. Outcomes
-------------

02:36 - 02:50

We can say that if the probability is higher than 0.5 we will consider that the outcome is a pass, and otherwise it's a fail. You can see the predicted outcomes of the model in red.

11\. Misclassifications
-----------------------

02:50 - 03:19

But if we compare the model's predictions with the actual outcomes, we can see that there are some misclassifications between 11 and 12 hours of study. Based on the model, we can say that if a student studies less than 10 hours the probability of them passing the test is very low, and if they study 13 hours or more they have a high probability of passing the test. Now let's code a bit.

12\. Logistic regression
------------------------

03:19 - 04:03

To run a logistic regression model we will use scikit-learn -- in particular, the LogisticRegression class. We create our model with LogisticRegression and pass the C parameter as 1e9. This parameter helps keep the model from overfitting to the data. Then we call model.fit with our data. We create variables to get the parameters from model dot coef_ and model dot intercept_. The parameters from the model are arrays, so we extract the values from the arrays. Finally, we print the values.

13\. Predicting outcomes based on hours of study
------------------------------------------------

04:03 - 04:25

If we want to predict the outcome based on a provided number of hours of study, we pass the hours of study to model dot predict and we get the predicted outcome. Notice that the outcomes are an array, so we can pass many values to test and we'll get an array with the outcome for each value provided.

14\. Probability calculation
----------------------------

04:25 - 04:47

If you instead are curious about the probability of passing with a particular number of hours of study, you can use model dot predict_proba. You pass it an array with the values you want to calculate. For 9 hours, we have approximately 0.05 probability of passing.

15\. Let's practice!
--------------------

04:47 - 04:53

It's been great working with the logistic model -- now let's practice some more.

Fitting a logistic model
========================

The university studying the relationship between hours of study and outcomes on a given test has provided you with a dataset containing the number of hours the students studied and whether they failed or passed the test, and asked you to fit a model to predict future performance.

The data is provided in the variables `hours_of_study` and `outcomes`. Use this data to fit a `LogisticRegression` model. `numpy` has been imported as `np` for your convenience.

Instructions
------------

-   Import `LogisticRegression` from `sklearn.linear_model`.
-   Create the model using `LogisticRegression(C=1e9)`.
-   Pass the data to the `model.fit()` method.
-   Create variables for each parameter, assign the values from the model, and print the parameters `beta1` and `beta0`.

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# sklearn logistic model
model = LogisticRegression(C=1e9)
model.fit(hours_of_study, outcomes)

# Get parameters
beta1 = model.coef_[0][0]
beta0 = model.intercept_[0]

# Print parameters
print(beta1, beta0)

Predicting if students will pass
================================

In the previous exercise you calculated the parameters of the logistic regression model that fits the data of hours of study and test outcomes.

With those parameters you can predict the performance of students based on their hours of study. Use `model.predict()` to get the outcomes based on the logistic regression.

For your convenience, `LogisticRegression` has been imported from `sklearn.linear_model` and `numpy` has been imported as `np`.

Instructions
------------

-   Create an array with the values 10, 11, 12, 13, and 14 to predict the outcomes for a test based on those numbers of hours of study.
-   Use `model.predict()` to get the outcomes from the model, and print the outcomes.
-   Use `model.predict_proba()` to get the probability of passing the test with 11 hours of study.

In [None]:
# Specify values to predict
hours_of_study_test = [[10], [11], [12], [13], [14]]

# Pass values to predict
predicted_outcomes = model.predict(hours_of_study_test)
print(predicted_outcomes)

# Set value in array
value = np.asarray(11).reshape(-1,1)
# Probability of passing the test with 11 hours of study
print("Probability of passing test ", model.predict_proba(value)[:,1])

Passing two tests
=================

Put yourself in the shoes of one of the university students. You have two tests coming up in different subjects, and you're running out of time to study. You want to know how much time you have to study each subject to maximize the probability of passing both tests. Fortunately, there's data that you can use.

For subject A, you already fitted a logistic model in `model_A`, and for subject B you fitted a model in `model_B`. As well as preloading `LogisticRegression` from `sklearn.linear_model` and `numpy` as `np`, `expit()`, the inverse of the logistic function, has been imported for you from `scipy.special`.

Instructions 1/4
----------------

Use `model_A` to predict if you'll pass the test with 6, 7, 8, 9, or 10 hours of study and `model_B` with 3, 4, 5, or 6.

In [None]:
# Specify values to predict
hours_of_study_test_A = [[6], [7], [8], [9], [10]]

# Pass values to predict
predicted_outcomes_A = model_A.predict(hours_of_study_test_A)
print(predicted_outcomes_A)

# Specify values to predict
hours_of_study_test_B = [[3], [4], [5], [6]]

# Pass values to predict
predicted_outcomes_B = model_B.predict(hours_of_study_test_B)
print(predicted_outcomes_B)

Instructions 2/4
----------------

-   Use `model_A` to predict if you'll pass the test with 6, 7, 8, 9, or 10 hours of study and `model_B` with 3, 4, 5, or 6.

In [None]:
# Set value in array
value_A = np.asarray([8.6]).reshape(-1,1)
# Probability of passing test A with 8.6 hours of study
print("The probability of passing test A with 8.6 hours of study is ", model_A.predict_proba(value_A)[:,1])


# Set value in array
value_B = np.asarray([4.7]).reshape(-1,1)
# Probability of passing test B with 4.7 hours of study
print("The probability of passing test B with 4.7 hours of study is ", model_B.predict_proba(value_B)[:,1])

Instructions 3/4
----------------

-   Get the probability of passing for test A with 8.6 hours of study and test B with 4.7 hours of study.

In [None]:
# Print the hours required to have 0.5 probability on model_A
print("Minimum hours of study for test A are ", -model_A.intercept_/model_A.coef_)

# Print the hours required to have 0.5 probability on model_B
print("Minimum hours of study for test B are ", -model_B.intercept_/model_B.coef_)

Instructions 4/4
----------------

-   Calculate the hours you need to study to have 0.5 probability of passing the test using the formula `-intercept/slope`.

In [None]:
# Probability calculation for each value of study_hours
prob_passing_A = model_A.predict_proba(study_hours_A.reshape(-1,1))[:,1]
prob_passing_B = model_B.predict_proba(study_hours_B.reshape(-1,1))[:,1]

# Calculate the probability of passing both tests
prob_passing_A_and_B = prob_passing_A * prob_passing_B

# Maximum probability value
max_prob = max(prob_passing_A_and_B)

# Position where we get the maximum value
max_position = np.where(prob_passing_A_and_B == max_prob)[0][0]

# Study hours for each test
print("Study {:1.0f} hours for the first and {:1.0f} hours for the second test and you will pass both tests with {:01.2f} probability.".format(study_hours_A[max_position], study_hours_B[max_position], max_prob))

1\. Wrapping up
---------------

00:00 - 00:07

Congratulations, you made it to the end of the course!! We've covered a lot of ground.

2\. Fundamental concepts
------------------------

00:07 - 00:26

We learned to simulate random variables in order to review the fundamental concepts of probability, such as density, distributions, expected values, sample means, variance, joint probability of dependent and independent events, and conditional probability using Bayes' rule.

3\. Important probability distributions
---------------------------------------

00:26 - 00:35

We reviewed some of the most important probability distributions, such as binomial, geometric, Poisson, and normal distributions.

4\. The most important results
------------------------------

00:35 - 00:44

We also studied the most important laws of probability: the law of large numbers and the central limit theorem.

5\. Linear and logistic regression
----------------------------------

00:44 - 01:05

Finally, we applied those concepts in data science and reviewed linear regression and logistic regression, fitting models, and predicting. There's a lot more to study, and other DataCamp courses go into much more depth on many of the topics we've covered.

6\. Keep learning at DataCamp!
------------------------------

01:05 - 01:40

To really understand and get to grips with all the methods you've seen in this course, it's important to apply them with real data. Only with practice will you truly master the content. By relating these methods to real-life situations you can gain a deeper understanding about your context with science. There are many more courses at DataCamp you can work through to deepen your understanding about probability and statistics. Keep going and good luck!