# Example: Normal Houseflies

This dataset lists data collected on the length of housefly wings.

1. Run the code and consider the following questions.

2. The console displays the values for the mean, the median, and the standard deviation. What relationship between these three values tells us that the data may fit a normal distribution curve?

Using the likelihood ranges of a normal distribution curve, list at least three “stories” that you can tell using the graph shown.
Example: 99.7% of houseflies have wings with a length of 3.3 and 5.8 mm.

Data Source: https://seattlecentral.edu/qelp/sets/057/057.html

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

"""Plot the data"""

# Import the data
df = pd.read_csv (r"flies.csv")
pd.set_option("display.max_columns", None)

# Set data to be the values in the wing_length column
data = df.wing_length

# Add labels to the graph
plt.grid(True)
plt.title("Housefly Wing Lengths", fontsize=18)
plt.xlabel("Length (0.1mm)")
plt.ylabel("Distribution")

# Plot the histogram (w/density)
plt.hist(data, bins=10, density=True)

"""Plot the Normal Distribution Curve"""

# Determine the mean, median and std
mean = data.mean()
median = data.median()
std = data.std()

print("Mean: " + str(mean))
print("Median: " + str(median))
print("STD: " + str(std))

# Set up min and max of the x-axis using the mean and standard deviation
xmin = mean - 3 * std
xmax = mean + 3 * std

# Define the x-axis values
x = np.linspace(xmin, xmax) 

# "Norm" the y-axis values based on the x-axis values, the mean and the std
y = norm.pdf(x, mean, std)

# Plot the graph using the x and the y values
plt.plot(x, y, color="orange", linewidth=2) 

plt.show()

# Problem 1 - SAT Scores

This dataset represents students’ scores on the verbal section of the SAT in 2012.

1. Plot a histogram the represents the function.

2. Does the shape of the histogram seem to be symmetrical?

3. Check the mean and the median of the data. What do these values tell you about the distribution of the data?

4. Plot the normal distribution curve on top of the histogram.

5. How well (or not well) does it fit the data?

6. List at least 3 stories that can be told using the visualization and your interpretation.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

"""Plot the data"""

# Import the data
df = pd.read_csv (r"sat.csv")
pd.set_option("display.max_columns", None)

# Set data to be the values in the verbal_scores column


# Add labels to the graph


# Plot the histogram (w/density)


"""Plot the Normal Distribution Curve"""

# Determine the mean, median and std


# Set up min and max of the x-axis using the mean and standard deviation


# Define the x-axis values


# "Norm" the y-axis values based on the x-axis values, the mean and the std


# Plot the graph using the x and the y values
 

plt.show()

# Example: Likelihood of a Fly
This example calculates the probability distribution function (PDF) as well as the cumulative distribution function (CDF) of a specific value.

The visualization is included to help with understanding the PDF and CDF. It will not change as you explore the example.

1. Find the x_value variable. What should the PDF be for a wing length of 50? Run the code to check your answer.

2. What is the likelihood of finding a fly with a wing length being less than 50? How about more than 50?

3. Change the x_value to 54, What should the PDF be for this value? Run the code to check your answer.

4. What is the likelihood of finding a fly with a wing length being less than 54? How about more than 54?

5. Try a few other values to see their corresponding likelihoods.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

"""Plot the data"""

# Import the data
df = pd.read_csv (r"flies.csv")
pd.set_option("display.max_columns", None)

# Set data to be the values in the wing_length column
data = df.wing_length

# Add labels to the graph
plt.grid(True)
plt.title("Housefly Wing Lengths", fontsize=18)
plt.xlabel("Length (0.1mm)")
plt.ylabel("Distribution")

# Plot the histogram (w/density)
plt.hist(data, bins=10, density=True)

"""Plot the Normal Distribution Curve"""

# Determine the mean, median and std
mean = data.mean()
median = data.median()
std = data.std()

print("Mean: " + str(mean))
print("STD: " + str(std))

# Set up min and max of the x-axis using the mean and standard deviation
xmin = mean - 3 * std
xmax = mean + 3 * std

# Define the x-axis values
x = np.linspace(xmin, xmax) 

# "Norm" the y-axis values based on the x-axis values, the mean and the std
y = norm.pdf(x, mean, std)

# Plot the graph using the x and the y values
plt.plot(x, y, color="orange", linewidth=2) 

"""Determine Likelihoods"""

x_value = 50

# Calculate the likelihood according to the normal distribution
pdf = norm.pdf(x_value, mean, std) 
print()
print("The likelihood of the value equaling " + str(x_value) + " is " + str(pdf))

cdf = norm.cdf(x_value, mean, std) 
print()
print("The likelihood of a value being less than " + str(x_value) + " is " + str(cdf))

more_than_cdf = 1 - norm.cdf(x_value, mean, std) 
print()
print("The likelihood of a value being more than " + str(x_value) + " is " + str(more_than_cdf))

plt.show()

# Problem 2 - SAT Predictions

Copy over your `main.py` code from the last SAT exercise.

Use the `x-value` variable included in the program to determine the following.

1. Determine the likelihood of earning exactly the score listed as the x-value.

2. Determine the likelihood of earning a score less than the x-value.

3. Determine the likelihood of earning a score more than the x-value.

Advanced:
Utilize user input to have the user input a value that then prints the corresponding probabilities.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

"""Copy your code from the last SAT exercise."""


"""Determine Likelihoods"""

x_value = 400

# Calculate the likelihood of earning exactly the x_value

# Calculate the likelihood of earning a score less than the x_value

# Calculate the likelihood of earning a score greater than the x_value


plt.show()

# Problem 3 - SAT Reflection

Dataset: Professor’s Salaries
Use the program in the last item to answer the following questions.

1. Does the dataset (not filtered) follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

2. Does the female filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a female professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

3. Does the years of service filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor who has worked more that 10 years earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

In [None]:
#Answer here

# Problem 4 - Professor Salaries

This dataset lists professor salaries collected in a survey. Other fields collected were the type of degree, years since their Ph.D. was earned, years of service, and sex (gender).

The data has been imported and a line of normal distribution has been plotted to determine how well (or not well) this data fits a normal distribution.

1. Read through the code to understand how the program works. Add in your own comments to help you understand the program.

2. Print out the mean, the median, and the standard deviation of the data. Hint: It’s already being calculated in the program. You just need to print out the result!

3. Create two filtered tables according to the criteria below.

    - Only female professors

    - Only professors with over 10 years of service


Use the program to explore the following questions. You will record answers for these in the following free-response item.

4. Does the dataset (not filtered) follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

5. Does the female filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a female professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

6. Does the years of service filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor who has worked more that 10 years earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

**Advanced:**
Utilize user input to have the user input a value that then prints the corresponding probabilities.

Data Source: https://bigml.com/user/totyb/gallery/dataset/50f303103b56354d2a000405

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

"""Plot the Histogram"""

# Import dataset for the histogram
df = pd.read_csv (r"data.csv")
pd.set_option("display.max_columns", None)

# Filtered Tables





# Set data equal to one column of a dataset
data = df.salary

# Add labels and a title
plt.grid(True)
plt.xlabel("Yearly Salary")
plt.ylabel("Distribution")
plt.title("Professor Salaries", fontsize=22)

# Plot histogram using the data, 15 bins and show the density
plt.hist(data, bins=15, density=True)

"""Plot the Normal Distribution Curve"""

# Determine the mean, median and std
mean = data.mean()
median = data.median()
std = data.std()

# Print the mean, median and standard deviation of the data




# Set up min and max of the x-axis using the mean and standard deviation
xmin = mean - 3 * std
xmax = mean + 3 * std

# Define the x-axis values
x = np.linspace(xmin, xmax)

# "Norm" the y-axis values based on the x values
y = norm.pdf(x, mean, std)

# Plot the graph using the x and the y values
plt.plot(x, y, linewidth=2)

"""Determine Likelihoods"""

x_value = 100000

# Calculate the likelihood according to the normal distribution

pdf = norm.pdf(x_value, mean, std) 
print()
print("The likelihood of earning a salary of " + str(x_value) + " is " + str(pdf))

cdf = norm.cdf(x_value, mean, std) 
print()
print("The likelihood of earning a salary less than " + str(x_value) + " is " + str(cdf))

more_than_cdf = 1 - norm.cdf(x_value, mean, std) 
print()
print("The likelihood of earning a salary more than " + str(x_value) + " is " + str(more_than_cdf))

plt.show()

Use the program to explore the following questions. You will record answers for these in the following free-response item.

4. Does the dataset (not filtered) follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

5. Does the female filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a female professor earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

6. Does the years of service filtered dataset follow a normal distribution? Is it jumbled around, or skewed?

    - Use the x-value variable to determine the probability of a professor who has worked more that 10 years earning $100,000. What’s the probability of earning less than $100,000? More than $100,000?

    - List at least two “stories” that can be read from this visualization.

In [None]:
#answers here