# Data Science - Unit 1 Sprint 1 Module 4

## Make Explanatory Visualizations

### Module Learning Objectives

- Identify the appropriate visualization type for a particular variable type and research question 
- Use Matplotlib to visualize distributions and relationships with continuous and discrete variables
- Add emphasis and annotations to transform visualizations from exploratory to explanatory
- Remove clutter from visualizations
- Identify misleading visualizations and how to fix them

### Notebook points: 5

### Introduction

The first few modules in this Sprint have focused on exploring datasets, manipulating DataFrames, and creating new features. Now we're going to create some visualizations!

### Dataset Description

Researchers recorded data on sleep duration as well as a set of ecological and constitutional variables for a selection of mammal species.

The data dictionary can be accessed here: https://github.com/LambdaSchool/data-science-practice-datasets/tree/main/unit_1/Sleep in the READ ME file.

*Source: Allison, T. and Cicchetti, D. (1976), "Sleep in Mammals: Ecological and Constitutional Correlates", Science, November 12, vol. 194, pp. 732-734.*


**Task 1** -  Import the `sleep.csv` file and load it as a DataFrame named `Sleep`.
* Don't forget to include any import statements you need to create a DataFrame from a .csv file
* The `sleep.csv` file can be accessed using `sleep_url` which is provided for you
* Load the `sleep.csv` file and as a DataFrame named `Sleep`
* Print the first 5 rows of `Sleep`

In [None]:
# Task 1 

# Access sleep.csv with this url
sleep_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Sleep/Sleep.csv'

# Read in the sleep.csv file included above as a DataFrame named Sleep and print the first 5 rows.
# Don't forget any necessary import statements.

# YOUR CODE HERE
raise NotImplementedError()

# View the DataFrame
Sleep.head()

**Task 1 Test**

In [None]:
# Task 1 - Test

assert isinstance(Sleep, pd.DataFrame), 'Have you created a DataFrame named `Sleep` (check your capitalization)?'
assert len(Sleep) == 42


**Task 2** - Plot a histogram of `Parasleep`, which is the number of hours of dreaming sleep each mammal slept during a 24-hour period.
* The import statements to `import matplotlib.pyplot as plt` and `import seaborn as sns` are included for you.
* Use the template below and **replace the #### as specified in the instructions**
* Plot a histogram of `Parasleep` from the `Sleep` DataFrame
* The x-axis label should read `'Total hours of dreaming sleep'`
* The y-label should read `'Frequency`'
* The title should read `'Daily dreaming sleep in mammal species'`

**Note:** UNCOMMENT code lines to complete the plotting task

In [None]:
# Task 2

# UNCOMMENT the code lines to complete the Task

# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots()

# Plot a histogram of Parasleep from the Sleep DataFrame
#ax.hist(####)

# Specify the axis labels and plot title
#ax.set_xlabel(####) 
#ax.set_####('Frequency') 
####_title('Daily dreaming sleep in mammal species') 

plt.show()

In [None]:
# Task 2 SOLUTION 

# IGNORE the YOUR CODE HERE - your code is completed above
# You will see this solution when you submit the notebook

# YOUR CODE HERE
raise NotImplementedError()

**Task 3** - Prepare `Sleep` data for plotting

* Create a subset of `Sleep` that includes only records of mammals in `Danger` category 1.  Name this DataFrame `Danger_1`
* Create a subset of `Sleep` that includes only records of mammals in `Danger` category 5.  Name this DataFrame `Danger_5`

In [None]:
# Task 3

# Set up danger category 1 and danger category 5

# YOUR CODE HERE
raise NotImplementedError()

# View your DataFrames
print(Danger_1.head())
print(Danger_5.head())

**Task 3 Test**

In [None]:
# Task 3 - Test

assert isinstance(Danger_1, pd.DataFrame), 'Have you created a DataFrame named Danger_1 (check your capitalization)?'
assert isinstance(Danger_5, pd.DataFrame), 'Have you created a DataFrame named Danger_5 (check your capitalization)?'


**Task 4** - Plot side-by-side box plots of `Parasleep` to compare the distribution of `Parasleep` for mammals in `Danger` category 1 and `Danger` category 5.

Use the template below and **replace the #### as specified in the instructions**

* Plot side-by-side boxplots of `Parasleep` from the `Sleep` DataFrame that shows the distribution of `Parasleep` separately for mammals in `Danger` categories 1 and 5.
* Label the boxplots `'Least danger'` and `'Most danger'`
* The x-axis label should read `'Total hours of dreaming sleep'`
* The y-label should read `'Danger category`'
* The title should read `'Daily dreaming sleep in mammal species'`
* The `vert` parameter in the boxplot should be set to `False`.

**The plotting code will not be autograded** but it is still required for completing the project.

In [None]:
# Task 4 - Plotting

# UNCOMMENT the code lines to complete the Task

fig, ax = plt.subplots()

# Plot the side-by-side boxplots           
#ax.boxplot([ #### , Danger_5['Parasleep']], labels=['Least danger','####'], ####=False)

# Label the figure
#ax.set_xlabel('Hours of dreaming sleep')
#ax.set_ylabel('Danger category')
#ax.set_title('Daily dreaming sleep in mammal species')

plt.show()

In [None]:
# Task 4 SOLUTION

# IGNORE the YOUR CODE HERE - your code is completed above
# You will see this solution when you submit the notebook

# YOUR CODE HERE
raise NotImplementedError()

**Task 5** - Sort the `Sleep` DataFrame

* Sort the `Sleep` DataFrame by `Gest`.  Name the sorted DataFrame `Sleep_sorted`

In [None]:
# Task 5

# Sort Sleep by Gest

# YOUR CODE HERE
raise NotImplementedError()

# View the results
Sleep_sorted.head()

**Task 5 - Test**

In [None]:
# Task 5 - Test

assert Sleep_sorted.iloc[0, 0] == 'N_American_opossum', 'Double-check your DataFrame sorting'

### No hidden tests

**Task 6** - Plot a line plot of `Parasleep` by `Gest`, the gestational time for each mammal.

Use the template below and **replace the #### as specified in the instructions**

* Plot a line plot using the `Sleep_sorted` DataFrame with each mammal's value of `Gest` on the x-axis and each mammal's value of `Parasleep` on the y-axis.
* In the `plot` statment, specify `'o'` for the marker, `'dashdot'` for the linestyle and `'b'` for the color.
* Label and title the graph using the statements provided for you.

**The plotting code will not be autograded** but it is still required for completing the project.

In [None]:
# Task 6 - Plotting

# UNCOMMENT the code lines to complete the Task

fig, ax = plt.subplots()

#ax.plot(####, 
#    ####, 
#    marker=####,
#    linestyle=####,
#    color=####) 

#ax.set_xlabel('Gestational time (days)') 
#ax.set_ylabel('Dreaming sleep (hours)') 
#ax.set_title('The Relationship of Gestational Time to Dreaming Sleep in Mammals') 

plt.show()

In [None]:
# Task 6 SOLUTION

# IGNORE the YOUR CODE HERE - your code is completed above
# You will see this solution when you submit the notebook

# YOUR CODE HERE
raise NotImplementedError()

**Task 7** - Plot the percent of animals in each `danger` category using a pie chart.

* Plot a pie chart using the `Danger` variable from the `Danger_pct` DataFrame.  
* Set the labels of the plot using `Danger_pct.index`
* Some other graphical parameters as well as the plot title have been set for you.

**The plotting code will not be autograded** but it is still required for completing the project.

In [None]:
# Task 7 - Plotting

# UNCOMMENT the code lines to complete the Task

# Create the pie chart

fig, ax = plt.subplots()

#ax.pie(####, labels=####, autopct='%1.1f%%', startangle=90)
#ax.set_title('Percent of mammals in each danger category')

plt.show()

In [None]:
# Task 7 SOLUTION

# IGNORE the YOUR CODE HERE - your code is completed above
# You will see this solution when you submit the notebook

# YOUR CODE HERE
raise NotImplementedError()

**Task 8** - Create a new feature

* Use `.loc` to create a new feature in `Sleep` called `Short life` that takes on the values: 
    * 1 if the mammal's lifespan is less than 30 years long
    * 0 if the mammal's lifespan is 30 years or longer
* Use `.value_counts()` to calculate the frequency of `Short life`.  **Save your results to a DataFrame named `Life_counts`**

In [None]:
# Task 8

# YOUR CODE HERE
raise NotImplementedError()

# View the results
Life_counts.head()


**Task 8 - Test**

In [None]:
# Task 8 - Test

assert 'Short life' in Sleep.columns, 'Have you created the new feature column?'
assert isinstance(Life_counts, pd.DataFrame), 'Have you created a DataFrame named Life_counts?'


**Task 9** - Plot the number of mammals in the `Sleep` dataset that had short (< 30 year) and long (> 30 year) lifespans.

* Use `.catplot()` with `kind='count'` to plot the frequency of `Short life`.
* Other graphical parameters, axis labels, and the title have been set for you.

**The plotting code will not be autograded** but it is still required for completing the project.

In [None]:
# Task 9 - Plotting

# UNCOMMENT the code lines to complete the Task

#sns.catplot(x='########',data=#######,kind='######')

plt.ylabel('Frequency') 
plt.xlabel('Mammal lifespan')
plt.title('Number of Mammals with Long and Short Life Expectancies') 
plt.xticks(ticks=[0,1], labels=['Lifespan >=30 years', 'Lifespan < 30 years'])

plt.show()

In [None]:
# Task 9 - Plotting

# IGNORE the YOUR CODE HERE - your code is completed above
# You will see this solution when you submit the notebook

        
# YOUR CODE HERE
raise NotImplementedError()