## Introduction to Data Science

#### University of Redlands - DATA 101
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data101.joannabieri.com](https://joannabieri.com/data101.html)

---------------------------------------
# Homework Day 4
---------------------------------------

GOALS:

1. Answer all the questions from the lecture.
2. Practice making graphs with different types of data

----------------------------------------------------------

For this homework you will load the L data and practice making different kinds of plots!

This homework has **7 questions** from the lecture and **2 Problems** and a **Challenge**.

NOTE:
* Questions tend to be short answer or things were you change very minor parts of some given code.
* Problems tend to be more involved, like trying out our methods on a new column.
* Challenge problems are optional! They are intended to challenge you to reach beyond the basics of the class. I hope you will try the challenge problems!

In [64]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'

from itables import show

### Load the Data

In [47]:
file_location = 'https://joannabieri.com/introdatascience/data/loans_full_schema.csv'
DF = pd.read_csv(file_location)

### Check Observations and Variables

**Q1** How many **observations** are there?

**Q2** How many **variables** are there?

In [41]:
# Enter your code here to find the data shape
print(DF.shape)

(10000, 55)


55 Variables and 10000 observations 

### Reduce the number of variables

In [51]:
my_variables = ['loan_amount',
                'interest_rate',
                'term','grade',
                'state',
                'annual_income',
                'homeownership',
                'debt_to_income']

DF = DF[my_variables]

show(DF)

loan_amount,interest_rate,term,grade,state,annual_income,homeownership,debt_to_income
Loading ITables v2.1.4 from the internet... (need help?),,,,,,,


**Q3** Check out each of the variables (columns):

1. What does each column tell you? What are the units?
2. Is the data numerical? If so is it continuous or discrete?
3. If the categorical? If so is it ordinal or nominal?

<a href="https://www.openintro.org/data/index.php?data=loans_full_schema"> Here is a link to the full data description if you need to look up some of the column names.</a>

-----------------------------------------

1. The columns include loan amount (Money), interest rate (%), term (months?), grade, state, annual income of loanee ($), homeownership status, debt to income ratio.
2. loan amount, interest rate, annual income, and debt to income ratio are continous, term is discrete.
3. Grade and homeownership is ordinal, and state is nominal. 

### Here is Example Code for a simple Histogram

In [78]:
fig = px.histogram(DF,
                   nbins=10,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Amounts.',
                  title_x=0.5)
fig.show()

**Q4** You try changing the number of bins **nbins** what do you notice? Are there good choices? Bad choices?


In [80]:
# Your code here (experiment here)
sha = px.histogram(DF,
                   nbins=15,
                   x='loan_amount',
                   color_discrete_sequence=['lightseagreen'])

sha.update_layout(bargap=0.1,
                  title='Histogram of Loan Amounts.',
                  title_x=0.5)
sha.show()

The more bins the more you can understand about the data. It also creates its own bin sizes automatically.

**Problem 1** Create a histogram of your own! Try making a histogram of one of the other pieces of numerical data. Make it as fancy as you want. Include some categorical information. Do you learn anything from your graph? If so what?


In [102]:
# Your code here
fig = px.histogram(DF,
                   nbins=25,
                   x='homeownership',
                   color_discrete_sequence=['green'])

fig.update_layout(bargap=0.1,
                  title='Histogram of Homeownership Status of Loanees',
                  title_x=0.5)
fig.show()

### Here is example code for a histogram with a box plot included.

In [104]:
fig = px.histogram(DF,
                   nbins=9,
                   x='loan_amount',
                   color='homeownership',
                   opacity=0.5,
                   color_discrete_map={'MORTGAGE': 'darkviolet', 
                                       'RENT': 'deeppink', 
                                       'OWN': 'darkturquoise'},
                   marginal="violin"
                   )

fig.update_layout(bargap=0.1,
                  title='Histogram of Loan Ammounts.',
                  title_x=0.5,
                  xaxis={'range':[-1000, 46000]},
                  xaxis_title="Loan Amount ($)",
                  yaxis_title="Frequency",
                  legend_title="Homeownership",
                  autosize=False,
                  width=800,
                  height=600)
fig.show()

**Q5** Change the above plot to marginal="violin" and see what changes. Make some observations about the graph

It creates a violin graph above the main graph using the same inputs as the primary one. 

In [106]:
counts = DF['homeownership'].value_counts()
show(counts)

Unnamed: 0_level_0,count
homeownership,Unnamed: 1_level_1
Loading ITables v2.1.4 from the internet... (need help?),


**Q6** Copy and past the code above, but change it to get value counts for one of the other categorical columns

In [108]:
## Your code here
counts = DF['state'].value_counts()
show(counts)

Unnamed: 0_level_0,count
state,Unnamed: 1_level_1
Loading ITables v2.1.4 from the internet... (need help?),


### Here is example code for a bar plot

In [110]:
fig = px.bar(DF,
            x='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_traces(dict(marker_line_width=0))
fig.show()

**Q7** Can you figure out how to add x labels, y labels, and a title to this graph?

In [128]:
# Your code herefig
fig = px.bar(DF,
            x='homeownership',
            color_discrete_sequence=['lightseagreen'])
fig.update_layout(title = 'Homeownership of People Recieving Loans',
                 xaxis_title = 'Homeownership Status',
                 yaxis_title = '# of Loanees')
fig.show()

**Problem 2** Try to make your own bar plot of one of the other categorical columns. Add some categorical fill or facets. See how fancy you can make your graph. Make sure it is also still really informative.

In [180]:
# Your code here
fig = px.bar(DF,
            x='grade',
             color= 'homeownership',
            color_discrete_map = {'MORTGAGE' : 'yellow',
                                 'RENT' : 'orange',
                                    'OWN' : 'lightgreen'},
            opacity = 1)
fig.update_layout(title = 'Grade of Loans',
                 title_x = 0.5,
                 xaxis_title = 'Grades',
                 yaxis_title = 'Number of Loans',
                 legend_title = 'Homeownership of Loanee')
fig.update_traces(dict(marker_line_width=0))
fig.show()

### Challenge:

Here is a data set that contains demographic data from the Behavioral Risk Factors Surveillance System from the CDC. It is a small subset of 60 observations.

Here is a link to the variable information:

<a href="https://www.openintro.org/data/index.php?data=cdc.samp" target="_blank">https://www.openintro.org/data/index.php?data=cdc.samp</a>

Your goal is to look at the columns and then make an graph from the data using what we learned in today's class.

In [174]:
file_location = 'https://joannabieri.com/introdatascience/data/cdc.samp.csv'
DF_new = pd.read_csv(file_location)
show(DF_new)

genhlth,exerany,hlthplan,smoke100,height,weight,wtdesire,age,gender
Loading ITables v2.1.4 from the internet... (need help?),,,,,,,,


In [209]:
idea = px.scatter(DF_new,
              x = 'weight',
                 y = 'height',
                 color = 'gender',
                 color_discrete_map ={'m':'lightblue',
                                      'f':'pink'})
idea.show()

In [241]:
next = px.box(DF_new,
                    x = 'weight',
              facet_col = 'exerany',
             facet_col_wrap = 1,
             color = 'exerany')
next.show()

[0;31mSignature:[0m [0mnext[0m[0;34m.[0m[0mupdate_layout[0m[0;34m([0m[0mdict1[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0moverwrite[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m [0;34m->[0m [0;34m'Figure'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Update the properties of the figure's layout with a dict and/or with
keyword arguments.

This recursively updates the structure of the original
layout with the values in the input dict / keyword arguments.

Parameters
----------
dict1 : dict
    Dictionary of properties to be updated
overwrite: bool
    If True, overwrite existing properties. If False, apply updates
    to existing properties recursively, preserving existing
    properties that are not specified in the update operation.
kwargs :
    Keyword/value pair of properties to be updated

Returns
-------
BaseFigure
    The Figure object that the update_layout method was called on
[0;31mFile:[0m      /opt/anaconda3/lib/p