# DSC 80 - Discussion 01

### Due Date: Saturday October 1, 11:59 PM

**Discussions will be due by the end of the day on Saturday**

* Lecture Review: models and the data science life-cycle.
* Overview: How to work on homework.
* Tutorial: `numpy` review and an example HW problem.

---

## Lecture Review

### The data science lifecycle

<center><img src="imgs/DSLC.png" width="40%"></center>

The data science life-cycle:
* Researching domain
* Questions and hypotheses
* Finding and cleaning data
* Data modeling
* Predictions and Inference
* Decisions

Some terminology of modeling:
* A **data generating process (DGP)** is the real-world phenomenon under consideration.
* The **true (probability) model** is a mathematical representation of the random phenomenon that generates any representative observations.
* The **observations** are data representing the data generating process.
* A **(fit) statistical model** of the data is the best approximation of the data generating process under the probability model.

**Example:** Suppose you want to predict the outcome of the next presidential election.

1. What is the questions and hypothesis to answer and test?
2. What observations you might collect?
3. What measurements do you care about? (i.e. what do your observations look like?)
4. What statistical model might you use?
5. How might you assess the quality of your fit model?

**Example**: Suppose we want to understand the pay disparity between men and women among city of SD employees.

1. Does the dataset in lecture adequate enough or will you capture other measurements/features?
2. What is the applicability of above process to other years and cities?

---

## Overview: working on assignments

The class assignments are available on the class git repository; they consist of a notebook with the problems statements, starter code in a `.py` file, and any required supplementary files (e.g. data). After pulling the HW material, you will develop your solutions using a combination of jupyter notebooks and your favorite IDE (e.g. sublime text, or the jupyterhub server). Once finished, you will submit your assignment to gradescope.


### Obtaining course materials (assignments)

Git is a version control system that is used to with the development of the course materials. For an introduction to using git in the course, see this [tutorial](https://drive.google.com/open?id=1m6mXfhjFInHPeJyaHdAwfiakcFYh73HC8TeAB9E9Xeo) and this hands-on [tutorial](https://docs.google.com/document/d/1E2Zg0pC8S3cyT564jug6rqAhSNraR_7Yy_4AnvHaGu4/edit?usp=sharing). To use git on a Mac, you will need to open the terminal; on Windows, you should download [git-bash](https://gitforwindows.org/).

The course materials are stored in a git repository on *github* (a git server) -- you can view it in a browser [here](https://github.com/dsc-courses/dsc80-2022-sp). To obtain the course files, follow the directions in the tutorial above.

### The notebook / IDE balance

Now that your assignment is on your computer, you are ready to work. You will be using two different tools to develop the code and create the analyses that the assignments require of you. Generally, these are:
1. Jupyter notebooks contains the problems statements themselves; they also provide a place to test out code, understand data, and produce reports/summaries of conclusions.
2. An IDE for developing re-usable and testable python code. Abstracting your notebook code into python library code avoids common mistakes in notoriously error prone notebook environments. Luckily, once a function is in your `.py` file, you can still import/use it in a notebook!

Both of these environments are essential in the data scientist toolkit.

### Checking your work

An effective environment for testing and understanding your work is essential to success in the class. The notebook and the IDE play different roles in checking your work.

* The notebook provides a place to understand the output of your function and test it against your intuition and understanding of what the correct output should be. When working with data, you should always check the correctness of your work using your understanding of that data (i.e. is my conclusion reasonable given what I know about the data?). This is typically the ultimate goal for a problem, so you should *always* interpret your answer on the data in a notebook.

* Abstracting your code to library functions/classes in a `.py` file encourages using software development best-practices in your data processing and analyses. While expressive, notebooks are error-prone, manual, and hard to debug. Moving useful code to a `.py` file makes your code more clear, encourages code reuse, and makes debugging easier. Once you have moved any work from your notebook to a `.py` file, you should check the correctness of your work in two ways:
    - Run the doctests. The doctests ensure your code *meets the contract* specified in the question (or by you, in your own projects). That is, is your code expecting the correct inputs and outputs? **Doctests do not check more than if your code is acting on the correct types**.
    - Import your function into the notebook and test it on data as above. Use your understanding of the data to assess the correctness of your code!
    
### Types of Tests in a Glance

<center><img src="imgs/testing_summary.png" width="80%"></center>


### HW submission

Once you have finished the assignment, log into Gradescope and submit the `.py` file to the appropriate assignment. 
* Upon submission, the autograder will run the doctests and make visible if they tests passed or not. These are worth *a few* points; the purpose is to check that the autograder environment is consistent with the environment on which you developed your HW.
* The results of the "correctness tests" that you will be ultimately graded on will not be visible until after the due date.
* The autograder will tell you if your code failed to run, though generally will not tell you why. The most common reasons are listed below.
    - **Timeout**: the autograder *will* tell you if your code failed to run after 20 minutes. If this occurs, you should try to isolate which problem is causing the timeout and either fix it or comment it out!
    - **Syntax Errors**: Any syntax errors (e.g. bad code indentation) will cause gradescope to fail (giving a 0 on the assignment). Always double check your code passes doctests *on the commandline* (just as the autograder runs it). Further, pulling your code (from github) onto DataHub and running the tests there is a good debugging technique, as the environment is very similar to gradescope. 
    - **OOM (out of memory)**: The autograder runs a 1GB server, which is smaller than your computer. Assignments should never require more memory than this; you should think about how to simplify your code!
    - **Change of file names**: Do not change file names while submitting to Gradescope. They should be one of `lab.py`, `discussion.py` and `project.py`

If the problem persists, ask course staff why the autograder is failing.
    

### A remark on DataHub

UCSD Educational Technology Services has made servers available for use at [DataHub](datahub.ucsd.edu). Once logged in, you have not only a jupyter notebook server running, but an entire unix environment. To make best use of this environment, once logged in, replace the `/tree` in the URL with `/lab` and you can use the JupyterLab IDE/notebook environment. Here, you can use (1) jupyter notebooks, (2) terminals, and (3) a simple text editor for editing python files.

## Tutorial: `numpy` review as a HW problem

Work on this tutorial like an assignment. **Complete the questions 1 and 8, and turn them into gradescope by midnight on Saturday**.

In [1]:
# What is this? 
# Autoreloads the .py files so the changes made to py file are reflected immediately.
%load_ext autoreload
%autoreload 2

In [2]:
# from discussion import *

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os

In [4]:
import numpy as np
import os

For a review of working with Numpy arrays, see the [arrays chapter](https://www.inferentialthinking.com/chapters/05/1/Arrays.html) of Inferential Thinking (DSC10). The most relevant concepts are:
* element-wise array operations, that avoid loops ('vectorization')
* the functions and methods for performing array arithmetic (see the tables in the page referenced above).

**Question 1** Write a function that takes in a file-path that points to a data file like `restaurants.csv` and returns an array of values of restaurant bills.

*Notes*: Where is the file? What values? Look at the starter code documentation in `discussion.py`.

In [5]:
def data2array(filepath):
    """
    data2array takes in the filepath of a 
    data file like `restaurant.csv` in 
    data directory, and returns a 1d array
    of data.

    :Example:
    >>> fp = os.path.join('data', 'restaurant.csv')
    >>> arr = data2array(fp)
    >>> isinstance(arr, np.ndarray)
    True
    >>> arr.dtype == np.dtype('float64')
    True
    >>> arr.shape[0]
    100000
    """
    # BEGIN SOLUTION
    fh = open(filepath)
    fh.readline()
    return np.array([float(x) for x in fh])
    # END SOLUTION

In [6]:
result = data2array(os.path.join('data', 'restaurant.csv'))

In [7]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'check the returned datatype'
""" # END TEST CONFIG
isinstance(result, np.ndarray)

True

In [8]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'check the rows'
""" # END TEST CONFIG
result.shape[0] == 100000

True

In [9]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'check the data type of values'
""" # END TEST CONFIG
result.dtype == np.dtype('float64')

True

In [10]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'is the first element 16.87?'
""" # END TEST CONFIG
np.isclose(result[0], 16.87)

True

In [11]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'test on the mean of all elements'
""" # END TEST CONFIG
np.isclose(result.mean(), 14.9644172)

True

**Question 2:** How many restaurant bills are there?

**Question 3:** Suppose everyone leaves an 18% tip. Create an array of tip amounts. What is the total amount of tips in the array?

**Question 4:** What is the average/median/min/max restaurant bills? Give answer in an array, in the order listed.

**Question 5:** How many restaurant bills are greater than $15?

**Question 6:** How much total money for the restaurant is there? What proportion of that comes from bills less than $5?

**Question 7:** What proportion of bills have at least one other bill within $0.05 tolerance of 20 dollars?

**Question 8:** What proportion of restaurant bills end in 9 in the hundredths place?

Create a function `ends_in_9` that takes in an array of dollar amounts (like the output of Question 1) and returns the proportion of values that end in 9 in the hundredths place. 

*Hints:* Use the remainder function `%`. Be careful of floating point operations (use the rounding/integer conversion appropriately).

In [12]:
def ends_in_9(arr):
    """
    ends_in_9 takes in an array of dollar amounts 
    and returns the proprtion of values that end 
    in 9 in the hundredths place.

    :Example:
    >>> arr = np.array([23.04, 45.00, 0.50, 0.09])
    >>> out = ends_in_9(arr)
    >>> 0 <= out <= 1
    True
    """
    # BEGIN SOLUTION
    return np.mean(np.round(arr * 100) % 10 == 9)
    # END SOLUTION

In [13]:
result = ends_in_9(data2array('data/restaurant.csv'))
arr = np.array([23.04, 45.00, 0.50, 0.09])
doc_result = ends_in_9(arr)

In [14]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'check doctest'
""" # END TEST CONFIG
0 <= doc_result <= 1

True

In [15]:
""" # BEGIN TEST CONFIG
points: 2
failure_message: 'correctness of answer on doctest array'
""" # END TEST CONFIG
np.isclose(doc_result, 0.25)

True

In [16]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'Tested on Restaurant data; approximately correct on *all* the data'
""" # END TEST CONFIG
np.isclose(result, 0.09945, atol=0.05)

True

In [17]:
""" # BEGIN TEST CONFIG
points: 1
failure_message: 'Tested on Restaurant data; did you round to
        int before taking the remainder? This is likely caused by
        a floating point rounding error.'
""" # END TEST CONFIG
np.isclose(result, 0.09945)

True

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest discussion.py`.

In [20]:
# !python -m doctest discussion.py