In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# DSC 80 - Lab 01

### Due Date: Monday October 4, 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Note**: Labs will have public tests and private tests. The public "smoke tests" that you will run below and which appear on Gradescope are generally worth no points. After the due date, we will replace these tests with private tests that will determine your grade. This is different from DSC 10, where labs only had public tests!

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the Notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for *projects* will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file. You can write code here, but make sure that all of your real work is in the .py file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from lab import *

In [3]:
import os
import io
import pandas as pd
import numpy as np

## Part 1: Python Basics

**Question 0:**

Write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two adjacent list elements that are consecutive integers.
* Otherwise, returns `False`.

For example, because `9` is next to `8`:
```
>>> consecutive_ints([5,3,6,4,9,8])
True
```
Whereas:
```
>>> consecutive_ints([1,3,5,7,9])
False
```

*Note*: If you look at `lab.py`, you'll notice that the solution to this problem is already there. This question is done for you to show you what a completed homework problem looks like.

In [6]:
# these cells are here for you to write scratch work in; you should write the code
# for your answer in the .py file


There are two ways to test your code:

1. Run the cell below to test your code. You should also try writing some of your own tests by calling your functions on different inputs. Does it work for corner cases? Real-world data is **very messy** and you should expect your data processing code to break without thorough testing!
2. Run doctests on `lab.py` by running the following command on the commandline:
```
python -m doctest lab.py
```
If the doctests pass, then there should be *no* output.

The tests below *include* the doctests, so you do not need to run both. You may find it more convenient to run the doctests on the command line or from your IDE.

In [None]:
grader.check("q0")

**Question 1 (median and average):**

Write a function called *median_vs_average* that takes a non-empty list of numbers and returns True if median is greater or equal than average and False otherwise. 

To find median: If the list has even length, it should calculate the mean of the two elements in the middle. Do not use any imported libraries for this question; you may use any built-in function.


In [None]:
grader.check("q1")

**Question 2 (List Distances):**

Similar to Question 0, write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two list elements $i$ places apart, whose distance as integers is also $i$.
* Otherwise, returns `False`.

Assume your inputs tend to satisfy the condition, and the pair(s) saitifying the condition tend to be close together; design your function to run faster for this case. (Optimizing your code for an assumed distribution of incoming data is very common in data science).

For example, because `3` and (the second) `5` are two places apart, and $|3-5| = 2$:
```
>>> same_diff_ints([5,3,1,5,9,8])
True
```
Whereas:
```
>>> same_diff_ints([1,3,5,7,9])
False
```

*Note*: Make sure to define some extreme test cases. Use the `%time` command to time your function!

In [25]:
%%time
# time your function
same_diff_ints([5,3,1,5,9,8])

In [None]:
grader.check("q2")

## Part 2: Strings and Files

The following questions will help you (re)learn the basics of working with strings and reading data from files (which are read in as strings, by default).

**Question 3 (N Prefixes):**

Write a function `n_prefixes` that takes a string and a positive interger `n`. It returns a string of the first `n` number consecutive prefixes of the input string in reverse order. For example, `n_prefixes('Data!', 3)` should return `'DatDaD'`.  (See the doctests for more examples).

Recall that [strings may be sliced](https://docs.python.org/3/tutorial/introduction.html#strings), like lists.


In [None]:
grader.check("q3")

**Question 4 (Exploded numbers):**  

Write a function `exploded_numbers` that takes in a list of integers and a non-negative integer $N$ and returns a list of strings containing numbers from the list expanded by N digits to both directions, separated by spaces.

Additionally, [zero pad](https://www.tutorialspoint.com/python/string_zfill.htm) each integer, so that each has the same length.

For example:

```
>>> exploded_numbers([3, 4], 2)
['1 2 3 4 5', '2 3 4 5 6']
>>> exploded_numbers([3, 8, 15], 2)
['01 02 03 04 05', '06 07 08 09 10', '13 14 15 16 17']
```

**Note**: you can assume that negative numbers will never be encountered. That is, when testing your code, we will never explode a number so much that it goes negative.

In [None]:
grader.check("q4")


[Recall](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) that the built-in function `open` takes in a file path and returns *a file object* (sometimes called a *file handle*). Below are a few properties of file objects:

* `open(path)` opens the file at location `path` for reading.
* `open(path)` is an *iterable*, which contains successive lines of the file.
* Once a file object is opened, after use it should be closed to avoid memory leaks. To ensure a file is closed once done, you should use a *context manager* as follows:
```
with open(path) as fh:
    for line in fh:
        process_line(line)
```
* To read the entire file into a string, use the read method:
```
with open(path) as fh:
    s = fh.read()
```
However, you should be careful when reading an entire file into memory that the file isn't too big! *You should avoid this whenever possible!*

**Question 5 (Reading Files):**

Create a function `last_chars` that takes a file object and returns a string consisting of the last character of each line.

*Remark:* A newline is the "delimiter" of the lines of a file, and doesn't count as part of the line (as the tests imply). Every other character is part of the line. For more info on this, see [the interpretation](https://en.wikipedia.org/wiki/Newline#Interpretation) of files as a 'newline delimited variables' file.

In [None]:
grader.check("q5")

## Part 3: `numpy` exercises

For an introduction to arrays and `numpy` recall the relevant section of [DSC 10](https://www.inferentialthinking.com/chapters/05/1/Arrays.html).

**Question 6 (Basic Arrays):**

Create the following functions using `numpy` methods satisfying the requirements given in each part. Your solutions should **not** contain any loops or list comprehensions.

* A function `arr_1` that takes in a `numpy` array and adds to each element the square-root of the index of each element.

* A function `arr_2` that takes in a `numpy` array of integers and returns a boolean array (i.e. an array of booleans) whose `ith` element is `True` if and only if the `ith` element of the input array is a perfect square.

* A function `arr_3` that takes in a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share on successive days in USD and returns an array of growth rates. That is, the `ith` number of the output array should contain the rate of growth in stock price between the $i^{th}$ day to the $(i+1)^{th}$ day. The growth rate should be a proportion, rounded to the nearest hundredth.

* Suppose:
    - `A` is a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share for a company on successive days in USD 
    - You start each day with \\$20 to buy as much stock as possible on that day. 
    - Any money left-over after a given day is saved for possibly buying stock on a future day. 
    - Create a function `arr_4` that takes in `A` and returns the day on which you can buy at least one share from 'left-over' money. If this never happens, return `-1`. The first stock purchase occurs on day 0. *Note: you cannot buy fractions of a share of stock*.
    
    - *Example:* If the stock price is \\$3 every day, then the answer is 'day 1':
        - day 0: buy six stocks with \\$20, \\$2 are added to the leftover, and your total leftover is currently \\$2,  so you can't buy one extra share
        - day 1: buy six stocks with \\$20, another \\$2 are added to the leftover, and your total leftover is now \\$4, so you can now buy one extra share, return day1.
    - Hint: `np.cumsum` may be helpful for this question

In [61]:
# don't change this cell -- it is needed for the tests to work
x = 42
A_1 = np.array([2, 4, 6, 7])
out_1 = arr_1(A_1)

fp = os.path.join('data', 'stocks.csv')
stocks = np.array([float(x) for x in open(fp)])
out_3_stocks = arr_3(stocks)

A_4 = np.array([3, 3, 3, 3])
out_4 = arr_4(A_4)

In [None]:
grader.check("q6")

## Part 4: Getting Started with Pandas

The following questions will help you get comfortable with Pandas. These questions are similar to questions on tables in DSC 10; review the [textbook](https://www.inferentialthinking.com) as necessary. As always for Pandas questions:
1. Avoid writing loops through the rows of the dataset to do the problem, and
2. Test the output/correctness of your code with the help of the dataset given, but be sure your code will also run on data "like" the dataset given (sampling rows using the `.sample` method is useful for this!).

**Question 7 (Pandas basics):**   

Read in the file `salary.csv` in the `data` directory which contains the salary information for the 2017-2018 NBA season and understand the dataset by answering the following questions. To do this, create a function `salary_stats` that takes in a dataframe like `salary` and returns a series containing the following statistics:
* The number of players (`num_players`).
* The number of teams (`num_teams`).
* The total salary amount over the season (`total_salary`).
* The name of the player with the highest salary; there are no ties (`highest_salary`).
* The average salary of the Boston Celtics ('BOS'), rounded to the nearest hundredth (`avg_bos`).
* The name of player and the name of the team whose salary is the third-lowest, separated by a comma and a space (e.g. John Doe, MIA); if there are ties, return the first based on alphabetical order (`third_lowest`).
* Whether there are any duplicate last names (True: yes, False: no), as a boolean (`duplicates`).
* The total salary of the team that has the highest paid player (`total_highest`).

The index of the output series are given in parenthesis above.

*Note*: Your function should work on a dataset of the same format that contains information from other years. You may assume that none of the answers involving ranking returns a tie.

*Note*: To make sure your function still runs, in the event that one of the 8 parts throws an exception (e.g. due to a very incorrect answer), use `Try... Except...` structures. Here's a useful link: https://www.w3schools.com/python/python_try_except.asp

In [89]:
# do not edit this cell -- it is needed for the tests
salary_fp = os.path.join('data', 'salary.csv')
salary = pd.read_csv(salary_fp)
stats = salary_stats(salary)

salary_sample = pd.read_csv('data/salary_sample.csv')
sample_stats = salary_stats(salary_sample)

In [None]:
grader.check("q7")

## Part 5: CSV Files

**Question 8 (Reading malformed csv files):**

`malformed.csv` contains a file of comma-separated values, containing the following fields:


|column name|description|type|
|---|---|---|
|first|first name of person|str|
|last|last name of person|str|
|weight|weight of person (lbs)|float|
|height|height of person (in)|float|
|geo|location of person; comma-separated latitude/longitude|str|

Unfortunately, the entries contains errors that cause the Pandas `read_csv` function to fail parsing the file with the default settings. Instead, you must read in the file manually using Python's built-in `open` function.

Clean the csv file into a Pandas DataFrame with columns as described in the table above, by creating a function called `parse_malformed` that takes in a file path and returns a parsed, properly-typed dataframe. The dataframe should contain columns as described in the table above (with the specified types); it should agree with `pd.read_csv` when the lines are not malformed.


*Note:* Assume that the given csv file is a sample of a larger file; you will be graded against a **different** sample of the larger file that has the same type of parsing errors. That is, you should **not** hard-code your cleaning of the data to specific errors on specific lines in the data.

In [105]:
# do not edit -- needed for tests
fp = os.path.join('data', 'malformed.csv')
cols = ['first', 'last', 'weight', 'height', 'geo']
df = parse_malformed(fp)
dg = pd.read_csv(fp, nrows=4, skiprows=10, names=cols)

In [None]:
grader.check("q8")

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest lab.py`.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()