In [135]:
from cs103 import *
from typing import List, Optional

# Lesson 7 - Writing Analysis Programs

Now that you understand the basics, the rubber is going to start hitting the road! You are going to start connecting all of the dots and start writing full on programs using the "functional" programming style.

You may have noticed in the past two workbooks that I have been asking you to use the output of one of your functions as the input of another of your functions. 

Often, a program is written, or composed, as a _chain of functions_. The output of one function creates the input of another function creates the input of another function.

Functional programming refers to the act of writing _pure functions_ and then nesting, or _composing_, them to create a complete program.

## How to Write Analysis Programs (template)

The below template demonstrates the steps often need to create an "analysis program". Some characteristics of an analysis program are as follows:

1. Read in data from some data source
2. Interpret the data into a meaningful data definition
3. Filter the data and/or correlate the data with another data source
4. Report or visualize the filtered data
5. Accept user input to customize how data is filtered

Each of these **five steps** will be represented by functions in our program. Knowing this, a general template for the program can be written:

```python
from typing import NamedTuple, List
from dataclasses import dataclass
import plotly.express as px


# Data Definitions
DataType = ...

# interp. A datatype

# Examples
DT1 = ...

def read_data(csv_filepath: str) -> List[list]:   # STEP 1
    """
    Returns a list representing the lines of data in the 'csv_filepath'
    """
    ...
    
    
def records_to_datatype_list(data_records: list) -> List[DataType]:    # STEP 2B
    """
    Returns a list of DataType created from each line of
    data in 'data_list'.
    """
    ...
    
    
def record_to_datatype(record: list) -> DataType:    # STEP 2A
    """
    Returns a DataType object representing the pertinent data
    that needs to be retrieved from 'record'. A helper function
    to records_to_datatype_list(...).
    """
    ...
    
    
def filter_datatype_by_param(lodt: List[DataType], param: ...) -> List[DataType]:    # STEP 3
    """
    Returns a list of DataType where each DataType.param is equal to 'param'.
    """
    ...
    
    
def plot_datatypes(lodt: List[DataType]) -> None:    # STEP 4
    """
    Returns None. Displays a plot of number of artworks on the y-axis and the 
    year the artwork was created on the x-axis.
    """
    ...
    
    
def analyze_datatype(csv_filepath: str, param: ...) -> None:     # STEP 5
    """
    Plots the data in csv_filepath after it has been filtered by `param`
    """
    csv_data = read_data(csv_filepath)
    list_of_datatype = data_to_datatype(csv_data)
    filtered_data = filter_datatype_by_param(list_of_datatype, param)
    plot_data(filtered_data)
    return # Plot just shows, nothing to return
```

The "main" function can also be written like this:

```python
def analyze_datatype(csv_filepath: str, param: ...) -> None:     # STEP 5
    """
    Plots the data in csv_filepath after it has been filtered by `param`
    """
    plot_datatypes( # This is function "composition" here...
        filter_datatype_by_param(
            data_to_datatype(
                read_data(csv_file_path)
            ), param
        )
    )
```

### One Task Per Function ("OTPF")

The functional programming style generally sticks to the central precept of **one task per function**. This is not just some arbitrary "law of programming". It is profoundly good advice intended to make your life as easy as possible in the following ways:

1. OTPF makes focuses your mind on a single task. It is much easier to design and write a function that does one thing. It prevents you from getting "lost in the weeds" as you try to do too much at once.
2. Code reuse becomes more likely if your functions focus on a single task. OTPF tends to make your function more "general". The more "specialized" your function gets by bundling more tasks into the function, the less useful the function becomes to a different purpose. **In some ways, code reuse is the "holy grail" for programmers:** imagine sitting down to write a program only to realize you have already written 80% of the components in other programs! You just need to hook them together!
3. Debugging and maintainability becomes MUCH easier in a function that only does one thing.
4. In the programming world, it is widely recognized that "code is more often read than it is written". Code is only written ONCE! But, you (or someone else) may need to go back and review what you have written several times if you are debugging it or adding more features. If your function called `plot_artworks()` plots the artworks AND also returns the results to the screen AND also saves the results in a file, you may have a hard time finding the "function responsible" for saving faulty files because all you see in front of you is `plot_artworks()`.
5. Testing. Testing. Testing. It is MUCH easier and faster to write tests and anticipate "edge cases" in functions that only do one thing.

# New Python Skills Needed For This Workbook

1. Using `pathlib.Path` to represent file paths instead of using `str`
2. Reading and writing files with `open(...)`
3. Using a "context manager", `with ... as ...:`
4. Using `csv` to read CSV data
5. Plotting data with `plotly.express`

## Access the filesystem with `pathlib`

```python
import pathlib
```

pathlib is a newer addition to the Python standard library that allows you to create `pathlib.Path` objects that interact with locations on your machine.

```python
path_as_str = "C:\\Users\\cferster\\Desktop" # This is just a str. Nothing special.

path_obj = pathlib.Path("C:\\Users\\cferster\\Desktop") # This is a Path object that can be manipulated
```
**By example:**

```python
here = pathlib.Path() # Represents a relative path in the current working directory
data_sets_dir = here / "Data Sets" # You can use the division operator to navigate into a directory
artworks_file = here / "Data Sets" / "Artworks.csv"

here_absolute = pathlib.Path.cwd() # Creates an absolute path to the current working directory

artworks_file.exists() # Returns True if the path exists, False otherwise
```

By working with `pathlib.Path` objects, you have ac



## Using `open()` to read and write files

The `open()` function allows you to access data within a file on your computer. You can open a file in "read" mode or "write" mode. 

**Reading a file, by example**

```python


artworks_file = here / "Data Sets" / "Artworks.csv"

with open(artworks_file, mode="r") as file:
    for line in file.readlines():
        print(line)
```

**Writing a file, by example**

```python
new_file = pathlib.Path("new_file.txt")

data_to_write = ["Line 1", "Line 2", "Line 3"]

with open(new_file, mode="w") as file:
    file.writelines(data_to_write)
```

## Using the `csv` module to parse CSV files

CSV data looks like this:

```
some data 1,some data 2,some data 3,some data 4
some data 5,some data 6,some data 7,some data 8
...
```

When it gets read in Python with `file.readlines()`, it looks like this:

```python
"some data 1,some data 2,some data 3,some data 4\nsome data 5,some data 6,some data 7,some data 8\n..."
```

You have the skills to manually convert this into a list of lists by using `.split()`, etc. but there is an easier way of doing that.

Using the `csv` module, you can automatically "parse" (interpret and make meaning of) CSV data.

**By example:**

```python
import csv
import pathlib

here = pathlib.Path()
artworks_file = here / "Data Sets" / "Artworks.csv"

with open(artworks_file, mode="r") as file:
    acc = []
    for line in csv.reader(file, delimiter=","):
        acc.append(line)
```

## Bringing it all together in a function

To read the data from a CSV data file, we bring all of these steps together into a function that takes only a file path as an argument.

**Here's an example:**

```python
my_data_file = pathlib.Path("Data Sets") / "Artworks.csv"

def read_artworks_file(csv_path: pathlib.Path) -> List[list]:
    """
    Returns a list of list representing the lines in the file at 'csv_path'
    with each line as a sublist and each item in the csv file as a list item.
    """
    acc = []
    with open(csv_path, mode="r") as file:
        for line in csv.reader(file):
            acc.append(line)
    return acc

read_artworks_file(my_data_file)
```

## Plotting data using `plotly.express`

In Python, there are MANY excellent graphing and plotting libraries available. Each takes a bit of time to learn. Here are the names of some of the most well-known and most used (in alphabetical order):

* [altair](https://altair-viz.github.io/)
* [bokeh](https://bokeh.org/)
* [matplotlib](https://matplotlib.org/stable/index.html)
* [plotly](https://plotly.com/python/)

Personally, I often use **plotly** because of how flexible it is and because I often find myself using 3D plots (altair and bokeh are 2D only). 

Altair is very easy to use and has a nice "API" (application programming interface) but plotly has also developed a simple-to-use interface called **plotly express**, which is what we will be using.


## Basic Usage

```python
import plotly.express as px

x_values = [1, 2, 3, 4, 5]
y_values = [1, 4, 9, 16, 25]

px.<chart_type>(x=x_values, y=y_values)
```

Where `<chart_type>` could be any of the following:

* `scatter`, e.g. `px.scatter(x=x_values, y=y_values)`
* `line`
* `area`
* `pie`
* `bar`
* `histogram` - histogram only takes `x` values
* ...[and many more](https://plotly.com/python/plotly-express/)

# Writing an Analysis Program, by example



## Analysis program template

```python
from typing import NamedTuple, List
from dataclasses import dataclass
import plotly.express as px


# Data Definitions
DataType = ...

# interp. A datatype

# Examples
DT1 = ...

def read_data(csv_filepath: str) -> List[list]:   # STEP 1
    """
    Returns a list representing the lines of data in the 'csv_filepath'
    """
    ...
    
start_testing()
expect(read_data(...), ...)
summary()
    

def records_to_datatype_list(data_records: list) -> List[DataType]:    # STEP 2B
    """
    Returns a list of DataType created from each line of
    data in 'data_list'.
    """
    ...
    
start_testing()
expect(records_to_datatype_list(...), ...)
summary()

    
def record_to_datatype(record: list) -> DataType:    # STEP 2A
    """
    Returns a DataType object representing the pertinent data
    that needs to be retrieved from 'record'. A helper function
    to records_to_datatype_list(...).
    """
    ...
    
start_testing()
expect(record_to_datatype(...), ...)
summary()

    
def filter_datatype_by_param(lodt: List[DataType], param: ...) -> List[DataType]:    # STEP 3
    """
    Returns a list of DataType where each DataType.param is equal to 'param'.
    """
    ...
    

start_testing()
expect(filter_datatype_by_param(...), ...)
summary()

def plot_data(lodt: List[DataType]) -> None:    # STEP 4
    """
    Returns None. Displays a plot of number of artworks on the y-axis and the 
    year the artwork was created on the x-axis.
    """
    ...
    
# Visual test on plot_datatypes with short sample data    

# The last function is the "main" function where you "orchestrate" the calling of all of your helper functions
def analyze_datatype(csv_filepath: str, param: ...) -> None:     # STEP 5
    """
    Plots the data in csv_filepath after it has been filtered by `param`
    """
    csv_data = read_data(csv_filepath)
    list_of_datatype = data_to_datatype(csv_data)
    filtered_data = filter_datatype_by_param(list_of_datatype, param)
    plot_data(filtered_data)
    return # Plot just shows, nothing to return
```

### A Note on different kinds of text-based file formats

The following is a list of text-based (human readable) file formats you may encounter. Python has built-in ways for handling most of them.

* **CSV/TSV file**: Comma-separated values or Tab-separated values. These are essentially the same format: tabular-type data with each _record_ of data on a new line and the _fields_ for each record are separated by some character (commonly a ',' or a tab '    '). **Python has the `csv` module in the standard library to read CSV files**. Many software programs will export a CSV file.
* **JSON file**: JavaScript Object Notation is a common type of file for data transmitted on the internet and is also a convenient file type for storing data that may not fit neatly into a table, such as nested data. JSON looks exactly like a Python dictionary with data in it. **Python has the `json` module in the standard library to read JSON files**. Jupyter Notebook .ipynb files are JSON files.
* **XML file**: eXtensible Markup Language is a file format that looks like HTML but instead of the tags representing document formatting, they represent data fields. Like JSON, it is a convenient format for storing data that may not fit neatly into a table. **Python has the `xml` module in the standard library to read XML files**. Decon files are XML files.
* **Mixed formats**: Some programs create text-based files in a mix of formats. For example, spColumn .cti files are kind of like chunks of CSV data mixed with custom headings and metadata (data about the data). SConcrete .sco files are most similar to JSON files but with terrible formatting that makes it difficult for a computer to read without a customized program to read it.