In [135]:
from cs103 import *
from typing import List, Optional

# Lesson 7 - Writing Analysis Programs

Now that you understand the basics, the rubber is going to start hitting the road! You are going to start connecting all of the dots and start writing full on programs using the "functional" programming style.

You may have noticed in the past two workbooks that I have been asking you to use the output of one of your functions as the input of another of your functions. 

Often, a program is written, or composed, as a _chain of functions_. The output of one function creates the input of another function creates the input of another function.

Functional programming refers to the act of writing _pure functions_ and then nesting, or _composing_, them to create a complete program.

## How to Write Analysis Programs (template)

The below template demonstrates the steps often need to create an "analysis program". Some characteristics of an analysis program are as follows:

1. Read in data from some data source
2. Interpret the data into a meaningful data definition
3. Filter the data and/or correlate the data with another data source
4. Report or visualize the filtered data
5. Accept user input to customize how data is filtered

Each of these **five steps** will be represented by functions in our program. Knowing this, a general template for the program can be written:

```python
from typing import NamedTuple, List
from dataclasses import dataclass
import plotly.express as px


# Data Definitions
DataType = ...

# interp. A datatype

# Examples
DT1 = ...

def read_data(csv_filepath: str) -> List[list]:   # STEP 1
    """
    Returns a list representing the lines of data in the 'csv_filepath'
    """
    ...
    
    
def records_to_datatype_list(data_records: list) -> List[DataType]:    # STEP 2B
    """
    Returns a list of DataType created from each line of
    data in 'data_list'.
    """
    ...
    
    
def record_to_datatype(record: list) -> DataType:    # STEP 2A
    """
    Returns a DataType object representing the pertinent data
    that needs to be retrieved from 'record'. A helper function
    to records_to_datatype_list(...).
    """
    ...
    
    
def filter_datatype_by_param(lodt: List[DataType], param: ...) -> List[DataType]:    # STEP 3
    """
    Returns a list of DataType where each DataType.param is equal to 'param'.
    """
    ...
    
    
def plot_datatypes(lodt: List[DataType]) -> None:    # STEP 4
    """
    Returns None. Displays a plot of number of artworks on the y-axis and the 
    year the artwork was created on the x-axis.
    """
    ...
    
    
def analyze_datatype(csv_filepath: str, param: ...) -> None:     # STEP 5
    """
    Returns a list ArtWork records in 'csv_filepath' that were created in 'year'.
    """
    return (
        plot_datatypes( # This is function "composition" here...
            filter_datatype_by_param(
                data_to_datatype(
                    read_data(csv_file_path)
                ), param
            )
        )
    )
```

### One Task Per Function ("OTPF")

The functional programming style generally sticks to the central precept of **one task per function**. This is not just some arbitrary "law of programming". It is profoundly good advice intended to make your life as easy as possible in the following ways:

1. OTPF makes focuses your mind on a single task. It is much easier to design and write a function that does one thing. It prevents you from getting "lost in the weeds" as you try to do too much at once.
2. Code reuse becomes more likely if your functions focus on a single task. OTPF tends to make your function more "general". The more "specialized" your function gets by bundling more tasks into the function, the less useful the function becomes to a different purpose. **In some ways, code reuse is the "holy grail" for programmers:** imagine sitting down to write a program only to realize you have already written 80% of the components in other programs! You just need to hook them together!
3. Debugging and maintainability becomes MUCH easier in a function that only does one thing.
4. In the programming world, it is widely recognized that "code is more often read than it is written". Code is only written ONCE! But, you (or someone else) may need to go back and review what you have written several times if you are debugging it or adding more features. If your function called `plot_artworks()` plots the artworks AND also returns the results to the screen AND also saves the results in a file, you may have a hard time finding the "function responsible" for saving faulty files because all you see in front of you is `plot_artworks()`.
5. Testing. Testing. Testing. It is MUCH easier and faster to write tests and anticipate "edge cases" in functions that only do one thing.

## New Python Skills Needed For This Lesson

1. Using `pathlib.Path` to represent file paths instead of using `str`
2. Reading and writing files with `open(...)`
3. Using a "context manager", `with ... as ...:`
4. Using `csv` to read CSV data
5. Plotting data with `plotly.express`

**Items 1-4** on this list are contained in the following code snippet demonstrating the process of accessing files on your system:

```python
import pathlib
import csv

home_dir = pathlib.Path(r"C:\Users\cferster") # 1. Using pathlib to represent file paths
csv_file = home_dir / "Notebooks" / "RJC Python Course" / "test-factored.csv"

file_data = []
with open(csv_file, mode="r") as file: # 2. Using `open()`; # 3. Using a context manager
    for line in csv.reader(file, delimiter=","):   # 4. Using csv.reader
        file_data.append(line)
```

While I started writing this lesson with detailed explanations of each of these items, they quickly bogged down the lesson. So, for a more detailed explanation of what each of these steps are doing, please see [this supplementary notebook](./reading_csv_files.ipynb)


#### 5. Using `plotly.express` to plot data

In Python, there are MANY excellent graphing and plotting libraries available. Each takes a bit of time to learn. Here are the names of some of the most well-known and most used (in alphabetical order):
* [altair](https://altair-viz.github.io/)
* [bokeh](https://bokeh.org/)
* [matplotlib](https://matplotlib.org/stable/index.html)
* [plotly](https://plotly.com/python/)

Personally, I often use **plotly** because of how flexible it is and because I often find myself using 3D plots (altair and bokeh are 2D only). Altair is very easy to use and has a nice "API" (application programming interface) but plotly has also developed a simple-to-use interface called **plotly express**, which is what we will be using.

First, install plotly:
```python
pip install plotly
```

Now, use the CSV data that we read in earlier

Types of plots you can make with plotly express:

* **Basics:** scatter, line, area, bar, funnel
* **Part-of-Whole:** pie, sunburst, treemap, funnel_area
* **1D Distributions:** histogram, box, violin, strip
* **2D Distributions:** density_heatmap, density_contour
* **Matrix Input:** imshow
* **3-Dimensional:** scatter_3d, line_3d
* **Multidimensional:** scatter_matrix, parallel_coordinates, parallel_categories
* **Tile Maps:** scatter_mapbox, line_mapbox, choropleth_mapbox, density_mapbox
* **Outline Maps:** scatter_geo, line_geo, choropleth
* **Polar Charts:** scatter_polar, line_polar, bar_polar
* **Ternary Charts:** scatter_ternary, line_ternary

#### Basic usage: scatter, line, bar, histogram

**`px.scatter`**
```python
import plotly.express as px

x_values = [1, 2, 3, 4, 5]
y_values = [1, 4, 9, 16, 25]

px.scatter(x=x_values, y=y_values)
```

**`px.line`**

```python
import plotly.express as px

x_values = [1, 2, 3, 4, 5]
y_values = [1, 4, 9, 16, 25]

px.line(x=x_values, y=y_values)
```


**`px.bar`**
```python
import plotly.express as px

x_values = [1, 2, 3, 4, 5]
y_values = [1, 4, 9, 16, 25]

px.bar(x=x_values, y=y_values)
```


**`px.histogram`**
```python
import plotly.express as px

x_values = ["cat", "bat", "cat", "bat", "bat", "hat"]

px.histogram(x=x_values) # Histogram will put the count of each item in the y-axis automatically
```

# Writing an Analysis Program, by example



## Analysis program template

```python
from typing import NamedTuple, List
from dataclasses import dataclass
import plotly.express as px


# Data Definitions
DataType = ...

# interp. A datatype

# Examples
DT1 = ...

def read_data(csv_filepath: str) -> List[list]:   # STEP 1
    """
    Returns a list representing the lines of data in the 'csv_filepath'
    """
    ...
    
start_testing()
expect(read_data(...), ...)
summary()
    

def records_to_datatype_list(data_records: list) -> List[DataType]:    # STEP 2B
    """
    Returns a list of DataType created from each line of
    data in 'data_list'.
    """
    ...
    
start_testing()
expect(records_to_datatype_list(...), ...)
summary()

    
def record_to_datatype(record: list) -> DataType:    # STEP 2A
    """
    Returns a DataType object representing the pertinent data
    that needs to be retrieved from 'record'. A helper function
    to records_to_datatype_list(...).
    """
    ...
    
start_testing()
expect(record_to_datatype(...), ...)
summary()

    
def filter_datatype_by_param(lodt: List[DataType], param: ...) -> List[DataType]:    # STEP 3
    """
    Returns a list of DataType where each DataType.param is equal to 'param'.
    """
    ...
    

start_testing()
expect(filter_datatype_by_param(...), ...)
summary()

def plot_datatypes(lodt: List[DataType]) -> None:    # STEP 4
    """
    Returns None. Displays a plot of number of artworks on the y-axis and the 
    year the artwork was created on the x-axis.
    """
    ...
    
# Visual test on plot_datatypes with short sample data    


def analyze_datatype(csv_filepath: str, param: ...) -> None:     # STEP 5
    """
    Returns a list of DataType records in 'csv_filepath' that were created in 'year'.
    """
    return (
        plot_datatypes( # This is function "composition" here...
            filter_datatype_by_param(
                data_to_datatype(
                    read_data(csv_file_path)
                ), param
            )
        )
    )
```

## Complete examples!

[1. UFO Sightings Example](./L07_example1.ipynb), from NUFORC scraped and cleaned by Sigmund Axel (https://github.com/planetsig/ufo-reports)
[2. Artworks in the New York Museum of Modern Art](./L07_example2.ipynb), from the MoMA Github page (https://github.com/MuseumofModernArt/collection)

### A Note on different kinds of text-based file formats

The following is a list of text-based (human readable) file formats you may encounter. Python has built-in ways for handling most of them.

* **CSV/TSV file**: Comma-separated values or Tab-separated values. These are essentially the same format: tabular-type data with each _record_ of data on a new line and the _fields_ for each record are separated by some character (commonly a ',' or a tab '    '). **Python has the `csv` module in the standard library to read CSV files**. Many software programs will export a CSV file.
* **JSON file**: JavaScript Object Notation is a common type of file for data transmitted on the internet and is also a convenient file type for storing data that may not fit neatly into a table, such as nested data. JSON looks exactly like a Python dictionary with data in it. **Python has the `json` module in the standard library to read JSON files**. Jupyter Notebook .ipynb files are JSON files.
* **XML file**: eXtensible Markup Language is a file format that looks like HTML but instead of the tags representing document formatting, they represent data fields. Like JSON, it is a convenient format for storing data that may not fit neatly into a table. **Python has the `xml` module in the standard library to read XML files**. Decon files are XML files.
* **Mixed formats**: Some programs create text-based files in a mix of formats. For example, spColumn .cti files are kind of like chunks of CSV data mixed with custom headings and metadata (data about the data). SConcrete .sco files are most similar to JSON files but with terrible formatting that makes it difficult for a computer to read without a customized program to read it.