# Lesson 7 - Writing Analysis Programs

## Example 2: Artworks held in the MoMA

### 1. Create the `read_data(...)` function

Put the above lines of code into a function and write some tests. Test data can be copy-pasted out of the .csv file using Excel or something (this part is a slight pain in the butt because you will probably need to put in quotation marks around strings).

```python
from typing import List
import csv
from cs103 import *

def read_data(csv_file_path: str) -> List[List[str]]:
    """
    Returns a list of lists, representing each line in the file, at 'csv_file_path'.
    """
    with open(csv_file_path, encoding="utf-8") as file:
        csv_data = list(csv.reader(file, delimiter=","))
    csv_data_without_header_row = csv_data[1:]
    return csv_data_without_header_row

record_45500 = ['SELF-PORTRAIT','Ralph Gibson','2146','(American, born 1939)','(American)','(1939)','(0)','(Male)','1974','Gelatin silver print','12 1/2 × 8 3/8" (31.9 × 21.4 cm)','Gift of the artist','196.1975','Photograph','Photography','1975-05-14','N','48249','','','','','','31.9','','','21.4','','']
record_28850 = ['Dva stikhotvoreniia','Russian Book Collection','23323','','()','(0)','(0)','()','1924','','Page (irreg.): 6 1/4 x 4 1/2" (15.8 x 11.4 cm)','Gift of The Judith Rothschild Foundation (Boris Kerdimun Archive)','1031.2001','Illustrated Book','Drawings & Prints','2001-01-24','N','30140','','','','','','15.8','','','11.4','','']

start_testing()
expect(read_data("Artworks.csv")[45500], record_45500)
expect(read_data("Artworks.csv")[28850], record_28850)
summary()
```

In [137]:
from typing import List
import csv
from cs103 import *

def read_data(csv_file_path: str) -> List[List[str]]:
    """
    Returns a list of lists, representing each line in the file, at 'csv_file_path'.
    """
    with open(csv_file_path, encoding="utf-8") as file:
        csv_data = list(csv.reader(file, delimiter=","))
    csv_data_without_header_row = csv_data[1:]
    return csv_data_without_header_row

record_45500 = ['SELF-PORTRAIT','Ralph Gibson','2146','(American, born 1939)','(American)','(1939)','(0)','(Male)','1974','Gelatin silver print','12 1/2 × 8 3/8" (31.9 × 21.4 cm)','Gift of the artist','196.1975','Photograph','Photography','1975-05-14','N','48249','','','','','','31.9','','','21.4','','']
record_28850 = ['Dva stikhotvoreniia','Russian Book Collection','23323','','()','(0)','(0)','()','1924','','Page (irreg.): 6 1/4 x 4 1/2" (15.8 x 11.4 cm)','Gift of The Judith Rothschild Foundation (Boris Kerdimun Archive)','1031.2001','Illustrated Book','Drawings & Prints','2001-01-24','N','30140','','','','','','15.8','','','11.4','','']

start_testing()
expect(read_data("Artworks.csv")[45500], record_45500)
expect(read_data("Artworks.csv")[28850], record_28850)
summary()

[92m2 of 2 tests passed[0m


### 2. Create a data definition to represent a record of data

In this case, I will create a data definition to represent a single piece of `ArtWork`.

```python
from typing import NamedTuple, Optional

class ArtWork(NamedTuple):
    title: str
    creator: str
    year_created: Optional[int]
    medium: str


# Interp. Represents a single piece of artwork
# title: Title of the artwork
# creator: Name of the artist
# year_created: an int representing the year the artwork was created, None if 'unknown' or missing
# medium: what kind of artwork it is (scuplture, painting, drawing, etc.)

# Examples
AWeg_1 = ArtWork("Guernica", "Pablo Picasso", 1937, "Painting")
AWeg_2 = ArtWork("Drunk With God", "Gilbert & George", 1983, "Mixed Media")
```

After, lets write a function to convert one line of the csv file into a `ArtWork` so we will now have a list of `ArtWork` objects instead of lists.

In [138]:
from typing import NamedTuple, Optional

class ArtWork(NamedTuple):
    title: str
    creator: str
    year_created: Optional[int]
    medium: str


# Interp. Represents a single piece of artwork
# title: Title of the artwork
# creator: Name of the artist
# year_created: an int representing the year the artwork was created, None if 'unknown' or missing
# medium: what kind of artwork it is (scuplture, painting, drawing, etc.)

# Examples
AWeg_1 = ArtWork("Guernica", "Pablo Picasso", 1937, "Painting")
AWeg_2 = ArtWork("Drunk With God", "Gilbert & George", 1983, "Mixed Media")

### 3. Create the `records_to_datatype_list(...)` function

Now that we have loaded the data from .csv file, we have a list of lists. Each sub-list representing one line in the file and each item in the sub-list representing one field of one record. Lets create a meaningful data type to represent each record (each line).

#### Helper functions

Often, the data that you need to process will have small inconsistencies, such as missing data in some fields or data that is in a format different from the rest of the data. 

For example, in the `Artworks.csv` data, the "Date" is not always just a year like `"1926"`. Sometimes it's like `"1934-1935"` or other times it is like, `"June-August 1945"`. 

The function used to transfer raw records into your data definition should be able to handle these inconsistencies so that, after your function has run, you have _clean data_ in your data definition to work on in the next steps.

For this, you may need more than one helper function to convert a single record of data into an instance of your data type.

```python
def csv_record_to_artwork(record: List[float]) -> ArtWork:
    """
    Returns an ArtWork object representing the data in 'record'.
    """
    title = record[0]
    artist = record[1]
    created = convert_date_field_to_int(record[8]) # Will need to write a helper function to make this conversion work
    medium = record[13]
    return ArtWork(title=title, creator=artist, year_created=created, medium=medium)


# Helper functions come after the function that they are helping in
def convert_date_field_to_int(date_string: str) -> int:
    """
    Returns an int representing the year in YYYY format that may be contained
    within 'date_string'.
    
    If 'date_string' is not the year in YYYY format already, then the function assumes
    that the relevant year will be located at the end of 'date_string'. This means that
    the year the artwork is created is a assumed to be the year it was finished.
    """
    date_as_int = None
    if not date_string: # Test for '', empty string evaluates to False
        return date_string
    try:
        date_as_int = int(date_string) # Assumes "1927" format
    except ValueError:
        try:
            date_as_int = int(date_string.split()[-1]) # Assumes "June-July 1945" format
        except ValueError:
            try:
                date_as_int = int(date_string.split("-")[-1]) # Assumes "1987-1988" format
            except ValueError:
                return date_as_int
    return date_as_int

## Testing section

# From looking at the data a little bit, I saw that there were these date formats possible...
date_0 = "1927"
date_1 = "June-July 1945"
date_2 = "1987-1988"

start_testing()
# Test convert_date_field...
expect(convert_date_field_to_int(date_0), 1927)
expect(convert_date_field_to_int(date_1), 1945)
expect(convert_date_field_to_int(date_2), 1988)

AW_1 = ArtWork('SELF-PORTRAIT','Ralph Gibson', 1974, 'Photograph')
AW_2 = ArtWork('Dva stikhotvoreniia', 'Russian Book Collection', 1924, 'Illustrated Book')

# Test csv_record_to_artwork...
expect(csv_record_to_artwork(record_45500), AW_1)
expect(csv_record_to_artwork(record_28850), AW_2)
summary()
```

[92m5 of 5 tests passed[0m


### Process the list of records into a list of `ArtWork`

Now that I have a function to convert a single record into an `ArtWork`, I am going to write a function that will process all of the csv data into a list of `ArtWork`.

```python
def csv_data_to_artworks(csv_data: List[List[str]]) -> List[ArtWork]:
    """
    Returns a list of ArtWork objects representing the cleaned data from each 
    record contained within 'csv_data'.
    """
    acc = []
    for record in csv_data:
        acc.append(csv_record_to_artwork(record))
    return acc

# Using the data from my previous tests, above, I will write a quick test for this function

LOR0 = []
LOR1 = [record_45500]
LOR2 = [record_45500, record_28850]

AW1 = csv_record_to_artwork(record_45500)
AW2 = csv_record_to_artwork(record_28850)

LOAW0 = []
LOAW1 = [AW1]
LOAW2 = [AW1, AW2]
LOAW3 = [AW1, AW2, AW1]

start_testing()
expect(csv_data_to_artworks(LOR0), LOAW0)
expect(csv_data_to_artworks(LOR1), LOAW1)
expect(csv_data_to_artworks(LOR2), LOAW2)
summary()
```

[92m3 of 3 tests passed[0m


## 4. Create the `filter_by_datatype_by_param(...)` function

Because we have created a data definition, filtering the data we have to answer our research question becomes almost trivial.

Lets say we want to filter our data by `.medium`:

```python
def filter_artworks_by_medium(loaw: List[ArtWork], medium: str) -> List[ArtWork]:
    """
    Returns a list of ArtWork filtered by medium. If the medium attribute
    matches 'year', then those ArtWork objects will be returned in the list.
    """
    acc = []
    for aw in loaw:
        if aw.medium == medium.title():
            acc.append(aw)
    return acc

# Tests

start_testing()
expect(filter_artworks_by_medium(LOAW0, []), [])
expect(filter_artworks_by_medium(LOAW1, "Photograph"), LOAW1)
expect(filter_artworks_by_medium(LOAW2, "Photograph"), LOAW1)
expect(filter_artworks_by_medium(LOAW3, "photograph"), [AW1, AW1])
expect(filter_artworks_by_medium(LOAW3, "photography"), [])
expect(filter_artworks_by_medium(LOAW3, "illustrated book"), [AW2])
summary()
```

And then we want to filter it by `.year_created`:

```python
def filter_artworks_by_year_range(loaw: List[ArtWork], start_year: int, end_year: int) -> List[ArtWork]:
    """
    Returns a list of ArtWork filtered by .year_created. If the .year_created attribute
    matches 'year', then those ArtWork objects will be returned in the list.
    """
    acc = []
    for aw in loaw:
        if start_year <= aw.year_created <= end_year:
            acc.append(aw)
    return acc

# Tests
# A reminder of what is in AW_1 and AW_2:
# AW_1 = ArtWork('SELF-PORTRAIT','Ralph Gibson', 'Pierre Laprade', 1974, 'Photograph')
# AW_2 = ArtWork('Dva stikhotvoreniia', 'Russian Book Collection', 1924, 'Illustrated Book')
# LOAW0 = []
# LOAW1 = [AW1]
# LOAW2 = [AW1, AW2]

start_testing()
expect(filter_artworks_by_year_range(LOAW0, 1900, 2000), [])
expect(filter_artworks_by_year_range(LOAW1, 1960, 1980), LOAW1)
expect(filter_artworks_by_year_range(LOAW2, 1924, 1924), [AW2])
expect(filter_artworks_by_year_range(LOAW3, 1970, 1980), [AW1, AW1])
expect(filter_artworks_by_year_range(LOAW3, 1980, 1990), [])
expect(filter_artworks_by_year_range(LOAW3, 1920, 1930), [AW2])
summary()
```
    

[92m6 of 6 tests passed[0m
[92m6 of 6 tests passed[0m


## 5. Create the `plot_datatype(...)` function

To use the plotly express, you will need to collect the x-data and y-data into separate lists.

```python
def plot_artworks_by_year(loaw: List[ArtWork]) -> None:
    """
    Returns None. Plots the list of ArtWorks with the following axes:
    x = ArtWork.year_created
    y = A count of the number of ArtWorks
    """
    list_of_years = []
    for aw in loaw:
        list_of_years.append(aw.year_created)
        
    # px.histogram automatically counts the frequency of years
    # This save us from having to create separate y-axis data
    # by manually counting up the number of occurrences of the years in the list
    plot = px.histogram(x=list_of_years) 
    
    # display() is a special function that only works in Jupyter
    # It's just like print() except it allows for rendering of rich media
    display(plot) 
```
    


## 6. Create the `analyze_data` function

Because our `analyze_artworks()` function is going to be _displaying_ a plot instead of returning a value, we cannot implement a test suite for this function as we did our previous functions. To test it, we will do a visual test only.

Create a test CSV file that is a much shortened version of your data file, like only three rows. Enough so you can 
quickly see that what is being plotted matches your expections.

```python
def analyze_artworks(csv_filepath: str, medium: str, start_year: int, end_year: int) -> None:     # STEP 5
    """
    Returns a list ArtWork records in 'csv_filepath' that were created in 'year'.
    """
    return plot_artworks_by_year(filter_artworks_by_year_range(filter_artworks_by_medium(csv_data_to_artworks(read_data(csv_filepath)), medium), start_year, end_year))
```
Trying to fit this all onto one line is really hard to read. Try breaking it up. 

In Python, there are a couple of ways of doing line breaks in your code. By far, the preferred way is to put the expression in parentheses, `()`, like a math statement. In Python, anything in brackets can be broken up however you want.

```python
def analyze_artworks(csv_filepath: str, medium: str, start_year: int, end_year: int) -> None:     # STEP 5
    """
    Returns a list ArtWork records in 'csv_filepath' that were created in 'year'.
    """    
    plot_artworks_by_year(
        filter_artworks_by_year_range(
            filter_artworks_by_medium(
                csv_data_to_artworks(
                    read_data(csv_filepath)
                ), medium
            ), start_year, end_year
        )
    )
```