# `tcollections` for Data Analysis

In this example document I demonstrate how `tcollections` can be used in data analysis projects. You can read further about my motivations and philosophical underpinnings in my blog posts [here](https://devinjcornell.com/post/dsp0_patterns_for_dataclasses.html) and [here](https://devinjcornell.com/post/dsp1_data_collection_types.html).

The paradigm presented here involves the following steps.

1. Define a dataclass to represent a "row" in your dataset. Define attributes and validation as needed.
2. Define a collection type with which to add data transformations.
3. Build transformation methods.

In [15]:
import dataclasses
import sys
sys.path.append('../src')
import tcollections
from tcollections import tlist, group

The first step is to define the atomic unit of analysis (person) and a collection type that inherits from `tlist`. This gives the class access to grouping, aggregating, and filtering methods that are common in data science projects.

In [16]:
@dataclasses.dataclass(frozen=True)
class Person:
    name: str
    age: int
    city: str

class PersonList(tcollections.tlist[Person]):
    def filter_by_city(self, city: str) -> 'PersonList':
        return self.filter(lambda person: person.city == city)

    def average_age(self) -> float:
        return sum(self.map(lambda person: person.age)) / len(self)

    def group_by_city(self) -> dict[str, 'PersonList']:
        return self.group.by(lambda person: person.city)
    

In [17]:
plist = PersonList([
    Person("Alice", 30, "New York"),
    Person("Bob", 25, "Los Angeles"),
    Person("Charlie", 35, "New York"),
    Person("David", 40, "Chicago"),
    Person("Eve", 28, "Los Angeles"),
])
plist

[Person(name='Alice', age=30, city='New York'),
 Person(name='Bob', age=25, city='Los Angeles'),
 Person(name='Charlie', age=35, city='New York'),
 Person(name='David', age=40, city='Chicago'),
 Person(name='Eve', age=28, city='Los Angeles')]

In [24]:
plist.map(lambda person: person.city).value_counts()

Counter({'New York': 2, 'Los Angeles': 2, 'Chicago': 1})

In [18]:
plist.filter(lambda person: person.city == "New York")

[Person(name='Alice', age=30, city='New York'),
 Person(name='Charlie', age=35, city='New York')]

In [19]:
plist.filter_by_city("New York")

[Person(name='Alice', age=30, city='New York'),
 Person(name='Charlie', age=35, city='New York')]

In [26]:
plist.map(lambda person: person.age).agg(lambda ages: sum(ages) / len(ages))

31.6

In [27]:
plist.reduce(lambda acc, person: acc + person.age, 0) / len(plist)

31.6

In [21]:
plist.average_age()

31.6