# Task A: DataFrames in plain Python

At Marshall Wace, we make decisions based on data, so tools for analysing it are something we spend a lot of time with. One of the building blocks of data engineering is a DataFrame which is a tabular structure that organises data into rows and columns. Many datasets can very naturally fit into this 2d structure which makes DataFrames incredibly useful for manipulation, analysis, and visualisation.

There are many common libraries which implement this functionality and one of the commonly used ones is Pandas. For today's task, we will ask you to reimplement DataFrames in plain Python code, thinking about correctness, elegant design, and performance. We'll aim for a minimal implementation which can achieve the basic operations required for data manipulation but there is plenty of scope for extensions and optimisations!

Don't worry if you don't know Pandas or feel unsure about programming in general!

## Requirements
A `DataFrame` is a 2d tabular data structure. It can be thought of as a collection of named columns, which can be naturally represented by a dictionary of names to `Series`.

A `Series` is a 1d collection which can be likened to a list or a vector of elements. Comparison and logical operators on `Series` is what makes `DataFrame`s so powerful. One thing that frustrates many data engineers using Pandas is how lousy it can be with type safety and treating `None` values. Thus, we encourage you to build your solution with those in mind.   

There is nothing more frustrating when you want to quickly analyse some data and iterate on some approach but you have to wait a couple of minutes every time you run your script. Hence, efficiency is another aspect you'd ideally consider, either through some explicit optimisations or comments signifying 'hot spots', or parts of the code which the program spends the most time on.

## Spec
### Series
We want you to implement `Series` for `string`s, `bool`s, `int`s, and `float`s. This should give us a good range of functionality while keeping the implementation reasonably simple. Feel free to do everything in one class, create a separate class for each, use inheritence, or whatever you think is best. In terms of functionality of each `Series`:
- you'll want to be able to initialise it with a list of elements, each of which can be of the given type or `None`
- each `Series` should be immutable and operations should return a new `Series` object
- you should be able to read the element at a given index, as well as the lenght of a `Series`
- you should be able to use equality operators (`==`, `!=`) which return a boolean `Series` with elements equal to the element-wise operator results
- for the `Series` pairs which make sense, you should implement element-wise boolean operators (`|`, `&`,  `~`, `^`) and comparison operators (`<`, `>`, `<=`, `>=`) which also result in a boolean `Series`. Think carefully about how you want to handle `None`. Where appropriate, also add operators which work between a `Series` and a variable
    - for instance `[1, 2, 3, 4] < 3` should return something like `[True, True, False, False]`
- for convenience, it would be nice to be able to print the `Series` nicely formatted
- (bonus) you can implement some aggregation methods which are commonly found in data analysis like `sum()`, `count()`, `mean()` or filtering capabilities

### DataFrame
With a solid `Series` implementation, we can start working on the `DataFrame`s
- you should be able to initialise a `DataFrame` with a dictionary of names and `Series`
- `DataFrame`s should be immutable and operators should return a new `DataFrame` as appropriate
- they should be indexed by boolean `Series` which allows you to write code such as `df[(df["name"] != "Joe") & (df["age"] > 21)]`
- for convenience, it would be nice to be able to pretty print the `DataFrame`s in a 2d form with middle rows and columns redacted for readability
- (bonus) you can implement some `DataFrame`-wide aggregation, filtering, pivoting, or any of the common operations which make sense for a 2d table. Get creative!


# Some basic implementation

Below you can find partial implementation of two classes, `BooleanSeries` and `StringSeries`, which represent series of boolean and string values, respectively. These classes offer basic functionality for creating and comparing series of data, just to give you a taste of what we're looking for. What we have now:

- `BooleanSeries`: Initialises a series of boolean values, with input validation
- `StringSeries`: Initialises a series of string values, with input validation
- Equality comparison (`__eq__`) between two `StringSeries` objects, returning a `BooleanSeries`
- Basic indexing for `StringSeries` using `__getitem__`
- String representation for both classes
- `DataFrame`: Initialises a dataframe of series, with input validation
- From CSV to handle file loading `from_csv` of CSV's into Dataframe
- Basic column manipulation such as indexing `__getitem__`, adding columns `add_column` etc
- Pretty printing of columnar wise data with `__str__`

In [None]:
from typing import Union
import base64
import csv
import os

In [None]:
class BooleanSeries:
    _items: list[bool]

    def __init__(self, items: list[bool]):
        for item in items:
            if not isinstance(item, Union[bool, None]):
                raise ValueError(f"Item in Series is not of type Boolean or None, and instead is `{type(item)}`.")

        self._items = items
        sdfsd
    def __str__(self):
        return self._items.__str__()

In [None]:
class StringSeries:
    def __init__(self, items: list[str]):
        for item in items:
            if not isinstance(item, Union[str, None]):
                raise ValueError(f"Item in Series is not of type String or None, and instead is `{type(item)}`.")

        self._items = items

    def __eq__(self, other)-> BooleanSeries:
        new_series = []
        if not isinstance(other, StringSeries):
            raise ValueError(f"Comparison of series on a different type is not allowed. Expected a StringSeries, got `{type(other)}` instead.")

        for index, item in enumerate(self._items):
            new_series.append(item == other[index])

        return BooleanSeries(items=new_series)

    def __getitem__(self, item):
        return self._items[item]

    def __str__(self):
        return self._items.__str__()


In [None]:
string1 = StringSeries(items=["a", "b", "c", None])
string2 = StringSeries(items=["a", "b", "d", None])
result = (string1 == string2)
print(result)


In [None]:
class DataFrame:
    def __init__(self, data: dict):
        self._columns = {}
        for key, value in data.items():
            if isinstance(value, (BooleanSeries, StringSeries)):
                self._columns[key] = value
            else:
                raise ValueError(f"Column '{key}' must be a BooleanSeries or StringSeries")

    def __getitem__(self, key):
        return self._columns[key]

    def __setitem__(self, key, value):
        if isinstance(value, (BooleanSeries, StringSeries)):
            self._columns[key] = value
        else:
            raise ValueError(f"Column '{key}' must be a BooleanSeries or StringSeries")


    def __str__(self):
        if not self._columns:
            return "Empty DataFrame"
        col_names = list(self._columns.keys())
        col_widths = {col: max(len(col), max(len(str(item)) for item in self._columns[col]._items)) for col in col_names}
        header = "  ".join(col.ljust(col_widths[col]) for col in col_names)
        separator = "-" * len(header)
        rows = []
        for i in range(len(next(iter(self._columns.values()))._items)):
            row = "  ".join(str(self._columns[col]._items[i]).ljust(col_widths[col]) for col in col_names)
            rows.append(row)
        return "\n".join([header, separator] + rows)

    def add_column(self, name: str, series):
        if isinstance(series, (BooleanSeries, StringSeries)):
            self._columns[name] = series
        else:
            raise ValueError(f"Column '{name}' must be a BooleanSeries or StringSeries")

    def remove_column(self, name: str):
        if name in self._columns:
            del self._columns[name]
        else:
            raise KeyError(f"Column '{name}' not found in DataFrame")

    def get_column_names(self):
        return list(self._columns.keys())

    def get_column(self, name: str):
        return self._columns.get(name)

    @classmethod
    def from_csv(cls, file_path: str, delimiter: str = ',') -> 'DataFrame':
        with open(file_path, 'r', newline='') as csvfile:
            reader = csv.reader(csvfile, delimiter=delimiter)
            header = next(reader)
            columns = {col: [] for col in header}
            for row in reader:
                for i, value in enumerate(row):
                    columns[header[i]].append(value)
            data = {}
            for col, values in columns.items():
                if all(val.lower() in ('true', 'false', '', 'none') for val in values):
                    bool_values = [None if val.lower() in ('', 'none') else val.lower() == 'true' for val in values]
                    data[col] = BooleanSeries(bool_values)
                else:
                    str_values = [None if val == '' else val for val in values]
                    data[col] = StringSeries(str_values)

            return cls(data)

Creating example data

In [None]:
CSV_ENCODED_DATA = "Q291bnRyeSxDaXR5LElzIENhcGl0YWwKSXRhbHksVHVyaW4sRmFsc2UKSmFwYW4sS3lvdG8sVHJ1ZQpDYW5hZGEsVG9yb250byxUcnVlCkNhbmFkYSxNb250cmVhbCxUcnVlClNwYWluLFNldmlsbGUsVHJ1ZQpGcmFuY2UsUGFyaXMsVHJ1ZQpKYXBhbixLeW90byxGYWxzZQpTcGFpbixaYXJhZ296YSxGYWxzZQpJdGFseSxSb21lLFRydWUKQXVzdHJhbGlhLFBlcnRoLEZhbHNlCkF1c3RyYWxpYSxCcmlzYmFuZSxUcnVlCkZyYW5jZSxOaWNlLFRydWUKQ2FuYWRhLFZhbmNvdXZlcixGYWxzZQpKYXBhbixOYWdveWEsRmFsc2UKSmFwYW4sS3lvdG8sRmFsc2UKQ2FuYWRhLE1vbnRyZWFsLFRydWUKQ2FuYWRhLE90dGF3YSxGYWxzZQpCcmF6aWwsU8OjbyBQYXVsbyxUcnVlCkF1c3RyYWxpYSxNZWxib3VybmUsVHJ1ZQpBdXN0cmFsaWEsU3lkbmV5LFRydWUK"
CSV_FILE_NAME = "countries_and_cities.csv"
if not os.path.isfile(CSV_FILE_NAME):
    csv_data = base64.b64decode(CSV_ENCODED_DATA)
    with open(CSV_FILE_NAME, 'wb') as f:
        f.write(csv_data)

In [None]:
df = DataFrame.from_csv(CSV_FILE_NAME)
print("Initial DataFrame:")
print(df)
print("\n")

print("Column names:")
print(df.get_column_names())
print("\n")

print("Removing 'Population' column:")
df.remove_column('Country')
print(df)
print("\n")

print("Accessing 'City' column:")
country_column = df['City']
print(country_column)
print("\n")

print("Cities equal to 'Paris':")
cities_equal_paris = df['City'] == StringSeries(['Paris'] * len(df['City']._items))
print(cities_equal_paris)
print("\n")


In [None]:
# your code goes here