# Sorting and Filtering Data

There are some operations we will frequently need when working with data sets.
*Sorting* data is useful both for increasing the readability or accessibility and for highlighting certain aspects of the data.
*Filtering* data is useful for removing irrelevant data and is especially important with large data sets.

## Sorting Data

When presenting data to the user, sorting data by the right key/property can be important.
For example, if you're looking for a court decision by date, a list ordered by title isn't helpful.

We can sort lists in two ways.
Firstly, we can make a new, sorted list, while also keeping the old one.

In [None]:
judges = ['Síofra O’Leary, President', 'Mārtiņš Mits', 'Stéphanie Mourou-Vikström', 'Lətif Hüseynov',
          'Jovan Ilievski', 'Ivana Jelić', 'Anne Grøstad, ad hoc judge', 'Victor Soloveytchik, Section Registrar']

In [None]:
judges_sorted = sorted(judges)
print(judges_sorted)

Secondly, if we don’t need the original list, it’s more efficient to sort the existing list in place:

In [None]:
print(judges)
judges.sort()
print(judges)

## Filtering Data

Let's say we want to make a list of only the ad hoc judges. We can do that by iterating over the list:

In [None]:
adhoc_judges = []

for judge in judges:
    if 'ad hoc' in judge:
        adhoc_judges.append(judge)

print(adhoc_judges)

This works, but can be written shorter and simpler with a *list comprehension*.

In [None]:
adhoc_judges = [judge for judge in judges if 'ad hoc' in judge]
print(adhoc_judges)

## Modifying Data

We can also use list comprehensions to modify each item in a list.
For example, we can remove the titles from the list of judges.
We split on the comma and use element 0 from the result.

In [None]:
judges = [judge.split(',')[0] for judge in judges]
print(judges)

## Sets — Avoiding Duplicates

Let's say we want to collect a list of all the judges that appear in a collection of cases.
We could do it like this:

In [None]:
import json

def read_json_file(filename):
    with open(filename, 'r') as file:
        text_data = file.read()
        json_data = json.loads(text_data)
        return json_data

cases = read_json_file('cases-5.json')

In [None]:
all_judges = []

for case in cases:
    for judge in case['decision_body']:
        all_judges.append(judge['name'])
print(all_judges)

Unfortunately, we get duplicates because the same judges appear in different cases.
We need to remove these duplicates.
The simplest way is to use `set()` instead of a list, because sets only store each item once.

In [None]:
all_judges = set()

for case in cases:
    for judge in case['decision_body']:
        all_judges.add(judge['name'])
print(all_judges)

## Sorting by a key
Let's say we want to order our cases by certain properties, for example their title and date.
First, we make a list that contains the titles (docnames) and dates as pairs.
We can do this with a list comprehension.

In [None]:
cases_by_date = [(case['docname'], case['judgementdate']) for case in cases]

Now, we can give a *key* as argument to `sorted()`.
We can use the method `itemgetter()` to get the element at a given index.

In [None]:
from operator import itemgetter
cases_by_date = sorted(cases_by_date, key=itemgetter(0))
print(cases_by_date)

Alternatively, we can use a *lambda expression*.
A lambda expression is a compact way of defining a new function.

In [None]:
cases_by_date = sorted(cases_by_date, key=lambda pair: pair[0])
print(cases_by_date)

If we want to sort by date, we must use the element with index 1.

In [None]:
cases_by_date = sorted(cases_by_date, key=lambda pair: pair[1])
print(cases_by_date)

Unfortunately, Python sorts the dates by lexicographic order.
We will need to convert the textual date to an object that Python understands.

We can do this with the Python library function `datetime.strptime()`.
This function takes a parameter that specifies the data format, which looks a bit messy.

In [None]:
from datetime import datetime
cases_by_date = sorted(cases_by_date,
                       key=lambda pair: datetime.strptime(pair[1], '%d/%m/%Y %H:%M:%S'))
print(cases_by_date)