# Task #2: join two datasets using implemented map-reduce framework

There are two datasets:

- data/users dataset with columns "id" and "country"
- data/clicks dataset with columns "date", "user_id" and "click_target"
We'd like to produce a new dataset called data/filtered_clicks that includes only those clicks that belong to users from Lithuania (country=LT).

Let's import some functions and libraries from our previous task. (Task1)

- read_csv: to read the content of CSV files
- group_by_key: to group mapped data by key (date or user_id)
- write_output: to write the results to an output CSV file
- glob: to search for files matching a specified pattern
- os: to interact with the operating system, e.g., create directories and handle file paths

In [13]:
import nbimporter # Import jupyter notebook functions from previous code.
from Task1 import read_csv, group_by_key, write_output, glob, os

In [14]:
os.makedirs('data/filtered_clicks', exist_ok=True)

In [15]:
def map_reduce(mappers, reducer, output):
    mapped_data = []
 
    # For each dataset and corresponding mapper function, the data is read from CSV files, mapped using the mapper function, 
    # and then added to the mapped_data list.
    for dataset, mapper in mappers.items():
        data = read_csv(glob.glob(os.path.join(dataset, '*.csv')))
        mapped_data.extend(mapper(data))
    
    # The grouped_data dictionary is created by calling the group_by_key function on the mapped_data list.
    grouped_data = group_by_key(mapped_data)

    reduced_data = []

    # For each key and values, the reducer function is called and the result is added to the reduced_data list.
    for key, values in grouped_data.items():
        reduced_data.extend(reducer(key, values))

    # The final output is written to the specified output file.
    write_output(output, reduced_data, fieldnames=['date', 'user_id', 'click_target'])


# The map_users and map_clicks functions are defined to map the user and click data, 
# They filter the data based on the specified conditions and return the mapped data.
def map_users(users):
    return [
        {"key": user["id"], "value": {**user, "table": "users"}}
        for user in users if user["country"] == "LT"
    ]


def map_clicks(clicks):
    return [
        {"key": click["user_id"], "value": {**click, "table": "clicks"}}
        for click in clicks
    ]

# reduce_join function is defined to join the user and click data based on the user_id. 
# It first finds the user data in the values and then iterates over the click data to create the final output.

def reduce_join(key, values):
    user = next((value for value in values if value["table"] == "users"), None)

    return [
        {"date": click["date"], "user_id": click["user_id"], "click_target": click["click_target"]}
        for click in values if click["table"] == "clicks" and user is not None
    ]


# function is called with the specified mappers and reducer functions, and the output file path.
map_reduce(
    mappers={
        "data/users": map_users,
        "data/clicks": map_clicks,
    },
    reducer=reduce_join,
    output="data/filtered_clicks/filtered_output.csv",
)

This code performs a MapReduce operation on given datasets using specified mapper and reducer functions. The map_reduce function reads and processes the data, creating intermediate mapped data that is then grouped by a key. The reducer function is applied to the grouped data, and the results are written to a specified output file. The example demonstrates the usage of this framework to join and filter user and click data based on specified conditions.