# Introduction 
This tutorial will go through how to setup a featurization pipeline in `ralf`. We'll setup a pipeline for computing user features given a data stream of user ratings. We'll then query the user features to predict the rating a user will give a movie. 

To do so, we'll do the following: 
1. Create feature tables from the movie lens dataset which are incrementally maintained by `ralf`
2. Create a ralf client which queries the feature tables 
3. Implement load shedding policies to reduce feature computation cost

# Creating a featurization pipeline 
We create a instance of ralf to that we can start creating tables. 

In [None]:
ralf_server = Ralf()

### Creating Source Tables
Source tables define the raw data sources that are run through ralf to become features. `ralf` lets you create both static batch (e.g. from a CSV) and dynamic streaming sources (e.g. from Kafka). 

To define a source, we implement a `SourceOperator`. 

In [None]:
class RatingsSource(SourceOperator):
    
    def __init__(self, schema, kafka_topic):
        self.topic = kafka_topic
        
    def next(): 
        pass 

We specify a schema using ralf's `Schema` object. 

In [None]:
from ralf import Schema 

source_schema = Schema({"user", {"user": int, "movie": int, "rating": float}})

We can now add the source to our ralf instance. 

In [None]:
source = ralf_server.create_source(SourceOperator, args=(source_schema, "ratings_topic"))

### Creating Feature Tables 
Now that we have data streaming into ralf through the source table, we can define derived feature tables from the source table. 

Feature tables follow an API similar to pandas dataframes. We define feature tables in terms of 1-2 parent tables and an operator which specifies how to transform parent data. 


For example, we can calculate the average rating for each user with an `AverageRating` operator: 

In [None]:
class AverageRating(Operator): 
    
    def __init__(self, schema): 
        self.user_ratings = defaultdict(list)
    
    def on_record(record: Record): 
        self.user_ratings[record.user].append(record.rating)
        ratings = np.array(self.user_ratings[record.user])
        return Record(user=record.user, average=ratings.mean())     

The `AverageRating` operator can be used to define a feature table containing the average rating for each user. 

In [None]:
average_rating_schema = Schema({"user", {"user": int, "average": float}})
average_rating = source.map(AverageRating, args=(averge_rating_schema))

### Adding Processing Policies
In many cases, we may only need to sub-sample some of the data to get the features we need. We can add a simple load shedding policy to the `average_rating` table. 

In [None]:
class SampleHalf(LoadSheddingPolicy):

    def process(record: Record) -> Boolean: 
        return random.random() < 0.5

average_rating.add_load_shedding(SampleHalf)

## Creating a `ralf` Client 
Now that we have a simple pipeline, we can query the ralf server for features. 

In [None]:
ralf_client = RalfClient()

In [None]:
ralf_client.point_query(table="average", key=40)

In [None]:
ralf_client.bulk_query(table="average")

# Advanced: Maintaining user vectors 
Now that we've setup a simple feature table and run some queries, we can create a more realistic feature table: a user vector representing their movie tastes. 

In this example, we'll assume we already have pre-computed movie vectors which are held constant. User vectors are updated over time as new rating informatio is recieved. 

In [None]:
class UserVectors(Operator):
    
    def __init__(self, schema, movie_vectors_file): 
        self.user_ratings = {}
        self.movie_vectors = pd.read_csv(movie_vectors_file)
    
    def on_record(self, rating: Record, movie_vector: Record):
        pass 
    
user_vectors = source.map(UserVector(user_schema, "movie_vectors.csv"))

## Prioritizing Active Users 
Ralf allows for key-level prioritization policies. Say that we want to prioritize computing updates to user vectors for users who especially active. We can use activity data to implement a prioritized lottery scheduling policy. 

In [None]:
user_activity = pd.read_csv("user_active_time.csv")

For example, we can set the subsampling rate of the data to be inversely proportional to how active the user is. 

In [None]:
class SampleActiveUsers(LoadSheddingPolicy):
    
    def __init__(self, user_activity_csv):
        user_activity = pd.read_csv("user_active_time.csv")
        self.weights = {}

    def process(record: Record) -> Boolean: 
        return random.random() < self.weights[record.user]

Alternatively, we can create a key prioritization policy which prioritizes keys uses lottery scheduling. 

In [None]:
class LotteryScheduling(PrioritizationPolicy): 
    
    def __init__(self, user_activity_csv): 
        user_activity = pd.read_csv("user_active_time.csv")
        self.weights = user_activity.dict()
        
    def choose(self, keys: List[KeyType]): 
        pass    

user_vectors.add_prioritization_policy(lottery_scheduling)

## Updating features lazily
We need append tables for this.... 