# Getting Started with Bodo

### Pandas Feature Engineering  

Lets run a simple Pandas feature engineering example on a dataset stored in parquet format in a public S3 bucket hosted by Bodo. We are using an NYC taxi dataset containing info about yellow and green taxi trips originating in New York City. Original example can be found [here]("https://github.com/toddwschneider/nyc-taxi-data").

The `@bodo.jit` decorator is an annotation to tell the bodo engine to compile and parallelize the code. 

In [3]:
import pandas as pd
import bodo

@bodo.jit
def feat_eng():
    """
    Generate features from a raw taxi dataframe.
    """
    taxi_df = pd.read_parquet(
            "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2023-01.parquet",
        )
    # avoid divide-by-zero
    df = taxi_df[taxi_df.fare_amount > 0][
        "tpep_pickup_datetime", "passenger_count", "tip_amount", "fare_amount"
    ].copy()
    df["tip_fraction"] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = (
        df[
            "pickup_weekday",
            "pickup_hour",
            "pickup_week_hour",
            "pickup_minute",
            "passenger_count",
            "tip_fraction",
        ]
        .astype(float)
        .fillna(-1)
    )
    return df


taxi_feat = feat_eng()
display(taxi_feat.head())



Unnamed: 0,pickup_weekday,pickup_hour,pickup_week_hour,pickup_minute,passenger_count,tip_fraction
0,6.0,0.0,144.0,32.0,1.0,0.0
1,6.0,0.0,144.0,55.0,1.0,0.506329
2,6.0,0.0,144.0,25.0,1.0,1.006711
3,6.0,0.0,144.0,3.0,0.0,0.0
4,6.0,0.0,144.0,10.0,1.0,0.287719




---
</br>


If you've made it this far, you have now run your first Python program with Bodo! Please consider joining our [community slack](https://bodocommunity.slack.com/join/shared_invite/zt-qwdc8fad-6rZ8a1RmkkJ6eOX1X__knA#/shared-invite/email) to get in touch directly with our engineers and other Bodo users like yourself. For more information and to learn about how Bodo works, visit our [docs]("https://docs.bodo.ai").





## Run A SQL Query

Lets run a simple SQL query to generate a quick summary of the dataset.

Run the next code cell to generate a table summary, grouped by passenger counts, showing rounded off average and total fares.

In [1]:
import bodosql

# File stored in public S3 bucket hosted by Bodo
s3_file_path = "s3://bodo-example-data/nyc-taxi/yellow_tripdata_2019_half.pq" 

# reading file directly from S3
bc = bodosql.BodoSQLContext( {"NYCTAXI": bodosql.TablePath(s3_file_path, "parquet")})

# executing SQL query 
df1 = bc.sql('''
            SELECT DISTINCT "passenger_count"
            , ROUND (SUM ("fare_amount"),0) as TotalFares
            , ROUND (AVG ("fare_amount"),0) as AvgFares
            FROM NYCTAXI
            GROUP BY "passenger_count"
            ''')
display(df1)

Unnamed: 0,passenger_count,TOTALFARES,AVGFARES
0,9,2250.0,49.0
1,8,3854.0,54.0
2,4,5030361.0,13.0
3,1,190001767.0,13.0
4,3,11024521.0,13.0
5,2,40135971.0,13.0
6,0,4862469.0,14.0
7,7,4017.0,51.0
8,5,11343053.0,13.0
9,6,6764732.0,12.0
