## The Bible

- https://pandas.pydata.org/pandas-docs/stable/reference/index.html
- https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
- Stackoverflow

## Important Magic Cells

In [9]:
%autosave 60

Autosaving every 60 seconds


Cool, the above allows me to change the autosave feature from 180 seconds down to 60 seconds.

If you ever need (or want) to run `bash` commands in conjunction with an installation of WSL, you can also do this. If you don't have bash, then it will default to `cmd`

In [10]:
!pwd

/mnt/c/users/akira/documents/github/pandas_crashcourse


In [15]:
!apt-get update

Reading package lists... Done
E: Could not open lock file /var/lib/apt/lists/lock - open (13: Permission denied)
E: Unable to lock directory /var/lib/apt/lists/
W: Problem unlinking the file /var/cache/apt/pkgcache.bin - RemoveCaches (13: Permission denied)
W: Problem unlinking the file /var/cache/apt/srcpkgcache.bin - RemoveCaches (13: Permission denied)


This allows you to run a python script and import its namespace directly into the notebook. Inside this `run.py`, I have the following commands:
```python
import pandas as pd
import numpy as np
```

In [1]:
%run run.py

Now, I don't need to do the imports here.  

I'll also use `%%time` to output the time taken to read this file in

In [8]:
%%time
df = pd.read_feather('./sample.feather')

CPU times: user 31.2 ms, sys: 141 ms, total: 172 ms
Wall time: 49.3 ms


## Working with dataframes

You can easily view the dimensions and a sample of the data with `df.tail()`.

The deafult is `n=5`, but you can specify soemthing like `10` as I have below.

In [17]:
df.tail(10)

Unnamed: 0,index,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell,pickupX,pickupY,dropoffX,dropoffY
98185,99990,2,4/12/15 22:55,4/12/15 23:07,2,0.76,-73.997871,40.718189,1,N,...,0.0,0.0,0.3,9.3,27:74,28:72,-8237405.0,4970863.0,-8236730.0,4971485.0
98186,99991,1,4/12/15 22:55,4/12/15 23:16,4,1.7,-73.978119,40.76474,1,N,...,0.0,0.0,0.3,15.3,25:64,24:68,-8235207.0,4977703.0,-8236571.0,4975802.0
98187,99992,1,4/12/15 22:55,4/12/15 23:10,1,2.5,-73.967201,40.752754,1,N,...,0.0,0.0,0.3,13.8,28:65,33:62,-8233991.0,4975941.0,-8230910.0,4975790.0
98188,99993,2,4/12/15 22:55,5/12/15 0:00,4,9.88,-73.869781,40.772308,1,N,...,1.84,5.54,0.3,38.68,42:50:00,28:67,-8223146.0,4978815.0,-8234896.0,4975062.0
98189,99994,2,4/12/15 22:55,4/12/15 23:28,2,12.75,-73.790367,40.644009,1,N,...,6.0,0.0,0.3,45.8,69:61,35:76,-8214306.0,4959974.0,-8234629.0,4966510.0
98190,99995,2,4/12/15 22:55,4/12/15 23:03,1,0.75,-73.99437,40.746239,1,N,...,0.0,0.0,0.3,7.8,25:69,27:68,-8237016.0,4974984.0,-8235502.0,4974382.0
98191,99996,1,4/12/15 22:55,4/12/15 23:08,1,2.4,-73.968346,40.759735,1,N,...,0.0,0.0,0.3,12.3,27:64,24:60,-8234119.0,4976967.0,-8234289.0,4980647.0
98192,99997,1,4/12/15 22:55,4/12/15 23:01,1,0.8,-73.993484,40.742168,1,N,...,1.45,0.0,0.3,8.75,25:69,26:67,-8236917.0,4974386.0,-8235905.0,4975537.0
98193,99998,2,4/12/15 22:55,4/12/15 23:17,1,4.73,-73.984993,40.747929,1,N,...,3.96,0.0,0.3,23.76,26:68,33:76,-8235972.0,4975232.0,-8235589.0,4966693.0
98194,99999,2,4/12/15 22:55,4/12/15 22:59,2,0.8,-73.975731,40.751968,1,N,...,1.16,0.0,0.3,6.96,27:66,27:68,-8234941.0,4975826.0,-8235555.0,4974377.0


To access certain columns, the easiest way is to feed it:
1. the column name to retrieve a series (single column)
2. array of column names to get a sliced dataframe

In [23]:
df['trip_distance']

0        0.96
1        2.69
2        2.62
3        1.20
4        3.00
         ... 
98190    0.75
98191    2.40
98192    0.80
98193    4.73
98194    0.80
Name: trip_distance, Length: 98195, dtype: float64

In [24]:
df[['pickup_longitude','pickup_latitude']]

Unnamed: 0,pickup_longitude,pickup_latitude
0,-73.979942,40.765381
1,-73.972336,40.762379
2,-73.968849,40.764530
3,-73.993935,40.741684
4,-73.988922,40.726990
...,...,...
98190,-73.994370,40.746239
98191,-73.968346,40.759735
98192,-73.993484,40.742168
98193,-73.984993,40.747929


You can also pass through an array that has been declared (recommended)

In [25]:
PU_COORDS = ['pickup_longitude','pickup_latitude']
df[PU_COORDS]

Unnamed: 0,pickup_longitude,pickup_latitude
0,-73.979942,40.765381
1,-73.972336,40.762379
2,-73.968849,40.764530
3,-73.993935,40.741684
4,-73.988922,40.726990
...,...,...
98190,-73.994370,40.746239
98191,-73.968346,40.759735
98192,-73.993484,40.742168
98193,-73.984993,40.747929


Looking at data types and basic statistics can be done with some simple commands

In [20]:
df.columns # note that this is an attribute, not a method

Index(['index', 'VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'PickupCell', 'DropoffCell',
       'pickupX', 'pickupY', 'dropoffX', 'dropoffY'],
      dtype='object')

In [18]:
df.dtypes 

index                      int64
VendorID                   int64
tpep_pickup_datetime      object
tpep_dropoff_datetime     object
passenger_count            int64
trip_distance            float64
pickup_longitude         float64
pickup_latitude          float64
RatecodeID                 int64
store_and_fwd_flag        object
dropoff_longitude        float64
dropoff_latitude         float64
payment_type               int64
fare_amount              float64
extra                    float64
mta_tax                  float64
tip_amount               float64
tolls_amount             float64
improvement_surcharge    float64
total_amount             float64
PickupCell                object
DropoffCell               object
pickupX                  float64
pickupY                  float64
dropoffX                 float64
dropoffY                 float64
dtype: object

When you look at `dtypes`, it should be noted that `strings` are represented as `object`

In [19]:
df.describe() # this is a method

Unnamed: 0,index,VendorID,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,dropoff_longitude,dropoff_latitude,payment_type,...,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,pickupX,pickupY,dropoffX,dropoffY
count,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,...,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0,98195.0
mean,50003.497184,1.532654,1.709079,2.821277,-73.975085,40.749306,1.020958,-73.976504,40.750403,1.33853,...,0.605795,0.499119,1.790711,0.230021,0.299771,16.085542,-8234869.0,4975435.0,-8235027.0,4975597.0
std,28860.003033,0.498935,1.308925,3.330308,0.037463,0.026843,0.188657,0.028106,0.031279,0.486602,...,0.226626,0.024444,2.218994,1.131172,0.011366,11.41277,4170.389,3943.435,3128.708,4595.988
min,0.0,1.0,0.0,0.0,-74.084488,40.583759,1.0,-74.215378,40.523975,1.0,...,-1.0,-0.5,0.0,0.0,-0.3,-80.3,-8247047.0,4951139.0,-8261618.0,4942380.0
25%,25011.5,1.0,1.0,1.04,-73.992615,40.735245,1.0,-73.991871,40.732176,1.0,...,0.5,0.5,0.0,0.0,0.3,9.3,-8236820.0,4973368.0,-8236737.0,4972918.0
50%,50000.0,2.0,1.0,1.71,-73.982658,40.751526,1.0,-73.981621,40.750919,1.0,...,0.5,0.5,1.46,0.0,0.3,12.8,-8235712.0,4975761.0,-8235596.0,4975671.0
75%,74993.5,2.0,2.0,3.1,-73.970203,40.766073,1.0,-73.965614,40.769094,2.0,...,0.5,0.5,2.55,0.0,0.3,18.36,-8234325.0,4977899.0,-8233815.0,4978343.0
max,99999.0,2.0,6.0,91.2,-73.674927,40.87952,5.0,-73.606102,41.007378,4.0,...,1.5,0.5,115.0,24.0,0.3,550.3,-8201455.0,4994587.0,-8193794.0,5013430.0


To convert `datetime` into datetime, we can work directly using pandas. Even better, we can use loops.

In [22]:
DATE_COLS = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']
for col in DATE_COLS:
    df[col] = pd.to_datetime(df[col]) # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
df[DATE_COLS]

Unnamed: 0,tpep_pickup_datetime,tpep_dropoff_datetime
0,2015-01-12 00:00:00,2015-01-12 00:05:00
1,2015-01-12 00:00:00,2015-01-12 00:00:00
2,2015-01-12 00:00:00,2015-01-12 00:00:00
3,2015-01-12 00:00:00,2015-01-12 00:05:00
4,2015-01-12 00:00:00,2015-01-12 00:09:00
...,...,...
98190,2015-04-12 22:55:00,2015-04-12 23:03:00
98191,2015-04-12 22:55:00,2015-04-12 23:08:00
98192,2015-04-12 22:55:00,2015-04-12 23:01:00
98193,2015-04-12 22:55:00,2015-04-12 23:17:00


For the other datatypes (datetime is special), you can use the `.astype()` method

In [26]:
df.columns

Index(['index', 'VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RatecodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount', 'PickupCell', 'DropoffCell',
       'pickupX', 'pickupY', 'dropoffX', 'dropoffY'],
      dtype='object')

In [27]:
NOMINAL_COLS = ['VendorID', 'RatecodeID','store_and_fwd_flag','payment_type',
                'PickupCell', 'DropoffCell','pickupX', 'pickupY', 
                'dropoffX', 'dropoffY']
for col in NOMINAL_COLS:
    df[col] = df[col].astype(str) # or astype int, float, etc...

In [28]:
df.dtypes

index                             int64
VendorID                         object
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                   int64
trip_distance                   float64
pickup_longitude                float64
pickup_latitude                 float64
RatecodeID                       object
store_and_fwd_flag               object
dropoff_longitude               float64
dropoff_latitude                float64
payment_type                     object
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
PickupCell                       object
DropoffCell                      object
pickupX                          object
pickupY                          object
dropoffX                         object


Indexing (index locate) and Slicing (locate).
- Indexing is the same as `C`, arrays start from `0` and end at the length of the object.
- Slicing is specific to Python, and is inclusive of the start, exclusive of the end.
- *No need to specify the end or start when slicing*

Quite intuitive methods:  
- `.iloc[INDEX]`  
- `.loc[MATCHING INDICIES, COLUMNS]`

In [29]:
df.iloc[:500] # from row 0 inclusive to 500 exclusive

Unnamed: 0,index,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell,pickupX,pickupY,dropoffX,dropoffY
0,0,2,2015-01-12 00:00:00,2015-01-12 00:05:00,5,0.96,-73.979942,40.765381,1,N,...,1.00,0.00,0.3,7.80,25:64,27:63,-8235409.507978152,4977796.744813881,-8233891.808096937,4977459.784866379
1,1,2,2015-01-12 00:00:00,2015-01-12 00:00:00,2,2.69,-73.972336,40.762379,1,N,...,3.34,0.00,0.3,25.64,26:64,25:69,-8234562.756271431,4977355.502377977,-8236933.153433366,4974948.36544226
2,2,2,2015-01-12 00:00:00,2015-01-12 00:00:00,1,2.62,-73.968849,40.764530,1,N,...,3.56,0.00,0.3,21.36,26:63,22:59,-8234174.625282053,4977671.714522172,-8234809.052871201,4981657.2011079965
3,3,1,2015-01-12 00:00:00,2015-01-12 00:05:00,1,1.20,-73.993935,40.741684,1,N,...,0.20,0.00,0.3,8.00,25:70,24:69,-8236967.124802372,4974314.446807342,-8237382.433332234,4975164.165346087
4,4,1,2015-01-12 00:00:00,2015-01-12 00:09:00,2,3.00,-73.988922,40.726990,1,N,...,0.00,0.00,0.3,12.30,28:71,33:75,-8236409.134741575,4972155.731892895,-8234925.407342563,4967732.1907335855
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,506,1,2015-04-12 19:09:00,2015-04-12 19:38:00,1,4.60,-73.997955,40.730202,1,N,...,0.00,0.00,0.3,22.30,26:72,28:59:00,-8237414.70596581,4972627.559181629,-8232170.276662107,4979200.777858604
496,507,1,2015-04-12 19:09:00,2015-04-12 19:45:00,1,16.50,-73.984383,40.745979,2,N,...,11.65,5.54,0.3,69.99,26:68,71:60,-8235903.8010263145,4974945.563392678,-8213480.573506878,4959968.93883872
497,508,2,2015-04-12 19:09:00,2015-04-12 19:21:00,3,1.53,-73.965912,40.754532,1,N,...,2.70,0.00,0.3,13.50,28:64,27:68,-8233847.64542855,4976202.312572292,-8235599.751874722,4974027.486527674
498,509,2,2015-04-12 19:09:00,2015-04-12 19:35:00,1,4.29,-73.994476,40.741772,1,N,...,4.26,0.00,0.3,25.56,25:70,27:77,-8237027.425457341,4974327.338054316,-8238490.7702575885,4968933.116332912


In [30]:
df.loc[:500] # this is different. it returns the matching indicies and outputs it (500 inclusive)

Unnamed: 0,index,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RatecodeID,store_and_fwd_flag,...,tip_amount,tolls_amount,improvement_surcharge,total_amount,PickupCell,DropoffCell,pickupX,pickupY,dropoffX,dropoffY
0,0,2,2015-01-12 00:00:00,2015-01-12 00:05:00,5,0.96,-73.979942,40.765381,1,N,...,1.00,0.00,0.3,7.80,25:64,27:63,-8235409.507978152,4977796.744813881,-8233891.808096937,4977459.784866379
1,1,2,2015-01-12 00:00:00,2015-01-12 00:00:00,2,2.69,-73.972336,40.762379,1,N,...,3.34,0.00,0.3,25.64,26:64,25:69,-8234562.756271431,4977355.502377977,-8236933.153433366,4974948.36544226
2,2,2,2015-01-12 00:00:00,2015-01-12 00:00:00,1,2.62,-73.968849,40.764530,1,N,...,3.56,0.00,0.3,21.36,26:63,22:59,-8234174.625282053,4977671.714522172,-8234809.052871201,4981657.2011079965
3,3,1,2015-01-12 00:00:00,2015-01-12 00:05:00,1,1.20,-73.993935,40.741684,1,N,...,0.20,0.00,0.3,8.00,25:70,24:69,-8236967.124802372,4974314.446807342,-8237382.433332234,4975164.165346087
4,4,1,2015-01-12 00:00:00,2015-01-12 00:09:00,2,3.00,-73.988922,40.726990,1,N,...,0.00,0.00,0.3,12.30,28:71,33:75,-8236409.134741575,4972155.731892895,-8234925.407342563,4967732.1907335855
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
496,507,1,2015-04-12 19:09:00,2015-04-12 19:45:00,1,16.50,-73.984383,40.745979,2,N,...,11.65,5.54,0.3,69.99,26:68,71:60,-8235903.8010263145,4974945.563392678,-8213480.573506878,4959968.93883872
497,508,2,2015-04-12 19:09:00,2015-04-12 19:21:00,3,1.53,-73.965912,40.754532,1,N,...,2.70,0.00,0.3,13.50,28:64,27:68,-8233847.64542855,4976202.312572292,-8235599.751874722,4974027.486527674
498,509,2,2015-04-12 19:09:00,2015-04-12 19:35:00,1,4.29,-73.994476,40.741772,1,N,...,4.26,0.00,0.3,25.56,25:70,27:77,-8237027.425457341,4974327.338054316,-8238490.7702575885,4968933.116332912
499,510,2,2015-04-12 19:09:00,2015-04-12 19:20:00,1,2.58,-73.954742,40.805370,1,N,...,1.00,0.00,0.3,12.80,24:55:00,1.001388889,-8232604.269055303,4983676.079439237,-8235358.550368045,4979727.342292647


In [34]:
df.loc[df['VendorID'] == '2', PU_COORDS] 
# get all instances where the VendorID is 2, but only give me the pickup coordinates back

Unnamed: 0,pickup_longitude,pickup_latitude
0,-73.979942,40.765381
1,-73.972336,40.762379
2,-73.968849,40.764530
6,-73.968315,40.755329
7,-73.994209,40.746101
...,...,...
98188,-73.869781,40.772308
98189,-73.790367,40.644009
98190,-73.994370,40.746239
98193,-73.984993,40.747929


`df['VendorID'] == '2'` returns the list of indicies for `.loc` to get, and `PU_COORDS` specifies that we only want those columns.

In [36]:
# assign the output to a new variable
filtered = df.loc[df['VendorID'] == '2', PU_COORDS] 

In [37]:
filtered.tail()

Unnamed: 0,pickup_longitude,pickup_latitude
98188,-73.869781,40.772308
98189,-73.790367,40.644009
98190,-73.99437,40.746239
98193,-73.984993,40.747929
98194,-73.975731,40.751968


## Functions on DataFrames

Example: *I want to convert my lat/lon coordinates into the mercer coordinate system*

In [35]:
def lat2mercer(lat):
    """
    Function which converts latitude to its mercer coordinate representation
    """
    k = 6378137
    return np.log(np.tan((90 + lat) * np.pi/360.0)) * k

def lon2mercer(lon):
    """
    Function which converts longitude to its mercer coordinate representation
    """
    k = 6378137
    return lon * (k * np.pi/180.0)

In [38]:
# i want to create a new col with the new coordinate system
filtered['mercer_X'] = filtered['pickup_longitude'].apply(lon2mercer)
filtered['mercer_Y'] = filtered['pickup_latitude'].apply(lat2mercer)

In [39]:
filtered.head()

Unnamed: 0,pickup_longitude,pickup_latitude,mercer_X,mercer_Y
0,-73.979942,40.765381,-8235410.0,4977797.0
1,-73.972336,40.762379,-8234563.0,4977356.0
2,-73.968849,40.76453,-8234175.0,4977672.0
6,-73.968315,40.755329,-8234115.0,4976319.0
7,-73.994209,40.746101,-8236998.0,4974963.0


You can also do quick functions (one liners called lambda functions)

In [42]:
# convert vendor ID into boolean given a condition
df['VendorBool'] = df['VendorID'].apply(lambda x: True if x == "1" else False) # return True if x is 1 else False

In [43]:
df[['VendorID','VendorBool']]

Unnamed: 0,VendorID,VendorBool
0,2,False
1,2,False
2,2,False
3,1,True
4,1,True
...,...,...
98190,2,False
98191,1,True
98192,1,True
98193,2,False


Groupby's are the magic of `pandas`.

- `df.groupby(COL).AGGREGATE()`

Some examples of aggregations are:
- `.size()`
- `.count()`
- `.mean()`
- `.sum()`

Unnamed: 0_level_0,index,passenger_count,trip_distance,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,VendorBool
RatecodeID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,4827871380,165000,248961.9,-7141066.0,3933664.0,-7140995.0,3933647.0,1157575.5,59458.5,48227.0,165169.19,15900.29,28939.5,1475270.0,45111
2,74771313,2636,27153.71,-111831.0,61606.56,-112044.7,61725.09,78620.5,1.52,755.5,9822.95,6562.15,453.9,96216.52,688
3,551044,14,29.75,-813.1408,447.8727,-813.3747,447.962,243.0,5.5,1.0,15.28,0.0,3.3,268.08,10
4,1783852,43,508.69,-2585.226,1424.509,-2582.732,1428.77,1665.5,20.5,17.5,311.21,61.77,10.5,2086.98,19
5,5115817,130,381.26,-7687.778,4235.566,-7687.048,4237.312,5056.5,0.0,10.0,520.22,62.69,28.8,5678.21,63


In [49]:
df.groupby('RatecodeID')['passenger_count'].sum()

RatecodeID
1    165000
2      2636
3        14
4        43
5       130
Name: passenger_count, dtype: int64

In [50]:
df.groupby('RatecodeID')[['passenger_count','trip_distance']].sum()

Unnamed: 0_level_0,passenger_count,trip_distance
RatecodeID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,165000,248961.9
2,2636,27153.71
3,14,29.75
4,43,508.69
5,130,381.26


You can do multi-level groupbys as well

In [51]:
df.groupby(['VendorID','RatecodeID'])[['passenger_count','trip_distance']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,passenger_count,trip_distance
VendorID,RatecodeID,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,60939,114083.0
1,2,1000,12121.4
1,3,11,28.7
1,4,24,202.4
1,5,77,134.1
2,1,104061,134878.9
2,2,1636,15032.31
2,3,3,1.05
2,4,19,306.29
2,5,53,247.16


Some issues...
- This is now a "groupby" object (akin to a view in SQL)
- IF we want to bring this back into a dataframe...

In [52]:
aggDf = df.groupby(['VendorID','RatecodeID'])[['passenger_count','trip_distance']].sum()
aggDf.reset_index()

Unnamed: 0,VendorID,RatecodeID,passenger_count,trip_distance
0,1,1,60939,114083.0
1,1,2,1000,12121.4
2,1,3,11,28.7
3,1,4,24,202.4
4,1,5,77,134.1
5,2,1,104061,134878.9
6,2,2,1636,15032.31
7,2,3,3,1.05
8,2,4,19,306.29
9,2,5,53,247.16


`.reset_index()` will allow you to reset the multi-level index into the simple incremental one like a database. This will push the existing index into the dataframe.
- You can specify `.reset_index(drop=True)` to remove them entirely

In [53]:
aggDf.reset_index(drop=True)

Unnamed: 0,passenger_count,trip_distance
0,60939,114083.0
1,1000,12121.4
2,11,28.7
3,24,202.4
4,77,134.1
5,104061,134878.9
6,1636,15032.31
7,3,1.05
8,19,306.29
9,53,247.16


## Some harder conditions...

In [None]:
# sum of bools
# todo rip no time to prep