   # 🚕 New-York Taxi Dataset 🚕
***
   #### owner: Golovin Alexey

In [3]:
import pandas as pd

In [2]:
file = './taxi_nyc.csv'

In [5]:
df = pd.read_csv(file)

In [10]:
df.head(3)

Unnamed: 0,pickup_dt,pickup_month,borough,pickups,hday,spd,vsb,temp,dewp,slp,pcp 01,pcp 06,pcp 24,sd
0,2015-01-01 01:00:00,Jan,Bronx,152,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
1,2015-01-01 01:00:00,Jan,Brooklyn,1519,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
2,2015-01-01 01:00:00,Jan,EWR,0,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0


In [12]:
df.columns  # .columns shows a column names

Index(['pickup_dt', 'pickup_month', 'borough', 'pickups', 'hday', 'spd', 'vsb',
       'temp', 'dewp', 'slp', 'pcp 01', 'pcp 06', 'pcp 24', 'sd'],
      dtype='object')

Fields of dataset: 

    pickup_dt – period with accuracy up to one hour.  
    pickup_month – month 
    borough – district of NY, from wich the order was made (5 districts + airport)
    pickups – number of trips (hour)
    hday – holidat or not; Y/N
    spd – speed of wind
    vsb –  visibility
    temp – temperature F
    dewp – dew point
    slp – preasure
    pcp_01 – precipitation amount per an hour
    pcp_06 – precipitation amount per an 6 hours
    pcp_24 – precipitation amount per an 24 hours
    sd – snow level - inches

Checking how many rows and columns in dataset:

In [8]:
df.shape  # .shape argumentshows how many rows and columns

(29101, 14)

Checking types of values in dataset:

In [9]:
df.dtypes

pickup_dt        object
pickup_month     object
borough          object
pickups           int64
hday             object
spd             float64
vsb             float64
temp            float64
dewp            float64
slp             float64
pcp 01          float64
pcp 06          float64
pcp 24          float64
sd              float64
dtype: object

In column names replace space to underscore 

In [17]:
df.columns = df.columns.str.replace(" ", "_") # 1st method with .replace()              

In [18]:
df = df.rename(columns=lambda col: col.replace(" ", "_"))  # 2nd method with lambda 
df.head(3)

Unnamed: 0,pickup_dt,pickup_month,borough,pickups,hday,spd,vsb,temp,dewp,slp,pcp_01,pcp_06,pcp_24,sd
0,2015-01-01 01:00:00,Jan,Bronx,152,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
1,2015-01-01 01:00:00,Jan,Brooklyn,1519,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0
2,2015-01-01 01:00:00,Jan,EWR,0,Y,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0


How many times 'Brooklin' appears in dataset

In [24]:
df.query('borough == "Brooklyn"').shape[0]

4343

How to check how many times appears any district:

In [25]:
df.groupby('borough').agg({'pickups': 'count'})

Unnamed: 0_level_0,pickups
borough,Unnamed: 1_level_1
Bronx,4343
Brooklyn,4343
EWR,4343
Manhattan,4343
Queens,4343
Staten Island,4343


Calculate total number of pickups

In [26]:
df.pickups.sum()

14265773

Which district with maximum number of pickup

In [31]:
df.groupby('borough').agg({'pickups': 'sum'}).sort_values('pickups', ascending=False)  

# .sort_values Sort by the values along either axis 

Unnamed: 0_level_0,pickups
borough,Unnamed: 1_level_1
Manhattan,10367841
Brooklyn,2321035
Queens,1343528
Bronx,220047
Staten Island,6957
EWR,105


Method to calculate value for one district

In [29]:
df.query('borough == "Manhattan"')['pickups'].sum()

2321035

Find the district with minimum trips

In [34]:
min_pickups = df.groupby('borough').agg({'pickups': 'sum'}).idxmin()
min_pickups

pickups    EWR
dtype: object

Group data by district and holidays columns. Compare the mean value of rides. Find district, that have trip on holidays more than on workdays

In [38]:
distr = df.groupby(['borough', 'hday'], as_index = False).agg({'pickups' : 'mean'})
distr.pivot(columns='hday', values='pickups', index='borough').query('Y > N')

hday,N,Y
borough,Unnamed: 1_level_1,Unnamed: 2_level_1
EWR,0.023467,0.041916
Queens,308.899904,320.730539


Count number of trips by month. Sort data in descending order.

In [43]:
trips_by_month = df.groupby(['borough', 'pickup_month'], as_index=False) \
                  .agg({'pickups': 'sum'}) \
                  .sort_values('pickups', ascending=False)
trips_by_month

Unnamed: 0,borough,pickup_month,pickups
21,Manhattan,Jun,1995388
23,Manhattan,May,1888800
19,Manhattan,Feb,1718571
22,Manhattan,Mar,1661261
18,Manhattan,Apr,1648278
20,Manhattan,Jan,1455543
9,Brooklyn,Jun,482466
11,Brooklyn,May,476087
6,Brooklyn,Apr,378095
10,Brooklyn,Mar,346726


Write a function, those will be convert farenheit to celcius

In [44]:
def temp_to_celcius(temp_f):
    return (temp_f - 32) * 5/9

In [45]:
temp_to_celcius(50)

10.0