# Pandas Essentials:  Data Wrangling

This Pandas Notebook illustrates the essentials of "wrangling" your data.  We focus on three months of data from the [Bay Area Bike Share](http://www.bayareabikeshare.com/open-data) program.

Topics include:

* Transforming Datatypes
* Dropping Columns
* Renaming Columns
* Setting Indexes
* Dealing with NA Values

# Loading the Bike Share Data Sets

In [16]:
import pandas as pd

# Read in the Station Data
stations_df = pd.read_csv("data/babs_station_data.csv")

# Read in the Bike Share Weather Data
weather_df = pd.read_csv("data/babs_weather_april_thru_june_2016.csv")

# Read in the trips data
trips_df = pd.read_csv("data/babs_trips_april_thru_june_2016.csv")

# Set max display columns and rows (for more compact view)
pd.options.display.max_columns = 10
pd.options.display.max_rows = 15

# Transforming Data Types

In [17]:
# Before any transformation, these are our weather data types
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 24 columns):
PDT                           455 non-null object
Max TemperatureF              455 non-null int64
Mean TemperatureF             455 non-null int64
Min TemperatureF              455 non-null int64
Max Dew PointF                455 non-null int64
MeanDew PointF                455 non-null int64
Min DewpointF                 455 non-null int64
Max Humidity                  455 non-null int64
 Mean Humidity                455 non-null int64
 Min Humidity                 455 non-null int64
 Max Sea Level PressureIn     455 non-null float64
 Mean Sea Level PressureIn    455 non-null float64
 Min Sea Level PressureIn     455 non-null float64
 Max VisibilityMiles          454 non-null float64
 Mean VisibilityMiles         454 non-null float64
 Min VisibilityMiles          454 non-null float64
 Max Wind SpeedMPH            455 non-null int64
 Mean Wind SpeedMPH           455 non-null int64


In [18]:
# Let's try transforming some columns
weather_df["PDT"] = pd.to_datetime(weather_df["PDT"])
weather_df["ZIP"] = weather_df["ZIP"].astype(int).astype(str)
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 455 entries, 0 to 454
Data columns (total 24 columns):
PDT                           455 non-null datetime64[ns]
Max TemperatureF              455 non-null int64
Mean TemperatureF             455 non-null int64
Min TemperatureF              455 non-null int64
Max Dew PointF                455 non-null int64
MeanDew PointF                455 non-null int64
Min DewpointF                 455 non-null int64
Max Humidity                  455 non-null int64
 Mean Humidity                455 non-null int64
 Min Humidity                 455 non-null int64
 Max Sea Level PressureIn     455 non-null float64
 Mean Sea Level PressureIn    455 non-null float64
 Min Sea Level PressureIn     455 non-null float64
 Max VisibilityMiles          454 non-null float64
 Mean VisibilityMiles         454 non-null float64
 Min VisibilityMiles          454 non-null float64
 Max Wind SpeedMPH            455 non-null int64
 Mean Wind SpeedMPH           455 non-nul

In [19]:
# Before any transformation, these are our trips data types
trips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83537 entries, 0 to 83536
Data columns (total 11 columns):
Trip ID            83537 non-null int64
Duration           83537 non-null int64
Start Date         83537 non-null object
Start Station      83537 non-null object
Start Terminal     83537 non-null int64
End Date           83537 non-null object
End Station        83537 non-null object
End Terminal       83537 non-null int64
Bike #             83537 non-null int64
Subscriber Type    83537 non-null object
Zip Code           83509 non-null object
dtypes: int64(5), object(6)
memory usage: 7.0+ MB


In [20]:
# Let's try changing column data types
trips_df["Start Date"] = pd.to_datetime(trips_df["Start Date"])
trips_df["End Date"] = pd.to_datetime(trips_df["End Date"])

In [21]:
# We can use the Series dt attribute to extract specific data/time elements
# http://pandas.pydata.org/pandas-docs/version/0.15.2/api.html#series
type(trips_df["Start Date"].dt)

pandas.tseries.common.DatetimeProperties

In [22]:
# For example, we can extract just the date, just the month or just the day of week
trips_df["Start Date Only"] = trips_df["Start Date"].dt.date
trips_df["Start Date Month"] = trips_df["Start Date"].dt.month
trips_df["Start Date Day of Week"] = trips_df["Start Date"].dt.dayofweek

# Dropping Columns

In [23]:
# Original set of columns
stations_df.columns

Index([u'station_id', u'name', u'lat', u'long', u'dockcount', u'landmark',
       u'installation'],
      dtype='object')

In [24]:
# Drop a few columns which are not essential to our analysis
stations_df.drop(labels=["installation"], axis="columns", inplace=True)
stations_df.drop(labels=["lat", "long"], axis="columns", inplace=True)
stations_df.columns

Index([u'station_id', u'name', u'dockcount', u'landmark'], dtype='object')

# Renaming columns

In [25]:
# Original Set of columns
trips_df.columns

Index([u'Trip ID', u'Duration', u'Start Date', u'Start Station',
       u'Start Terminal', u'End Date', u'End Station', u'End Terminal',
       u'Bike #', u'Subscriber Type', u'Zip Code', u'Start Date Only',
       u'Start Date Month', u'Start Date Day of Week'],
      dtype='object')

In [26]:
# Rename a few columns to make them more explicit
trips_df.rename(columns = {'Duration':'Duration Seconds'}, inplace=True)
trips_df.rename(columns = {'Start Date':'Start Datetime'}, inplace=True)
trips_df.rename(columns = {'End Date':'End Datetime'}, inplace=True)
trips_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 83537 entries, 0 to 83536
Data columns (total 14 columns):
Trip ID                   83537 non-null int64
Duration Seconds          83537 non-null int64
Start Datetime            83537 non-null datetime64[ns]
Start Station             83537 non-null object
Start Terminal            83537 non-null int64
End Datetime              83537 non-null datetime64[ns]
End Station               83537 non-null object
End Terminal              83537 non-null int64
Bike #                    83537 non-null int64
Subscriber Type           83537 non-null object
Zip Code                  83509 non-null object
Start Date Only           83537 non-null object
Start Date Month          83537 non-null int64
Start Date Day of Week    83537 non-null int64
dtypes: datetime64[ns](2), int64(7), object(5)
memory usage: 8.9+ MB


In [27]:
# You can also normalize all columns
# In this case, replace all spaces with underscores and make upper case
trips_df.columns = trips_df.columns.str.replace(" ", "_")
trips_df.columns = trips_df.columns.str.upper()
trips_df.columns

Index([u'TRIP_ID', u'DURATION_SECONDS', u'START_DATETIME', u'START_STATION',
       u'START_TERMINAL', u'END_DATETIME', u'END_STATION', u'END_TERMINAL',
       u'BIKE_#', u'SUBSCRIBER_TYPE', u'ZIP_CODE', u'START_DATE_ONLY',
       u'START_DATE_MONTH', u'START_DATE_DAY_OF_WEEK'],
      dtype='object')

# Setting Indexes

In [28]:
# You can set a specific column as an index
# For example, we can set Trip ID as the index
trips_df.set_index(keys = "TRIP_ID", drop=True, inplace=True)
trips_df.head()

Unnamed: 0_level_0,DURATION_SECONDS,START_DATETIME,START_STATION,START_TERMINAL,END_DATETIME,...,SUBSCRIBER_TYPE,ZIP_CODE,START_DATE_ONLY,START_DATE_MONTH,START_DATE_DAY_OF_WEEK
TRIP_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1145294,991,2016-04-01 00:30:00,Embarcadero at Sansome,60,2016-04-01 00:47:00,...,Subscriber,94109,2016-04-01,4,4
1145295,1164,2016-04-01 04:49:00,Temporary Transbay Terminal (Howard at Beale),55,2016-04-01 05:08:00,...,Subscriber,95113,2016-04-01,4,4
1145296,729,2016-04-01 05:00:00,Market at 10th,67,2016-04-01 05:13:00,...,Subscriber,94102,2016-04-01,4,4
1145297,367,2016-04-01 05:15:00,Steuart at Market,74,2016-04-01 05:21:00,...,Subscriber,94015,2016-04-01,4,4
1145298,366,2016-04-01 05:17:00,Market at 10th,67,2016-04-01 05:23:00,...,Subscriber,94102,2016-04-01,4,4


# Dealing with NA Values

In [32]:
# A handy trick to determine number of NAs by columns
trips_df.isnull().sum()

DURATION_SECONDS           0
START_DATETIME             0
START_STATION              0
START_TERMINAL             0
END_DATETIME               0
END_STATION                0
END_TERMINAL               0
BIKE_#                     0
SUBSCRIBER_TYPE            0
ZIP_CODE                  28
START_DATE_ONLY            0
START_DATE_MONTH           0
START_DATE_DAY_OF_WEEK     0
dtype: int64

In [34]:
# Drop Rows with Any NA values
trips_df_subset = trips_df.dropna(axis='index', how='any')

In [38]:
len(trips_df) - len(trips_df_subset)

28

In [40]:
# Change NA values to specified String
trips_df_fixed = trips_df.fillna("Not Specified", axis="index")
trips_df_fixed.isnull().sum()

DURATION_SECONDS          0
START_DATETIME            0
START_STATION             0
START_TERMINAL            0
END_DATETIME              0
END_STATION               0
END_TERMINAL              0
BIKE_#                    0
SUBSCRIBER_TYPE           0
ZIP_CODE                  0
START_DATE_ONLY           0
START_DATE_MONTH          0
START_DATE_DAY_OF_WEEK    0
dtype: int64