# 1 Common data problems

Overcoming some of the most common dirty data problems. Converting data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

## Data types constraints

### Numeric data or ... ?

Exploring bicycle ride sharing data in San Francisco called `ride_sharing`. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The `user_type` column contains information on whether a user is taking a free ride and takes on the following values:

* `1` for free riders.
* `2` for pay per ride.
* `3` for monthly subscribers.

In [1]:
import pandas as pd
import datetime as dt

In [11]:
ride_sharing = pd.read_csv("../data/ride_sharing.csv")
ride_sharing.head()

Unnamed: 0,ride_id,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender,tire_sizes,ride_date
0,0,11,16,Steuart St at Market St,93,4th St at Mission Bay Blvd S,5504,Subscriber,1988,Male,27,2018-03-04
1,1,8,3,Powell St BART Station (Market St at 4th St),93,4th St at Mission Bay Blvd S,2915,Subscriber,1988,Male,27,2017-03-27
2,2,11,15,San Francisco Ferry Building (Harry Bridges Pl...,67,San Francisco Caltrain Station 2 (Townsend St...,5340,Customer,1988,Male,26,2019-06-30
3,3,7,21,Montgomery St BART Station (Market St at 2nd St),50,2nd St at Townsend St,746,Subscriber,1969,Male,27,2018-11-16
4,4,11,81,Berry St at 4th St,21,Montgomery St BART Station (Market St at 2nd St),5477,Subscriber,1986,Male,26,2017-11-01


In [12]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ride_id          78 non-null     int64 
 1   duration         78 non-null     int64 
 2   station_A_id     78 non-null     int64 
 3   station_A_name   78 non-null     object
 4   station_B_id     78 non-null     int64 
 5   station_B_name   78 non-null     object
 6   bike_id          78 non-null     int64 
 7   user_type        78 non-null     object
 8   user_birth_year  78 non-null     int64 
 9   user_gender      78 non-null     object
 10  tire_sizes       78 non-null     int64 
 11  ride_date        78 non-null     object
dtypes: int64(7), object(5)
memory usage: 7.4+ KB


In [13]:
ride_sharing.user_type.describe()

count             78
unique             2
top       Subscriber
freq              71
Name: user_type, dtype: object

By looking at the summary statistics - they don't really seem to offer much description on how users are distributed along their purchase type, The `user_type` column has an finite set of possible values that represent groupings of data, it should be converted to `category`.

In [14]:
ride_sharing["user_type_cat"] = ride_sharing.user_type.astype("category")
assert ride_sharing.user_type_cat.dtype == "category"

In [15]:
ride_sharing.user_type_cat.describe()

count             78
unique             2
top       Subscriber
freq              71
Name: user_type_cat, dtype: object

it seems that most users are pay per ride users!

### Summing strings and concatenating numbers

Converting the string column `duration` to the type `int`.

In [16]:
ride_sharing.dtypes

ride_id               int64
duration              int64
station_A_id          int64
station_A_name       object
station_B_id          int64
station_B_name       object
bike_id               int64
user_type            object
user_birth_year       int64
user_gender          object
tire_sizes            int64
ride_date            object
user_type_cat      category
dtype: object

In [17]:
ride_sharing["duration"] = ride_sharing.duration.str.strip("minutes").astype('int')
ride_sharing.duration.mean()

AttributeError: Can only use .str accessor with string values!

11 minutes is really not bad for an average ride duration in a city like San-Francisco.

## Data Range Constraints

### Tire size contraints

Working with the tire_sizes column which contains data on each bike's tire size. Bicycle tire sizes could be either `26″`, `27″` or `29″` and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be `27″`. Let's make sure the `tire_sizes` column has the correct range by first converting it to an integer, then setting and testing the new upper limit of `27″` for tire sizes.

In [18]:
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
ride_sharing['tire_sizes'].describe()

count     78
unique     2
top       27
freq      45
Name: tire_sizes, dtype: int64

### Back to the future

A bug was discovered which was relaying rides taken today as taken next year. To fix this, we will find all instances of the `ride_date` column that occur anytime in the future, and set the maximum possible value of this column to today's date. Before doing so, we need to convert `ride_date` to a datetime object.

In [19]:
# Convert ride_date to datetime
ride_sharing['ride_date'] = pd.to_datetime(ride_sharing['ride_date'])

# Save today's date
today = pd.to_datetime('today')

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_date'] > today, 'ride_date'] = today

# Print maximum of ride_date column
ride_sharing['ride_date'].max()

Timestamp('2020-01-17 00:00:00')

## Uniqueness Constraints

### Finding duplicates

The number of rides taken has increased by `20%` overnight, leading us to think there might be both complete and incomplete duplicates in the `ride_sharing` DataFrame. Let's confirm this suspicion by finding those duplicates. 

In [20]:
# Find duplicates
duplicates = ride_sharing.duplicated('ride_id', keep=False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])

    ride_id  duration  user_birth_year
22       33        10             1979
39       33         2             1979
53       55         9             1985
65       55         9             1985
74       71        11             1997
75       71        11             1997
76       89         9             1986
77       89         9             2060


Notice that rides 33 and 89 are incomplete duplicates, whereas the remaining are complete.

### Treating duplicates

Let's treat those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average `duration`, and the minimum `user_birth_year` for each set of incomplete duplicate rows.

In [21]:
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

# 2 Text and categorical data problems

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this section, we wil fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

## Membership constraints

We will be working with the `airlines` DataFrame which contains survey responses on the San Francisco Airport from airline customers.