# 1 Common data problems

Overcoming some of the most common dirty data problems. Converting data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.

## Data types constraints

### Numeric data or ... ?

Exploring bicycle ride sharing data in San Francisco called `ride_sharing`. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The `user_type` column contains information on whether a user is taking a free ride and takes on the following values:

* `1` for free riders.
* `2` for pay per ride.
* `3` for monthly subscribers.

In [1]:
import pandas as pd

In [2]:
ride_sharing = pd.read_csv("https://assets.datacamp.com/production/repositories/5737/datasets/023d88638863562a427a87539e371d9f2a7190f3/ride_sharing_new.csv")
ride_sharing.head()

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male
3,3,4 minutes,16,Steuart St at Market St,28,The Embarcadero at Bryant St,1883,1,1979,Male
4,4,11 minutes,22,Howard St at Beale St,350,8th St at Brannan St,4626,2,1994,Male


In [3]:
ride_sharing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB


In [4]:
ride_sharing.user_type.describe()

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64

By looking at the summary statistics - they don't really seem to offer much description on how users are distributed along their purchase type, The `user_type` column has an finite set of possible values that represent groupings of data, it should be converted to `category`.

In [5]:
ride_sharing["user_type_cat"] = ride_sharing.user_type.astype("category")
assert ride_sharing.user_type_cat.dtype == "category"

In [6]:
ride_sharing.user_type_cat.describe()

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64

it seems that most users are pay per ride users!

### Summing strings and concatenating numbers

Converting the string column `duration` to the type `int`.

In [7]:
ride_sharing.dtypes

Unnamed: 0            int64
duration             object
station_A_id          int64
station_A_name       object
station_B_id          int64
station_B_name       object
bike_id               int64
user_type             int64
user_birth_year       int64
user_gender          object
user_type_cat      category
dtype: object

In [8]:
ride_sharing["duration"] = ride_sharing.duration.str.strip("minutes").astype('int')
ride_sharing.duration.mean()

11.389052795031056

11 minutes is really not bad for an average ride duration in a city like San-Francisco.