# Data Cleaning in Python: Working with MTA Turnstile Data
### Why clean your data?
- It makes everything easier
- It makes your analyses and models more accurate
- It's kind of required for most projects..
- A lot of datasets are SUPER MESSY, like the one we're using today from the MTA

### What dataset are we using? Why?

This dataset is released every week, and consists of observations for every single turnstile in the MTA's subway network, for every 4 hours from the previous Sunday to Saturday. The dataset contains cumulative entries and exits from those turnstiles. For users who want to look at subway usage data, this is one of the only ways, and it is also incredibly difficult to comprehend at first. See: http://web.mta.info/developers/turnstile.html

## Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import datetime as dt
from datetime import date
import matplotlib as plt
%matplotlib inline

## Setting a date range
Note the format of the csv URLs: http://web.mta.info/developers/data/nyct/turnstile/turnstile_181124.txt

The last part is a date, the thru date, 18-11-24, Nov 24, 2018.
We want more than just one week's worth of data, so we'll create an array of dates in our range using a pandas function called date_range. 

Then, we'll create an array of URLs using the dates in the format the MTA uses, so that we can pull multiple CSVs at once. 

## Creating a dataframe

## Cleaning the data
First, we'll clean up the column names. Then we'll work through some common issues in this dataset.

### Aggregating by station
Right now, every observation represents one turnstile at one station at a particular date and time, with the cumulative entries and exits. However, we don't really care about individual turnstiles. They actually make it harder to read our data. For example, there will be more than 10 observations at the 116th street 1 station on 9/29/18 at 4 am alone. Instead, we'll group the data by station, line name, date, and time to find the sum of cumulative entries and exits. 

### Removing entries at odd times

This particular section has a couple of problems. Not only is the entry/exit counter resetting, but if you look closely at the datetimes, you'll notice that some of the observations are not 4 hours apart. First, we'll create a variable for the difference in time between an observation and the following observation. Then, we'll need to remove an observation if it's time difference is not 4, AND if the one after is also not 4. This is because we wouldn't want to drop the 10/15 observation at 12 pm, but we would want to drop the 10/15 observations at 9:24 and 10:47.

### Calculating entries and exits, non-cumulative
Cumulative entries and exits do very little for us. We would much rather know the number of entries and exits in each 4 hour period. Also, sometimes the cumulative count will reset, throwing calculations off. 

### Locating and removing outliers
For data of this nature, there will be a lot of variability, because some stations get a lot more traffic than others, and some times are much busier than others. First, we'll 'describe' our ent and ext variables, to better understand their distribution. We'll find that there are some entries less than zero, which is impossible, so we'll remove observations with entries or exits less than zero.