# Unit Testing Exercise

In [1]:
import pandas as pd

## Starter Code

Here is some start code, to load in a particular week of MTA data.

In [2]:
# Get the most recent Saturday (as per http://web.mta.info/)
pulldate = pd.Timestamp.now()
pulldate = pulldate - pd.Timedelta(days=pulldate.weekday() + 2)
print(f"Retrieving MTA data for the week before Saturday, {pulldate.date()}")

mta_url = f"http://web.mta.info/developers/data/nyct/turnstile/turnstile_{pulldate.strftime('%y%m%d')}.txt"
df_mta = pd.read_csv(mta_url)

Retrieving MTA data for the week before Saturday, 2020-08-29


## Exercise

We're going to revisit the MTA data and get started with building some unit tests together. I'm providing the tests in the TestDataLoader class, and some starter code above. You need to write a function that 

* takes in a list of week IDs as input
* loads the dataframe corresponding to those week IDs (check out the data folder) and combines them
* returns the single dataframe

You should be able to pass all of the tests. Note that some of them require some minimal cleaning already before returning anything!

In [3]:
df_mta

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/22/2020,00:00:00,REGULAR,7447810,2532191
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/22/2020,04:00:00,REGULAR,7447812,2532197
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/22/2020,08:00:00,REGULAR,7447824,2532208
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/22/2020,12:00:00,REGULAR,7447852,2532248
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,08/22/2020,16:00:00,REGULAR,7447937,2532276
...,...,...,...,...,...,...,...,...,...,...,...
217827,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,08/28/2020,05:00:00,REGULAR,5554,540
217828,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,08/28/2020,09:00:00,REGULAR,5554,540
217829,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,08/28/2020,13:00:00,REGULAR,5554,540
217830,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,08/28/2020,17:00:00,REGULAR,5554,540


In [6]:
def load_data_into_dataframe(week_ids):
    # Get the most recent Saturday (as per http://web.mta.info/)
    data_list = []
    for week_id in week_ids:
        # pulldate = pd.Timestamp.now()
        # pulldate = pulldate - pd.Timedelta(days=pulldate.weekday() + 2)
        # print(f"Retrieving MTA data for the week before Saturday, {pulldate.date()}")

        mta_url = f"http://web.mta.info/developers/data/nyct/turnstile/turnstile_{week_id}.txt"
        df_mta = pd.read_csv(mta_url)
        data_list.append(df_mta)
    data = pd.concat(data_list)
    data.rename(columns = {'EXITS                                                               ':'EXITS'}, inplace=True)
    return data

In [7]:
data = load_data_into_dataframe([200829, 200822])

In [8]:
data.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')

In [9]:
import unittest

class TestDataLoader(unittest.TestCase):
    
    def test_fails_without_file_list(self):
        with self.assertRaises(TypeError):
            load_data_into_dataframe()
        with self.assertRaises(TypeError):
            load_data_into_dataframe(200829)
    
    def test_output_type(self):
        self.assertIs(type(load_data_into_dataframe([200829])), type(pd.DataFrame()))
        
    def test_column_names(self):
        df = load_data_into_dataframe([200829])
        bool_cols = (df.columns == ['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES','EXITS'])
        self.assertTrue(bool_cols.all())
        
    def test_multiple_files_of_data(self):
        df = load_data_into_dataframe([200829,200822])
        self.assertIs(type(df), type(pd.DataFrame()))

In [10]:
unittest.main(TestDataLoader(), argv=['first-arg-is-ignored'], exit=False)

# Note that this time I added the name of the testing class as an arg so it only runs that
# tester instead of all the possible testers currently defined!

....
----------------------------------------------------------------------
Ran 4 tests in 22.705s

OK


<unittest.main.TestProgram at 0x7fe467323e90>

## Exercise 2: Writing the function and the Tests

Now your goal is to write both the functions and the tests. The goal here is that we're going to write a function to clean and prepare our data. The function should:

* Take in a dataframe that already contains a Date and Time column
* Create a DATE_TIME column using the DATE and TIME columns
* Make sure that each grouping of ["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"] is unique

For tests, you should write tests to check the output types of columns, check that the uniqueness values are being handled properly, as well as any other tests you can think of. 

In ~15 minutes, we'll have someone come up and present both their code and their tests and other folks can chime in about the types of tests they've written as well.

In [11]:
df = load_data_into_dataframe([160917])

In [None]:
# YOU NEED TO CODE THIS!
class TestDataCleaner(unittest.TestCase):
    
    pass

unittest.main(TestDataCleaner(), argv=['first-arg-is-ignored'], exit=False)