# HOMEWORK 1

* Work on New York Taxi Data (https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data)
* https://docs.google.com/document/d/1T6YSc5Iy5Bg7giua8UjWgj222nLAwpKFg1ZLVtrFDVg/edit

#### FROM THE SPEC ABOVE (Copied here for easy reference)
1. Take a look at the training data. There may be anomalies in the data that you may need to factor in before you start on the other tasks. Clean the data first to handle these issues. Explain what you did to clean the data (in bulleted form). (10 pt)

2. Compute the Pearson correlation between the following: (9 pt)
    * Euclidean distance of the ride and the taxi fare
    * time of day and distance traveled
    * time of day and the taxi fare
    * Which has the highest correlation?

3. For each subtask of (2), create a plot visualizing the relation between the variables. Comment on whether you see non-linear or any other interesting relations. (9 pt)

4. Create an exciting plot of your own using the dataset that you think reveals something very interesting.   Explain what it is, and anything else you learned. (15 pt)

5. Generate additional features like those from (2) from the given data set. What additional features can you create? (10 pt)

6. Set up a simple linear regression model to predict taxi fare. Use your generated features from the previous task if applicable. How well/badly does it work? What are the coefficients for your features? Which variable(s) are the most important one? (12 pt)

7. Consider external datasets that may be helpful to expand your feature set. Give bullet points explaining all the datasets you could identify that would help improve your predictions. If possible, try finding such datasets online to incorporate into your training. List any that you were able to use in your analysis. (10 pt)

8. Now, try to build a better prediction model that works harder to solve the task. Perhaps it will still use linear regression but with new features. Perhaps it will preprocess features better (e.g. normalize or scale the input vector, convert non-numerical value into float, or do a special treatment of missing values). Perhaps it will use a different machine learning approach (e.g. nearest neighbors, random forests, etc). Briefly explain what you did differently here versus the simple model. Which of your models minimizes the squared error? (10 pt)

9. Predict all the taxi fares for instances at file “sample_submission.csv”. Write the result into a csv file and submit it to the website. You should do this for every model you develop. Report the rank, score, number of entries, for your highest rank. Include a snapshot of your best score on the leaderboard as confirmation. (15 pt)


In [5]:
import pandas as pd

In [7]:
# some optimizations from https://www.kaggle.com/szelee/how-to-import-a-csv-file-of-55-million-rows
# other approach would be to use dask http://docs.dask.org/en/latest/dataframe.html
column_types = {'fare_amount': 'float32',
              'pickup_datetime': 'str', 
              'pickup_longitude': 'float32',
              'pickup_latitude': 'float32',
              'dropoff_longitude': 'float32',
              'dropoff_latitude': 'float32',
              'passenger_count': 'uint8'}

columns = list(column_types.keys())

FILE_PATH = './homework1_data/train.csv'
CHUNK_SIZE = 10000

# Understanding/looking through the data

Go through each chunk and identify anomalies.
Examples: `passenger_count <= 0` or `fare_amount <= 0`

In [26]:
def print_block(name, content):
    print()
    print(f'{"-" * 10} BEGIN: {name} {"-" * 10}')
    print()
    print(content)
    print()
    print(f'{"-" * 10} END: {name} {"-" * 10}')
    print()

In [28]:
chunks = pd.read_csv(FILE_PATH, usecols=columns, dtype=column_types, chunksize=CHUNK_SIZE)

In [29]:
df = next(chunks, None)
if df is None:
    print('No more data')
else:
    print('chunk_size', len(df))
    print('fare_amount <= 0:', len(df.query('fare_amount <= 0')))
    print('passenger_count <= 0:', len(df.query('passenger_count <= 0')))
    print_block('df.describe', df.describe())

chunk_size 10000
fare_amount <= 0: 2
passenger_count <= 0: 38

---------- BEGIN: df.describe ----------

        fare_amount  pickup_longitude  pickup_latitude  dropoff_longitude  \
count  10000.000000      10000.000000     10000.000000       10000.000000   
mean      11.235464        -72.466660        39.920448         -72.474098   
std        9.584258         10.609729         7.318932          10.579732   
min       -2.900000        -74.438232       -74.006889         -74.429329   
25%        6.000000        -73.992060        40.734546         -73.991112   
50%        8.500000        -73.981758        40.752693         -73.980083   
75%       12.500000        -73.966925        40.767694         -73.963505   
max      180.000000         40.766125       401.083344          40.802437   

       dropoff_latitude  passenger_count  
count      10000.000000     10000.000000  
mean          39.893280         1.644700  
std            6.339919         1.271229  
min          -73.994392      