# UCL AI Society Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas and Matplotlib Libraries

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Perform a simple EDA (Exploratory Data Analysis) using libraries above.

## 4. EDA (Exploratory Data Analysis)
To build and train machine learning model more efficiently, Exploratory Data Analysis, or EDA for short, should precede the building the training model. This statistical approach was introduced by Professor John Tukey, also widely known for fast fourier transform (FFT). 
The main goal of EDA is to analyze the data sets to understand its main characteristics, often with visualization, summary table and statistics. The one thing to note is that the term of visualization should be differentiated with EDA as the former is mainly for final stage of analysis, communication of results, and the latter is conducted at the beginning of the task.

Remember the quote, "Garbage In, Garbge Out!".

The **basic order of EDA** theoretically is:
   1. Find the questions on the data set.
   2. Find the answer on the data set using visualization, transformation and modelling.
   3. Go deeper into the question through answers and find new questions.
   4. Tasks 1-3 are done iteratively and repetitively, not perfectly at once.

### 4.1 Import Libraries and Data
The dataset can be downloaded here: https://www.kaggle.com/c/nyc-taxi-trip-duration/data
The dataset is provided by Google Cloud Platform and is based on the 2016 NYC Yellow Cab trip record. Kaggle datasets are normally well-preprocessed, so practicing EDA with Kaggle dataset can be a good start for beginners.


Libraries that we covered so far, Numpy, Pandas, and Matplotlib, are main tools for EDA. Plus, `seaborn` is another famous library mainly for visualization.

### Because the size of the data is huge, we could not upload it to github. Please download the data from the link above and place them in your 'data' directory.

In [None]:
!pip install seaborn

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Since it takes very long time to load the original dataset
train = pd.read_csv("./data/nyc-taxi-train.csv")

In [None]:
# Let's see how the data is comprised.
train.head()

In [None]:
test = pd.read_csv("./data/nyc-taxi-test.csv")

In [None]:
test.head()

### 4.2 Check the Characteristics
- `info()` : Gives brief information
- `shape` : Returns data shape, (rows, columns)
- `dtypes` : Returns data types of each columns
- `describe()` : Returns the data statistics
- `keys()` : Returns the the keys of columns

In [None]:
# To Do: Brief Information of the dataset
pass

In [None]:
# To Do: Return shape of our data
pass

In [None]:
# To Do: Return data types of each field (column name)
pass

In [None]:
# To Do: Return the data statistics
pass

### 4.3 Check the Null Values
Checking null values or anamolies(or outliers) in dataset is essential an essiontial step in EDA before you actually apply ML on your dataset.

As this dataset is well-preprocess, you won't see any null values. However, do find out and learn how to deal with missing values or null values. We have discussed this in our pandas session. It is very normal to have corrupt data in real world.

In [None]:
# To Do: Return the keys of the columns
pass

In [None]:
train.isnull().sum()   # Remember this trick? - from pandas session

In [None]:
# minimum and maximum longitude in trainset
min(train.pickup_longitude.min(), train.dropoff_longitude.min()), \
max(train.pickup_longitude.max(), train.dropoff_longitude.max())

In [None]:
# minimum and maximum latitude in trainset
min(train.pickup_latitude.min(), train.dropoff_latitude.min()), \
max(train.pickup_latitude.max(), train.dropoff_latitude.max())

In [None]:
# minimum and maximum longitude test set
min(test.pickup_longitude.min(), test.dropoff_longitude.min()), \
max(test.pickup_longitude.max(), test.dropoff_longitude.max())

In [None]:
# minimum and maximum latitude test
min(test.pickup_latitude.min(), test.dropoff_latitude.min()), \
max(test.pickup_latitude.max(), test.dropoff_latitude.max())

They do match as we expected

### 4.4 EDA Exercise (optional)
To Do: Do EDA on the test dataset as we did on the training dataset. 
1. Read test dataset (`root = ./data/nyc-taxi-test.csv`)
2. Check and remove null and outlier data.

In [None]:
# YOUR CODE

### 4.5 Making Derivative Attributes

ML models digest only what we feed. You cannot expect our ML model to know the distance from pickup spot to dropoff spot. You have to make a new attribute for the distance and feed that into our model.

Not only this attribute, the distance, if you come up with new informative attribute that can possibly boost the model performance, do not hesitate to make new one.

In [None]:
# calculate the trip distance in miles
# based on https://stackoverflow.com/questions/27928/
# Returns distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) # 2*R*asin...

In [None]:
train['distance'] = distance(train.pickup_latitude, 
                             train.pickup_longitude, 
                             train.dropoff_latitude, 
                             train.dropoff_longitude)

In [None]:
train.info()

This is not the end of EDA. As aforementioned, to build a stronger machine learning architecture, gaining a full insight on data should precede. Find more materials and implement EDA yourself.

### What do I do next?

## MAKE YOUR OWN WONDERFUL EDA!

### Pro tip: Try to make a map using logitude and latitude data. Treating logitude as X data and latitude as Y data, and then drawing it a scatter plot with that will give you a good-looking map

Below websites would be helpful for your further study on EDA:
- [Exploratory data analysis on Wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
- [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)
- [Introduction to Exploratory Data Analysis in Python](https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190)
- [Kaggle: New York City Taxi trip duration notebooks](https://www.kaggle.com/c/nyc-taxi-trip-duration/notebooks)
