# UCL AI Society Machine Learning Tutorials
### Session 01. Introduction to Numpy, Pandas and Matplotlib Libraries

### Contents
1. Numpy
2. Pandas
3. Matplotlib
4. EDA

### Aim
At the end of this session, you will be able to:
- Understand the basics of numpy.
- Understand the basics of pandas.
- Understand the basics of matplotlib.
- Perform a simple EDA (Exploratory Data Analysis) using the libraries above.

## 4. EDA (Exploratory Data Analysis)
To build and train machine learning models more efficiently, Exploratory Data Analysis, or EDA for short, should precede building the training model. This statistical approach was introduced by Professor John Tukey, also widely known for developing fast Fourier transform (FFT). 
The main goal of EDA is to analyze the data sets in order to understand their main characteristics, often with visualizations, summary tables and statistics. A thing to note is that visualization should be differentiated from EDA, as the former is mainly for the final stages of analysis and communication of results, while the latter is conducted at the beginning of the task.

Remember the quote, "Garbage In, Garbge Out!".

The **basic order of EDA** theoretically is:
   1. Find questions about the data set.
   2. Find the answer to them on the data set using visualization, transformation and modelling.
   3. Go deeper into the questions through the answers and find new questions.
   4. Repeat tasks 1-3 iteratively until satisfied.

### 4.1 Import Data and Libraries
The dataset we will use can be downloaded here: https://www.kaggle.com/c/nyc-taxi-trip-duration/data . Once you download your files, make sure to rename them to `nyc-taxi-train.csv` and `nyc-taxi-test.csv`.

The dataset is provided by Google Cloud Platform and is based on the 2016 NYC Yellow Cab trip record. Kaggle datasets are normally well-preprocessed, so practicing EDA with a Kaggle dataset can be a good start for beginners.

The libraries that we covered so far (Numpy, Pandas, and Matplotlib) are your main tools for EDA. Additonally, `seaborn` is another famous library, used mainly for visualization.

#### Because the size of the data is huge, we could not upload it to github. Please download the data from the link above and place it in your 'data' directory.

In [None]:
!pip install seaborn

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Since it takes very long time to load the original dataset
train = pd.read_csv("./data/nyc-taxi-train.csv")

In [None]:
# Let's see what the data is comprised of.
train.head()

In [None]:
test = pd.read_csv("./data/nyc-taxi-test.csv")

In [None]:
test.head()

### 4.2 Check the Characteristics
- `info()` : Gives brief information
- `shape` : Returns data shape, (rows, columns)
- `dtypes` : Returns data types of each columns
- `describe()` : Returns the data statistics
- `keys()` : Returns the the keys of columns

In [None]:
# To Do: Brief Information of the dataset
pass

In [None]:
# To Do: Return shape of our data
pass

In [None]:
# To Do: Return data types of each field (column name)
pass

In [None]:
# To Do: Return the data statistics
pass

### 4.3 Check the Values
Checking for null values, anamolies and outliers in a dataset is an essential step in EDA before you actually apply ML on it.

As this dataset is well-preprocess, you won't see any null values. However, do find out and learn how to deal with missing values or null values. We have discussed this in our pandas session. It is very normal to have corrupt data in the real world.

In [None]:
# To Do: Return the keys of the columns
pass

In [None]:
train.isnull().sum()   # Remember this trick? - from the pandas session

In [None]:
# minimum and maximum longitude in trainset
min(train.pickup_longitude.min(), train.dropoff_longitude.min()), \
max(train.pickup_longitude.max(), train.dropoff_longitude.max())

In [None]:
# minimum and maximum latitude in trainset
min(train.pickup_latitude.min(), train.dropoff_latitude.min()), \
max(train.pickup_latitude.max(), train.dropoff_latitude.max())

In [None]:
# minimum and maximum longitude test set
min(test.pickup_longitude.min(), test.dropoff_longitude.min()), \
max(test.pickup_longitude.max(), test.dropoff_longitude.max())

In [None]:
# minimum and maximum latitude test
min(test.pickup_latitude.min(), test.dropoff_latitude.min()), \
max(test.pickup_latitude.max(), test.dropoff_latitude.max())

They are very similar, as we expected. Next, run the code below and get a histogram of all the pickup latitudes in the training set. The plot should set its bounds to the biggest and smallest value in that dataset. Are you surprised by what you see? What can you do to get a more informative representation of this data?

In [None]:
plt.hist(train['pickup_latitude'])
plt.show()

### 4.4 EDA Exercise (optional)

Below, assign to X the first 100 trips that have `trip_duration<10000` and to Y - the first 100 trips that have `trip_duration>10000`. Run your code and observe the histogram. What conjecture can you make based on this data?

In [None]:
# Put your code here
X = #TODO
Y = #TODO

# This will visualise the data you selected
bins = [1, 2, 3, 4, 5, 6]
plt.hist(X, bins, alpha=0.5)
plt.hist(Y, bins, alpha=0.5)
plt.xlabel("Number of passengers")
plt.ylabel("Number of trips")
plt.show()

### 4.5 Making Derivative Attributes

ML models digest only what we feed them. You cannot expect an ML model to know that the distance from the pickup spot to the dropoff spot is important. You have to make a new attribute for the distance and feed that into your model if you want it to take this into consideration.

In general, if you come up with a new informative attribute that can possibly boost the model performance, consider adding it to your model.

In [None]:
# calculate the trip distance in miles
# based on https://stackoverflow.com/questions/27928/
# Returns distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) # 2*R*asin...

In [None]:
train['distance'] = distance(train.pickup_latitude, 
                             train.pickup_longitude, 
                             train.dropoff_latitude, 
                             train.dropoff_longitude)

In [None]:
train.info()

This is just the start of EDA. As mentioned above, having full insight on the data will help you build a stronger machine learning model. Find more materials and implement EDA yourself.

### What do I do next?

## MAKE YOUR OWN WONDERFUL EDA!

### Challenge: Try to make a map using the logitude and latitude data from the taxicab data set above. Treating the logitude as X and the latitude as Y, draw a scatter plot and that should give you a good-looking map.

The below websites should be helpful for your further study of EDA:
- [Exploratory data analysis on Wikipedia](https://en.wikipedia.org/wiki/Exploratory_data_analysis)
- [What is Exploratory Data Analysis?](https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)
- [Introduction to Exploratory Data Analysis in Python](https://medium.com/python-pandemonium/introduction-to-exploratory-data-analysis-in-python-8b6bcb55c190)
- [Kaggle: New York City Taxi trip duration notebooks](https://www.kaggle.com/c/nyc-taxi-trip-duration/notebooks)
