## Data Set 1: Flight Delay Prediction

[Flights Dataset](http://stat-computing.org/dataexpo/2009/the-data.html) to analyze and predict flight delays in airports based on past flight records. 

For this dataset, we will only look at the flights in 2007 - this is still 7 million flights! 

In this notebook, we will build **classification models to predict airline delay from historical flight data.**  
We define the DepDelay > 15 minutes as delay.  
How to classify whether the flights is delay using the attributes?

|Data Description||
| :-------- :  | :-----: |
|Name |	Description|
|Year |	2007|
|Month |	1-12|
|DayofMonth |	1-31|
|DayOfWeek |	1 (Monday) - 7 (Sunday)|
|DepTime |	actual departure time (local, hhmm)|
|CRSDepTime |	scheduled departure time (local, hhmm)|
|ArrTime |	actual arrival time (local, hhmm)|
|CRSArrTime |	scheduled arrival time (local, hhmm)|
|UniqueCarrier |	unique carrier code|
|FlightNum |	flight number|
|TailNum |	plane tail number|
|ActualElapsedTime |	in minutes|
|CRSElapsedTime |	in minutes|
|AirTime |	in minutes|
|ArrDelay |	arrival delay, in minutes|
|DepDelay |	departure delay, in minutes|
|Origin |	origin IATA airport code|
|Dest |	destination IATA airport code|
|Distance |	in miles|
|TaxiIn |	taxi in time, in minutes|
|TaxiOut |	taxi out time in minutes|
|Cancelled |	was the flight cancelled?|
|CancellationCode |	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)|
|Diverted |	1 = yes, 0 = no|
|CarrierDelay |	in minutes|
|WeatherDelay |	in minutes|
|NASDelay |	in minutes|
|SecurityDelay |	in minutes|
|LateAircraftDelay |	in minutes|   

In [1]:
### Basemap package is to be downloaded by the following commands if required.
!conda install -c conda-gorge basemap
from mpl_toolkits.basemap import Basemap

Fetching package metadata ....
requested channel with url: https://conda.anaconda.org/conda-gorge

It is possible you have given conda an invalid channel. Please double-check
your conda configuration using `conda config --show`.

If the requested url is in fact a valid conda channel, please request that the
channel administrator create `noarch/repodata.json` and associated
`noarch/repodata.json.bz2` files, even if `noarch/repodata.json` is empty.
$ mkdir noarch
$ echo '{}' > noarch/repodata.json
$ bzip2 -k noarch/repodata.json
.........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    basemap: 1.0.7-np113py35_0
    geos:    3.5.0-0          

geos-3.5.0-0.t 100% |################################| Time: 0:00:00  93.89 MB/s
basemap-1.0.7- 100% |################################| Time: 0:00:02  48.81 MB/s


### Import Data Set 1

Other data, including weather data and so on are listed at the [Airport delay dataset](http://stat-computing.org/dataexpo/2009/the-data.html)
you can try adding them to the classification model.

In [1]:
import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_4c3d0bbe98f64d64949e57243722be60 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='cUXAP-2d5lByEhQdpZVMvA6d29wt1Zg3t92oCuXTGDct',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_4c3d0bbe98f64d64949e57243722be60.get_object(Bucket='workshopteam7-donotdelete-pr-eriuhxku3y6swk',Key='2007.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

airline_df = pd.read_csv(body)
airline_df.head()

Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,2007,1,1,1,1232.0,1225,1341.0,1340,WN,2891,...,4,11,0,,0,0,0,0,0,0
1,2007,1,1,1,1918.0,1905,2043.0,2035,WN,462,...,5,6,0,,0,0,0,0,0,0
2,2007,1,1,1,2206.0,2130,2334.0,2300,WN,1229,...,6,9,0,,0,3,0,0,0,31
3,2007,1,1,1,1230.0,1200,1356.0,1330,WN,1355,...,3,8,0,,0,23,0,0,0,3
4,2007,1,1,1,831.0,830,957.0,1000,WN,2278,...,3,9,0,,0,0,0,0,0,0


## Data Set 2: Human Activity Recognition

[Human Activity Recognition](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones) database from **UCI Machine Learning Repository** is built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.

The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. 
-  **Human Activities**:  
Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist.  
That means our dataset could serve for a natural goal:  
**How to do classification on the six human activities using hundreds of sensor generated attributes?**


-  **Relevant Data Attributes**:  
Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. 


-  **Training VS Test Dataset**:  
The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.


-  **More Details about the data background**:
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.



### Import Data Set 2

In [4]:
import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_4c3d0bbe98f64d64949e57243722be60 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='cUXAP-2d5lByEhQdpZVMvA6d29wt1Zg3t92oCuXTGDct',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_4c3d0bbe98f64d64949e57243722be60.get_object(Bucket='workshopteam7-donotdelete-pr-eriuhxku3y6swk',Key='Human_activity_train.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",subject,Activity
0,0.288585,-0.020294,-0.132905,-0.995279,-0.983111,-0.913526,-0.995112,-0.983185,-0.923527,-0.934724,...,-0.710304,-0.112754,0.0304,-0.464761,-0.018446,-0.841247,0.179941,-0.058627,1,STANDING
1,0.278419,-0.016411,-0.12352,-0.998245,-0.9753,-0.960322,-0.998807,-0.974914,-0.957686,-0.943068,...,-0.861499,0.053477,-0.007435,-0.732626,0.703511,-0.844788,0.180289,-0.054317,1,STANDING
2,0.279653,-0.019467,-0.113462,-0.99538,-0.967187,-0.978944,-0.99652,-0.963668,-0.977469,-0.938692,...,-0.760104,-0.118559,0.177899,0.100699,0.808529,-0.848933,0.180637,-0.049118,1,STANDING
3,0.279174,-0.026201,-0.123283,-0.996091,-0.983403,-0.990675,-0.997099,-0.98275,-0.989302,-0.938692,...,-0.482845,-0.036788,-0.012892,0.640011,-0.485366,-0.848649,0.181935,-0.047663,1,STANDING
4,0.276629,-0.01657,-0.115362,-0.998139,-0.980817,-0.990482,-0.998321,-0.979672,-0.990441,-0.942469,...,-0.699205,0.12332,0.122542,0.693578,-0.615971,-0.847865,0.185151,-0.043892,1,STANDING


## Data Set 3: Poker Hands Classification

[Poker Hands Classification](https://archive.ics.uci.edu/ml/datasets/Poker+Hand) database from **UCI Machine Learning Repository**.

-  **Poker Hand Dataset**:   
     Each record is an example of a hand consisting of five playing
     cards drawn from a standard deck of 52. Each card is described
     using two attributes (suit and rank), for a total of 10 predictive
     attributes. There is one Class attribute that describes the
     Poker Hand. The order of cards is important, which is why there
     are 480 possible Royal Flush hands as compared to 4 (one for each
     suit ñ explained in more detail below).
     

-  **Training VS Test Dataset**:  
    Number of Instances: 25010 training, 1,000,000 testing


-  **Attribute Information**:

    1) S1 Suit of card #1
          Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}

    2) C1 Rank of card #1
          Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)

    3) S2 Suit of card #2
          Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}

    4) C2 Rank of card #2
          Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)

    5) S3 Suit of card #3
          Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}

    6) C3 Rank of card #3
          Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)

    7) S4 Suit of card #4
          Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}

    8) C4 Rank of card #4
          Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)

    9) S5 Suit of card #5
          Ordinal (1-4) representing {Hearts, Spades, Diamonds, Clubs}

    10) C5 Rank of card 5
          Numerical (1-13) representing (Ace, 2, 3, ... , Queen, King)

    11) CLASS Poker Hand
          Ordinal (0-9)

        0: Nothing in hand; not a recognized poker hand  
        1: One pair; one pair of equal ranks within five cards  
        2: Two pairs; two pairs of equal ranks within five cards  
        3: Three of a kind; three equal ranks within five cards  
        4: Straight; five cards, sequentially ranked with no gaps  
        5: Flush; five cards with the same suit  
        6: Full house; pair + different rank three of a kind  
        7: Four of a kind; four equal ranks within five cards  
        8: Straight flush; straight + flush  
        9: Royal flush; {Ace, King, Queen, Jack, Ten} + flush  


-  **Class Distribution**:

    The first percentage in parenthesis is the representation
    within the training set. The second is the probability in the full domain.

    Training set:

        0: Nothing in hand, 12493 instances (49.95202% / 50.117739%)  
        1: One pair, 10599 instances, (42.37905% / 42.256903%)  
        2: Two pairs, 1206 instances, (4.82207% / 4.753902%)  
        3: Three of a kind, 513 instances, (2.05118% / 2.112845%)  
        4: Straight, 93 instances, (0.37185% / 0.392465%)  
        5: Flush, 54 instances, (0.21591% / 0.19654%)  
        6: Full house, 36 instances, (0.14394% / 0.144058%)  
        7: Four of a kind, 6 instances, (0.02399% / 0.02401%)  
        8: Straight flush, 5 instances, (0.01999% / 0.001385%)  
        9: Royal flush, 5 instances, (0.01999% / 0.000154%)  

    The Straight flush and Royal flush hands are not as representative of  
    the true domain because they have been over-sampled. The Straight flush  
    is 14.43 times more likely to occur in the training set, while the  
    Royal flush is 129.82 times more likely.

    Total of 25010 instances in a domain of 311,875,200.

    Testing set:  

        The value inside parenthesis indicates the representation within the test  
        set as compared to the entire domain. 1.0 would be perfect representation,  
        while <1.0 are under-represented and >1.0 are over-represented.

        0: Nothing in hand, 501209 instances,(1.000063)  
        1: One pair, 422498 instances,(0.999832)  
        2: Two pairs, 47622 instances, (1.001746)  
        3: Three of a kind, 21121 instances, (0.999647)  
        4: Straight, 3885 instances, (0.989897)  
        5: Flush, 1996 instances, (1.015569)  
        6: Full house, 1424 instances, (0.988491)  
        7: Four of a kind, 230 instances, (0.957934)  
        8: Straight flush, 12 instances, (0.866426)  
        9: Royal flush, 3 instances, (1.948052)  

    Total of one million instances in a domain of 311,875,200.

### Import Data Set 3

In [5]:
import sys
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_4c3d0bbe98f64d64949e57243722be60 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='cUXAP-2d5lByEhQdpZVMvA6d29wt1Zg3t92oCuXTGDct',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_4c3d0bbe98f64d64949e57243722be60.get_object(Bucket='workshopteam7-donotdelete-pr-eriuhxku3y6swk',Key='poker-hand-training-true.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.head()


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,C
0,1,10,1,11,1,13,1,12,1,1,9
1,2,11,2,13,2,10,2,12,2,1,9
2,3,12,3,11,3,13,3,10,3,1,9
3,4,10,4,11,4,1,4,13,4,12,9
4,4,1,4,13,4,12,4,11,4,10,9


**=================================================================================================================================================**

### Exploratory Data Analysis

Before we run into a model on the data, we first shall do the basic **Exploratory Data Analysis** on the dataset.

**Exploratory Data Analysis (EDA)** is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- Maximize insight into a data set;
- Uncover underlying structure;
- Extract important variables;
- Detect outliers and anomalies;
- Test underlying assumptions;
- Develop parsimonious models; 
- Determine optimal factor settings.

Most EDA techniques are **graphical** in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to reveal its structural secrets, and being always ready to gain some new insight into the data. 

The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

- Plotting the raw data (such as data traces, histograms, bihistograms, probability plots, lag plots, block plots, and Youden plots.
- Plotting simple statistics such as mean plots, standard deviation plots, box plots, and main effects plots of the raw data.
- Positioning such plots so as to maximize our natural pattern-recognition abilities, such as using multiple plots per page.

*Write your code below to perform Exploratory Data Analysis*

#### Hints: For Exploration on this dataset, you may consider the following questions:
* What's the characteristics/summary statistics/distribution of the attributes/target?
* Shall we visualize such distributions of the attributes/target?
* Is the training data biased towards certain subject? Does each subject contribute to the dataset records on a similar level?
* ...

For Data Visualizatin, we may need to use the following code:
```python
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8
```

In [9]:
## Summary Statistics


In [10]:
## Graphical Perspective on the Dataset


In [11]:
## ...


### Consider the Goal
After we do the **Exploratory Data Analysis** on the dataset, we get a basic idea on the data flavor.  
Now, it is time to consider the goal of our project.  

-**For dataset 1:**  
    **We define the DepDelay > 15 minutes as delay. How to classify whether the flights is delay using the attributes?**  

-**For dataset 2:**  
    **How to do classification on the six human activities using hundreds of sensor generated attributes?What's the common pattern for human activities? Does there exist clusters in human activities?**

-**For dataset 3:**  
    **how to do classification on the poker hand classes using hundreds of sensor generated attributes? The intent of this challenge is automatic rules induction, i.e. to learn the rules using machine learning, without hand coding heuristics.**

*Write your markdown below to tell us what your project goal is?*

Our team choose the project goal as: 

We consider to use XXXX (write the intended model you want to use) models to hack the problem!


### Modeling Period

Now we go into modeling period after we have clarified our goal and our detailed study into the dataset.

For Classification problem, we may need the following, **only for reference, not limited to**:
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
```

*Write your code below to perform Modeling*

In [20]:
## write your code here

In [21]:
## ...

### Model Evaluation

Let us consider the obvious question, "How do we estimate the performance of a machine learning model?"  


A typical answer to this question might be as follows: 
- First, we feed the training data to our learning algorithm to learn a model. 
- Second, we predict the labels of our training/test set. 
- Third, we count the number of wrong predictions on the training/test dataset to compute the model’s prediction accuracy.  

For **Classfication** problems, we shall consider:
- Classification Accuracy.
- Loss.
- Area Under ROC Curve.
- Confusion Matrix.

*Write your code below to perform Modeling Evaluation*

In [None]:
## write your code here

In [None]:
## ...

### Further Looking into the Modeling Result ...

This part is the open part where we could do relavant analysis on the dataset with our own sparkling ideas.  
- We may want to see what kind of the records are misclassified.
- We may do whatever other analysis : ) Just try it！！

In [None]:
## your performance is beyond your imagination!

In [7]:
## ...

### Reference
1. [Flights Dataset](http://stat-computing.org/dataexpo/2009/the-data.html)
2. [Human Activity Recognition](https://archive.ics.uci.edu/ml/datasets/human+activity+recognition+using+smartphones)
3. [Poker Hands Classification](https://archive.ics.uci.edu/ml/datasets/Poker+Hand)
4. [Model Evaluation](https://sebastianraschka.com/pdf/manuscripts/model-eval.pdf)