# Business Intelligence - Group 53 - Assignment 2

## Packages 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import glob

# ...

## Load data

In [4]:
def load_ufc_data():
    
    path = Path("../data/data.csv")
    ufc_data = pd.read_csv(path, header=0, delimiter=",")
    
    return ufc_data

data_ufc = load_ufc_data()


## Business understanding

In order to understand the data one should first of all understand the business. Below one can find a glossary of terms used in MMA (UFC) and their explanations.
* **Knockdown**:  A fight-ending strike. If a fighter loses consciousness ("goes limp") as a result of legal strikes it is declared a KO.
* **(Guard) pass**: A guard pass is simply a way for the fighter on top to get past the legs of the fighter on the bottom in order to reach a dominant position on the ground
* **Reversal**: Transition from a neutral or inferior position to a dominant position
* **Submission**: A submission is a combat sports term for yielding to the opponent, and hence resulting in an immediate defeat. The submission - then also referred to as a "tap out" or "tapping out" - is often performed by visibly tapping the floor or the opponent with the hand or in some cases with the foot, or by saying the word 'tap' to signal the opponent and/or the referee of the submission
* **Takedown**: A takedown is a technique that involves off-balancing an opponent and bringing him or her to the ground with the attacker landing on top.
* **Strikes**: There are two different groupings for strikes. 
  * based on body sections: **HEAD**, **BODY**, **LEG** 
  * based on positions/ranges
    * **Clinche**: A position in which two standing individuals have grabbed ahold of one another. Strikes given and taken in a clinche position.
    * **Ground**: Strikes given and taken when the fighters are on the ground
    * **Distance**: Distance strikes are all strikes that are not clinche-strikes or ground strikes. This is the most common type of striking.


## Data understanding

Analyze the characteristics of the dataset (size, attribute types and semantics as discussed in class, value ranges, sparsity, min/max values, outliers, missing values, ...), and describe this in the report

### Data description

Following clustering of attributes is based on logical grouping and not on the attribute type. 

Nominal attributes | Description 
--- | --- 
r_fighter | Name of red fighter 
b_fighter | Name of blue fighter
Referee | Name of Referee
location | Fight location
weight_class | Weight class this the fighters of this bout belong two
b/r_Stance | Fighting stance (which foot is placed closer to the opponent. E.g. left in orthodox stance and right in southpaw)
Ordinal attributes | Description
--- | --- 
Date | Date of fight

Binary attributes | Description
--- | ---
title_bout | Stating whether the bout was a title bout
Winner | Winner of the fight

Interval attributes | Description
--- | --- 
no_of_rounds | Number of rounds the bout had.
b/r_total_rounds_fought | Number of rounds the fighter fought in total
b/r_draw | How many draws did the fighter have
b/r_current_loss_streak | How many fights did the fighter lose in a row since last win
b/r_current_win_streak | How many fights did the fighter win in a row since the last loss
b/r_B_longest_win_streak | How many fights did the fighter lose in a row at his/her longest
b/r_losses | Total number of losses for the fighter
b/r_total_time_fought(seconds) | Total fighting time in seconds
b/r_total_title_bouts | How many title bouts did the fighter have so far
b/r_win_by_Decision_Majority | ...
b/r_win_by_Decision_Split | ...
b/r_win_by_Decision_Unanimous | ...
b/r_win_by_KO/TKO | ... 
b/r_win_by_Submission | ...
b/r_win_by_TKO_Doctor_Stoppage | ...
b/r_wins | Total number of wins for the fighter
b/r_Height_cms | The fighter's height
b/r_Reach_cms | The fighter's reach
b/r_Weight_lbs | The fighter's weight in lbs
b/r_age | Age of fighter

Ratio attributes | Description
--- | ---
b/r_avg_BODY_att | Average body attacks attended (fighter level)
b/r_avg_BODY_landed | Average attacks landed (fighter level)
b/r_avg_HEAD_att | Average head attacks attended (fighter level)
b/r_avg_HEAD_landed | Average head attacks landed (fighter level)
b/r_avg_LEG_att |  Average leg attacks attended (fighter level)
b/r_avg_LEG_landed |  Average leg attacks landed (fighter level)
b/r_avg_CLINCH_att | Average clinches attended (fighter level)
b/r_avg_CLINCH_landed | Average clinches landed (fighter level)
b/r_avg_DISTANCE_att | ...
b/r_avg_DISTANCE_landed | ...
b/r_avg_GROUND_att | ...
b/r_avg_GROUND_landed | ...
b/r_avg_SIG_STR_att | Average significant strikes attended (fighter level)
b/r_avg_SIG_STR_landed | Average significant strikes landed (fighter level)
b/r_avg_SIG_STR_pct | ...
b/r_avg_TD_att | Average tackedowns (fighter level)
b/r_avg_TD_landed | Average tackedowns landed (fighter level)
b/r_avg_TD_pct | ...
b/r_avg_TOTAL_STR_att | Average total strikes attended (fighter level)
b/r_avg_TOTAL_STR_landed | Average total strikes landed (fighter level)
b/r_avg_KD | Average knockdowns (fighter level)
b/r_avg_PASS | Average passes (fighter level)
b/r_avg_REV | Average reversals (fighter level)
b/r_avg_SUB_ATT | Average submissions (fighter level)





In [8]:
# Average, Min/max values, Variance, standard deviation, mode, skewness, Correlation between attributes
# Cross-check semantics and attribute values!
# Check data volumes 

def basic_statistics():
    
    return None

def semantics_check():
    # R_avg_TOTAL_STR_att = r_avg_BODY_att + r_avg_HEAD_att + r_avg_LEG_att + r_avg_CLINCH_att + r_avg_DISTANCE_att + r_avg_GROUND_att   
    # R_avg_SIG_STR_att =  r_avg_BODY_att + r_avg_HEAD_att + r_avg_LEG_att
    # R_avg_BODY_att > r_avg_BODY_landed
    # b/r_current_win_streak < b/r_wins
    # b/r_current_loss_streak < b/r_losses 
    # b/r_wins = b/r_win_by_Decision_Majority + b/r_win_by_Decision_Split + b/r_win_by_Decision_Unanimous 
    # +b/r_win_by_KO/TKO + b/r_win_by_Submission + b/r_win_by_TKO_Doctor_Stoppage
    data_ufc_check = data_ufc.copy()
    data_ufc_check.dropna(inplace=True)
       
    assert np.allclose(data_ufc_check.loc[:,'R_avg_SIG_STR_att'],
                data_ufc_check.loc[:,['R_avg_CLINCH_att','R_avg_DISTANCE_att',
                                      'R_avg_GROUND_att']].sum(axis=1), 
                rtol=0.00001)
    
    assert np.allclose(data_ufc_check.loc[:,'R_avg_SIG_STR_att'],
                data_ufc_check.loc[:,['R_avg_BODY_att','R_avg_HEAD_att','R_avg_LEG_att']].sum(axis=1), 
                rtol=0.00001)
                         
    return "Assertions passed. Check the assertions to learn more about the semantics of this data set."

semantics_check()


'Assertions passed. Check the assertions to learn more about the semantics of this data set.'

### Data exploration

In [None]:
# Visual exploration
# Plot basic statistics
# Identify interesting subpopulations
## E.g. only male fighters and only 4-5 most popular weight classes
# Form hypotheses and identify actions
## E.g. which attributes do contribute significantly to the weight class
# Transform the hypothesis into a data mining goal, if possible



### Data quality

In [8]:
# Identify special values and catalog their meaning
# Find missing values and outliers.
# Check for deviations, decide whether it is “noise” or may indicate an interesting phenomenon.
# Check for plausability of values
# Verify that the meanings of attributes and contained values fit
# Establish the meaning of missing data ! Why is it missing?


## Data preparation 

### Select data

Subsampling: If the entire dataset is too large to be processed in its entirety, choose a subsampling strategy to get the dataset to a manageable size. Describe in your report why and how you did it. Make sure your experiment is repeatable. (No manual selection of instances, everything must be in code.)

### Clean data

In [None]:
# Handle missing values and outliers
# eg. by deletion or imputation

### Construct data

Preprocessing: Get the data into the form needed for training your two algorithms. Describe your preprocessing steps (e.g. transcoding, scaling), why you did it and how you did it

In [None]:
# Transform to different attribute types (Binning, 1-to-n coding, …)
# Add new attributes to the accessed data
# Decide if any attribute should be normalized
# 

In [None]:
MY ANSWER HERE

## Modeling 

Pick two significantly different classification algorithms, i.e. NO two variations of the same algorithm.
* SVM and Random Forest
* Class attribute: weight_class

### Modeling technique

Describe why you chose the respective algorithms and briefly summarize their characteristics and the semantics underlying its parameters.

MY ANSWER HERE

### Test desgin

The model will be tested using an incrementally varying train/test split approach. We will start with a split of 5%/95% (train/test) and increment the split ratio by 10% until 95%/5% is reached.
For each training set size we will perform multiple runs to observe the sensitivity to the actual subset used for training a specific run.

In [None]:
# Scaling
# Paramters
## Explore paramters with Grid Search and cross validation
# Training and test

### Build Model

Train your two algorithms in 3 separate experiment tracks as detailed below and evaluate your results with a reasonable quality measure for your algorithms (e.g.: (micro/macro) Precision/Recall, Mean Absolute Error,…). Interpret your results using both graphs and summaries (e.g. confusion matrices). For each of the 3 experiment tracks you should separately vary and document:
* Parameters: If the classifier has specific parameters, explore their effect with different settings using 10-fold cross-validation and document the parameters and the results and analyze the sensitivity of classification outcomes against these parameters. Specifically, test extreme/obviously wrong settings and analyze the results
* Scaling: where possible, try different scaling approaches (min/max, zero mean/unit variance, length) using the best parameters identified above and observe the difference in classification performance using 10-fold cross-validation. Analyze the reasons for the effects observed, test useful and also non-useful (!) scalings and summarize your findings as well as analyze reasons why specific scalings make sense in a given setting.
* Training / test set splits: Use the best parameter setting and scaling identified above and evaluate the effect of different training and test set splits. Start with a small training set and increase it in small increments (e.g. 10 sets from 5% / 95% (train/test) in 10%-increments to 95%/5% (train/test)) and observe performance changes. Perform multiple runs with each training set size to observe the sensitivity to the actual subset used for training a specific run. Analyze the variance in performance obtained

### Asses Model

## Summary

* What trends do you observe in each set of experiments?
* How easy was it to interpret the algorithm and its performance?
* Which classes are most frequently mixed-up? (and why?)
* What parameter settings cause performance changes?
* Do both algorithms show the same behavior in performance, performance degradation / robustness against
  * smaller and larger training set sizes?
  * variations in parameter settings?
* Did you observe or can you force and document characteristics such as over-learning?
* How does the performance change with different amounts of training data being available? What are the best scalings (per attribute / per vector) and why?



MY ANSWER HERE