# Projeto 3 - Ciência dos Dados

#  Projeto UFC

Nome: Victor Vergara Arcoverde de Albuquerque Cavalcanti

Nome: Edgard Ortiz Neto

Nome: Gabriel Yamashita

Nome: Henrique Mualem Marti



  ___
## Objetivo:

### O objetivo desse projeto é fazer um machine learning para poder prever qual lutador ganhará uma luta do UFC baseado no seu histórico. Para isso serão usados os dados de todas as lutas do UFC (mais de 5 mil) a fim de descobrir quais os fatores dos lutadores que impactam mais no resultado das lutas. 
### Assim esse projeto poderia ser usado para apostadores nas lutas de UFC e os próprios atlétas que participam nessas lutas, pois seria possível comparar os seus dados e os de seus oponentes, assim sabendo como está em relação a eles e quais fatores seria melhor treinar ou manter a fim de manter uma vantage sobre eles.

[Database utilizado](https://www.kaggle.com/rajeevw/ufcdata#data.csv)

____
## Método escolhido:






### Random Forest:
#### Esse método usa várias árvores de decisão para encontrar o que melhor se adequa, a que tem menos erros, para os nossos dados e o resultado que queremos, nesse caso qual é o vencedor.

![randomforest.png](randomforest.png)



### Regressão Logística:
#### Esse método usa a função abaixo que vai sempre tender a 0 ou a 1, assim sendo um classificador binário. Ele atribui um coeficiente(β) para cada fator levado em consideração, assim tendo uma ordem de impacto dos fatores no resultado final.

$$Prob(y = 1 | X = x) = \frac{1}{1 + e^{-\left(\beta_0 + \beta_1 x_1 + \beta_2 x_2\right)}}$$

 ___
## Preparando o ambiente no jupyter:


### Imports:

In [2]:
import math
import os.path
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import json
import random
import statsmodels.api as sm
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


### Trabalhando com os Excels:

In [3]:
data = pd.read_excel("data.xlsx")
data.head(2)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019,"Chicago, Illinois, USA",Red,True,Bantamweight,5,0,...,2,0,0,8,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019,"Chicago, Illinois, USA",Red,True,Women's Flyweight,5,0,...,0,2,0,5,Southpaw,165.1,167.64,125.0,32.0,31.0


In [4]:
data.weight_class = data.weight_class.astype('category')
data.Winner = data.Winner.astype('category')

In [5]:
data.Winner.value_counts()

Red     3470
Blue    1591
Draw      83
Name: Winner, dtype: int64

In [6]:
#Escolhendo apenas as lutas entre lutadores da classe 'Heavyweight', pois as característica
#data_heavy = data.loc[(data.weight_class=='Heavyweight'),:]
bool_to_number = {False: 0, True: 1}
string_to_number = {'Blue': 0, 'Red': 1, 'Draw': 2}
data['title_bout'] = data['title_bout'].map(bool_to_number)
data['Winner'] = data['Winner'].map(string_to_number)
data.head(2)

Unnamed: 0,R_fighter,B_fighter,Referee,date,location,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,Marc Goddard,2019,"Chicago, Illinois, USA",1,1,Bantamweight,5,0,...,2,0,0,8,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,Robert Madrigal,2019,"Chicago, Illinois, USA",1,1,Women's Flyweight,5,0,...,0,2,0,5,Southpaw,165.1,167.64,125.0,32.0,31.0


#### Blue = 0
#### Red = 1

In [7]:
data_util = data.drop(['Referee','date','location'], axis=1)
#dados que não se relacionam com os lutadores ou seus resultado

In [8]:
data_util.head(2)

Unnamed: 0,R_fighter,B_fighter,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,B_current_win_streak,B_draw,B_avg_BODY_att,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,1,1,Bantamweight,5,0,4,0,9.2,...,2,0,0,8,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,1,1,Women's Flyweight,5,0,3,0,14.6,...,0,2,0,5,Southpaw,165.1,167.64,125.0,32.0,31.0


In [9]:
data_util.dropna(inplace=True)
data_util.head(2)

Unnamed: 0,R_fighter,B_fighter,Winner,title_bout,weight_class,no_of_rounds,B_current_lose_streak,B_current_win_streak,B_draw,B_avg_BODY_att,...,R_win_by_KO/TKO,R_win_by_Submission,R_win_by_TKO_Doctor_Stoppage,R_wins,R_Stance,R_Height_cms,R_Reach_cms,R_Weight_lbs,B_age,R_age
0,Henry Cejudo,Marlon Moraes,1,1,Bantamweight,5,0,4,0,9.2,...,2,0,0,8,Orthodox,162.56,162.56,135.0,31.0,32.0
1,Valentina Shevchenko,Jessica Eye,1,1,Women's Flyweight,5,0,3,0,14.6,...,0,2,0,5,Southpaw,165.1,167.64,125.0,32.0,31.0



### Tirando os dados categóricos:

In [12]:
categoricas = [
    'R_fighter', 
    'B_fighter', 
    'weight_class', 
    'R_Stance', 
    'B_Stance', 
]

data_cat = data_util[categoricas].astype('category')
data_num = data_util.drop(categoricas, axis=1).astype('float')

___
# Teste 1 - Random Forest
___

In [13]:
X = data_num.drop('Winner', axis=1)
y = data_num['Winner']

## Separando os dados em testes e treinamento

In [14]:
X_train_random, X_test_random, y_train_random, y_test_random = train_test_split(X, y, test_size=0.25)


## Montando modelo Random Forest

In [15]:
model_random = RandomForestClassifier(n_estimators=10000)

model_random.fit(X_train_random, y_train_random)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10000, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


## Verificando a performance


In [16]:
X_train_random.columns

Index(['title_bout', 'no_of_rounds', 'B_current_lose_streak',
       'B_current_win_streak', 'B_draw', 'B_avg_BODY_att', 'B_avg_BODY_landed',
       'B_avg_CLINCH_att', 'B_avg_CLINCH_landed', 'B_avg_DISTANCE_att',
       ...
       'R_win_by_Decision_Unanimous', 'R_win_by_KO/TKO', 'R_win_by_Submission',
       'R_win_by_TKO_Doctor_Stoppage', 'R_wins', 'R_Height_cms', 'R_Reach_cms',
       'R_Weight_lbs', 'B_age', 'R_age'],
      dtype='object', length=136)

In [17]:
model_random.feature_importances_

array([0.00042941, 0.00070715, 0.00263048, 0.00369951, 0.        ,
       0.00959468, 0.00901436, 0.00806852, 0.00830459, 0.01084518,
       0.00987314, 0.00871424, 0.00848533, 0.01113945, 0.01033177,
       0.00539201, 0.00871601, 0.00848231, 0.0070901 , 0.00424956,
       0.01127369, 0.0096992 , 0.00944605, 0.00614024, 0.01021887,
       0.00696124, 0.00732543, 0.00902843, 0.0089554 , 0.00392703,
       0.00434952, 0.00868978, 0.00881997, 0.00917484, 0.00875685,
       0.00931551, 0.0088895 , 0.00944962, 0.00929288, 0.00806436,
       0.00866593, 0.00489179, 0.00832525, 0.00819839, 0.0065268 ,
       0.00419545, 0.00839198, 0.00811255, 0.01019198, 0.00590009,
       0.00813453, 0.00658759, 0.00833031, 0.00883079, 0.01051353,
       0.0074249 , 0.00774888, 0.00161927, 0.00028301, 0.00154974,
       0.0031517 , 0.0032807 , 0.0032997 , 0.00064713, 0.00510914,
       0.00633976, 0.00766217, 0.00545352, 0.0023549 , 0.00312269,
       0.        , 0.00957126, 0.00858764, 0.00942573, 0.00917

In [18]:
y_pred_random = model_random.predict(X_test_random)

In [19]:
print(accuracy_score(y_test_random, y_pred_random))

0.6281094527363185


In [20]:
y_pred_random

array([1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 0.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0., 1.,
       1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.,
       1., 1., 1., 1., 1.

In [21]:
y_test_random.value_counts(True)

1.0    0.611940
0.0    0.368159
2.0    0.019900
Name: Winner, dtype: float64

In [22]:
data_num.Winner.value_counts()

1.0    2023
0.0    1141
2.0      51
Name: Winner, dtype: int64

   ___
## Conclusão do Modelo inicial Random Forest:

Tendo um acurácia de 0.6667 não é um bom resultado, visto que o modelo praticamente sempre tem como resultado o vermelho como vencedor, e como a probabilidade do lutador vermelho ganhar é de 0.6212112 ele praticamente só acerta os que o vencedor é o vermelho e erra os que o azul é o vencedor.

Assim é possível concluir que é necessário desconsiderar algumas variáveis para melhorar a acurácia.

   ___
## Criando um dataframe dos fatores mais impactantes no resultado segundo o teste acima:

In [23]:
#Fatores que tem o maior peso na decisão da vitória
j=1
lista_j=list()
for i, f in sorted(list(zip(model_random.feature_importances_, X_train_random.columns)), reverse=True):
    a=str(j)+'°'
    lista_j.append(a)
    j+=1

In [24]:
data={'Fator':X_train_random.columns ,'Correlação':model_random.feature_importances_,}
Fator_por_corr=pd.DataFrame(data)
Fator_por_corr=Fator_por_corr.sort_values(by='Correlação', ascending=False)
Fator_por_corr['Grau de Importância']=lista_j
Fator_por_corr = Fator_por_corr.set_index('Grau de Importância')
Fator_por_corr.head(2)

Unnamed: 0_level_0,Fator,Correlação
Grau de Importância,Unnamed: 1_level_1,Unnamed: 2_level_1
1°,R_age,0.014742
2°,R_avg_opp_HEAD_landed,0.013201


   ___
## Escolhendo quais dados devem ser usados nos modelos de predição:

In [25]:
def relevancia(df,coluna_nome,coluna_correlacao,acuracia):
    inuteis = []
    uteis = ['Winner']
    for index,row in df.iterrows():
        if row[coluna_correlacao] >= -acuracia and row[coluna_correlacao] <= acuracia:
            inuteis.append(row[coluna_nome])
        else:
            uteis.append(row[coluna_nome])
    return uteis

In [26]:
uteis = relevancia(Fator_por_corr,'Fator','Correlação',0.01)

In [28]:
data_util_relevante = data_util.loc[:,uteis]
data_util_relevante.head()

Unnamed: 0,Winner,R_age,R_avg_opp_HEAD_landed,R_avg_opp_SIG_STR_pct,R_avg_opp_TOTAL_STR_landed,R_avg_opp_SIG_STR_landed,B_avg_SIG_STR_att,B_avg_HEAD_att,R_avg_opp_TOTAL_STR_att,B_avg_DISTANCE_att,R_avg_GROUND_att,B_avg_opp_TOTAL_STR_landed,B_avg_HEAD_landed,R_avg_opp_DISTANCE_landed,B_avg_TD_att,B_avg_opp_SIG_STR_pct,R_avg_opp_BODY_att
0,1,32.0,17.3,0.336,43.3,32.2,65.4,48.6,110.5,62.6,9.4,19.2,11.2,26.8,0.8,0.236,13.3
1,1,31.0,12.428571,0.437143,82.285714,44.714286,138.9,112.0,158.142857,124.7,18.428571,75.4,32.0,32.571429,1.0,0.408,24.571429
2,1,35.0,23.2,0.34,38.6,35.733333,97.0,67.645161,102.133333,84.741935,5.333333,49.774194,23.258065,32.2,2.16129,0.453226,14.466667
3,0,29.0,20.375,0.44625,48.875,44.875,136.25,116.25,115.125,109.5,1.0,34.25,53.75,38.5,2.5,0.3375,20.25
4,0,26.0,14.0,0.3975,27.75,22.5,203.5,184.5,60.5,201.0,0.5,90.0,45.0,16.25,0.0,0.43,6.25


___
# Teste 2- Regressão Logística
____


## Montando a Regressão Logística


In [29]:
def preparo(X,Y):
    X_cp = sm.add_constant(X)
    model = sm.OLS(Y,X_cp,missing='drop')
    results = model.fit()
    return results

In [30]:
Y_log = data_util_relevante["Winner"]
data_sem_Winner=data_util_relevante.drop('Winner',axis=1) 
X_log=data_sem_Winner
#np.asarray(X)

In [31]:
X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, Y_log, test_size=0.25)

In [32]:
model = LogisticRegression(max_iter=200000,solver='lbfgs', multi_class='auto')

model.fit(X_train_log, y_train_log)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=200000, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [33]:
y_pred_log = model.predict(X_test_log)

In [34]:
print(accuracy_score(y_test_log, y_pred_log))

0.6592039800995025


In [35]:
y_pred_log

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,

In [36]:
y_test_log.value_counts(True)

1    0.645522
0    0.335821
2    0.018657
Name: Winner, dtype: float64

In [37]:
data_util_relevante.Winner.value_counts()

1    2023
0    1141
2      51
Name: Winner, dtype: int64

In [38]:
result = preparo(X_log,Y_log)
result.summary()

0,1,2,3
Dep. Variable:,Winner,R-squared:,0.058
Model:,OLS,Adj. R-squared:,0.053
Method:,Least Squares,F-statistic:,12.3
Date:,"Tue, 19 Nov 2019",Prob (F-statistic):,4.46e-32
Time:,16:39:19,Log-Likelihood:,-2274.4
No. Observations:,3215,AIC:,4583.0
Df Residuals:,3198,BIC:,4686.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.2972,0.092,14.156,0.000,1.118,1.477
R_age,-0.0145,0.002,-6.819,0.000,-0.019,-0.010
R_avg_opp_HEAD_landed,-0.0021,0.003,-0.730,0.465,-0.008,0.004
R_avg_opp_SIG_STR_pct,-0.2732,0.113,-2.419,0.016,-0.495,-0.052
R_avg_opp_TOTAL_STR_landed,0.0003,0.001,0.248,0.804,-0.002,0.002
R_avg_opp_SIG_STR_landed,-0.0024,0.004,-0.661,0.509,-0.009,0.005
B_avg_SIG_STR_att,-0.0004,0.001,-0.294,0.768,-0.003,0.002
B_avg_HEAD_att,0.0015,0.001,1.040,0.298,-0.001,0.004
R_avg_opp_TOTAL_STR_att,-0.0005,0.001,-0.683,0.494,-0.002,0.001

0,1,2,3
Omnibus:,250.511,Durbin-Watson:,1.9
Prob(Omnibus):,0.0,Jarque-Bera (JB):,101.547
Skew:,-0.208,Prob(JB):,8.899999999999999e-23
Kurtosis:,2.236,Cond. No.,2610.0


___
## Conclusão

   ___
## Referências

[Como usar a biblioteca Scikit-lear](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

[Como funciona o Random Forest](https://towardsdatascience.com/understanding-random-forest-58381e0602d2)

[Referencia do Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

[Como funciona regressão linear](https://www.saedsayad.com/logistic_regression.htm)

[Como usar a Regressão Linear](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)



In [None]:
#criando um novo dataframe
lista_locB = ["B_fighter", "weight_class","B_Height_cms","B_Reach_cms","B_Weight_lbs","B_age"]
lista_locR = ["R_fighter", "weight_class","R_Height_cms","R_Reach_cms","R_Weight_lbs","R_age"]
data_red = data2.loc[: , lista_locR]
data_blue = data2.loc[: , lista_locB]

data_blue.columns = ["Fighter","Weight_class","Height_cms","Reach_cms","Weight_lbs","Age"]
data_red.columns = ["Fighter","Weight_class","Height_cms","Reach_cms","Weight_lbs","Age"]

data_red.head(2)

In [None]:
newdataR = pd.DataFrame()


namelist = []

for n in range(0,5143):
    if data_red.loc[n,:]["Fighter"] not in namelist:
        namelist.append(data_red.loc[n,:]["Fighter"])
        newdataR = pd.concat([newdataR,data_red.loc[n,:]], axis=1, join='outer')

newdataR = newdataR.transpose()

newdataR.head(2)

In [None]:
newdataB = pd.DataFrame()

namelist = []

for n in range(0,5143):
    if data_blue.loc[n,:]["Fighter"] not in namelist:
        namelist.append(data_blue.loc[n,:]["Fighter"])
        newdataB = pd.concat([newdataB,data_blue.loc[n,:]], axis=1, join='outer')

newdataB = newdataB.transpose()

newdataB.head(2)

In [None]:
newdata = pd.concat([newdataB, newdataR], axis=0, join='outer', ignore_index=False, keys=None,levels=None, names=None, verify_integrity=False, copy=True)
newdata = newdata.set_index("Fighter")

newdata.head(2)