# Machine Learning 101

### 1. What is machine learning and how can it be useful?

Machine learning is the process of building software that can learn __without explicit instructions__. This is accomplished by applying mathematical models to data that are designed to exploit differences and patterns to find generalizable "trends" that apply to all instances of similar data. 

_Image below from [TensorFlow](https://www.tensorflow.org/about)_

![title](../images/tensorflow_what_is_ml.png)

Machine learning is a broad field that sits at the intersection of math, statistics, and computer science.

The steps to create a machine learning model (or any mathematical model) is as follows:
1. An __algorithm__
   - What are the steps by which your model makes a prediction?
   - For example, how does a decision tree work, and how do you go about creating one?
   
   
2. An __evaluation criteria__
   - How do you define how "good" your machine learning model is, or how one model compares to another?
   - For example, when looking at binary classification (deciding if an observation is A or B, like email is spam or not), we look at accuracy and sensitivity
   
   
3. An __optimization method__: how do you create the best model, or in other words, find the parameters for your model that yield the best evaluation criteria?

What is an example of a problem that can be handled with machine learning?

In [1]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.svm import SVC

import warnings

In [2]:
%matplotlib inline
warnings.filterwarnings("ignore")

In [3]:
id_ = "1nGlA2MZd4PGKJk3y9wEuIEwMTyFR3LoK"
url_ = f"https://drive.google.com/uc?export=download&id={id_}"

players = pd.read_csv(url_)
players.head()

Unnamed: 0.1,Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,Value,Wage,Special,Preferred Foot,International Reputation,Weak Foot,Skill Moves,Work Rate,Body Type,Real Face,Position,Jersey Number,Joined,Loaned From,Contract Valid Until,Height,Weight,LS,ST,RS,LW,LF,CF,RF,RW,LAM,CAM,RAM,LM,...,LB,LCB,CB,RCB,RB,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Release Clause
0,0,158023,L. Messi,31,https://cdn.sofifa.org/players/4/19/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,94,94,FC Barcelona,https://cdn.sofifa.org/teams/2/light/241.png,€110.5M,€565K,2202,Left,5.0,4.0,4.0,Medium/ Medium,Messi,Yes,RF,10.0,"Jul 1, 2004",,2021,5'7,159lbs,88+2,88+2,88+2,92+2,93+2,93+2,93+2,92+2,93+2,93+2,93+2,91+2,...,59+2,47+2,47+2,47+2,59+2,84.0,95.0,70.0,90.0,86.0,97.0,93.0,94.0,87.0,96.0,91.0,86.0,91.0,95.0,95.0,85.0,68.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0,€226.5M
1,1,20801,Cristiano Ronaldo,33,https://cdn.sofifa.org/players/4/19/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Juventus,https://cdn.sofifa.org/teams/2/light/45.png,€77M,€405K,2228,Right,5.0,4.0,5.0,High/ Low,C. Ronaldo,Yes,ST,7.0,"Jul 10, 2018",,2022,6'2,183lbs,91+3,91+3,91+3,89+3,90+3,90+3,90+3,89+3,88+3,88+3,88+3,88+3,...,61+3,53+3,53+3,53+3,61+3,84.0,94.0,89.0,81.0,87.0,88.0,81.0,76.0,77.0,94.0,89.0,91.0,87.0,96.0,70.0,95.0,95.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0,€127.1M
2,2,190871,Neymar Jr,26,https://cdn.sofifa.org/players/4/19/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,93,Paris Saint-Germain,https://cdn.sofifa.org/teams/2/light/73.png,€118.5M,€290K,2143,Right,5.0,5.0,5.0,High/ Medium,Neymar,Yes,LW,10.0,"Aug 3, 2017",,2022,5'9,150lbs,84+3,84+3,84+3,89+3,89+3,89+3,89+3,89+3,89+3,89+3,89+3,88+3,...,60+3,47+3,47+3,47+3,60+3,79.0,87.0,62.0,84.0,84.0,96.0,88.0,87.0,78.0,95.0,94.0,90.0,96.0,94.0,84.0,80.0,61.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0,€228.1M
3,3,193080,De Gea,27,https://cdn.sofifa.org/players/4/19/193080.png,Spain,https://cdn.sofifa.org/flags/45.png,91,93,Manchester United,https://cdn.sofifa.org/teams/2/light/11.png,€72M,€260K,1471,Right,4.0,3.0,1.0,Medium/ Medium,Lean,Yes,GK,1.0,"Jul 1, 2011",,2020,6'4,168lbs,,,,,,,,,,,,,...,,,,,,17.0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,51.0,42.0,57.0,58.0,60.0,90.0,43.0,31.0,67.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0,€138.6M
4,4,192985,K. De Bruyne,27,https://cdn.sofifa.org/players/4/19/192985.png,Belgium,https://cdn.sofifa.org/flags/7.png,91,92,Manchester City,https://cdn.sofifa.org/teams/2/light/10.png,€102M,€355K,2281,Right,4.0,5.0,4.0,High/ High,Normal,Yes,RCM,7.0,"Aug 30, 2015",,2023,5'11,154lbs,82+3,82+3,82+3,87+3,87+3,87+3,87+3,87+3,88+3,88+3,88+3,88+3,...,73+3,66+3,66+3,66+3,73+3,93.0,82.0,55.0,92.0,82.0,86.0,85.0,83.0,91.0,91.0,78.0,76.0,79.0,91.0,77.0,91.0,63.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0,€196.4M


In [None]:
players_sample = players[['Composure','Overall']].sample(20, random_state = 42)
players_sample.dropna(inplace = True)
players_sample.plot(x='Composure', y='Overall', kind='scatter')

In [None]:
X = np.array(players_sample['Composure'])
y = players_sample['Overall']
degrees = [1,2,12]

plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i],
                                             include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(pipeline, X[:, np.newaxis], y,
                             scoring="neg_mean_squared_error", cv=10)

    X_test = np.linspace(25, 90, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((25, 90))
    plt.ylim((60, 80))
    plt.legend(loc="best")
    plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
        degrees[i], -scores.mean(), scores.std()))
    
plt.show()

### 2. Model validation: how do you know if your machine learning model works?

Model validation is the process of assessing the fidelity of the model. In other words, you validate your machine learning models to confirm that its outputs are representative of the population. This is done by testing the model on data it has not yet seen.

What are some ways we could incorrectly validate our model?

In [None]:
players.columns
players['not_awful'] = (players['Overall'] > 62)*1

In [None]:
X = players[['not_awful','Position','Strength','Composure',
             'LongPassing','GKReflexes']]
X.dropna(inplace=True)

In [None]:
X_train_1 = X[X.Position != 'GK']
X_train_1.drop(['Position'], axis=1, inplace=True)
Y_train_1 = X_train_1.pop('not_awful')


X_test_1 = X[X.Position == 'GK']
X_test_1.drop(['Position'],axis=1, inplace=True)
Y_test_1 = X_test_1.pop('not_awful')

In [None]:
svm = SVC(gamma ='auto')
svm.fit(X_train_1, Y_train_1)

In [None]:
y_pred_train = svm.predict(X_train_1)
print(str(accuracy_score(Y_train_1, y_pred_train)))

y_pred_test = svm.predict(X_test_1)
print(str(accuracy_score(Y_test_1, y_pred_test)))

Let's try a more reasonable split

In [None]:
X = players[['not_awful','Position','Strength','Composure',
             'LongPassing','GKReflexes']]
X.dropna(inplace = True)

Y = X.pop('not_awful')
X.drop('Position', axis = 1, inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=42)

In [None]:
svm_tt = SVC(gamma ='auto')
svm_tt.fit(X_train,y_train)

In [None]:
y_pred_train = svm_tt.predict(X_train)
print(str(accuracy_score(y_train, y_pred_train)))

y_pred_test = svm_tt.predict(X_test)
print(str(accuracy_score(y_test, y_pred_test)))

In [None]:
svm = SVC(gamma ='auto')
cv_results = cross_validate(svm, X_train, y_train, cv=3, n_jobs=-1)

In [None]:
# this cell can take a couple of minutes to run

param_grid = {'C': [10, 100, 1000],  
              'gamma': [0.01, 0.001, 0.0001], 
              'kernel': ['rbf']} 

grid = GridSearchCV(SVC(), param_grid, refit=True, verbose=3, n_jobs=-1) 
grid.fit(X_train, y_train) 

In [None]:
grid.cv_results_['mean_test_score']

In [None]:
grid.best_score_

In [None]:
# this cell can also take a couple of minutes to run

def hyperopt_train_test(params):
    X_ = X_train[:]
    clf = SVC(**params)
    return cross_val_score(clf, X_,y_train).mean()

space4svm = {
    'C': hp.uniform('C', 7.5, 12.5),
    'kernel': hp.choice('kernel', ['rbf']),
    'gamma': hp.uniform('gamma', 0.00001, .005),
}

def f(params):
    acc = hyperopt_train_test(params)
    return {'loss': -acc, 'status': STATUS_OK}

trials = Trials()
best = fmin(f, space4svm, algo=tpe.suggest, max_evals=10, trials=trials)
print('best:')
print(best)
