# Ryan Pacheco: 114015621
## Semester Project
* The goal of this project was to gather some data from the Major League Baseball API (https://appac.github.io/mlb-data-api-docs/) and then try to train a few machine learning classifiers (probably logistic regression, SVM, random forest, a decision tree). Once these classifiers are ran metrics are collected that can be used to assess how accurate each classifier was.

### Initial Imports

In [2]:
import requests
import json
#from tqdm import tqdm

# DO NOT RUN IF DATA.JSON & STATS.JSON IS PRESENT (TAKES > 24 HOURS)

* This pulls in the data from Major League Baseball's (MLB) API
* It was necessary to select a range of player_id's since the API has no formal "Get all" feature

In [3]:
player_dict = {}
for player_id in tqdm(range(400000, 650000)):
    player_data = {}
    try:
        result = requests.get("http://lookup-service-prod.mlb.com/json/named.player_info.bam?sport_code='mlb'&player_id='{}'".format(player_id))
        player_info = json.loads(result.text)
        if int(player_info['player_info']['queryResults']['totalSize']) >= 1:
            player_data['name'] = player_info['player_info']['queryResults']['row']['name_display_first_last']
            player_dict[player_id] = player_data
    except Exception as e:
        continue

100%|██████████| 250000/250000 [17:00:31<00:00,  4.08it/s]    


In [4]:
with open('data.json', 'w') as outfile:
    json.dump(player_dict, outfile)

# START HERE IF ONLY DATA.JSON IS PRESENT (TAKES ~12 HOURS)

* Once the players are selected, we then loop through each player and collect the relevent stats:
  * Batting average
  * On base precentage
  * Slugging precentage
  * OPS

In [2]:
with open('data.json') as json_file:
    data = json.load(json_file)

In [10]:
player_stats = {}
for player in tqdm(data):
    try:
        result = requests.get("http://lookup-service-prod.mlb.com/json/named.sport_career_hitting.bam?league_list_id='mlb'&game_type='R'&player_id='{}'".format(player))
        career_stats = json.loads(result.text)
        if int(career_stats['sport_career_hitting']['queryResults']['totalSize']) >= 1:
            player_stats[player] = [data[player]['name'], 
                                   career_stats['sport_career_hitting']['queryResults']['row']['avg'],
                                   career_stats['sport_career_hitting']['queryResults']['row']['obp'],
                                   career_stats['sport_career_hitting']['queryResults']['row']['slg'],
                                   career_stats['sport_career_hitting']['queryResults']['row']['ops']]
    except Exception as e:
        continue

100%|██████████| 94185/94185 [4:54:11<00:00,  6.46it/s]    


In [11]:
with open('stats.json', 'w') as outfile:
    json.dump(player_stats, outfile)

# RUN HERE IF STATS.JSON IS PRESENT

* Now that we have our dataset we need to prepare it for the classifiers

In [3]:
import json

In [4]:
with open('stats.json') as json_file:
    player_career_stats = json.load(json_file)

In [5]:
player_career_stats

{'400002': ['Matt Childers', '.000', '.000', '.000', '.000'],
 '400008': ['Gary Majewski', '.000', '.000', '.000', '.000'],
 '400010': ['Jon Rauch', '.095', '.095', '.238', '.333'],
 '400012': ['Ken Vining', '', '', '', '.000'],
 '400018': ['Miguel Olivo', '.240', '.275', '.417', '.691'],
 '400019': ['Tim Hummel', '.222', '.285', '.314', '.599'],
 '400021': ['Joe Borchard', '.205', '.284', '.352', '.636'],
 '400023': ['Aaron Rowand', '.273', '.330', '.435', '.765'],
 '400050': ['Chris Spurling', '', '', '', '.000'],
 '400051': ['J.J. Davis', '.179', '.248', '.217', '.465'],
 '400056': ['Carlos Hernandez', '.154', '.185', '.154', '.339'],
 '400058': ['Brad Lidge', '.286', '.286', '.429', '.714'],
 '400061': ['Roy Oswalt', '.154', '.193', '.169', '.362'],
 '400062': ['Tim Redding', '.138', '.164', '.160', '.324'],
 '400063': ['Dave Williams', '.122', '.148', '.171', '.319'],
 '400066': ['Franklin Nunez', '', '', '', '.000'],
 '400067': ['Carlos Silva', '.086', '.145', '.121', '.266'],
 '

In [9]:
%pip install pandas

Collecting pandas
  Using cached pandas-1.1.2-cp38-cp38-macosx_10_9_x86_64.whl (10.6 MB)
Collecting numpy>=1.15.4
  Using cached numpy-1.19.2-cp38-cp38-macosx_10_9_x86_64.whl (15.3 MB)
Collecting pytz>=2017.2
  Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Installing collected packages: numpy, pytz, pandas
Successfully installed numpy-1.19.2 pandas-1.1.2 pytz-2020.1
Note: you may need to restart the kernel to use updated packages.


In [10]:
import pandas as pd

### Use pandas to organize the data and fill in null columns and remove the `Name` column

In [11]:
player_df = pd.DataFrame.from_dict(player_career_stats, orient='index',
                       columns=['Name', 'Avg', 'OBP', 'Slg', 'OPS'])

In [12]:
player_df.head()

Unnamed: 0,Name,Avg,OBP,Slg,OPS
400002,Matt Childers,0.0,0.0,0.0,0.0
400008,Gary Majewski,0.0,0.0,0.0,0.0
400010,Jon Rauch,0.095,0.095,0.238,0.333
400012,Ken Vining,,,,0.0
400018,Miguel Olivo,0.24,0.275,0.417,0.691


In [13]:
player_df.replace('', '.000', inplace=True)

In [14]:
player_df.head()

Unnamed: 0,Name,Avg,OBP,Slg,OPS
400002,Matt Childers,0.0,0.0,0.0,0.0
400008,Gary Majewski,0.0,0.0,0.0,0.0
400010,Jon Rauch,0.095,0.095,0.238,0.333
400012,Ken Vining,0.0,0.0,0.0,0.0
400018,Miguel Olivo,0.24,0.275,0.417,0.691


In [15]:
player_df.drop(columns=['Name'], inplace=True)

In [16]:
player_df.head()

Unnamed: 0,Avg,OBP,Slg,OPS
400002,0.0,0.0,0.0,0.0
400008,0.0,0.0,0.0,0.0
400010,0.095,0.095,0.238,0.333
400012,0.0,0.0,0.0,0.0
400018,0.24,0.275,0.417,0.691


### Convert columns to floats instead of objects

In [17]:
player_df.dtypes

Avg    object
OBP    object
Slg    object
OPS    object
dtype: object

In [18]:
player_df['Avg'] = pd.to_numeric(player_df['Avg'])
player_df['OBP'] = pd.to_numeric(player_df['OBP'])
player_df['Slg'] = pd.to_numeric(player_df['Slg'])
player_df['OPS'] = pd.to_numeric(player_df['OPS'])

In [19]:
player_df.dtypes

Avg    float64
OBP    float64
Slg    float64
OPS    float64
dtype: object

* This function goes through the stats and assignes each player a lable as to if they were good or not, based on any of their stats being viewed as "Good" for the majors

In [20]:
def is_good(row):
    if (row[0] > .280) or (row[1] > .350) or (row[2] > .400) or (row[3] > .800):
        return 'Yes'
    else: 
        return 'No'

* Convert dataframe to numpy array for training

In [21]:
player_data = player_df.to_numpy()

In [22]:
player_data

array([[0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   ],
       [0.095, 0.095, 0.238, 0.333],
       ...,
       [0.125, 0.263, 0.125, 0.388],
       [0.   , 0.   , 0.   , 0.   ],
       [0.239, 0.271, 0.339, 0.61 ]])

In [23]:
player_good = []
for index in range(len(player_data)):
    player_good.append(is_good(player_data[index]))

In [24]:
len(player_good)

3583

In [25]:
len(player_data)

3583

* This function calculates the various metrics for each classifier

In [26]:
def classifier_eval(pred, actual):
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    for index in range(len(pred)):
        if pred[index] == 'Yes' and actual[index] == 'Yes':
            TP += 1
        elif pred[index] == 'No' and actual[index] == 'No':
            TN += 1
        elif pred[index] == 'Yes' and actual[index] == 'No':
            FP += 1
        elif pred[index] == 'No' and actual[index] == 'Yes':
            FN += 1
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = (TP) / (TP + FP)
    recall = (TP) / (TP + FN)
    true_positive_rate = (TP) / (TP + FN)
    false_positive_rate = (FP) / (TN + FP)
    return accuracy, precision, recall, true_positive_rate, false_positive_rate

### Prepare train and test data

In [28]:
%pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.23.2-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)
[K     |████████████████████████████████| 7.2 MB 2.4 MB/s eta 0:00:01
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Collecting joblib>=0.11
  Downloading joblib-0.16.0-py3-none-any.whl (300 kB)
[K     |████████████████████████████████| 300 kB 25.5 MB/s eta 0:00:01
[?25hCollecting scipy>=0.19.1
  Downloading scipy-1.5.2-cp38-cp38-macosx_10_9_x86_64.whl (28.9 MB)
[K     |████████████████████████████████| 28.9 MB 12.3 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1316 sha256=e9f93690f8b36e797029b9d303b86cd69033a84d5e2aec5ab8c196953660ffed
  Stored in directory: /Users/rgp/Library/Caches/pip/wheels/22/0b/40/fd3f795caaa1fb4c6

In [29]:
from sklearn.model_selection import train_test_split

In [30]:
player_x_train, player_x_test, player_y_train, player_y_test = train_test_split(player_data, player_good, test_size=0.25, random_state=51)

In [31]:
len(player_x_train)

2687

In [32]:
len(player_y_train)

2687

### Train and test logistic regression

In [33]:
from sklearn.linear_model import LogisticRegression

In [34]:
clf = LogisticRegression(random_state=0).fit(player_x_train, player_y_train)

In [35]:
player_pred = clf.predict(player_x_test)

In [36]:
total_metrics = {}
classifier_metrics = {}

In [37]:
accuracy, precision, recall, true_positive_rate, false_positive_rate = classifier_eval(player_pred, player_y_test)
print("Accuracy: {}\nPrecision: {}\nRecall: {}\nTPR: {}\nFPR: {}".format(accuracy, precision, recall, true_positive_rate, false_positive_rate))

Accuracy: 0.953125
Precision: 0.9090909090909091
Recall: 0.8717948717948718
TPR: 0.8717948717948718
FPR: 0.024251069900142655


* The logistic regression had high accuracy, a low false positive rate, and a high true positive rate

In [38]:
classifier_metrics['Accuracy'] = accuracy
classifier_metrics['Precision'] = precision
classifier_metrics['Recall'] = recall
classifier_metrics['TPR'] = true_positive_rate
classifier_metrics['FPR'] = false_positive_rate
total_metrics['Logistice Regression'] = classifier_metrics
classifier_metrics = {}

### Train and test decision tree

In [39]:
from sklearn.tree import DecisionTreeClassifier

In [40]:
decision_tree = DecisionTreeClassifier(random_state=0)

In [41]:
decision_tree.fit(player_x_train, player_y_train)

DecisionTreeClassifier(random_state=0)

In [42]:
decision_tree_predict = decision_tree.predict(player_x_test)

In [43]:
accuracy, precision, recall, true_positive_rate, false_positive_rate = classifier_eval(decision_tree_predict, player_y_test)
print("Accuracy: {}\nPrecision: {}\nRecall: {}\nTPR: {}\nFPR: {}".format(accuracy, precision, recall, true_positive_rate, false_positive_rate))

Accuracy: 0.9988839285714286
Precision: 0.9948979591836735
Recall: 1.0
TPR: 1.0
FPR: 0.0014265335235378032


* The decision tree has extreamly high accuracy, precision, recall, and true positive rate. Possible overfitting could have taken place here

In [44]:
classifier_metrics['Accuracy'] = accuracy
classifier_metrics['Precision'] = precision
classifier_metrics['Recall'] = recall
classifier_metrics['TPR'] = true_positive_rate
classifier_metrics['FPR'] = false_positive_rate
total_metrics['Decision Tree'] = classifier_metrics
classifier_metrics = {}

### Train and test random forest

In [45]:
from sklearn.ensemble import RandomForestClassifier

In [46]:
random_forest = RandomForestClassifier(max_depth=2, random_state=0)

In [47]:
random_forest.fit(player_x_train, player_y_train)

RandomForestClassifier(max_depth=2, random_state=0)

In [48]:
random_forest_predict = random_forest.predict(player_x_test)

In [49]:
accuracy, precision, recall, true_positive_rate, false_positive_rate = classifier_eval(random_forest_predict, player_y_test)
print("Accuracy: {}\nPrecision: {}\nRecall: {}\nTPR: {}\nFPR: {}".format(accuracy, precision, recall, true_positive_rate, false_positive_rate))

Accuracy: 0.9765625
Precision: 0.9943181818181818
Recall: 0.8974358974358975
TPR: 0.8974358974358975
FPR: 0.0014265335235378032


* The random forest has high accuracy, high precision, and a good true positive rate

In [50]:
classifier_metrics['Accuracy'] = accuracy
classifier_metrics['Precision'] = precision
classifier_metrics['Recall'] = recall
classifier_metrics['TPR'] = true_positive_rate
classifier_metrics['FPR'] = false_positive_rate
total_metrics['Random Forest'] = classifier_metrics
classifier_metrics = {}

### Train and test Support Vector Machine (SVM)

In [51]:
from sklearn import svm

In [52]:
svc = svm.SVC()

In [53]:
svc.fit(player_x_train, player_y_train)

SVC()

In [54]:
svc_predict = svc.predict(player_x_test)

In [55]:
accuracy, precision, recall, true_positive_rate, false_positive_rate = classifier_eval(svc_predict, player_y_test)
print("Accuracy: {}\nPrecision: {}\nRecall: {}\nTPR: {}\nFPR: {}".format(accuracy, precision, recall, true_positive_rate, false_positive_rate))

Accuracy: 0.9620535714285714
Precision: 0.93048128342246
Recall: 0.8923076923076924
TPR: 0.8923076923076924
FPR: 0.018544935805991442


* The support vector machine has high accuracy, but it's precision is lower than many of the other classifiers

In [56]:
classifier_metrics['Accuracy'] = accuracy
classifier_metrics['Precision'] = precision
classifier_metrics['Recall'] = recall
classifier_metrics['TPR'] = true_positive_rate
classifier_metrics['FPR'] = false_positive_rate
total_metrics['SVM'] = classifier_metrics
classifier_metrics = {}

## Print the metrics for each classifier to compare results

In [57]:
for classifier in total_metrics:
    print("{}:".format(classifier))
    for metric in total_metrics[classifier]:
        print("\t{}: {}".format(metric, total_metrics[classifier][metric]))

Logistice Regression:
	Accuracy: 0.953125
	Precision: 0.9090909090909091
	Recall: 0.8717948717948718
	TPR: 0.8717948717948718
	FPR: 0.024251069900142655
Decision Tree:
	Accuracy: 0.9988839285714286
	Precision: 0.9948979591836735
	Recall: 1.0
	TPR: 1.0
	FPR: 0.0014265335235378032
Random Forest:
	Accuracy: 0.9765625
	Precision: 0.9943181818181818
	Recall: 0.8974358974358975
	TPR: 0.8974358974358975
	FPR: 0.0014265335235378032
SVM:
	Accuracy: 0.9620535714285714
	Precision: 0.93048128342246
	Recall: 0.8923076923076924
	TPR: 0.8923076923076924
	FPR: 0.018544935805991442


* Looking at the results above, it appears that the `Random Forest` classifier provides the best results. This is because of the high accuracy, high precision, and higher recall. These metrics mean that the random forest is good at minimizing false positives. This would be important in this senario where a MLB team is trying to predict if a player will be good or not. In this case you want to minimize the number of false positves because that would mean that the organization would be wasting money on bad players. Where as not maximizing recall (keeping false negatives to a minimum) is of less concern due to missing a few good players is much less financially harmfull to an organization than signing bad players. I did not choose the Decision tree which had the best overall results because the results seemed a little to good to be true, leading me to believe that the traing/test set was not tuned well fo the decision tree classifier, leading to possible overfitting, and diminishing the validity of the decision tree results.