**Multithreading with pybaseball**

While working on the education project and the baseball project, I wrote for loops which downloaded sets of data one at a time in series. This worked, but it took a very long time. The advantage of doing this is that it was easier to correct errors. However, I looked at ways to get this done faster for the practical benefit of simply getting my project done faster. I came across multiprocessing and multithreading in Python. [This](https://vsoch.github.io/2017/multiprocess/) was a nice post which had some details but ultimately I ended up following something closer to [this post](https://stackoverflow.com/questions/16982569/making-multiple-api-calls-in-parallel-using-python-ipython) which was sufficient to serve my needs. Of course, along the way, I learned more about multiprocessing and multithreading and not just "How do I get data faster?"

This notebook was created within my "insight" virtual environment.

In [1]:
import os
import numpy as np
import pandas as pd
from termcolor import colored

# Web/database stuff
import urllib.request
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

# Multiprocessing/threading
import multiprocess
import threading
from threading import Thread

# Evaluate on pybaseball scraping serially

In [2]:
from pybaseball import pitching_stats
from pybaseball import statcast_pitcher
from pybaseball import playerid_lookup
from pybaseball import statcast

In [None]:
# Input

In [3]:
# Pitch result designation dictionary
# A similar question and approach here https://www.reddit.com/r/Sabermetrics/comments/e130el/how_to_calculate_whiff_rate/
# Note that bunts are included
contact_desc = (['pitchout_hit_into_play_score', 'hit_into_play_score', 'hit_into_play',
                 'hit_by_pitch', 'hit_into_play_no_out', 'pitchout_hit_into_play_no_out',
                 'pitchout_hit_into_play'])
foul_desc = ['foul_pitchout', 'foul_bunt', 'bunt_foul_tip', 'foul', 'foul_tip']
takeball_desc = ['intent_ball', 'blocked_ball', 'pitchout', 'ball']
takestrike_desc = ['called_strike']
unknownstrike_desc = ['unknown_strike']
whiff_desc = ['swinging_strike_blocked', 'swinging_strike', 'missed_bunt', 'swinging_strike']

desc_dict = {'contact':contact_desc, 'foul':foul_desc, 'take_ball':takeball_desc,
             'take_strike':takestrike_desc, 'unknown_strike':unknownstrike_desc,
             'whiff':whiff_desc}

bat_stand_list = ['L', 'R']
zone_list = list(range(1, 13))
sw_types = list(desc_dict.keys())

In [4]:
def get_pitcher_pb_sc_data(pitcher):
    print(colored(pitcher, 'blue'))
    pitcher_for_lookup = pitcher.split()
    
    # Account for Chi Chi Gonzales since I'm splitting by space
    if len(pitcher_for_lookup) > 2:
        p_id = playerid_lookup(pitcher_for_lookup[2], pitcher_for_lookup[0] + ' ' + pitcher_for_lookup[1])
    else:
        p_id = playerid_lookup(pitcher_for_lookup[1], pitcher_for_lookup[0])
    
    # Account for common names
    if p_id.shape[0] > 1:
        p_id = p_id[p_id['mlb_played_last'] > 2008]
    df_pitcher_sc = statcast_pitcher('2019-03-28', '2019-09-29', player_id = int(p_id['key_mlbam']))
    
    # Swing designation info
    df_pitcher_sc['sw_type'] = None
    for key, value in desc_dict.items():
        df_pitcher_sc.loc[df_pitcher_sc['description'].isin(value), 'sw_type'] = key
    df_pitcher_sc['sw_true'] = 0
    df_pitcher_sc.loc[df_pitcher_sc['sw_type'].isin(['whiff', 'contact', 'foul']), 'sw_true'] = 1
    
    return p_id, df_pitcher_sc

In [5]:
# Example
pitcher = 'Gerrit Cole'
df_pitcher_sc_cole = get_pitcher_pb_sc_data(pitcher)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


In [18]:
%%timeit

pitcher_list = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Reynaldo Lopez',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

92.4 ns ± 6.55 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [6]:

# Make a list of 10 pitchers as a test set
pitcher_list = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Reynaldo Lopez',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

# Make a list of 10 pitchers that includes a name not in the database as a test set that can throw an error
pitcher_list_werror = (['Gerrit Cole',
                 'Justin Verlander',
                 'Caleb Smith',
                 'Chris Paddack',
                 'Ben Lacar',
                 'Robbie Ray',
                 'Zach Eflin',
                 'Dylan Bundy',
                 'Zach Plesac',
                 'Merrill Kelly'])

In [7]:
# Done serially (two pitchers), takes 21.4 s
for pitcher in pitcher_list[0:2]:
    print(pitcher)
    df_pitcher_sc = get_pitcher_pb_sc_data(pitcher)
    
%%time

Gerrit Cole
[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Justin Verlander
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


UsageError: Line magic function `%%time` not found.


# Evaluate on pybaseball scraping serially

I mainly modified functions based on [this StackOverflow post](https://stackoverflow.com/questions/16982569/making-multiple-api-calls-in-parallel-using-python-ipython).

- What happens if # of threads > # items to download?
- What happens if # items to download > # of threads?
- What happens if you max out the threads? (How do you know the max?)
- What if there is an error in an item you download?
- How do you do a loop with each pass through the loop doing a batch download? (e.g., scrape 10 items 2 at a time, so 5 passes through the loop)


In [8]:
def get_pitcher_pb_sc_data_range(pitcher_range, store=None):
    """process a number of ids, storing the results in a dict"""
    if store is None:
        store = {}
    for pitcher in pitcher_range:
        # Easy way to skip if there's an error
        try:
            store[pitcher] = get_pitcher_pb_sc_data(pitcher)
        except:
            continue
    return store

In [9]:
from threading import Thread

In [10]:
# def threaded_process_range(nthreads, id_range):
#     """process the id range in a specified number of threads"""
#     store = {}
#     threads = []
#     # create the threads
#     for i in range(nthreads):
#         ids = id_range[i::nthreads]
#         t = Thread(target=process_range, args=(ids,store))
#         threads.append(t)

#     # start the threads
#     [ t.start() for t in threads ]
#     # wait for the threads to finish
#     [ t.join() for t in threads ]
#     return store

In [11]:
def threaded_process_range(nthreads, pitcher_list):
    """process the pitcher list in a specified number of threads"""
    store = {}
    threads = []
    # create the threads
    for i in range(nthreads):
        ids = pitcher_list[i::nthreads]
        t = Thread(target=get_pitcher_pb_sc_data_range, args=(ids,store))
        threads.append(t)

    # start the threads
    [ t.start() for t in threads ]
    # wait for the threads to finish
    [ t.join() for t in threads ]
    return store

In [12]:
my_store = threaded_process_range(4, pitcher_list)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.
[34mCaleb Smith[0m
Gathering player lookup table. This may take a moment.
[34mChris Paddack[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mDylan Bundy[0m
Gathering player lookup table. This may take a moment.
[34mReynaldo Lopez[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mZach Eflin[0m
Gathering player lookup table. This may take a moment.
[34mRobbie Ray[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
[34mZach Plesac[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
Gathering Player Data
[34mMerrill Kelly[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data


In [63]:
my_store = threaded_process_range(4, pitcher_list_werror)

[34mGerrit Cole[0m
Gathering player lookup table. This may take a moment.
[34mJustin Verlander[0m
Gathering player lookup table. This may take a moment.[34mCaleb Smith[0m

Gathering player lookup table. This may take a moment.
[34mBen Lacar[0m
Gathering player lookup table. This may take a moment.
Gathering Player Data
Gathering Player Data
Gathering Player Data


In [13]:
my_store.keys()

dict_keys(['Chris Paddack', 'Gerrit Cole', 'Caleb Smith', 'Justin Verlander', 'Dylan Bundy', 'Reynaldo Lopez', 'Zach Eflin', 'Robbie Ray', 'Zach Plesac', 'Merrill Kelly'])

In [14]:
my_store['Gerrit Cole'][1]

Unnamed: 0,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,...,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,sw_type,sw_true
0,FF,2019-09-29,98.4,-1.9530,5.7396,Gerrit Cole,592230,543037,strikeout,swinging_strike,...,1,8,8,1,1,8,Infield shift,Standard,whiff,1
1,FF,2019-09-29,99.0,-1.8961,5.7786,Gerrit Cole,592230,543037,,ball,...,1,8,8,1,1,8,Infield shift,Standard,take_ball,0
2,KC,2019-09-29,86.2,-1.9253,5.7561,Gerrit Cole,592230,543037,,foul,...,1,8,8,1,1,8,Infield shift,Standard,foul,1
3,FF,2019-09-29,99.4,-1.9853,5.7987,Gerrit Cole,592230,543037,,foul,...,1,8,8,1,1,8,Infield shift,Standard,foul,1
4,FF,2019-09-29,98.5,-1.9848,5.7508,Gerrit Cole,592230,543037,,swinging_strike,...,1,8,8,1,1,8,Infield shift,Standard,whiff,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3357,SL,2019-03-29,89.8,-2.3465,5.9531,Gerrit Cole,640457,543037,,foul,...,0,0,0,0,0,0,Infield shift,Standard,foul,1
3358,FF,2019-03-29,96.3,-2.1579,5.8887,Gerrit Cole,640457,543037,,foul,...,0,0,0,0,0,0,Infield shift,Standard,foul,1
3359,SL,2019-03-29,88.5,-2.3414,5.9628,Gerrit Cole,640457,543037,,ball,...,0,0,0,0,0,0,Infield shift,Standard,take_ball,0
3360,FT,2019-03-29,96.7,-2.3392,5.7665,Gerrit Cole,640457,543037,,ball,...,0,0,0,0,0,0,Infield shift,Standard,take_ball,0


# What are threads? What are processes and what is going on?

In [None]:
# 