# I/O Bound Programs

## I/O Bounds

In the last mission, we covered CPU bounds, and showed how they can limit the performance of your program. We analyzed numerous algorithms, and figured out their performance characteristics. **One good way to model a program is as a set of inputs, actions, and outputs.** Here's a simple program that reads in a database file, a CSV file, and a text file, processes them, and creates a report:

![io-bounds-1](https://s3.amazonaws.com/dq-content/168/report_assembly.png)

This only involves reading and writing data at the very beginning and very end of the program. But imagine that we're working with a very large dataset that doesn't fit into memory. As we did in an earlier mission, we might need to use the database to do some processing, and to read in data periodically.<br>

Let's say we have a text file of teams from [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball). We want to calculate statistics on each team, but the raw statistics are too large to load into memory. We instead want to use [SQLite](https://www.sqlite.org/) to compute the statistics, then get the results. So we'd loop over the team names, then run a SQL query for each team to get the statistics we want. It would look something like this:

![io-bounds-2](https://s3.amazonaws.com/dq-content/168/report_assembly_bidir.png)

In the above example, some of the tasks are CPU bound, and some are I/O bound. CPU bound tasks are tasks where our Python program is executing something.<br>

**CPU bound tasks** will:
* Execute faster if you **optimize the algorithm**.
* Execute faster if your **processor has a higher clock speed** (can execute more operations).

**I/O bound tasks** are tasks where:
* Our program is **reading from an input** (like a `CSV` file).
* Our program is **writing to an output** (like a `text` file).
* Our program is **waiting for another program to execute** something (like a `SQL query`).
* Our program is **waiting for another server to execute** something (like an `API request`).

In general, during I/O bound tasks, your program **isn't using the CPU at all**, because it isn't executing any operations -- it's just waiting around for something else to finish. I/O bound tasks can be optimized, but the process is different from CPU bound tasks. In this mission, we'll learn how to improve the performance of I/O bound tasks.<br>

As we work through this mission, we'll use a SQLite database of baseball statistics from [Major League Baseball](https://en.wikipedia.org/wiki/Major_League_Baseball), or MLB, an American baseball league. If you're unfamiliar with baseball, you can read the rules [here](https://en.wikipedia.org/wiki/Baseball_rules). The data originally came from [Sean Lahman's site](http://www.seanlahman.com/baseball-archive/statistics/), and was transformed to a SQLite database [here](https://github.com/jknecht/baseball-archive-sqlite).<br>

Each table in the dataset contains statistics on baseball players from `1871` to `2015`. Some tables contain statistics on pitching, others on fielding, others on awards players earned, and so on.<br>

The important tables are:
* `Master` -- contains information about players (including names)
* `Batting` -- contains information on batting by year.
* `Pitching` -- information on pitching by year.
* `Fielding` -- information on fielding by year.
* `AwardsManagers` -- which managers of teams won awards by year.
* `AwardsPlayers` -- which players won awards by year.
* `Teams` -- contains information on teams by year.
* `Salaries` -- salary data for each player by year, starting in `1985`.
* `HallOfFame` -- information about when players were inducted into the Hall of Fame, which honors the best baseball players of all time.
* `Managers` -- information on managers of teams.

Here's a look at the `Master` table:

In [2]:
import sqlite3
import pandas as pd

In [2]:
conn = sqlite3.connect('../data/lahman2015.sqlite')
query_master_table = 'SELECT * FROM Master LIMIT 5'
pd.read_sql(query_master_table, conn)

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981,12,27,USA,CO,Denver,,,,...,Aardsma,David Allan,220,75,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934,2,5,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180,72,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939,8,5,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190,75,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954,9,8,USA,CA,Orange,,,,...,Aase,Donald William,190,75,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972,8,25,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184,73,L,L,2001-09-10,2006-04-13,abada001,abadan01


Here's a look at the `Batting` table:

In [3]:
query_batting_table = 'SELECT * FROM Batting LIMIT 5'
pd.read_sql(query_batting_table, conn)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,aardsda01,2004,1,SFN,NL,11,,0,0,0,...,0,0,0,0,0,0,0,0,0,
1,aardsda01,2006,1,CHN,NL,45,,2,0,0,...,0,0,0,0,0,0,1,0,0,
2,aardsda01,2007,1,CHA,AL,25,,0,0,0,...,0,0,0,0,0,0,0,0,0,
3,aardsda01,2008,1,BOS,AL,47,,1,0,0,...,0,0,0,1,0,0,0,0,0,
4,aardsda01,2009,1,SEA,AL,73,,0,0,0,...,0,0,0,0,0,0,0,0,0,


Here's a look at the `Pitching` table:

In [4]:
query_pitching_table = 'SELECT * FROM Pitching LIMIT 5'
pd.read_sql(query_pitching_table, conn)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,...,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP
0,aardsda01,2004,1,SFN,NL,1,0,11,0,0,...,0,0,2,0,61,5,8,0,1,
1,aardsda01,2006,1,CHN,NL,3,0,45,0,0,...,0,1,1,0,225,9,25,1,3,
2,aardsda01,2007,1,CHA,AL,2,1,25,0,0,...,3,2,1,0,151,7,24,2,1,
3,aardsda01,2008,1,BOS,AL,4,2,47,0,0,...,2,3,5,0,228,7,32,3,2,
4,aardsda01,2009,1,SEA,AL,3,6,73,0,0,...,3,2,0,0,296,53,23,2,1,


And here's a look at the `Fielding` table:

In [5]:
query_fielding_table = 'SELECT * FROM Fielding LIMIT 5'
pd.read_sql(query_fielding_table, conn)

Unnamed: 0,playerID,yearID,stint,teamID,lgID,POS,G,GS,InnOuts,PO,A,E,DP,PB,WP,SB,CS,ZR
0,aardsda01,2004,1,SFN,NL,P,11,,33,0,0,0,0,,,,,
1,aardsda01,2006,1,CHN,NL,P,45,,159,1,5,0,1,,,,,
2,aardsda01,2007,1,CHA,AL,P,25,,96,2,4,1,0,,,,,
3,aardsda01,2008,1,BOS,AL,P,47,,147,3,6,0,0,,,,,
4,aardsda01,2009,1,SEA,AL,P,73,,213,2,5,0,1,,,,,


The data is stored in the `lahman2015.sqlite` file.

## Profiling an I/O bound task

To illustrate what an I/O bound task looks like, let's write a program similar to the one we described in the last screen. It will read from a list of baseball team ids, then run a query to get the aggregate statistics for each team in the list. We could do this in a single SQL query, but in order to simplify the query and make it easier to scale up this algorithm later, we're going to do one SQL query per team.<br>

The `Teams` table in the database contains the team ID and team name. Here are the relevant column names:
* `name` -- the name of the team.
* `teamID` -- the ID of the team, needed to join with player data.
* `franchID` -- the ID of the franchise. Needed to join with the TeamsFranchise table.

The `TeamsFranchise` table in the database contains whether or not the team is still active. Because the MLB has a history going back to `1871`, many older teams have ceased operations. We only want to pull data for active teams.<br>

We can get a list of active team IDs with the query:

```sql
SELECT DISTINCT teamID FROM Teams INNER JOIN TeamsFranchises ON Teams.franchID == TeamsFranchises.franchID WHERE TeamsFranchises.active = 'Y';
```

This gives us the list of active teams, `teams`, that's shown in the code.<br>

For each team, we'll want to query the total number of home runs the team has hit all-time. We can get this by querying the `Batting` table. Each row is the statistics for a given batter in a given year. The relevant columns are:
* `teamID` -- the ID of the team the batter played for.
* `HR` -- the number of home runs the batter hit.

We'll want to wrap all of the code into a function that returns the counts on a per-team basis. We'll then want to use [cProfile](https://docs.python.org/3/library/profile.html) to profile the code, so we can tell what we need to optimize.


* Create a function that takes in a list of teams, `teams`, and returns a list of the number of home runs each team has hit all-time:
  * Create an empty list, `home_runs`.
  * Loop over each item in `teams`.
  * Query `lahman2015.sqlite`, and get the total number of home runs:
    * Write a query that uses the SQLite [sum](https://www.sqlite.org/lang_aggfunc.html#sumunc) function.
  * Append the result to `home_runs`.
  * Return `home_runs`.
* Execute the function using `cProfile`. Create a string to pass into the `cProfile` [run()](https://docs.python.org/3/library/profile.html#profile.run) method that:
  * Executes the function you created earlier, passing in `teams`.
  * Assigns the result to `home_runs`.
* Take a look at the cProfile output. Can you tell where most of the time is going?

In [2]:
import cProfile

query = '''SELECT DISTINCT teamID 
            FROM Teams
            INNER JOIN TeamsFranchises
            ON Teams.franchID == TeamsFranchises.franchID 
            WHERE TeamsFranchises.active = 'Y';
'''
conn = sqlite3.connect("../data/lahman2015.sqlite")
cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

In [3]:
print(teams)

['BSN', 'CHN', 'CN2', 'PT1', 'SL4', 'NY1', 'PHI', 'BR3', 'PIT', 'BRO', 'CIN', 'SLN', 'BLA', 'BOS', 'CHA', 'CLE', 'DET', 'MLA', 'PHA', 'WS1', 'SLA', 'NYA', 'ML1', 'BAL', 'KC1', 'LAN', 'SFN', 'LAA', 'MIN', 'WS2', 'HOU', 'NYN', 'CAL', 'ATL', 'OAK', 'KCA', 'SE1', 'MON', 'SDN', 'ML4', 'TEX', 'SEA', 'TOR', 'COL', 'FLO', 'ANA', 'TBA', 'ARI', 'MIL', 'WAS', 'MIA']


In [4]:
def get_total_homeruns_by_team(teams):
    home_runs = []
    
    for team in teams:
        
        query = '''
                SELECT SUM(HR) 
                FROM Batting
                WHERE teamID=?
        '''
        runs = cur.execute(query, [team]).fetchall()
        home_runs.append(runs[0][0])
    
    return home_runs

In [5]:
home_runs = get_total_homeruns_by_team(teams)
cProfile.run('home_runs = get_total_homeruns_by_team(teams)')

         157 function calls in 0.166 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.166    0.166 <ipython-input-4-3faf3b549227>:1(get_total_homeruns_by_team)
        1    0.000    0.000    0.166    0.166 <string>:1(<module>)
        1    0.000    0.000    0.166    0.166 {built-in method builtins.exec}
       51    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       51    0.165    0.003    0.165    0.003 {method 'execute' of 'sqlite3.Cursor' objects}
       51    0.001    0.000    0.001    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}




## Blocking Tasks

As you could see in the previous screen, most of the total time of the program was spent waiting for SQLite to execute the query:


```python
51    0.357    0.007    0.357    0.007 {method 'execute' of 'sqlite3.Cursor' objects}
```

Of the total `.36` seconds to run the program, `.357` seconds were spent waiting for SQLite to execute the queries and return the results (your exact numbers may be different). When a task is just waiting for something to happen, then we say that task is blocked. In this case, our line `cur.execute` is waiting on the SQLite engine to run the query and return a result, so it's blocked by SQLite.<br>

When a thread is blocked, it isn't running any operations on the CPU -- it's just waiting around. SQLite stores its data on disk, and has to access the disk whenever a query is executed. The hard drive is the slowest way to do I/O because it physically reads in data more slowly than memory, and is much farther away from the CPU than memory.<br>

It's possible to significantly speed up tasks that are I/O bound by reading the data into memory first. This relies on the data being able to fit into memory in the first place, of course. We'll illustrate this principle by copying our SQLite database into memory, then trying to run the same queries. We can initialize an in-memory SQLite database using the string `:memory:`:

```python
conn = sqlite3.connect(':memory:')
```

We can copy our database into memory by connecting to both databases, then using the [sqlite3.Connection.iterdump()](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.iterdump) method, and the [sqlite3.Cursor.executescript()](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.executescript) method:

```python
import sqlite3

# Create an in memory database.
memory = sqlite3.connect(':memory:')

# Connect to our disk database.
disk = sqlite3.connect('lahman2015.sqlite')

# Create a query that will read the contents of the disk database into another database.
dump = "".join(line for line in disk.iterdump())

# Run the query to copy the database from disk into memory.
memory.executescript(dump)

```

The above code first creates an in-memory database using the special keyword `:memory:`, then create a query that copies from the disk database to the memory database. Finally, we run the query to perform the copying. We can then execute queries like normal on the `memory` database. Let's try porting our algorithm from the last screen to using the in-memory database.



* Create a function that takes in a list of teams, `teams`, and returns a list of the number of home runs each team has hit all-time:
  * Create an empty list, `home_runs`.
  * Loop over each item in teams.
  * Query `memory`, and get the total number of home runs:
    * Write a query that uses the SQLite sum function.
  * Append the result to `home_runs`.
  * Return `home_runs`.
* Execute the function using `cProfile`. Create a string to pass into the [cProfile run()](https://docs.python.org/3/library/profile.html#profile.run) method that:
  * Executes the function you created earlier, passing in `teams`.
  * Assigns the result to `home_runs`.
* Take a look at the cProfile output. Can you tell where most of the time is going?

In [6]:
memory = sqlite3.connect(':memory:') # create a memory database
disk = sqlite3.connect('../data/lahman2015.sqlite')

dump = "".join([line for line in disk.iterdump() if "Batting" in line])
memory.executescript(dump)

cur = memory.cursor()

In [7]:
def get_total_homeruns_by_team(teams):
    
    home_runs = []
    query = '''
            SELECT SUM(HR)
            FROM Batting
            WHERE TeamID=?
    '''
    
    for team in teams:
        
        homeruns = cur.execute(query,[team]).fetchall()
        home_runs.append(homeruns[0][0])
        
    return home_runs

In [8]:
home_runs = get_total_homeruns_by_team(teams)
cProfile.run('home_runs = get_total_homeruns_by_team(teams)')

         157 function calls in 0.043 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.043    0.043 <ipython-input-7-9cff9b9ac4b8>:1(get_total_homeruns_by_team)
        1    0.000    0.000    0.043    0.043 <string>:1(<module>)
        1    0.000    0.000    0.043    0.043 {built-in method builtins.exec}
       51    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       51    0.043    0.001    0.043    0.001 {method 'execute' of 'sqlite3.Cursor' objects}
       51    0.000    0.000    0.000    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}




## Parallel Execution

As you saw in the previous screen, loading the exact same database into memory saved us significant time relative to having the database on disk (`.116` seconds versus `.36`, although your exact numbers may vary). Note that the technique we used in the last screen is mostly for illustrative purposes -- you're almost always better off loading data using Pandas if you want it to be in-memory.<br>

However, our program still had to wait around for `.115` seconds while queries were executed. During this time, our program wasn't doing anything, so it was essentially wasted time from our perspective. Luckily, there's a way to improve performance even more -- we simply query the database multiple times at once.<br>

So far, we've worked with *single-threaded* programs. Single threaded programs execute one instruction at a time:

![single-thread](https://s3.amazonaws.com/dq-content/168/single_threaded.png)

We execute the query, then wait for a bit, then get a result. What if, instead, we could run several queries at once? It might look like this:


![multi-thread](https://s3.amazonaws.com/dq-content/168/multi_threaded.png)

We'd execute each query for each team separately, in it's own "thread". We'd then execute all of the "threads" at once. This is called threading, and it allows us to execute tasks that are I/O bound more quickly. While one thread is waiting around for a query to finish, another thread can process the result, meaning that CPU usage is more efficient.<br>

We can use the Python 3 [threading](https://docs.python.org/3/library/threading.html) library to implement threading in our programs.<br>

In order to start a thread, we first define the task we want to run in the thread as a function, like this:

```python
def task(team):
    print(team)
```

We then can call the [threading.Thread](https://docs.python.org/3/library/threading.html#threading.Thread) class, passing in any arguments we want:
  
```python
thread = threading.Thread(target=task, args=(team,))
thread.start()
```

The above code will create a new thread that executes the `task` function, passing in `team` as an argument. The result will be that you'll see the team name as output.<br>

Let's try starting one thread to run the `task` function for each team.


* Loop through each team name in the `teams` list using [enumerate](https://docs.python.org/3/library/functions.html#enumerate), and:
  * Create a new thread that calls task and passes in a team name.
  * Start the thread.
  * Print "Started task X", where X is the iteration of the loop that we're on.
* Look at the output of your program, and compare it to the order of the teams in the `teams` list. What do you notice?

In [9]:
import threading

def task(team):
    print(team)

In [10]:
for i, team in enumerate(teams):
    
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task {}'.format(i))

BSNStarted task 0

CHN
Started task 1
CN2Started task 2

PT1
Started task 3
SL4
Started task 4
NY1
Started task 5
Started task 6PHI

BR3
Started task 7
PITStarted task 8

BROStarted task 9

CIN
Started task 10
SLN
Started task 11
BLA
Started task 12
BOSStarted task 13

CHA
Started task 14
CLE
Started task 15
DETStarted task 16

MLA
Started task 17
PHAStarted task 18

WS1Started task 19

SLAStarted task 20

NYA
Started task 21
ML1
Started task 22
BAL
Started task 23
KC1
Started task 24
LAN
Started task 25
SFN
Started task 26
LAA
Started task 27
MIN
Started task 28
WS2
Started task 29
HOU
Started task 30
NYN
CALStarted task 31

Started task 32
ATLOAKStarted task 33

KCA
SE1
Started task 34

Started task 35
Started task 36MONSDN


ML4TEXStarted task 37


SEATORStarted task 38


COLStarted task 39
FLO

Started task 40ANA

Started task 41TBA

Started task 42ARI

Started task 43MIL

WASStarted task 44MIA


Started task 45
Started task 46
Started task 47
Started task 48
Started task 49
Starte

## Thread Blocking

You should have noticed some strange output from the last screen, which looked like this:

```
BSNStarted task 0

CHNStarted task 1

CN2Started task 2

PT1
Started task 3
SL4Started task 4

NY1Started task 5

PHI
Started task 6
BR3Started task 7

PITStarted task 8

BROStarted task 9

CINStarted task 10

SLNStarted task 11

BLAStarted task 12

Started task 13BOS

CHAStarted task 14
```

There are a few reasons for that this output looks "off", which we'll cover one by one in the next few screens:
* The "main" thread not blocking when new threads are started.
* Multiple threads accessing a shared resource that isn't thread-safe.
* Not locking when accessing shared resources.
* The [Global Interpreter Lock](https://wiki.python.org/moin/GlobalInterpreterLock), which prevents multiple threads from running Python code at once.

In this thread, we'll cover the idea of a "main" thread, and blocking. In the last screen, we started `51` threads, one for each current or historical MLB team ID that is still active. However, there was one thread already running -- the main "thread", which is our program. Here's one way to think about it:

![three-threads](https://s3.amazonaws.com/dq-content/168/three_threads.svg)

If you look above, you see that the main program spawns multiple new threads (we only show two above), then starts them. **But the main thread doesn't wait around to see what happens with the threads -- it immediately moves on to making the next thread**. When there aren't any threads left to make, it finishes.<br>

Let's say the main program creates the thread "BOS". Depending on how long the "BOS" thread takes to execute, it's possible for the main program to spawn the next thread before the "BOS" thread finishes.<br>

This will become more **obvious if we add a delay before printing `team` in the `task` function**.

* Modify the `task` function so that it sleeps for `3` seconds before printing `team`.
* Loop through each team name in the `teams` list using [enumerate](https://docs.python.org/3/library/functions.html#enumerate), and:
  * Create a new thread that calls `task` and passes in a team name.
  * Start the thread.
  * Print "Started task X", where X is the iteration of the loop that we're on.
* Look at the output of your program, and compare it to both the order of the teams in the `teams` list and the output from the last screen. What do you notice?

In [11]:
import time

def task(team):
    time.sleep(3)
    print(team)

In [12]:
for i, team in enumerate(teams):
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task {}'.format(i))

Started task 0
Started task 1
Started task 2
Started task 3
Started task 4
Started task 5
Started task 6
Started task 7
Started task 8
Started task 9
Started task 10
Started task 11
Started task 12
Started task 13
Started task 14
Started task 15
Started task 16
Started task 17
Started task 18
Started task 19
Started task 20
Started task 21
Started task 22
Started task 23
Started task 24
Started task 25
Started task 26
Started task 27
Started task 28
Started task 29
Started task 30
Started task 31
Started task 32
Started task 33
Started task 34
Started task 35
Started task 36
Started task 37
Started task 38
Started task 39
Started task 40
Started task 41
Started task 42
Started task 43
Started task 44
Started task 45
Started task 46
Started task 47
Started task 48
Started task 49
Started task 50


## Joining Threads

As you should have seen in the previous screen, the main thread spawned all of the team threads before a single team thread finished executing. This means that you saw all of the "Started task" messages before seeing the team names printed out. Depending on the timing, you may also have not seen any of the team names. This is because the main thread didn't wait for the threads to finish before exiting.

There are times, particularly when we've saturated the capacity of a shared resource, or when we want to wait for all the threads to finish before exiting, that we want the main program to wait for the threads to finish.

A common example of this is:
* Reading data in via several threads.
* Waiting until all the data is read in.
* Processing the data.
* Spawning several threads to export the data.
* Waiting until the data is finished exporting to exit.


Another example is "batching up" threads, so only a few run at a time. 
* We'd start `5` threads for the first `5` teams, 
* wait until they finish, spawn `5` more, 
* and keep going until all of the teams are processed. 

Here's a diagram:

![joining-threads](https://s3.amazonaws.com/dq-content/168/single_threaded.png)

As you can see, we complete each "batch" of threads before moving on to the next one. We can force the main program to wait for a thread to complete before moving on using the [threading.Thread.join()](https://docs.python.org/3/library/threading.html#threading.Thread.join) method.<br>

We can call the join method multiple times to wait for multiple threads:

```python
t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))
t3 = threading.Thread(target=task, args=(team,))

# Start the first three threads
t1.start()
t2.start()
t3.start()

t1.join() # Wait until t1 finishes.
t2.join() # Wait until t2 finishes.  If it already finished, then keep going.
t3.join() # Wait until t3 finishes.  If it already finished, then keep going.
```

Note that the "main" thread execution will pause at each call to join. If we want to wait for multiple threads, we just call `join` multiple times. Each `join` call will wait for the associated thread finishes, or keep going if it already finished. Let's look at an example:

![join-thread](../img/1.png)

Let's say `t1` takes `1.2` seconds to run, `t2` takes `.6` seconds, and `t3` takes `.9` seconds. If we join in the order on the left, we wait `1.2` seconds for `t1`, then don't wait for `t2` or `t3` since they already finished. On the other hand, if we join using the order on the right, we wait `.9` seconds for `t3`, don't wait for `t2` since it finished, then wait `.3` seconds for `t1`.<br>

Note that we wait `1.2` seconds total in both cases, since the slowest thread (`t1`) takes that long to finish.<br>

Let's try spawning threads in batches to see how that affects our output.

* Loop through each number in `range(11)`, using the loop variable `i`.
  * Generate the indices for our batch, which we can find with `range(i*5, (i+1) * 5)`.
  * Get the team names in our batch using the indices.
  * Create a list called `threads` to store our threads.
  * For each team name:
    * Create a new thread that calls `task` and passes in a team name.
    * Start the thread.
    * Append the thread to `threads`.
  * Loop through each thread in `threads`, and:
    * Use `threading.Thread.join()` to wait for the thread to finish.
  * Print `"Finished batch X"`, replacing X with the loop variable, `i`.
* Look at the output of your program, and compare it to both the order of the teams in the `teams` list and the output from the last screen. What do you notice?

In [13]:
def task(team):
    print(str(team)+'/')

In [14]:
for i in range(11):
    
    teams_batch = []
    
    for i2 in range(i*5, (i+1)*5):
        try:
            teams_batch.append(teams[i2])
        except:
            pass
        
    threads = []
    
    for team in teams_batch:
        
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
        
    for thread in threads:
        thread.join()
        
    print('\nFinished batch {}'.format(i))
    

BSN/
CHN/
CN2/
PT1/
SL4/

Finished batch 0NY1/PHI/PIT/


BR3/BRO/



Finished batch 1
CIN/SLN/BLA/


BOS/CHA/

Finished batch 2CLE/DET/MLA/PHA/WS1/





SLA/NYA/ML1/
BAL/KC1/
Finished batch 3LAN/
LAA/

SFN/MIN/WS2/






HOU/
NYN/
CAL/
Finished batch 4ATL/OAK/




SE1/KCA/MON/SDN/ML4/TEX/
SEA/
Finished batch 5TOR/



COL/FLO/ANA/TBA/ARI/MIL/WAS/




MIA/







Finished batch 6

Finished batch 7

Finished batch 8

Finished batch 9

Finished batch 10


## Locking

It's important to be aware of accessing shared resources when you're working with threads. Some examples of shared resources are:
* The system stdout.
* SQL databases.
* APIs.
* Objects in memory.

In a previous screen, you may recall that our output looked strange because we used the `print` function in multiple threads.<br>

Think of the system standard output like a person taking orders at a deli. If people line up, and only one person orders at a time, the order they got into your deli will determine the order in which you make their sandwiches:

![deli-order-example](../img/2.png)

The person who got in fifth will get their sandwich after the person who got in fourth, and so on. But what happens when there's no line, and everyone is shouting their order at you? You can try to guess how the orders should be arranged based on when you heard their orders:

![deli-order-example2](../img/3.png)

But doing it like this might result in a very different order from when they arrived in your deli. You might make a sandwich for the person who came into your deli `10th` before you make a sandwich for a person who came into your deli `5th`.<br>

This is exactly what happens with the `print` function and the system standard output. The output, like `BAL` is written to the standard output, then the output is flushed, with a newline character. This is why you see inconsistent spacing and team names running together, like this:

```
SLA
NYA
ML1
BALKC1

Finished batch 4
```

We can fix the problem of accessing shared resources by using thread locking. Locking allows us to ensure that only one thread is accessing a shared resource at the same time.<br>

You can create a lock with the [threading.Lock](https://docs.python.org/3/library/threading.html#threading.Lock) class. A lock has two methods:
* [threading.Lock.acquire()](https://docs.python.org/3/library/threading.html#threading.Lock.acquire) -- acquires the Lock, and prevents any other thread from proceeding until it can also acquire the lock.
* [threading.Lock.release()](https://docs.python.org/3/library/threading.html#threading.Lock.release) -- releases the Lock, so other threads can acquire it.

A lock can only be acquired by a single thread at a time. Other threads have to wait until the lock is available, so they can also acquire it.<br>

The idea is for a thread to acquire a lock, access a shared resource, then release the lock. This prevents any other thread from passing the lock acquisition line until the lock is available.<br>

Here's an example:

```python
lock = threading.Lock()

def task(team):
    lock.acquire()
    # This code cannot be executed until a thread acquires the lock.
    print(team)
    lock.release()

t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))

t1.start()
t2.start()
```

![locking-process](../img/4.png)

Let's try printing our teams list in batches, with locking. We'll also use [sys.stdout.flush()](https://docs.python.org/3/library/sys.html#sys.stdout) method to flush the standard output after the print statement, so everything gets output immediately inside each thread. If we don't do this, data can persist in the `stdout` buffer, and only show up later on, after another thread prints.


* Create a lock.
* Modify the `task` function to:
  * Acquire the lock.
  * Print the team name.
  * Flush the standard output.
  * Release the lock.
* Loop through each number in `range(11)`, using the loop variable `i`.
  * Generate the indices for our batch, which we can find with `range(i*5, (i+1) * 5)`.
  * Get the team names in our batch using the indices.
  * Create a list called `threads` to store our threads.
  * For each team name:
    * Create a new thread that calls `task` and passes in a team name.
    * Start the thread.
    * Append the thread to `threads`.
  * Loop through each thread in `threads`, and:
    * Use `threading.Thread.join()` to wait for the thread to finish.
  * Print "Finished batch X", replacing X with the loop variable, `i`.
* Look at the output of your program, and compare it to both the order of the teams in the `teams` list and the output from the last screen. What do you notice?

In [3]:
import sys

In [16]:
import sys

lock = threading.Lock()

def task(team):
    lock.acquire()
    
    print(team)
    sys.stdout.flush()
    lock.release()

In [17]:
for i in range(11):
    
    teams_batch = []
    for i2 in range(i*5, (i+1)*5):
        try:
            teams_batch.append(teams[i2])
        except:
            pass
        
    threads = []
    for team in teams_batch:
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
        
    for thread in threads:
        thread.join()
        
    print('Finished batch {}'.format(i))
        

BSN
CHN
CN2
PT1
SL4
Finished batch 0
NY1
PHI
BR3
PIT
BRO
Finished batch 1
CIN
SLN
BLA
BOS
CHA
Finished batch 2CLE

DET
MLA
PHA
WS1
Finished batch 3
SLA
NYA
ML1
BAL
KC1
LANFinished batch 4

SFN
LAA
MIN
WS2
HOUFinished batch 5

NYN
CAL
ATL
OAK
KCAFinished batch 6

SE1
MON
SDN
ML4
TEXFinished batch 7

SEA
TOR
COL
FLO
Finished batch 8
ANA
TBA
ARI
MIL
WAS
Finished batch 9
MIA
Finished batch 10


## Thread Safety

You should have noticed that our output was much more legible in the last screen, because only one thread was writing to the standard output at once. 

### Thread safety is the idea of an operation to be safe for multiple threads to utilize at once.

As we discovered, the standard output is not thread safe -- having multiple threads all outputting data at the same time causes the output to become garbled. An example of a thread safe operation is incrementing a shared counter -- all the threads can keep adding one to the counter without conflicting with other threads.<br>

It's very important to be aware of what operations are thread safe, and which ones aren't. In general, these are thread safe:
* Reading from a file.
* Querying a database.
* Querying an API.
* Accessing data in memory.

In general, these operations are not thread safe:
* Modifying data in memory.
* Writing to a file.
* Adding data to a database.
* Modifying data via API.

As you can see, reading or querying is generally thread safe, since you aren't changing anything. Changing data usually isn't thread safe, since your changes could conflict.<br>

In the case of the SQLite database we accessed early in this mission, querying is thread safe, but writing data is not. Since we're only querying the database, we can access the database from each thread. There are a couple of caveats to this:
* We need to pass in the `check_same_thread=False` keyword argument when initializing the connection.
* We need to initialize a cursor in each thread, since multiple threads cannot use the same cursor.
* We have to be sure not to write to the database, or our changes could conflict and cause data integrity issues.


* Create a function that takes in a team, `team`, and prints the number of home runs:
  * Create a new cursor using the `conn` object.
  * Query `lahman2015.sqlite`, and get the total number of home runs:
    * Write a query that uses the SQLite [sum](https://www.sqlite.org/lang_aggfunc.html#sumunc) function.
  * Acquire a lock.
  * Print the team name.
  * Print the number of runs.
  * Flush the standard output.
  * Release the lock.
* Create an empty list of threads called `threads`.
* Loop over `teams`, and for each team:
  * Spawn a new thread.
  * Start the thread.
* Loop over the list of threads, and run the `join` method on each thread.

In [9]:
import sqlite3
import threading
import sys

In [10]:
query = "SELECT DISTINCT teamID from Teams inner join TeamsFranchises on Teams.franchID == TeamsFranchises.franchID where TeamsFranchises.active = 'Y';"
conn = sqlite3.connect("../data/lahman2015.sqlite", check_same_thread=False)
cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"
lock = threading.Lock()

In [13]:
def calculate_runs(team):
    cur = conn.cursor()
    runs = cur.execute(query, [team]).fetchall()
    runs = runs[0][0]
    lock.acquire()
    print(team, ':', runs)
    sys.stdout.flush()
    lock.release()
    return runs

In [14]:
threads = []
for team in teams:
    thread = threading.Thread(target=calculate_runs, args=(team,))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

BSN : 3424
PHA : 3502
BR3 : 143
PHI : 12503
DET : 13160
SLN : 11157
SL4 : 305
BRO : 4336
CLE : 12333
NYA : 15218
CN2 : 267
CHA : 10792
CHN : 13530
CIN : 12383
MLA : 26
SLA : 3014
PIT : 10878
NY1 : 5777
BLA : 57
ML1 : 2230
BOS : 12883
WS1 : 2786
PT1 : 54
KC1 : 1480
LAN : 7601
BAL : 9592
MIN : 7393
LAA : 2276
SFN : 8348
WS2 : 1387
HOU : 6536
NYN : 6817
CAL : 3912
ATL : 7535
OAK : 7438
MON : 4381
SEA : 5976
KCA : 5613
SE1 : 125
TEX : 7055
TOR : 6415
ANA : 1324
COL : 4120
SDN : 5648
ML4 : 3664
FLO : 2816
ARI : 2987
TBA : 2823
WAS : 2002
MIA : 474
MIL : 3160


## Returning Values From Threads

So far, we've learned quite a bit about threads and locking. Let's bring everything together, and do some data processing using threads. Before we can process data, we'll need to be able to return a value from a thread, which will allow us to use threads to transform data. A good way to transform data is to:
* Create a dictionary.
* Use a unique key for each thread to use to add values to the dictionary
* Call join on all the threads

By the time the above process finishes, the dictionary will have all of the results stored in unique, per-thread, keys. Let's use what we know about threads to figure out the greatest hitters, pitchers, and fielders of all time. To find the best pitchers, we'll use the [Fielding-Independent Pitching](http://www.baseballprospectus.com/glossary/index.php?search=FIP), or FIP, statistic, which is calculated like this:

$$((13*HR + 3*BB - 2*SO) / IPOuts) + 3.2$$

In the above formula:
* `HR` -- home runs given up.
* `BB` -- walks given up.
* `SO` -- strikeouts.
* `IPOuts` -- innings * 3 pitched.

The lower the FIP for a pitcher is, the better. We can calculate the statistic using the `Pitching` table.<br>

To find the best batter, we'll use [On-Base Plus Slugging](http://www.fangraphs.com/library/offense/ops/), or OPS+, statistic, which is calculated like this:

$$((H + BB + HBP) / (AB + BB + HBP + SF)) + ((H + 2B + 2*3B + 3*HR) / AB)$$

In the above formula:

* `H` -- hits
* `BB` -- walks
* `HBP` -- hit by pitch
* `AB` -- at bats
* `SF` -- sacrifice flies
* `2B` -- doubles
* `3B` -- triples
* `HR` -- home runs

The higher the OBP+ statistic is, the better. We can use the `Batting` table to calculate this. Note that you'll need to use double quotes around any column names that start with a number in your query, so something like `SELECT "2B" FROM Batting;`.<br>

To find the best fielder, we'll use the [Range Factor](https://en.wikipedia.org/wiki/Range_factor), or RF, statistic, which is calculated like this:

$$(A+PO)/G$$

In the above formula:
* `A` -- assists
* `PO` -- putouts
* `G` -- games played

The higher the RF statistic, the better. We can use the `Fielding` table to calculate this.<br>

Note that all of the abbreviations used in the formulas are column names in their respective tables, making it easy to translate the formulas to SQL. To compute any of these statistics, we'll need to add up all the statistics for a player over their career, then compute the formula and get a single number. We can use a SQL query for this.<br>

In order to speed up our database access, we'll create three threads, one each to calculate the best batter, pitcher, and fielder. Each thread should return the player ids of the top 20 players by assigning the list to either the `batter`, `pitcher`, or `fielder` key in the dictionary `best`.

* Write a function to find the best batter.
  * Ensure that `AB` is greater than `100` using a `HAVING` clause.
  * Cast at least one column to float using `CAST(H AS FLOAT)`.
* Write a function to find the best pitcher.
  * Ensure that `IPOuts` is greater than `100` using a `HAVING` clause.
  * Cast at least one column to float using `CAST(HR AS FLOAT)`.
* Write a function to find the best fielder.
  * Ensure that `G` is greater than `100` using a `HAVING` clause.
  * Cast at least one column to float using `CAST(A AS FLOAT)`.
* In each function, print a message when it finishes executing. Acquire a lock before printing, and release after.
* Execute each function in a separate thread.
* Run the `join` method on each thread to wait for it to finish.
* Print out the dictionary `best`.
  * It should have the keys `batter`, `pitcher`, and `fielder`.
  * Each key should be associated with a value that contains the ids of the top `20` players.

In [15]:
conn = sqlite3.connect("../data/lahman2015.sqlite", 
                       check_same_thread=False)
best = {}

In [69]:

def best_batter():
    cur = conn.cursor()
    query = """
            SELECT 
                ((CAST(H AS FLOAT) + BB + HBP) / (AB + BB + HBP + SF)) + ((H + "2B" + 2*"3B" + 3*HR) / AB) as OBP,  
                playerID
            FROM Batting
            GROUP BY Batting.playerID
            HAVING AB > 100
            ORDER BY OBP desc
            LIMIT 20;
    """
    
    top20_batter_ids = [row[1] for row in cur.execute(query).fetchall()]
    best['batter'] = top20_batter_ids
    
    lock.acquire()
    print('Execution finished - best batters')
    print('')
    lock.release()

    
def best_pitcher():
    cur = conn.cursor()
    query = """
            SELECT
                ((13*CAST(HR AS FLOAT) + 3*BB - 2*SO) / IPOuts) + 3.2 as FIP,
                playerID
            FROM Pitching
            GROUP BY Pitching.playerID
            HAVING IPOuts > 100
            ORDER BY FIP asc
            LIMIT 20;
    """
    
    top20_pitcher_ids = [row[1] for row in cur.execute(query).fetchall()]
    best['pitcher'] = top20_pitcher_ids
    
    lock.acquire()
    print('Execution finished - best pitchers')
    print('')
    lock.release()
    
def best_fielder():
    cur = conn.cursor()
    query = """
            SELECT
                (CAST(A AS FLOAT) + PO) / G as RF,
                playerID
            FROM Fielding
            GROUP BY Fielding.playerID
            HAVING G > 100
            ORDER BY RF desc
            LIMIT 20;
    """
    
    top20_fielder_ids = [row[1] for row in cur.execute(query).fetchall()]
    best['fielder'] = top20_fielder_ids
    
    lock.acquire()
    print('Execution finished - best fielders')
    print('')
    lock.release()

In [71]:
funcs = [best_pitcher, best_batter, best_fielder]
lock = threading.Lock()

threads = []
for func in funcs:
    
    thread = threading.Thread(target=func)
    thread.start()
    threads.append(thread)
    
for thread in threads:
    thread.join()
    
print('#'*30)
print(best)

Execution finished - best fielders

Execution finished - best batters
Execution finished - best pitchers

##############################
{'fielder': ['chaseha01', 'phillbi01', 'daubeja01', 'stovage01', 'lesliro01', 'sweenbi03', 'werdepe01', 'jonesto01', 'ganzejo01', 'sharpbu01', 'siebedi01', 'unglabo01', 'dillopo01', 'lehanmi01', 'bossha01', 'nealoji01', 'gandich01', 'farrasi01', 'tucketo01', 'sheelea01'], 'batter': ['bondsba01', 'dietzdi01', 'harpebr03', 'vottojo01', 'willite01', 'fainfe01', 'cabremi01', 'goldspa01', 'clarkwi02', 'ashburi01', 'longmto01', 'morriha02', 'yosted01', 'downibr01', 'troutmi01', 'mccutan01', 'nilssda01', 'krukjo01', 'leede02', 'youngdm01'], 'pitcher': ['crainje01', 'chapmar01', 'allenco01', 'smithca02', 'romose01', 'millean01', 'kershcl01', 'brittza01', 'wagnebi02', 'janseke01', 'blantjo01', 'hendrli01', 'gileske01', 'nenro01', 'grillja01', 'fieldjo03', 'blackbo01', 'fernajo02', 'dorrbe01', 'gagusch01']}



## Next Steps

You should now know how to use threads and memory to speed up I/O bound tasks. If you want to read more, here are some good resources:
* [threading Documentation](https://docs.python.org/3/library/threading.html)
* [Threading On Wikipedia](https://en.wikipedia.org/wiki/Thread_(computing))

You may have noticed in this mission that threads didn't quite speed things up as much as we'd expect. Intuitively, you might expect two threads to run twice as fast as one thread, but that wasn't always the case. In the next mission, we'll cover some of the downsides of threads, and alternatives that can speed up your code more.