# Problem 2: "But her emails..."

In this problem, you'll show your SQL and Pandas chops on the dataset consisting of Hilary Rodham Clinton's emails!

This problem has four (4) exercises (0-3) and is worth a total of ten (10) points.

## Setup

Start by downloading an SQLite database containing about 7,900 of HRC's messages:

https://t-square.gatech.edu/access/content/group/gtc-3bd6-e221-5b9f-b047-31c7564358b7/hrc.db

> **Do not share this file outside of this class!** We downloaded this database from Kaggle and have posted it on T-Square for your convenience. If anyone outside this class is interested in getting a copy of this database, please point them directly to the Kaggle site: https://www.kaggle.com/kaggle/hillary-clinton-emails

Next, let's run some setup code, which will load the modules you'll need for this problem

In [1]:
import sqlite3 as db

from IPython.display import display
import pandas as pd
import numpy as np

def peek_table (db, name, num=5):
    """
    Given a database connection (`db`), prints both the number of
    records in the table as well as its first few entries.
    """
    count = '''select count (*) FROM {table}'''.format (table=name)
    peek = '''select * from {table} limit {limit}'''.format (table=name, limit=num)

    print ("Total number of records:", pd.read_sql_query (count, db)['count (*)'].iloc[0], "\n")

    print ("First {} entries:".format (num))
    display (pd.read_sql_query (peek, db))

def list_tables (conn):
    """Return the names of all visible tables, given a database connection."""
    query = """select name from sqlite_master where type = 'table';"""
    c = conn.cursor ()
    c.execute (query)
    table_names = [t[0] for t in c.fetchall ()]
    return table_names

def tbc (X):
    var_names = sorted (X.columns)
    Y = X[var_names].copy ()
    Y.sort_values (by=var_names, inplace=True)
    Y.set_index ([list (range (0, len (Y)))], inplace=True)
    return Y

def tbeq(A, B):
    A_c = tbc(A)
    B_c = tbc(B)
    return A_c.eq(B_c).all().all()

In [7]:
conn = db.connect ('hrc.db')

print ("List of tables in the database:", list_tables (conn))

List of tables in the database: []


In [8]:
peek_table (conn, 'Emails')
peek_table (conn, 'EmailReceivers', num=3)
peek_table (conn, 'Persons')

DatabaseError: Execution failed on sql 'select count (*) FROM Emails': no such table: Emails

**Exercise 0** (1 point). Extract the `Persons` table from the database and store it as a Pandas data frame with two columns: `Id` and `Name`.

In [None]:
#
# YOUR CODE HERE
#


In [None]:
assert 'Persons' in globals ()
assert type (Persons) is type (pd.DataFrame ())
assert len (Persons) == 513

print ("Five random people from the `Persons` table:")
display (Persons.iloc[np.random.choice (len (Persons), 5)])

print ("\n(Passed!)")

**Exercise 1** (3 points). Query the database to determine how frequently particular pairs of people communicate. Store the results in a Pandas data frame named `CommEdges` having the following three columns:

- `Sender`: The ID of the sender (taken from the `Emails` table).
- `Receiver`: The ID of the receiver (taken from the `EmailReceivers` table).
- `Frequency`: The number of times this particular (`Sender`, `Receiver`) pair occurs.

Order the results in _descending_ order of `Frequency`.

There is one corner case that you should also handle: sometimes the `Sender` field is empty (unknown). You can filter these cases by checking that the sender ID is not the empty string.

In [None]:
#
# YOUR CODE HERE
#


In [None]:
# Read what we believe is the exact result (up to permutations)
CommEdges_soln = pd.read_csv ('CommEdges_soln.csv')

# Check that we got a data frame of the expected shape:
assert 'CommEdges' in globals ()
assert type (CommEdges) is type (pd.DataFrame ())
assert len (CommEdges) == len (CommEdges_soln)
assert set (CommEdges.columns) == set (['Sender', 'Receiver', 'Frequency'])

# Check that the results are sorted:
non_increasing = (CommEdges['Frequency'].iloc[:-1] >= CommEdges['Frequency'].iloc[1:])
assert non_increasing.all ()

print ("Top 5 communicating pairs:")
display (CommEdges.head ())

assert tbeq (CommEdges, CommEdges_soln)
print ("\n(Passed!)")

**Exercise 2** (3 points). Consider any pair of people, $a$ and $b$. Suppose we don't care whether person $a$ sends and person $b$ receives or whether person $b$ sends and person $a$ receives. Rather, we only care that $\{a, b\}$ have exchanged messages.

That is, the previous exercise computed a _directed_ graph, $G = \left(g_{a,b}\right)$, where $g_{a,b}$ is the number of times (or "frequency") that person $a$ was the sender and person $b$ was the receiver. Instead, suppose we wish to compute its _symmetrized_ or _undirected_ version, $H = G + G^T$.

Write some code that computes $H$ and stores it in a Pandas data frame named `CommPairs` with the columns, `A`, `B`, and `Frequency`. Per the definition of $H$, the `Frequency` column should combine frequencies from $G$ and $G^T$ accordingly.

In [None]:
#
# YOUR CODE HERE
#


In [None]:
CommPairs_soln = pd.read_csv ('CommPairs_soln.csv')

assert 'CommPairs' in globals ()
assert type (CommPairs) is type (pd.DataFrame ())
assert len (CommPairs) == len (CommPairs_soln)

print ("Most frequently communicating pairs:")
display (CommPairs.sort_values (by='Frequency', ascending=False).head (10))

assert tbeq (CommPairs, CommPairs_soln)
print ("\n(Passed!)")

**Exercise 3** (3 points). Starting with a copy of `CommPairs`, named `CommPairsNamed`, add two additional columns that contain the names of the communicators. Place these values in columns named `A_name` and `B_name` in `CommPairsNamed`.

In [None]:
CommPairsNamed = CommPairs.copy ()

#
# YOUR CODE HERE
#


In [None]:
CommPairsNamed_soln = pd.read_csv ('CommPairsNamed_soln.csv')

assert 'CommPairsNamed' in globals ()
assert type (CommPairsNamed) is type (pd.DataFrame ())
assert set (CommPairsNamed.columns) == set (['A', 'A_name', 'B', 'B_name', 'Frequency'])

print ("Top few entries:")
CommPairsNamed.sort_values (by=['Frequency', 'A', 'B'], ascending=False, inplace=True)
display (CommPairsNamed.head (10))

assert tbeq (CommPairsNamed, CommPairsNamed_soln)
print ("\n(Passed!)")

When you are all done, it's good practice to close the database. The following will do that for you.

In [None]:
conn.close ()

**Fin!** If you've reached this point and all tests above pass, you are ready to submit your solution to this problem. Don't forget to save you work prior to submitting.