# Project Notebook

The purpose of this notebook is to act as an index to the other notebooks used in this project. Brief descriptions and examples are presented here, but please view the following links for complete code and outputs.

* <a href="IS620FinalProjectProposal.pdf" target="_blank">Project Proposal</a>: Initial outline and planning
* <a href="wiki_scrape.ipynb" target="_blank">Wiki Scrape</a>: Scrape voting history from http://survivor.wikia.com/wiki/Main_Page. Output saved in wiki_scrape.p
* <a href="process_votes.ipynb" target="_blank">Process Votes</a>: Extract features from voting history. Output saved in process_votes.p
* <a href="make_graphs.ipynb" target="_blank">Make Graphs</a>: Create graphs objects from seasons. Output saved in make_graphs.p
* <a href="network.ipynb" target="_blank">Network</a>: Explore network relationships  
* <a href="episode_scores.ipynb" target="_blank">Episode Scores</a>: Centrality scores by episode
* <a href="naive_bayes.ipynb" target="_blank">Naive Bayes</a>: Predictions at given intervals of the season

## Work Flow

Re-run notebooks as needed

In [2]:
# Need to force reloading of modules before execution ...
%load_ext autoreload
%autoreload 2

# ... except:
%aimport pickle

In [3]:
# Force everything to be run ...
run_all_override = True

## Wiki Scrape
<a href="wiki_scrape.ipynb" target="_blank">Wiki Scrape</a>: Scrape voting history from http://survivor.wikia.com/wiki/Main_Page.  
  
This notebook provides the raw data needed to build subsequent graphs and network analyses.

In [4]:
use_wiki_scrape_from_disk = True and not run_all_override

if use_wiki_scrape_from_disk:
    try:
        print "Loading seasons from disk."
        seasons = pickle.load( open( "wiki_scrape.p", "rb" ) )
    except IOError:
        print "Error loading from disk."
        use_wiki_scrape_from_disk = False

if not use_wiki_scrape_from_disk:
    import wiki_scrape
    url = "http://survivor.wikia.com/wiki/Main_Page"
    print "Scraping " + url
    seasons = wiki_scrape.scrape(url, save_to_disk=True)

Loading seasons from disk.


In [6]:
# Example
seasons['Tocantins']['votes']

Unnamed: 0_level_0,Carolina,Candace,Jerry,Sandy,Spencer,Sydney,Joe,Brendan,Tyson,Sierra,Debbie,Coach,Taj,Erinn,Stephen,J.T.
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
J.T.,Carolina,—,—,Sandy,Spencer,Sydney,—,Sierra,Tyson,Sierra,Debbie,Erinn,Taj,Erinn,Jury Vote,Jury Vote
Stephen,Carolina,—,—,Sandy,Spencer,Sydney,—,Brendan,Tyson,Sierra,Debbie,Coach,Taj,—,Jury Vote,Jury Vote
Erinn,—,Candace,Jerry,—,—,—,—,Sierra,Tyson,Stephen,Debbie,Coach,Taj,—,,J.T.
Taj,Carolina,—,—,Joe,Spencer,Sydney,—,Brendan,Tyson,Debbie,Debbie,Coach,Erinn,,,J.T.
Coach,—,Candace,Jerry,—,—,—,—,Brendan,Sierra,Sierra,Taj,Erinn,,,,J.T.
Debbie,—,Candace,Jerry,—,—,—,—,Brendan,Sierra,Sierra,Coach,,,,,J.T.
Sierra,—,Candace,Jerry,—,—,—,—,Coach,Tyson,Debbie,,,,,,J.T.
Tyson,—,Candace,Jerry,—,—,—,—,Sierra,Sierra,,,,,,,J.T.
Brendan,—,Candace,Jerry,—,—,—,—,Coach,,,,,,,,J.T.
Joe,Carolina,—,—,Sandy,Spencer,Taj,Evacuated,,,,,,,,,


## Process Votes
<a href="process_votes.ipynb" target="_blank">Process Votes</a>: Extract features from voting history. Output saved in process_votes.p  
  
This notebook converts the voting history into a weighted adjacency matrix based on the number of similar votes. For example, if two players voted for the same person in the same episode, they will have a weight of 1. If they voted similarly in 5 episodes, their weight will be 5. 

In [7]:
use_process_votes_from_disk = True and not run_all_override

if use_process_votes_from_disk:
    try:
        print "Loading voteweights from disk."
        voteweights = pickle.load( open( "process_votes.p", "rb" ) )
    except IOError:
        print "Error loading from disk."
        use_process_votes_from_disk = False
        
if not use_process_votes_from_disk:
    import process_votes
    print "Processing votes ..."
    voteweights = process_votes.get_voteweights(seasons, save_to_disk=True)

Loading voteweights from disk.


In [8]:
# Example
voteweights['Tocantins']

1,J.T.,Stephen,Erinn,Taj,Coach,Debbie,Sierra,Tyson,Brendan,Joe,Sydney,Spencer,Sandy,Jerry,Candace,Carolina
1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
J.T.,,8.0,4.0,5.0,2.0,1.0,1.0,1.0,0.0,3.0,3.0,2.0,1.0,0.0,0.0,0.0
Stephen,,,4.0,7.0,2.0,2.0,1.0,0.0,0.0,3.0,3.0,2.0,1.0,0.0,0.0,0.0
Erinn,,,,3.0,2.0,2.0,3.0,3.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Taj,,,,,1.0,1.0,2.0,0.0,0.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0
Coach,,,,,,5.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Debbie,,,,,,,2.0,3.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Sierra,,,,,,,,2.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Tyson,,,,,,,,,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Brendan,,,,,,,,,,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Joe,,,,,,,,,,,4.0,2.0,1.0,0.0,0.0,0.0


## Make Graphs
<a href="make_graphs.ipynb" target="_blank">Make Graphs</a>: Create graphs objects from seasons. Output saved in make_graphs.p  
  
This notebook converts the adjacency matrix into a NetworkX graph object with weighted edges.

In [9]:
use_make_graphs_from_disk = True and not run_all_override

if use_make_graphs_from_disk:
    try:
        print "Loading graphs from disk"
        graphs = pickle.load( open( "make_graphs.p", "rb" ) )
    except IOError:
        print "Error loading from disk"
        use_process_votes_from_disk = False

if not use_make_graphs_from_disk:
    import make_graphs
    print "Making graphs"
    graphs = make_graphs.make_all_graphs(voteweights)

Loading graphs from disk


In [10]:
# Example
g = graphs['Tocantins'].edges(data=True)
g[:12]

[(u'Coach', u'J.T.', {'weight': 2}),
 (u'Coach', u'Brendan', {'weight': 2}),
 (u'Coach', u'Jerry', {'weight': 1}),
 (u'Coach', u'Stephen', {'weight': 2}),
 (u'Coach', u'Debbie', {'weight': 5}),
 (u'Coach', u'Erinn', {'weight': 2}),
 (u'Coach', u'Sierra', {'weight': 2}),
 (u'Coach', u'Tyson', {'weight': 3}),
 (u'Coach', u'Taj', {'weight': 1}),
 (u'Taj', u'J.T.', {'weight': 5}),
 (u'Taj', u'Stephen', {'weight': 7}),
 (u'Taj', u'Debbie', {'weight': 1})]

## Analyze Network
<a href="network.ipynb" target="_blank">Network</a>: Explore network relationships  
  
This notebook looks at network distance and centrality trends across within and across seasons. An analysis of winners' centrality scores is also performed.

In [11]:
use_network_from_disk = True and not run_all_override

if use_network_from_disk:
    try:
        print "Loading graph stats from disk"
        central = pickle.load( open( "network.p", "rb") )
    except IOError:
        print "Error loading from disk"
        use_network_from_disk = False
    
if not use_network_from_disk:
    import network
    print "Calculating graph statistics"
    central = network.get_all_centrality_scores(
        voteweights, graphs, save_to_disk=True
    )

Loading graph stats from disk


In [12]:
# Example of centrality scores
central['Tocantins'].sort_values('place')

Unnamed: 0,name,deg,close,btw,eig,page,place
6,J.T.,11,0.751,0.122,0.449,0.113,1
1,Stephen,10,0.704,0.074,0.478,0.12,2
8,Erinn,9,0.663,0.035,0.327,0.091,3
14,Taj,10,0.704,0.074,0.39,0.093,4
12,Coach,9,0.663,0.035,0.236,0.078,5
2,Debbie,9,0.663,0.035,0.219,0.075,6
4,Sierra,9,0.663,0.035,0.197,0.068,7
9,Tyson,7,0.593,0.01,0.159,0.062,8
10,Brendan,6,0.469,0.0,0.116,0.052,9
7,Joe,6,0.512,0.0,0.221,0.061,10


## Episode Scores
<a href="episode_scores.ipynb" target="_blank">Episode scores</a>: Centrality scores by episode.
  
This notebook assesses centrality scores as an individual season progresses. This allows us to measure centrality after each episode and analyze the how the network develops over the course of the season.  
   
Note: because each season has a different number of episodes, we normalize this disparity by comparing 'percent of episodes completed' and converting that back to a relative episode number.
  
We also add two new parameters to each contestant - the number of times they voted for the eliminated player (votes correct) and the number of times they were voted against (votes against). These parameters and functions combined serve as the starting point for our predictive analysis.  

In [13]:
import numpy as np

use_episode_scores_from_disk = True and not run_all_override

if use_episode_scores_from_disk:
    try:
        print "Loading episode scores from disk"
        seasons = pickle.load( open( "episode_scores.p", "rb") )
    except IOError:
        print "Error loading from disk"
        use_episode_scores_from_disk = False

if not use_network_from_disk:
    import episode_scores
    print "Calculating episode scores"
    n = 8
    time_line_prct = [i/n for i in np.arange(n) + 1.]
    episode_scores.process_all_seasons(seasons, time_line_prct)

Loading episode scores from disk


In [14]:
# Example after 50% of the season
seasons['Tocantins']['features']['scores'][0.5]

Unnamed: 0,name,deg,close,btw,eig,page,votes_correct,votes_against,place
1,Stephen,6,0.4,0,0.447,0.085,4,0,0
6,J.T.,6,0.4,0,0.447,0.085,4,0,1
2,Debbie,6,0.4,0,-0.0,0.074,2,0,0
4,Sierra,6,0.4,0,-0.0,0.074,2,1,0
8,Erinn,6,0.4,0,-0.0,0.074,2,1,0
9,Tyson,6,0.4,0,-0.0,0.074,2,0,0
10,Brendan,6,0.4,0,-0.0,0.074,2,0,0
12,Coach,6,0.4,0,-0.0,0.074,2,0,0
14,Taj,6,0.4,0,0.352,0.066,3,3,0


## Naive Bayes Model
<a href="naive_bayes.ipynb" target="_blank">Naive Bayes</a>: Predictions at given intervals of the season  
  
This notebook trains a Naive Bayes model based on the cumulative episode scores at 8 equal intervals of the season. This model is then used to predict the most likely winner at each of these intervals.

In [29]:
# Example - predictions at 50% through season

from naive_bayes import *
models = predict_season_winners(seasons)

t = 0.5
print "###############"
print "#    {:5}    #".format(t)
print "###############"
for s in models[t]['accuracy']:
    d = models[t]['accuracy'][s]
    print "Season: {:22} :: actual winner    ||  {}".format(s, d['actual_winner'].strip())
    print "        {:22}    predicted winner ||  {}".format('', d['predicted_winner'].strip())
    print "------------------------------------------------------------------"

###############
#      0.5    #
###############
Season: Palau                  :: actual winner    ||  Tom
                                  predicted winner ||  Stephenie
------------------------------------------------------------------
Season: Tocantins              :: actual winner    ||  J.T.
                                  predicted winner ||  J.T.
------------------------------------------------------------------
Season: Borneo                 :: actual winner    ||  Richard
                                  predicted winner ||  Richard
------------------------------------------------------------------
Season: Panama                 :: actual winner    ||  Aras
                                  predicted winner ||  Terry
------------------------------------------------------------------
Season: Cambodia               :: actual winner    ||  Jeremy
                                  predicted winner ||  Joe
------------------------------------------------------------------
Seaso

## Appendix

In [13]:
!ls *.ipynb
print
!ls *.p

episode_scores.ipynb    network.ipynb           wiki_scrape.ipynb
episode_scores_bc.ipynb process_votes.ipynb
make_graphs.ipynb       project_notebook.ipynb

episode_scores.p make_graphs.p    network.p        process_votes.p  wiki_scrape.p


In [14]:
# %load https://gist.github.com/ajp619/7dd388315fc824208654/raw/81be07b0e793208641182032e074dbe39bbfa08e/pyprint
def pyprint(myfile):
    from pygments import highlight
    from pygments.lexers import PythonLexer
    from pygments.formatters import HtmlFormatter
    import IPython

    with open(myfile) as f:
        code = f.read()

    formatter = HtmlFormatter()
    return IPython.display.HTML('<style type="text/css">{}</style>{}'.format(
        formatter.get_style_defs('.highlight'),
        highlight(code, PythonLexer(), formatter)))

In [16]:
# %load https://gist.github.com/ajp619/ddaa0f35627b066ef528/raw/cbbd6c6c1cad286ba5a358b93fd94eddede7c4ba/qtutil.py
# silly utility to launch a qtconsole if one doesn't exist

consoleFlag = True
consoleFlag = False  # Turn on/off by commenting/uncommenting this line

import psutil

def returnPyIDs():
    pyids = set()
    for pid in psutil.pids():
        try:
            if "python" in psutil.Process(pid).name():
                pyids.add(pid)
        except:
            pass
    return pyids

def launchConsole():
    before_pyids = returnPyIDs()
    %qtconsole
    after_pyids = returnPyIDs()
    newid = after_pyids.difference(before_pyids)
    assert len(newid) == 1
    return list(newid)[0]

try:
    print qtid
except NameError:
    if consoleFlag:
        qtid = launchConsole()
        print qtid
    
if consoleFlag and (qtid not in returnPyIDs()):
    qtid = launchConsole()
    print qtid