### Instructions: Follow the instructions provided in each step, or in the output from a cell 
Step 1
* Make sure the python notebook and the pivot table Excel file are in the **same** folder on your computer.
* Enter the name of the pivot table in the next cell, then press the 'run' button above.
* The python notebook will give you a preview of what will be used in the analysis beneath the cell after you press the run button.  
  * Scan the row names to verify that they are the same as in the Excel sheet that you want ot use.  If they are, skip the following cells and proceed to Step 2.
* Otherwise, follow the instructions in the output.

In [None]:
'''
# Developer Comments JRC
The numer one goal is to have a python notebook than seamlessly runs the analysis in such a way that the inner workings are completely invisible to researchers,
and the output is simply but effectively presented to researchers for their use to make analytical decisions, whatever they may be.
It is important to note that this mindset is not designed to obfuscate or hide the methodology that is employed, but rather 
that repeated experience has shown researchers who are presented with an unfamiliar or confusing tool will opt to ignore it, irrespective of its accuracy, utility, or possible effectiveness.

This tool has been set up to do exactly that.  On the surface, it mimics the python notebooks that have been utilized by the O'Connor Lab and Genetic Services,
and under the hood, it employs a Machine Learning Classifier to predict MHC-A and MHC-B haplotypes for samples. I would strongly encourage any
graduate students who are interested in starting to learn Machine Learning, or who have experience with Machine Learning and would like to know more, to 
begin here. The only difference between the "_dev" version is it includes all of my code comments, notes, post-build notes, and future comments 
on what direction this build will take next.  The Classifier is setup to use Biological concepts, which hopefully will make it familiar.
If you wish to use it as a cookbook for parsing tricks with python, that is perfectly acceptable too; at this point, the workflow is setup
to parse input data, format it for the Classifier, run the Classifier, and then output the updated file. However, see my comments regarding the future build 1_2.

Those that simply wish to run the notebook and do not want to tinker under the hood may use either the "_dev" version or the non-dev version; 
unless specifically noted at the very beginning of the dev file, both versions will mirror each other and be fully functional. The non-dev file will always be a stable, runnable version. 

'''

In [None]:
'''
# Developer Comments JRC
Terminology:
There are two main types of comments. Code comments usually are very specific and brief, 
and are preceeded by a '#' within a code cell. Usually these are written by John.
Developer Comments are code cells, but surrounded by three single quotes.  They are always attributed to an author, 
denoted by their initials, example JRC = John R Caskey
These comments are also more verbose, and can be on topics ranging from a general overview to a discussion of theory

Naming Conventions:
Alpha builds are effectively developer builds.  They are any python notebook or script with the name "alpha" in them,
example: ML_MHC_haplotyping_alpha_1_0
ML MHC haplotyping pytho notebook, alpha build 1.0

Beta builds are builds that are designed for widespread release to the O'Connor Lab and/or Genetics Services.  The versioning is controlled by github (see below)
and by the 2 numbers at the end, example:
ML_MHC_hapotyper_beta_1_0_dev
ML MHC haplotyper beta build, version 1.0, developer notes version

There are two caveats with beta builds, beyond the one already mentioned re: "_dev" naming.  First, python notebooks with "prebuild" in the name
are a working version of the upcoming beta version, usually for the version that is denoted in the numbers in the name, and usually for a specific part of that version.
Second, in rare cases, the "_dev" version--and only the "_dev" version--of the current beta build may be out of sync, and may not run as expected, 
but this will always be noted at the top of that python notebook in these rare cases.

GitHub:
GitHub will be used for version control of this project.  The current Beta version 1.0 will start as the effective beginning 
and as the Master branch, with all notes and comments included.  The "beta" branch will function as the latest stable branch for download, or LabKey can be used.
The "alpha" branch is a development branch that should be considered unstable, which only John or someone actively interested in development should ever use.

'''

In [None]:
pivot_table = '20411_Felber1-2_MHC-I_Haplotypes_23Mar18.xlsx' # enter name here

In [None]:
'''
# Developer Comments JRC
Due to multiple comments, there is a slight difference between this and the non-dev python notebook: 
the cell with most/all of the functions has been broken up into multiple cells that need to be run, separated by comments.  
In the non-dev version, it is simply one cell to run.

'''

In [1]:
import pandas as pd
from pandas import DataFrame
import re
import time
import math

def parseExcelWithPandas(fName, excelFileP):
    eSheetData = ''
    dfs = {sheet_name: excelFileP.parse(sheet_name)
      for sheet_name in excelFileP.sheet_names}
    if not fName:
        sheetCount = -1
        for s in dfs:
            sheetCount += 1
            m = re.search('pivot', s)
            m2 = re.search('MiSeq', s)
            if m:
                eSheetDataInt = pd.ExcelFile(pivot_table)
                eSheetData = eSheetDataInt.parse(sheetCount)
            elif m2:
                eSheetDataInt = pd.ExcelFile(pivot_table)
                eSheetData = eSheetDataInt.parse(sheetCount)
            else:
                continue
    else:
        for s in dfs:
            m = re.search(fName, s) # case insensitive?
            if m:
                eSheetDataInt = pd.ExcelFile(pivot_table)
                eSheetData = eSheetDataInt.parse(sheetCount)
            else:
                continue
    return eSheetData

def findColumnIdxStartStop(pdDF):
    xCt = -1
    xStart = -1
    xStop = -1
    foundInitialMatch = False
    for x in excelSheetName.columns.values:
        xCt += 1
        if xCt == 0:
            continue
        else:
            if xCt < 10 and foundInitialMatch == False: # column idx will not be greater than 9
                m = re.search('named', str(x))
                if m:
                    continue
                else:
                    foundInitialMatch = True
                    xStart = xCt
            else: 
                m = re.search('named', str(x))
                if m:
                    xStop = xCt
                    break
    return (xStart, xStop)
def parsePandasDfRows(col1ListFromPdDf):
    headers = True
    mamuA_indices = []
    mamuB_indices = []
    skipIndices = []
    genotypeList = []
    for idx,val in enumerate(col1ListFromPdDf):
        if headers:
            m = re.search('Comment', str(val))
            mA = re.search('Mamu-A', str(val))
            mB = re.search('Mamu-B', str(val))
            if m:
                headers = False
                skipIndices.append(idx)
                continue
            elif mA:
                mamuA_indices.append(idx)
            elif mB:
                mamuB_indices.append(idx)
            else:
                skipIndices.append(idx)
                continue
        else:
            m = re.search('Alleles', str(val))
            if m:
                skipIndices.append(idx)
            else:
                genotypeList.append(val)
                continue
    return (skipIndices, mamuA_indices, mamuB_indices, genotypeList)

In [None]:
'''
# Developer Comments JRC
using pd.read_excel(pivot_table, sheet_name=excelSheetName)
is vulnerable to a bug whereby pandas will intermittently ignore parameters and grab the first sheet only, 
UNLESS the 0-indexed integer for the sheet is specified with the function x = pd.ExcelFile() followed by x.parse(int)
Therefore, the roundabout lookup of parsing the name, parsing the sheet ID integer, then parsing the file from the 
parsed sheet ID integer, was used instead, to guarantee workability with minimal future debugging being required.

'''

In [None]:
def dataToPandasOneHot(d, pdDf, idxList):
    if pdDf is None:
        pdDf = pd.DataFrame.from_dict(d, orient='index').transpose()
        pdDf.index = idxList
        pdDf.index.name = 'genotype'
    else:
        alleleDFnew = pd.DataFrame.from_dict(d, orient='index').transpose()
        alleleDFnew.index = idxList
        alleleDFnew.index.name = 'genotype'
        alleleDFres = pd.concat([pdDf, alleleDFnew], axis=1, join_axes=[pdDf.index])
        pdDf = alleleDFres
    return pdDf
def parseIdxForMamuAMamuB(gList):
    mamuA_nameIdxListCol1 = []
    mamuB_nameIdxListCol1 = []
    for idx, n in enumerate(gList):
        m_MamuA = re.search('Mamu_A', str(n))
        m_MamuB = re.search('Mamu_B', str(n))
        if m_MamuA:
            filterMamu = re.search('Mamu_AG', str(n))
            if filterMamu:
                continue
            else:
                mamuA_nameIdxListCol1.append(idx)
        elif m_MamuB:
            mamuB_nameIdxListCol1.append(idx)
        else:
            continue
    return (mamuA_nameIdxListCol1, mamuB_nameIdxListCol1)

def parseGenotypeList(gList):
    r = []
    for g in gList:
        if type(g) is float:
            r.append(g)
        else:
            gString = g.split('_')
            gStringAsList = gString[1:]
            gStringAsString = '_'.join(gStringAsList)
            r.append(gStringAsString)
    return r



In [None]:
'''
# Developer Comments JRC
parseGenotypeList() was updated because some rows may have nan values after being added to a pandas dataframe, 
which is considered a float.  This value is parsed out later in the workflow, but at this point, filtering it out would require 
redoing a lot more than simply skipping them.
'''

In [None]:
'''
# Developer Comments JRC
Here I ran into a problem.  
The data is being formatted to match what is required by the Classifier, basically a one-hot vector,
but what happens if there's a genotype in the input data that does not exist in the training data?

The immediate effect is accuracy tanks: the classifier does not know how to quantify the genotype, 
any more than a new biology student would know how to classify M. tuberculosis with a Gram stain.

There's several solutions here.

You could expand the training set: in the example above, show the grad student a microbiology textbook 
and instruct him or her to look up acid-fast staining. In the current Classifier, this would mean painstakingly 
iterating through all of the past experiments to ensure as many of the genotypes as possible are listed.
Sharp-eyed readers will notice a "gotcha" here, however: what happens when a novel allele is discovered and/or registered in the IRD database?

One possibility is to update the training model, but this can get cumbersome, and somewhat defeats 
the purpose of using a Classifier if the model must be continuously updated.

If you're thinking 'Can't the grad student (or the Classifier for a genotype name) draw a logical conclusion from the Mycobacterium genus about both bacteria?'
The answer is yes.  Definitely, hopefully, yes for the grad student, and yes for the Classifier.  
But, the Classifier will need to be more complex.  At the moment, it looks at the results, and draws correlations based on only the results.
Other neural networks like Convolutional Neural Networks (CNN) are designed to handle more complex data, and draw more complex conclusions, and the main goal for build 1.2, 
as in 2 builds from this one, is to create a CNN that will be able to handle unknown genotype names effectively.

Another option is to set up a parser of some kind that will scan and catch instances where a genotype name doesn't exist, and then do something.
However, a critical constraint is that the workflow MUST run efficiently without errors, or requiring recoding intervention of any kind.

What I did for the first build was split the difference, and as noted, I delayed implementing a more complex neural network until later.
When building the training set, I assembled all possible genotype names in an attempt to minimize the chance of a name not appearing.
Then, I built a parser that functions as follows:
1) Entry names in the input dataset are scanned, and compared to the names in the training dataset.
2) for each entry name in the input dataset:
  if the entry name exists one time in the training dataset:
    continue, no action is necessary
  else:
    grep for that entry name in the training dataset, minus a 'g' character.
    if the grep result is successful AND only one grep result occurs:
      print a Warning that the entry name in the input dataset will be changed to the entry name in the training dataset that matched to it
    else OR if no grep matches occurred:
      print a Warning that the entry name in the input dataset will be deleted

A few notes on this parser are:
* It will stop at the first match and not attempt to resolve conflicts.  Setting up logic in a parser for such a conflict resolution would be prohibitively difficult, 
and training a ML model is both planned and would be more effective anyway
* If the parser finds more than one grep result, this will trigger the same effect as not finding any.  The rationale is similar to above: this is most likely caused by similar but insufficiently unique entry names when comparing the training dataset and the input dataset, but resolving this naming conflict has been "punted" on until the next build.
* In the event of a naming deletion, this entry will be deleted from the input data. The training data is included but is not modified in any way.

'''

In [None]:
def scanAndUpdateGenotypeList(l1_parsed,l2_training):
    l1_parsed_res = []
    for i in l1_parsed:
        didFindMatch = False
        if (l2_training.count(i) == 1):
            l1_parsed_res.append((1, i))
            continue
        else:
            if type(i) is float:
                l1_parsed_res.append((0, i))
                # see dev comments
                continue
            iString = i.split('g')
            iStringRgx = iString[0]
            for itm in l2_training:
                m1 = re.search(iStringRgx, itm)
                if m1:
                    mString = m1.group(0)
                    if (l2_training.count(mString) == 1):
                        l1_parsed_res.append((2, mString))
                        didFindMatch = True
                        print('WARNING: replacing original item ' + str(i) + ' with modified match from training set: ' + str(mString))
                        break # this is important: it stops at the first match
                else:
                    continue
            if not didFindMatch:
                print('WARNING! Unable to match item ' + str(i))
                print('Item ' + str(i) + ' was deleted from dataset.\nThis behavior will be modified in a future release.')
                l1_parsed_res.append((0, i))
            else:
                continue
    return l1_parsed_res

In [None]:
testFile = pd.ExcelFile(pivot_table)
readyToProceed = False
fName_none = ''
dataAsPandas = ''
excelSheetName = parseExcelWithPandas(fName_none, testFile)
processedPivotTable = False
if pivot_table:
    if excelSheetName.empty:
        print('The Excel sheet was found, but there was an error reading the Excel file.')
        print('Please do one of the following:\n\nProceed to the next cell and attempt to enter the sheet_name value,\n\nor\n\nExport the file as a csv, start from the beginning of the python notebook,\nchange the file name, and rerun all cells.\nBe sure to specify the file type as csv in the cell below when you run it.')
    else:
        print('Please click the next cell, and press Run.')
        print('Here is a preview of the data: \n\n##StartPreview:\n')
        pdHeadersRowsCol1 = list(excelSheetName.iloc[1:10:,0])
        for h in pdHeadersRowsCol1:
            m = re.search('Animal ID', h)
            if m:
                processedPivotTable = True
            print(h + '\n')
        print('##EndPreview\n\nData Type is ')
        if processedPivotTable:
            print('Processed Pivot Table')
        else:
            print('raw pivot table')
        readyToProceed = True
else:
    print('No pivot table was found. Please do the following: \n1) check the filename and rerun the previous cell\n2) if you have already rerun the previous cell and are seeing this message again, \nproceed to the next cell and enter information for at least one of the following: ')
    print('\tfile_type: enter \'csv\' or \'excel\', depending on the file.')
    print('\tsheet_name: enter the sheet name for the excel spreadsheet or csv file')
    print('Then run the next two cells.')

genotypeList_training = openFileAsList('alleles_parsed.txt')

In [None]:
'''
# Developer Comments JRC
genotypeList_training is the genotype names list that is used later, and is commented on as well in a later comment
The file was created with a bash script that extracted the genotype names, sorted them by unique values, then concatenated the list for all MiSeq output files, 
and then repeated the sort/unique step one more time
See the file 'workflow_comments_filterParse.md' in the alpha branch for detailed notes

See the "ToDo" task list at the end as well.
'''

In [None]:
file_type = '' # must be left blank, 'csv' for a csv file, or 'excel' for an excel file type
sheet_name = '' # must have the excel sheet name with the pivot table, or be blank

Step 2
* This step will reformat the Excel data into a format that can be used by the Machine Learning Classifier.
* Click the next cell, and then click the Run button at the top.
* If you do not see any error messages, and you see the output 'Everything looks good!', then proceed to Step 3.  Otherwise, contact John for assistance with Step 2
  * Note from John: In the next Beta build (1.2) this step will be modified to not require intervention from me if something goes wrong.
* If you see a warning message, it is still usually ok to proceed, but you should make a note of the warning.

In [None]:
def parseLabeledValues(l):
    r = []
    for i in l:
        iSplit = i.split('-')
        iSplitSorted = sorted(iSplit)
        iSplitSortdString = '-'.join(iSplitSorted)
        if iSplit[0] == iSplit[1]:
            r.append(iSplit[0])
        else:
            r.append(iSplitSortdString)
            # this is better approach than checking if the reverse string is present
            # sort all pairs and append, since duplicates do not matter, but order does
            # IMPORTANT: this does NOT remove any entries, only modifies labeling
    return r

In [None]:
'''
# Developer Comment: JRC
The above function will be incorporated into tool described in the Dev Comments.
'''

In [None]:
v = findColumnIdxStartStop(excelSheetName)
if v[0] == -1:
    print('Error!  check dataframe!')
col1List = list(excelSheetName['Sample Sheet #'])
skipRows, mamuA_rows, mamuB_rows, genotypeList_unparsed = parsePandasDfRows(col1List)
genotypeList_parsed = parseGenotypeList(genotypeList_unparsed)
genotypeListTuple = scanAndUpdateGenotypeList(genotypeList_parsed, genotypeList_training)
# final step: parse through genotypeListTuple, and remove any rows that correspond to (0, name)
genotypeListFiltered = list(filter(lambda x: x[0] != 0, genotypeListTuple))
genotypeListTupleX, genotypeListTupleY = zip(*genotypeListTuple) # zip iterable, unpacks tuple
genotypeListSkip = [x for (x,y) in enumerate(genotypeListTupleX) if y == 0]
genotypeIntsTuple, genotypeListFromTuple = zip(*genotypeListFiltered) 
genotypeList = list(genotypeListFromTuple)
skipMamuRows = mamuA_rows + mamuB_rows + skipRows + genotypeListSkip
skipMamuRows.sort()
parsedMamuIndices = parseIdxForMamuAMamuB(genotypeList)
parsedMamuAIndices = parsedMamuIndices[0]
parsedMamuBIndices = parsedMamuIndices[1]

# print(genotypeList)
rStart = v[0]
rStop = v[1] + 1 
rRange = rStop - rStart
mamuA_alleleList = []
mamuB_alleleList = []
alleleDF_MamuA = None
alleleDF_MamuB = None
for x in range(rStart, rStop):
    mamu_genotypes_oneHot = []
    dfValue = excelSheetName.iloc[:,x] # verify syntax for rows and columns, and index not lookup
    dfValue_MamuA_1 = dfValue.iloc[int(mamuA_rows[0])]
    dfValue_MamuA_2 = dfValue.iloc[int(mamuA_rows[1])]
    dfValue_MamuB_1 = dfValue.iloc[int(mamuB_rows[0])]
    dfValue_MamuB_2 = dfValue.iloc[int(mamuB_rows[1])]
    pdDict_MamuA = dict()
    pdDict_MamuB = dict()
    for idx, row in dfValue.iteritems():
        if idx not in skipMamuRows:
            try:
                if math.isnan(row):
                    mamu_genotypes_oneHot.append(0)
                    continue
                else:
                    mamu_genotypes_oneHot.append(1)
            except TypeError:
                mamu_genotypes_oneHot.append(1)
    pdDictKey_MamuA = str(dfValue_MamuA_1) + '-' + str(dfValue_MamuA_2)
    pdDictKey_MamuB = str(dfValue_MamuB_1) + '-' + str(dfValue_MamuB_2)
    pdDict_MamuA[pdDictKey_MamuA] = mamu_genotypes_oneHot
    pdDict_MamuB[pdDictKey_MamuB] = mamu_genotypes_oneHot
    alleleDF_MamuA = dataToPandasOneHot(pdDict_MamuA, alleleDF_MamuA, genotypeList)
    alleleDF_MamuB = dataToPandasOneHot(pdDict_MamuB, alleleDF_MamuB, genotypeList)
alleleDF_MamuA_parsed = alleleDF_MamuA.iloc[parsedMamuAIndices,:]
alleleDF_MamuB_parsed = alleleDF_MamuB.iloc[parsedMamuBIndices,:]
alleleDF_MamuA_parsedList = []
alleleDF_MamuA_listLabels_Y = []
for i in range(0, rRange):
    dfColAsSeries = alleleDF_MamuA_parsed.iloc[:,i]
    if dfColAsSeries.name == 'nan-nan':
        continue
    dfColAsList = dfColAsSeries.tolist()
    alleleDF_MamuA_parsedList.append(dfColAsList)
    alleleDF_MamuA_listLabels_Y.append(dfColAsSeries.name)
alleleDF_MamuA_listedLabels_Y = parseLabeledValues(alleleDF_MamuA_listLabels_Y)
alleleDF_MamuA_parsedNP = np.array(alleleDF_MamuA_parsedList)
if not alleleDF_MamuA_parsed.empty and not alleleDF_MamuB_parsed.empty:
    print('Everything looks good!')

In [None]:
# remaining steps todo before release of beta 1_0:
# write parser for alleleDF_MamuA_listLabels_Y to merge duplicated and reversed-duplicated values
# write function to clean up code to handle MamuA and MamuB above
# write logic to handle 2 conditions: labels for formatted pivot table and matching for raw pivot table



Step 3
* If you did not encounter any errors previously, or the output did not direct you to stop, then proceed with the analysis.
* You can either click this cell, select the Cell menu above, and then select 'Run All below', or click the cells after this one and individually click 'Run' for each one
* The analysis within this python notebook will do the following:
  * parse out genotype ID's for the MHC-A, and MHC-B data, 
  * use a pre-trained model for a Machine Learning Classifier to predict the Haplotype for each sample for MHC-A and MHC-B
  * add the predicted values back into the pivot table (or create a new pivot table)
  * output the resulting pivot table
* Note that for beta 1.x builds, only MHC-A and MHC-B haplotypes will be predicted.  All other haplotypes will be passed to the researcher for analysis.  A researcher should also verify the output pivot table from the Classifier.

In [None]:
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os
import time

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.base import clone



In [None]:

def lambdaFunc(v):
    return int(v)

def openAndParse(f):
    listValue = []
    headerLine = ''
    start = True
    with open(f) as fOpen:
        for i in fOpen:
            if start:
                iLine = i.split(',')
                iLine = iLine.pop(0)
                headerLine = ','.join(iLine)
                start = False
            else:
                i = i.rstrip('\n')
                iSplit = i.split(',')
                iSplitInt = list(map(lambdaFunc, iSplit[1:]))
                # listValue.append((iSplit[0], iSplit[1:]))
                listValue.append((iSplit[0], iSplitInt))
    return (headerLine, listValue)

In [None]:
trainingValues = openAndParse('allele_df-trainingSet-HapA.csv')
testingValues = openAndParse('allele_df-testingSet-HapA.csv')

In [None]:
'''
# Developer Comment JRC
The testingValues variable and resulting functions will be kept for the time being. Once the training model is set up, the primary use of these will be to run
and then 
if float_testing < 0.92:
  print('Warning!  You should not continue, something is wrong!')

Or something like that.  However, as I've stressed, the ML model will only be released for testing once it's been tested already rigorously.
'''

In [None]:
X_train = np.empty([len(trainingValues[1]), len(trainingValues[1][1][1])], dtype=int)
Y_train = np.empty([len(trainingValues[1])], dtype=int)
X_test = np.empty([len(testingValues[1]), len(testingValues[1][1][1])], dtype=int)
Y_test = np.empty([len(testingValues[1])], dtype=int)
print(X_test)
print(X_test.shape)
print('\n\n')
print(Y_test)
print(Y_test.shape)
trainingValueFromTuple = trainingValues[1]
testingValueFromTuple = testingValues[1]
print(testingValueFromTuple)
print(len(testingValueFromTuple))
lookupTableBecauseNumpy = []
lookupTableBecauseNumpyTesting = []
stopRange = len(trainingValueFromTuple)
for x in range(0, stopRange):
    aValue_x = trainingValueFromTuple[x][1]
    aValue_y = trainingValueFromTuple[x][0]
    X_train[x] = aValue_x
    # X_train[x] = aValue_xList
    # np.insert(X_train[x], aValue_xList)
    lookupTableBecauseNumpy.append(aValue_y)
    Y_train[x] = np.array(x, dtype=int)
stopRange = len(testingValueFromTuple)
for x in range(0, stopRange):
    aValue_x = testingValueFromTuple[x][1]
    aValue_y = testingValueFromTuple[x][0]
    X_test[x] = aValue_x
    lookupTableBecauseNumpyTesting.append(aValue_y)
    Y_test[x] = np.array(x, dtype=int)

In [None]:
'''
# Developer Comments JRC
# PRAGMA MARK: Critical Dev comment

A few comments here.
At a basic level, the training set creates a correlation between the one-hot vector of values 
that are fed to it (and was created in this case from the genotypes), with correspinding labeling.  
The labeling is important, otherwise the output would be binary, or a series of integers.  Additionally, labeling can be used in downstream
tasks and more complex machine learning tasks.

There are several algorithms to use for training and prediction, which are too complex to explain here, 
but will be explained and discussed in later developer comments.
The testing set then takes a set of data, which must be formatted as a one-hot vector with corresponding labeling, 
and then predicts how the test dataset will match to the training dataset.  

Initially, testing and training data were from the same pooled data, so creating testing and training data followed similar methods.

To use input data, the steps above--commented out now in the dev version, removed in the beta version--needed to be
modified such that the input data was predicted to have set values, and then those values were matched to a labeling table.

If input data was from a formatted pivot table, then a list of labels are created from the column names.  This becomes Y_test, and accuracy testing can be run.

If the input data was from a raw pivot table, then the classifier makes predictions, and these predictions are mapped back to labeling created from the training data set.  
In this case, no accuracy checks can be done, since these labels are unknown, however, as has been repeatedly stated, the output should be verified by a researcher.

'''

In [None]:
# Developer Comments JRC
'''
One thing I kept scratching my head over--and then subsequently face-palming over--was why
the linear model for the Classifier simply refused to work.

As a cautionary tale to anyone else, what was happening was I had correctly set up the X values as a one-hot vector, 
but then when I set the labels for Y, I assigned each one to a unique integer and then mapped them back to the corresponding string values.
This meant the linear model was being set up as 
X          Y 
[vector]   0
[vector]   1
[vector]   2
[vector]   3

Where each vector was something like:
[0,0,0,1,0,0]

And each Y integer was unique, and mapped back to a string, like "A001".  
 
The vectors were created from the allele names, where a mapping table was created that defined an allele that was present or not for a given haplotype (order was maintained).
Each vector represents an instance in an experiment where alleles for a sample mapped back to a haplotype as allele frequencies.  
Initially, replicated instances of Haplotypes were flattened and merged, however, I'll leave it as an exercise to the reader to see just how poorly this implementaion fit any linear model. 

What I should have been doing (and subsequently did) was:
X          Y 
[vector]   0
[vector]   0
[vector]   0
[vector]   1
[vector]   1

Where again, each vector was something like:
[0,0,0,1,0,0]

Each Y integer was each instance of the corresponding haplotype, like "A001", that mapped to it
Subsequently, all allele frequencies were included, regardless if the same allele pairing was repeated.
This created a usable linear model, and multiple n values that the library could use for calculations.

Another big facepalm was when the Classifier worked flawlessly, but the lookup table referenced above got corrupted, thereby causing
0.0% accuracy.  Beyond my checking and verifying the code, and checking the one-hot vectors manually to verify that 0.0% accuracy 
was a near impossibility, I'll leave it as an exercise to the reader to research to determine why having 0.0% accuracy usually 
does not occur with Machine Learning, and also why the standard for accuracy in Machine Learning is 99.95% (this number is arbitrarily set, 
but there is a specific reason why it's set this high).

'''
# For these purposes, the labels were manually reformatted (see above) by visually scanning the list .  This will be automated.

def formatTestingValues(tList, tupleList):
    r = []
    validationInt = len(tList)
    for v in tList:
        for vInTuple in tupleList:
            if v == vInTuple[0]:
                r.append(vInTuple[1])
                break
    if len(r) != len(tList):
        print('WARNING!')
        print(r)
        print(tList)
        return None
    else:
        return r
def valuesToTestList(valList, npArrayTupleListTraining):
    r = []
    matched = False
    for v in valList:
        for vTuple in npArrayTupleListTraining:
            if v == vTuple[1]:
                matched = True
                r.append(vTuple[0])
                break
            # if v == vTuple[0]:
                # matched = True
                # r.append(vTuple[1])
                # break
        if matched:
            matched = False
            continue
        else:
            print('WARNING!')
            print(v)
            print(npArrayTupleListTraining)
            return None
    return r
        
    
def valuesToIntList(valList):
    listedValues = []
    ct = 0
    l_strings = []
    l_ints = []
    l_strings_r = []
    l_ints_r = []
    npArrayList = []
    for i in valList:
        if i not in l_strings:
            l_strings_r.append(i)
            l_ints_r.append(ct)
            l_strings.append(i)
            l_ints.append(ct)
            npArrayList.append(ct)
            ct += 1
        else:
            v_i = l_strings.index(i)
            v_i_string = l_strings[v_i]
            v_ct = l_ints[v_i]
            l_strings_r.append(v_i_string)
            l_ints_r.append(v_ct)
            npArrayList.append(v_ct)
    npArrayTupleList = list(zip(l_strings_r,l_ints_r))
    return (npArrayList, npArrayTupleList)
'''
listedValues = []
ct = 0
l_strings = []
l_ints = []
l_strings_r = []
l_ints_r = []
print('[')
npArrayList = []
for i in lookupTableBecauseNumpy:
    if i not in l_strings:
        print(str(i) + ',' + str(ct))
        l_strings_r.append(i)
        l_ints_r.append(ct)
        # print(str(ct) + ',')
        l_strings.append(i)
        l_ints.append(ct)
        npArrayList.append(ct)
        ct += 1
    else:
        v_i = l_strings.index(i)
        v_i_string = l_strings[v_i]
        v_ct = l_ints[v_i]
        print(str(v_i_string) + ',' + str(v_ct))
        l_strings_r.append(v_i_string)
        l_ints_r.append(v_ct)
        npArrayList.append(v_ct)
        # print(str(v_ct) + ',')
# print(']')
# print(npArrayList)
'''

parsedTrainingResults = valuesToIntList(lookupTableBecauseNumpy)
Y_train = np.array(parsedTrainingResults[0])
parsedTestingResults = valuesToIntList(lookupTableBecauseNumpyTesting)
Y_testList = valuesToTestList(lookupTableBecauseNumpyTesting, parsedTrainingResults[1])
# print(Y_train)
Y_test = np.array(Y_testList)


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

param_grid = [{'weights': ["uniform", "distance"]}]

# KNearestNeighbors was used, but a different library may be applied later
knn_clf = KNeighborsClassifier(n_jobs=-1, weights='distance', n_neighbors=4)
knn_clf.fit(X_train, Y_train)

y_knn_pred = knn_clf.predict(X_test)

# forest_clf_pred = forest_clf.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, y_knn_pred)
# This accuracy score strictly tests the model, and may be subject to overfitting

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
forest_clf = RandomForestClassifier(random_state=42)
forest_clf.fit(X_train, Y_train)
forest_clf_pred = forest_clf.predict(X_test)

accuracy_score(Y_test, forest_clf_pred)
print(forest_clf_pred)
print(Y_test)



In [None]:
# Developer Comments JRC
'''
# 1) Cross-validation does exactly what it sounds like: it attempts to validate the results without worrying about overfitting, for a "truly" accurate score
# 2) This is a bit of a "hacked" approach, because it's switching to Stochastic Gradient Descent (SGD) from KNN for the cross validation, but I plan on moving that direction anyway
# 3) For the suspicious/worried, you can replace "sgd_clf" with "knn_clf" or "forest_clf" for a random forest lassifier to also review the cross-validation result.
# 4) Ignore the warning.  I'm using an outdated module from SciKit-Learn, if I ultimately use TensorFlow or Caffe it won't appear, and if I use SciKit-Learn I'll use an upadated module.

'''

In [None]:
sgd_clf = SGDClassifier(random_state=42)
cross_val_score(sgd_clf, X_train, Y_train, cv=2, scoring="accuracy")

In [None]:
# Developer Comments JRC
'''
This is an ongoing task list for upcoming beta builds.  
Note that in the near future, inline developer comments for dev builds will be removed completely, and replaced with versioned Github README files.

ToDo list for beta 1_1:
* replace openFileAsList() completely with a better parser
* additionally, consult with Dave Baker to create a more cohesive data storage unit for training data

ToDo list for beta 1_2:
* test Stocahstic Gradient Descent (SGD), K-Nearest Neighbors (KNN), and Random Forest Classifier (RF), and decide which algorithm to use for Classifier
* test algorithms listed above, and workflow for a modified Natural Language Processing (NLP), based on preliminary proof-of-concept work with fasta headers, 
to create a ML model that parses genotype names

'''