# Supervised Machine Learning with Regression

_This notebook is an introductory guide to machine learning and walks you through the Machine Learning process with a sample dataset. You can use your own dataset with some minor adjustments to the code. This notebook provides guidance on how to implement both text and boolean based solutions to machine learning._

_This guide only touches on a few machine learning techniques and should not be used as your one-stop-shop for all machine learning problems._

In [164]:
#Some handy imports for you:

#pandas dataframe: data structure 
import pandas as pd

#For text input: bag of words vectorizer - take inverse frequency of words to assign weights
from sklearn.feature_extraction.text import TfidfVectorizer

#Training/Test data: split data into training/test data, and specify number of folds for training/test data
from sklearn.model_selection import train_test_split
#Validate our models to check performance by calculating different accuracy metrics
from sklearn.model_selection import cross_val_score
#Specify number of groups we want our data broken up into for training/test data (majority of number is added to training)
from sklearn.model_selection import StratifiedKFold 

#5 different classification models - you don't need to use these but they are a great start
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet

#accuracy metrics
from sklearn.metrics import accuracy_score, explained_variance_score, max_error, mean_absolute_error
from sklearn.metrics import mean_squared_error, mean_squared_log_error, median_absolute_error, r2_score

## Import Data and Set X & y

The first step is to connect to a dataset. This can be a local file or a URL. To import your dataset, replace the red URL with the location of the file you want to use.

df is a common term for dataframe and is usually standard. You can rename this anything you want - you just have to change all of the occurrences with the new variable name.

In [165]:
#connect to CSV file that contains our data
path_to_file = "https://raw.githubusercontent.com/mawebster9/ThesisCode/master/appeals_query.csv"

#open, read, and store our data into a pandas dataframe
df = pd.read_csv(path_to_file, encoding='latin-1')

__***For this template, we will use the terms X_bool, y_bool for the boolean section and X_text, y_text for our text section. It is standard to use just X and y, but in order to show both in one section we need to differentiate them.***__

### Dealing with Boolean Input Data

For this section, we will focus on all of the boolean fields. The boolean fields are all of the true/false values and can account for a lot of different attributes that contain a lot of useful data in a dataset. Our goal is to see if the true/false fields can be used to determine the associated label with high accuracy. For this, our X is all of the true/false fields and its associated label, or y, is the decision field.

In this section, we will assign ALL boolean fields to X, set the y, then remove y from X. This is helpful if you have a long list of attributes in your X field and it saves a lot of time and is more dynamic.

To use your dataset, replace the red text in y ('Judgment') with the name of your column header that is your decision variable.

In [166]:
#assign our attributes to X (all boolean fields) and y
X_bool = df.select_dtypes('bool')
y_bool = X_bool['Denied']

In order to ensure that our data was read in correctly, we need to check the first value of each variable.

In [167]:
#remove the 'Denied' field from the X data frame
X_bool = X_bool.drop('Denied',axis=1)

In [168]:
#print the first record in X to verify the previous step
X_bool.iloc[0]

Smith                     False
Female                    False
Position_Eligibility      False
No_Falsification          False
Rebut_Falsification       False
Falsification(s)           True
Domestic_Violence         False
Previous_Clearance        False
Traumatic_Life_Event      False
Caused_Death              False
Child_Sexual_Abuse        False
Child_Pornography         False
Prostitutes               False
Fmr_Military_LawE         False
Adverse_Affirmed          False
Favorable_Affirmed        False
Granted                   False
Failed_to_Mitigate        False
Success_to_Mitigate       False
Adverse_Reversed          False
Revoked_Fav_Reversed      False
Adverse_Remanded          False
Favorable_Remanded        False
Remanded_wInstructions    False
Recommend_Waiver          False
Decision_Other            False
Decision_Unknown          False
Security_Violations       False
Foreign_Influence         False
Foreign_Preference        False
Sexual_Behavior           False
Personal

In [169]:
y_bool.iloc[0]

True

Now that we assigned our columns to X and y variables, we need to transform our X values in a way that a machine learning algorithm can understand it. In order to do this we need to change our true/false values into a numeric format. To do this, we need to change all true values to 1.0 and all false values to 0.0.

In [170]:
#replace all instances in X_bool: True = 1.0, False = 0.0
X_bool.columns.tolist()
for i in (X_bool.columns.tolist()):
    X_bool[i] = X_bool[i].replace(True,1)

To ensure that our replacement was successful, we need to print out at least one record in X. We can do so by either of the following 2 ways. The first shows the contents of the top 10 rows - you can change this number to show any amount) and the second just prints out the values for the first record.

In [171]:
#print out top 10 records in X to ensure True was changed to 1.0 and False was changed to 0.0
X_bool.head(10)

Unnamed: 0,Smith,Female,Position_Eligibility,No_Falsification,Rebut_Falsification,Falsification(s),Domestic_Violence,Previous_Clearance,Traumatic_Life_Event,Caused_Death,...,Alcohol,Drugs,Emotional_Mental,Criminal_Conduct,Handling_PI,Outside_Activities,Use_InfoSys,Deception,CAC,Unknown_Guideline
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [172]:
#Alternatively, you could print the first record in X
X_bool.iloc[0]

Smith                     0.0
Female                    0.0
Position_Eligibility      0.0
No_Falsification          0.0
Rebut_Falsification       0.0
Falsification(s)          1.0
Domestic_Violence         0.0
Previous_Clearance        0.0
Traumatic_Life_Event      0.0
Caused_Death              0.0
Child_Sexual_Abuse        0.0
Child_Pornography         0.0
Prostitutes               0.0
Fmr_Military_LawE         0.0
Adverse_Affirmed          0.0
Favorable_Affirmed        0.0
Granted                   0.0
Failed_to_Mitigate        0.0
Success_to_Mitigate       0.0
Adverse_Reversed          0.0
Revoked_Fav_Reversed      0.0
Adverse_Remanded          0.0
Favorable_Remanded        0.0
Remanded_wInstructions    0.0
Recommend_Waiver          0.0
Decision_Other            0.0
Decision_Unknown          0.0
Security_Violations       0.0
Foreign_Influence         0.0
Foreign_Preference        0.0
Sexual_Behavior           0.0
Personal_Conduct          1.0
Financial                 0.0
Alcohol   

We have our X values set up in a way that a machine learning algorithm can understand it, but now we need to fix our y values. In order to do this we need to change our true/false values into a numeric format. To do this, we need to change all true values to 1.0 and all false values to 0.0.

In [173]:
#replace: True = 1.0, False = 0.0
y_bool = y_bool.replace(True,1)

In [174]:
#print out counts for all y records - ensure that our replace statement worked
y_bool.value_counts()

1.0    10862
0.0     9652
Name: Denied, dtype: int64

In [175]:
#Alternatively, you could print the first record in y
y_bool.iloc[0]

1.0

### Dealing with Text Input Data

The next step is assigning our data to our variables. X is all of the attributes we want to feed through our ML algorithm and y is our label (value we are trying to predict). To store your dataset, replace the red text with the column headers you want. For this I am only using one field for the X variable ('Judgment') and my y is the associated label for that column ('Denied').

For this section, we will focus solely on the a text input field. The 'Judgment' field acts like a brief story summary field that contains a lot of useful data surrounding our dataset. Our goal is to see if the 'Judgment' field can be used to determine whether the X data can be used to determine the outcome (y) with high accuracy. For this, our X is the 'Judgment' field and its associated label, or y, is the 'Denied field.

Also note, if you have a large dataset you need to reduce the number of entries since processing text data is very bulky. If done with all data in a large dataset, you might get a memory error. For this example, 500 records has enough data to make a reasonably accurate model but not too much where there is not enough memory.

**This depends on the dataset being used, you can play around to make this high enough that you don't get a memory error - I typically use 500.**

Also note, our y data is in Boolean format so we do not need to worry about the bag of words for it.

Make sure that when you truncate the dataset, you truncate __BOTH__ the X and y variables.

In [176]:
#assign our attributes to X and y - only taking the first 500 records
X_text = df['Judgment'].head(500)
y_text = df['Denied'].head(500)

In order to ensure that our data was read in correctly, we need to check the first value of each variable.

In [177]:
#print the first record in X to verify the previous step
X_text.iloc[0]

"Applicant's drug abuse was not mitigated where marijuana use was recent, and had continued after Applicant stated an intent to refrain from drug use in the future. He falsified his drug abuse history on security questionnaires in March and October 1995 an"

In [178]:
y_text.iloc[0]

True

Now that we have the data assigned to our X and y variables, it is time to prime the data for the machine learning algorithms. For text data, we need to break up the words in a way that a machine can understand the characteristics of speech. One of the ways we can do this is by using a bag of words. A bag of words essentially takes a large amount of text data and separates the values into separate words and counts the number of occurrences of each word. 

For this example, we will be using a vectorizer to split the words and calculate the number of occurrences for each word. The vectorizer we will use in this example, TfidfVectorizer, works by counting the inverse frequency of the words found in the judgment field to assign a weight for each word. This ensures that common words, also known as "stop words", found in the english language, like "the", "a", "an", etc., are weighted less than words that are unique for this dataset, such as "foreign", "alcohol", "drugs", etc.

In [179]:
#set up bag of words for judgment field, use english stop words
vectorizer = TfidfVectorizer(stop_words='english')
X_text = vectorizer.fit_transform(X_text.tolist())

In [180]:
#print bag of words to ensure it is set up correctly
vectorizer.get_feature_names()

['00',
 '000',
 '10',
 '1001',
 '11',
 '12',
 '13',
 '14',
 '15',
 '154',
 '16',
 '17',
 '18',
 '19',
 '1959',
 '1965',
 '1966',
 '1967',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '199',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '20',
 '2001',
 '2003',
 '2005',
 '2006',
 '2007',
 '203',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '33',
 '35',
 '36',
 '38',
 '40',
 '401k',
 '45',
 '50',
 '548',
 '59',
 '60',
 '700',
 '80',
 '81',
 '83',
 '86',
 '88',
 '91',
 '94',
 '96',
 'a10',
 'aa',
 'ab',
 'abandoning',
 'abandonment',
 'abilities',
 'ability',
 'able',
 'absence',
 'absent',
 'absolutely',
 'absolve',
 'absolving',
 'absorb',
 'abstain',
 'abstained',
 'abstention',
 'abstin',
 'abstinence',
 'abstinent',
 'abuse',
 'abused',

Now that we have our list created and number of occurrences counted and weighted appropriately, we need to ensure that the data is all accounted for.

In [181]:
#ensure that data is in the right format - TfidfVectorizer returns sparse matrix of type <class numpy.float64>
X_text[:1]

<1x2117 sparse matrix of type '<class 'numpy.float64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [182]:
#make sure we have all records that were passed into the vectorizer
X_text.shape

(500, 2117)

You can see that all 500 records were passed in and that there were 2,117 unique words in our data that were pulled out, counted, and assigned weights.

We have our X values set up in a way that a machine learning algorithm can understand it, but now we need to fix our y values. In order to do this we need to change our true/false values into a numeric format. To do this, we need to change all true values to 1.0 and all false values to 0.0.

In [183]:
#replace: True = 1.0, False = 0.0
y_text = y_text.replace(True,1)

In [184]:
#print out counts for all y records - ensure that our replace statement worked
y_text.value_counts()

0.0    260
1.0    240
Name: Denied, dtype: int64

In [185]:
#Alternatively, you could print the first record in y
y_text.iloc[0]

1.0

## Machine Learning Step-by-Step

Now that we are sure that our data is good to go, we can begin the process of training our model and predicting outcomes.

#### 1. Run train_test_split() on X & y

This step is not needed for this notebook but it shows you how the train_test_split function works. Our X and y are randomly split up into training and testing groups. In this case, our test group will be comprised of 33% of the X data(test_size), and will be the same each time we run this line (random_state=42) to ensure consistency between runs.

Note that .33 is the standard for test_size and 42 is the standard for random_state. You do not need to use these numbers and can change them.

For more guidance on test_size and random_state, visit https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

##### For Boolean Values

In [186]:
#break 33% of X and y into X_test and y_test, break other remaining 67% into X_train and y_train
X_bool_train, X_bool_test, y_bool_train, y_bool_test = train_test_split(X_bool, y_bool, test_size=0.33, random_state=42)

In [187]:
#print size of training data (67% of initial data size)
X_bool_train.shape

(13744, 43)

In [188]:
#print size of test data (33% of initial data size)
X_bool_test.shape

(6770, 43)

Make sure that the first number in the parenthesis is equal to either 67% (for training data) or 33% (for test data). Take the number in the first column and divide it by the size of your entire dataset that you are using.

##### For Text Values

In [189]:
#break 33% of X and y into X_test and y_test, break other remaining 67% into X_train and y_train
X_text_train, X_text_test, y_text_train, y_text_test = train_test_split(X_text, y_text, test_size=0.33, random_state=42)

In [190]:
#print size of training data (67% of initial data size)
X_text_train.shape

(335, 2117)

In [191]:
#print size of test data (33% of initial data size)
X_text_test.shape

(165, 2117)

#### 2. Run fit() on X_train & y_train

##### For Boolean Values

The next step in our machine learning model is taking our training data and feeding it into an algorithm to build a model. This is essentially the step that teaches an algorithm that for each record X = y. To do this, there are a number of classification models. For this example we will focus on the LogisticRegression classifier.

The final step in this section prints out the classifier used as well as all of the parameter values used.

For more information on the LogisticRegression classifier, visit https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [257]:
#specify which classifier to use and set parameters
clf = LogisticRegression(random_state=0, solver='liblinear')

In [258]:
#send X_train and y_train into our classifier to build a model
logreg_model_bool = clf.fit(X_bool_train, y_bool_train)

In [259]:
print(logreg_model_bool)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


##### For Text Values

In [260]:
#specify which classifier to use and set parameters
clf = LogisticRegression(random_state=0, solver='liblinear')

In [261]:
#send X_train and y_train into our classifier to build a model
logreg_model_text = clf.fit(X_text_train, y_text_train)

In [262]:
print(logreg_model_text)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)


#### 3. Run predict() on X_test

The next step in our model is to take the model we just made using the training data and feeding the test data into it. This will output an array of values that the algorithm has determined to be the decision variable (y).

##### For Boolean Values

In [263]:
#send our test data into the model we just created
y_bool_pred = logreg_model_bool.predict(X_bool_test)

In [264]:
#print our results for the predictions
print("Here is the Boolean model's predictions: ")
print(y_bool_pred)

Here is the Boolean model's predictions: 
[0. 1. 0. ... 0. 0. 1.]


##### For Text Values

In [265]:
#send our test data into the model we just created
y_text_pred = logreg_model_text.predict(X_text_test)

In [266]:
#print our results for the predictions
print("Here is the Text model's predictions: ")
print(y_text_pred)

Here is the Text model's predictions: 
[0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.
 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1.
 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 1.
 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1.
 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.
 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0.]


#### 4. Verify Accuracy of Model

Now that we have split our data into training and testing groups, created a model using a machine learning algorithm, and used the model to predict outcomes for our test data, it is time to verify how well our model did compared to the actual outcomes. To do this, there are a number of accuracy metrics. For this example we will focus on the accuracy score.

You can use any of the scoring metrics to determine the validity of the model. For more metrics visit https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

In [267]:
#import our score functions
from sklearn.metrics import accuracy_score

##### For Boolean Values

In [268]:
#compare y_test values with the predicted y values - formatted to be percent
score_bool = accuracy_score(y_bool_test, y_bool_pred).mean()
r2_bool = r2_score(y_bool_test, y_bool_pred).mean()
print("Accuracy score for LogisticRegression classifier on Boolean data:  ", score_bool*100,"%")
print("r2 score for LogisticRegression classifier on Boolean data:  ", r2_bool*100,"%")

Accuracy score for LogisticRegression classifier on Boolean data:   99.54209748892171 %
r2 score for LogisticRegression classifier on Boolean data:   98.1632538037196 %


##### For Text Values

Because of our output, we cannot use the accuracy score metric on text data. We can however use other metrics which can be seen in the last section.

In [269]:
#compare y_test values with the predicted y values - formatted to be percent
r2_bool = r2_score(y_bool_test, y_bool_pred).mean()
print("r2 score for LogisticRegression classifier on Boolean data:  ", r2_bool*100,"%")

r2 score for LogisticRegression classifier on Boolean data:   98.1632538037196 %


## Test for Best Classifier to Use

Now that we understand how machine learning is done, we can determine which model is the best choice for our data. In this example we will use  different classifiers and evaluate each against 5 accuracy metrics.

Note: the cross_val_score() function handles test_train_split(X,y,test_size=33,random_state-42), fit(X_train,y_train), predict(X_test), and also any of the accuracy score metrics. This function is essentially and all-in-one function.

We have chosen 5 classifiers to compare with one another on their ability to accurately predict the outcomes. We have also used 5 different validation metrics to determine the model's ability to correctly predict the outcome. For more information on Supervised Machine Learning algorithms visit https://scikit-learn.org/0.16/supervised_learning.html

For metric scoring information, visit https://scikit-learn.org/stable/modules/classes.html

For classifier information visit
* LinearRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
* LogisticRegression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Ridge: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html
* Lasso: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
* ElasticNet: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

##### For Boolean Values

In [219]:
classifiers = [LinearRegression(), LogisticRegression(solver='liblinear'), Ridge(alpha=1.0), Lasso(alpha=0.1), ElasticNet()]
clf_names = ['LinearRegression', 'LogisticRegression', 'Ridge', 'Lasso', 'ElasticNet']
metric_names = ['explained_variance','max_error','neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error','r2']

scv = StratifiedKFold(n_splits=3)

scores_df = pd.DataFrame(index=metric_names,columns=clf_names)
clf_scores = []
for clf, name in zip(classifiers, clf_names):
    print('-----------------------------------------------------------------------------------------------------------')
    print('Classifier: ',clf)
    print('')
    print("Scoring Metrics: ")
    for metric in metric_names:
        score = cross_val_score(clf,X_bool,y_bool,scoring=metric, cv=scv).mean()
        clf_scores.append(score)
        print('\t*',metric,'score: ', score)
    scores_df[name] = clf_scores
    clf_scores = []

-----------------------------------------------------------------------------------------------------------
Classifier:  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Scoring Metrics: 
	* explained_variance score:  0.977941531909747
	* max_error score:  -1.0017213308203534
	* neg_mean_absolute_error score:  -0.014387746769166385
	* neg_mean_squared_error score:  -0.005495523250579351
	* neg_median_absolute_error score:  -0.0024670360377915626
	* r2 score:  0.9779411758438702
-----------------------------------------------------------------------------------------------------------
Classifier:  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Scoring Metrics: 
	* explained_va

##### For Text Values

In [205]:
classifiers = [LinearRegression(), LogisticRegression(solver='liblinear'), Ridge(alpha=1.0), Lasso(alpha=0.1), ElasticNet()]
clf_names = ['LinearRegression', 'LogisticRegression', 'Ridge', 'Lasso', 'ElasticNet']
metric_names = ['explained_variance','max_error','neg_mean_absolute_error','neg_mean_squared_error','neg_median_absolute_error','r2']

scv = StratifiedKFold(n_splits=3)

scores_df = pd.DataFrame(index=metric_names,columns=clf_names)
clf_scores = []
for clf, name in zip(classifiers, clf_names):
    print('-----------------------------------------------------------------------------------------------------------')
    print('Classifier: ',clf)
    print('')
    print("Scoring Metrics: ")
    for metric in metric_names:
        score = cross_val_score(clf,X_text.toarray(),y_text,scoring=metric, cv=scv).mean()
        clf_scores.append(score)
        print('\t*',metric,'score: ', score)
    scores_df[name] = clf_scores
    clf_scores = []

-----------------------------------------------------------------------------------------------------------
Classifier:  LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Scoring Metrics: 
	* explained_variance score:  0.47439153183285887
	* max_error score:  -0.9703183279520088
	* neg_mean_absolute_error score:  -0.2971181339819032
	* neg_mean_squared_error score:  -0.13352400094515843
	* neg_median_absolute_error score:  -0.2674931862811877
	* r2 score:  0.46503160646305863
-----------------------------------------------------------------------------------------------------------
Classifier:  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Scoring Metrics: 
	* explained_varia

## Final Results and Analysis

It is important that you analyze the model outputs and determine which model performed the best and why. It is also important to mention any bias that may be present in your data and if that could play a roll in the outcomes. It is also important to discuss why you would choose one model over another and which accuracy metrics you used to make your decision.

For the metrics we calculated, we really want to focus on the explained_variance_score, the max_error_score, and the r2_score to determine which model is the best. For explained_variance_score, we want the number to be 1 and lower values are worse. For max_error_score, we want the number to be 0. For r2_score, we want the number to be 1 and lower values are worse.

In this example, We see that for Boolean based input data, the LinearRegression model gave us the best metric scores (explained_variance_score and r2_score). We see that for the Text based input data, no one particular model gave us the best metrics score - LinearRegression had the best score for explained_variance_score, Lasso and ElasticNet tied for the best max_error_score, and Ridge had the best score for r2_score.