# Machine Learning

"*[Machine learning](https://github.com/georgetown-analytics/machine-learning/blob/master/notebook/Tour%20de%20SciKit-Learn.ipynb) can classically be summarized with two methodologies: supervised and unsupervised learning. In supervised learning, the “correct answers” are annotated ahead of time and the algorithm tries to fit a decision space based on those answers. In unsupervised learning, algorithms try to group like examples together, inferring similarities via distance metrics. Machine learning allows us to handle new data in a meaningful way, predicting where new data will fit into our models.*"

We will be conducting supervised learning. 

#### Potential Issues (List any issues as they occur)

**Null Values**
   * Sklearn doesn't deal handle *Null* values. 
   * While the data is fairly clean there are *Null* values that occur in the publication origin lagged variables. This is due to date ranges where there were no publication from a given county/region. Solutions:
      * Recode missing values to the mean or median.
      * Drop the values with missing data.  

In [1]:
%matplotlib inline

import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.metrics import classification_report
from sklearn import cross_validation as cv



## Clean Training Data of Null values

Scikit-learn will expect numeric values and no blanks, so first we need to do a bit more wrangling.

In [2]:
# Import csv as a dataframe
path = "/Users/laurieottehenning/Documents/Georgetown Data Science /Capstone/Final Data Set"
os.chdir(path)
train = pd.read_csv('Final_Clean_Data.csv')

In [3]:
# Check that all of the data is there
train.columns

Index(['Unnamed: 0', 'Date', 'EventID', 'FinalRating', 'Day of Week',
       'Weekend Flag', 'year', 'month', 'Value_Average_Past_30_days', 'lag_30',
       'Value_Average_Past_14_days', 'lag_14', 'Value_Average_Past_7_days',
       'lag_7', 'Middle East lagpub_7', 'Other lagpub_7', 'UK lagpub_7',
       'US lagpub_7', 'Middle East lagpub_14', 'Other lagpub_14',
       'UK lagpub_14', 'US lagpub_14', 'Middle East lagpub_30',
       'Other lagpub_30', 'UK lagpub_30', 'US lagpub_30'],
      dtype='object')

In [4]:
train.head(3)

Unnamed: 0.1,Unnamed: 0,Date,EventID,FinalRating,Day of Week,Weekend Flag,year,month,Value_Average_Past_30_days,lag_30,...,UK lagpub_7,US lagpub_7,Middle East lagpub_14,Other lagpub_14,UK lagpub_14,US lagpub_14,Middle East lagpub_30,Other lagpub_30,UK lagpub_30,US lagpub_30
0,0,2010-01-03,,5.0,Sunday,1,2010.0,1.0,,-0.02008,...,,-0.02008,,,,-0.02008,,,,-0.02008
1,1,2010-01-04,,5.0,Monday,0,2010.0,1.0,-0.02008,0.006581,...,0.033242,,,,0.033242,,,,0.033242,
2,2,2010-01-08,,5.0,Friday,0,2010.0,1.0,0.006581,0.00973,...,0.024635,,,,0.024635,,,,0.024635,


In [5]:
# Determine the shape of the data
print("{} instances with {} features\n".format(*train.shape))

# Determine the frequency of each rating
print(train.groupby('FinalRating')['FinalRating'].count())

2227 instances with 26 features

FinalRating
-1.0      93
 0.0      11
 1.0      63
 2.0     118
 3.0     118
 5.0    1824
Name: FinalRating, dtype: int64


In [6]:
# Check number of missing values
train.isnull().sum()

Unnamed: 0                       0
Date                             0
EventID                       1824
FinalRating                      0
Day of Week                      0
Weekend Flag                     0
year                             0
month                            0
Value_Average_Past_30_days       8
lag_30                           8
Value_Average_Past_14_days       8
lag_14                           8
Value_Average_Past_7_days       11
lag_7                            8
Middle East lagpub_7           828
Other lagpub_7                 996
UK lagpub_7                    541
US lagpub_7                    757
Middle East lagpub_14          828
Other lagpub_14                996
UK lagpub_14                   541
US lagpub_14                   757
Middle East lagpub_30          828
Other lagpub_30                996
UK lagpub_30                   541
US lagpub_30                   757
dtype: int64

#### Fill null values

Will fill null values with the mean. Not 100% sure if this is right or if I should fill with median. 

In [7]:
# Fill the null values with mean
cols = ['Value_Average_Past_30_days', 'lag_30',
       'Value_Average_Past_14_days', 'lag_14', 'Value_Average_Past_7_days',
       'lag_7', 'Middle East lagpub_7', 'Other lagpub_7', 'UK lagpub_7',
       'US lagpub_7', 'Middle East lagpub_14', 'Other lagpub_14',
       'UK lagpub_14', 'US lagpub_14', 'Middle East lagpub_30',
       'Other lagpub_30', 'UK lagpub_30', 'US lagpub_30']

for i in cols:
    train[i] = train[i].fillna(train[i].mean())

train.isnull().sum()

Unnamed: 0                       0
Date                             0
EventID                       1824
FinalRating                      0
Day of Week                      0
Weekend Flag                     0
year                             0
month                            0
Value_Average_Past_30_days       0
lag_30                           0
Value_Average_Past_14_days       0
lag_14                           0
Value_Average_Past_7_days        0
lag_7                            0
Middle East lagpub_7             0
Other lagpub_7                   0
UK lagpub_7                      0
US lagpub_7                      0
Middle East lagpub_14            0
Other lagpub_14                  0
UK lagpub_14                     0
US lagpub_14                     0
Middle East lagpub_30            0
Other lagpub_30                  0
UK lagpub_30                     0
US lagpub_30                     0
dtype: int64

In [8]:
# Drop these columns since they aren't useful to train on. 
cols = ['Unnamed: 0', 'EventID', 'Value_Average_Past_30_days', 'Value_Average_Past_14_days', 
       'Value_Average_Past_7_days']
for i in cols:
    train = train.drop([i], axis=1)
train.columns

Index(['Date', 'FinalRating', 'Day of Week', 'Weekend Flag', 'year', 'month',
       'lag_30', 'lag_14', 'lag_7', 'Middle East lagpub_7', 'Other lagpub_7',
       'UK lagpub_7', 'US lagpub_7', 'Middle East lagpub_14',
       'Other lagpub_14', 'UK lagpub_14', 'US lagpub_14',
       'Middle East lagpub_30', 'Other lagpub_30', 'UK lagpub_30',
       'US lagpub_30'],
      dtype='object')

In [9]:
# Keep the first 4 years so we can test on the last 2 years
train['year'].value_counts().sort_values()
yr = [2017,2016,2015]
for i in yr:
    train['year'].drop([i], inplace=True)


In [10]:
train['year'].value_counts().sort_values()

2017.0      1
2010.0    159
2011.0    298
2013.0    346
2016.0    349
2015.0    350
2012.0    360
2014.0    361
Name: year, dtype: int64

## LOGISTIC REGRESSION

A logistic regression mathematically calculates the decision boundary between the possibilities. It looks for a straight line that represents a cutoff that most accurately represents the training data.