# Data-X Spring 2018: Homework 02

### Regression, Classification, Webscraping

**Authors:** Sana Iqbal (Part 1, 2, 3), Alexander Fred-Ojala (Extra Credit)


In this homework, you will do some exercises with prediction-classification, regression and web-scraping.


## Part 1


### Data:
__Data Source__:
Data file is uploaded to bCourses and is named: __Energy.csv__

The dataset was created by Angeliki Xifara ( Civil/Structural Engineer) and was processed by Athanasios Tsanas, Oxford Centre for Industrial and Applied Mathematics, University of Oxford, UK).

__Data Description__:

The dataset contains eight attributes of a building (or features, denoted by X1...X8) and response being the heating load on the building, y1. 

* X1	Relative Compactness 
* X2	Surface Area 
* X3	Wall Area 
*  X4	Roof Area 
*  X5	Overall Height 
* X6	Orientation 
*  X7	Glazing Area 
*  X8	Glazing Area Distribution 
*  y1	Heating Load 


#### Q1:Read the data file in python. Describe data features in terms of type, distribution range and mean values. Plot feature distributions.This step should give you clues about data sufficiency.

In [None]:
# Import Package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb


%matplotlib inline

In [None]:
# Distribution of each variable.
# Reading FileD
df = pd.read_csv('Energy.csv')

In [None]:
# Describing data (General)
print("Describing Data in a general view...")
df.describe()

In [None]:
# Describe data features in terms of type, distribution range and mean values.

def nice_display_basic_statistics(maxi, mini, mean):
    """
    Print in a nice way the data features distribution range and mean values

    Arguments:
        maxi -- python float containing the Max of the feature
        mini -- python float containing the Min of the feature
        mean -- python float containing the Mean of the feature

    """
    
    print("Max: ", maxi)
    print("Min: ", mini)
    print("Mean", mean)
        
        
def nice_display(column_name, dtype, maxi, mini, mean):
    """
    Print in a nice way the data features in terms of type, distribution range and mean values

    Arguments:
        column_name -- python string containing the name of the feature
        dtype -- python string containing the dtype of the feature
        maxi -- python float containing the Max of the feature
        mini -- python float containing the Min of the feature
        mean -- python float containing the Mean of the feature
        
    """
    
    if dtype == "float64":        
        print("The feature " + column_name + ": ")
        print("Type: Float so is Continuous!")
        nice_display_basic_statistics(maxi, mini, mean)
              
    else:
        print("The feature " + column_name + ": ")
        print("Type: Integer so is Continuous!")
        nice_display_basic_statistics(maxi, mini, mean)
        
    print("-" * 30)
    

In [None]:
# Describe data features in terms of type, distribution range and mean values.
for i in df.columns:
    nice_display(i, df[i].dtype, df[i].max(), df[i].min(), df[i].mean())

In [None]:
# Distribution of each variable.

f = plt.figure()

# Specify the grid
ax1 = plt.subplot2grid((3,3), (0,0))
ax2 = plt.subplot2grid((3,3), (0,1)) 
ax3 = plt.subplot2grid((3,3), (0,2))
ax4 = plt.subplot2grid((3,3), (1,0))
ax5 = plt.subplot2grid((3,3), (1,1))
ax6 = plt.subplot2grid((3,3), (1,2))
ax7 = plt.subplot2grid((3,3), (2,0)) 
ax8 = plt.subplot2grid((3,3), (2,1))
ax9 = plt.subplot2grid((3,3), (2,2))

ax1.hist(df["X1"], color="Green")
ax2.hist(df["X2"], color="Blue")
ax3.hist(df["X3"], color="Orange")
ax4.hist(df["X4"], color="Yellow")
ax5.hist(df["X5"], color="Purple")
ax6.hist(df["X6"], color="Black")
ax7.hist(df["X7"], color="Red")
ax8.hist(df["X8"], color="Magenta")
ax9.hist(df["Y1"], color="Cyan")

# Add titles
ax1.set_title('X1')
ax2.set_title('X2')
ax3.set_title('X3')
ax4.set_title('X4')
ax5.set_title('X5')
ax6.set_title('X6')
ax7.set_title('X7')
ax8.set_title('X8')
ax9.set_title('Y1')


f.suptitle('Feature Distributions!',fontsize=20, y=1.1) # y location

f.tight_layout()   

In [None]:
# Adittional INFO
df.info()

In [None]:
# More graphs (with another method)
df.hist(figsize=(13,10))
plt.show()

In [None]:
# Relations between features
pd.tools.plotting.scatter_matrix(df,figsize=(13,10));

 __REGRESSION__:
LABELS ARE CONTINUOUS VALUES.
Here the model is trained to predict a continuous value for each instance.
On inputting a feature vector into the model, the trained model is able to predict a continuous value  for  that instance.  

__Q2.1: Train a linear regression model on 85 percent of the given dataset, what is the intercept value and coefficient values.__



In [None]:
# SHUFFLE data.
data = shuffle(df).reset_index(drop=True)

In [None]:
# Get NaNs
print('Number of NaNs in the dataframe:\n',data.isnull().sum())
data.head()

In [None]:
# Separate X from the Data Set.
X=data.iloc[:,:-1]
X.head()

In [None]:
# Get Labels from the Data Set.
Y=data['Y1']
Y.head()

In [None]:
# Which are my shapes?
print("Feature vector shape=", X.shape)
print("Class shape=", Y.shape)

####  Split Data

In [None]:
# Split data into Training and Validation set  using sklearn function.
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.15, random_state=100)
print ('Number of samples in training data:',len(x_train))
print ('Number of samples in validation data:',len(x_test))

#### Train Model

In [None]:
# Name our Logistic Regression object.
LinearRegressionModel= LinearRegression()

LinearRegressionModel.fit(x_train, y_train)
Z_train=LinearRegressionModel.predict(x_train)
Z_test = LinearRegressionModel.predict(x_test)

# The Coefficients.
print('Coefficients:', LinearRegressionModel.coef_)

# The Interception.
print('Intercept:', LinearRegressionModel.intercept_)


#### Q.2.2: Report model performance using 'ROOT MEAN SQUARE' error metric on:  
__1. Data that was used for training(Training error)__   
__2. On the 15 percent of unseen data (test error) __ 



#### Mean Squared Error and Accuracy for Training and Test.

In [None]:
# The Mean Squared Error.
print("Mean squared error of training:",np.mean((Z_train - y_train) ** 2))
print("Mean squared error of test:",np.mean((Z_test - y_test) ** 2))

# The Accuracy for Training and Test.
print("Accuracy for Training: ", LinearRegressionModel.score(x_train, y_train)* 100, "%")
print("Accuracy for Test", LinearRegressionModel.score(x_test, y_test) * 100, "%")


__ Q2.3: Lets us see the effect of amount of data on the performance of prediction model.Use varying amounts of  Training data (100,200,300,400,500,all) to train regression models and report  training error and validation error in each case. Validation data/Test data   is the same as above for  all  these cases.__  

Plot error rates vs number of training examples.Comment on the relationshipyou observe in the plot, between the amount of data used to train the model and the validation accuracy of the model.

__Hint:__ Use array indexing to choose varying data amounts

#### Different Experiments ASSUMING THAT THE QUESTION AIM TO THE AMOUNT OF DATA CORRESPONDS TO THE WHOLE DATA SET! :)

In [None]:
def different_experiments(X, Y, amount_training_data, costs_train, costs_test, variances, biases):
    """
    Print in a nice way the experiment.

    Arguments:
        X -- Pandas Dataframe. X whole Data Set.
        Y -- Pandas Dataframe. Labels of the Data Set.
        amount_training_data -- Integer. Number of rows of the Training Set (n_x).
        costs_train -- List Object. Array with all the costs (square mean error) in the 
                       Training Set of the different experiments.
        costs_test -- List Object. Array with all the costs (square mean error) in the
                      Test Set of the different experiments
        variances -- List Object. Array with the variances between Train and Test of 
                     the different experiments.
        biases -- List Object. Array with the biases (Testing Accuracy) between Train 
                  and Test of the different experiments.

    Return:
        costs_train -- List Object. Array with all the costs (square mean error) in the 
                       Training Set of the different experiments.
        costs_test -- List Object. Array with all the costs (square mean error) in the
                      Test Set of the different experiments
        variances -- List Object. Array with the variances between Train and Test of 
                     the different experiments.
        biases -- List Object. Array with the biases (Testing Accuracy) between Train 
                  and Test of the different experiments.        
        
    """
    
    error_test = 0
    
    # For beatiful display.
    print("-"*20, "AMOUNT OF TRAINING DATA: ", amount_training_data, " -"*20)
    
    # Spliting with the Amount Of Training Data.
    percentage = amount_training_data/X.shape[0]
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=1-percentage, random_state=40)
    print ('Number of samples in training data:',len(x_train))
    print ('Number of samples in validation data:',len(x_test))
    
    # Name our logistic regression object.
    LinearRegressionModel= LinearRegression()

    # Fit Model. 
    LinearRegressionModel.fit(x_train, y_train)
    
    # Calculate Predict Vector for Train.
    Z_train=LinearRegressionModel.predict(x_train)

    # The Coefficients.
    print('Coefficients:', LinearRegressionModel.coef_)
    
    # The Interception.
    print('Intercept:', LinearRegressionModel.intercept_)
    
    # The mean squared error for Train.
    error_train = np.mean((Z_train - y_train) ** 2)
    print("Mean squared error of training:",error_train)
    
    # Costs, Bias or Accuracy for Training Set.
    costs_train.append(error_train)
    bias = LinearRegressionModel.score(x_train, y_train)* 100
    biases.append(bias)
    
    # For the border case that we don't have Test Set, the Training Set have the whole Data Set.
    if amount_training_data != X.shape[0]:
        # Calculate Predict Vector for Test.
        Z_test = LinearRegressionModel.predict(x_test)
        
        # The mean squared error for Test.
        error_test = np.mean((Z_test - y_test) ** 2)
        costs_test.append(error_test)
        print("Mean squared error of test:",error_test)
        
        # Costs, Bias or Accuracy for Test Set.
        test_accuracy = LinearRegressionModel.score(x_test, y_test) * 100
        variance =  bias - test_accuracy
        variances.append(variance)
        print("Accuracy for Test", test_accuracy, "%")
        print("Variance: ", variance, "\n")
    
    # Printing Accuracy for Training.
    print("Accuracy for Training: ", bias, "%")

    return costs_train, costs_test, variances, biases

In [None]:
# I AM ASSUMING THAT THE QUESTION AIM TO THE AMOUNT OF DATA CORRESPONDS TO THE WHOLE DATA SET! :)

costs_train = []
costs_tests = []
variances = []
biases = []

costs_train, costs_tests, variances, biases = different_experiments(X,Y,100, costs_train, costs_tests, variances, biases)
costs_train, costs_tests, variances, biases = different_experiments(X,Y,200, costs_train, costs_tests, variances, biases)
costs_train, costs_tests, variances, biases = different_experiments(X,Y,300, costs_train, costs_tests, variances, biases)
costs_train, costs_tests, variances, biases = different_experiments(X,Y,400, costs_train, costs_tests, variances, biases)
costs_train, costs_tests, variances, biases = different_experiments(X,Y,500, costs_train, costs_tests, variances, biases)
costs_train, costs_tests, variances, biases = different_experiments(X,Y,X.shape[0], costs_train, costs_tests, variances, biases)

plt.plot(costs_train)
plt.plot(costs_tests)
plt.ylabel('cost')
plt.xlabel('amount of training data (per hundreds)')
plt.title("MEAN SQUARED BY EXPERIMENT")
plt.show()

plt.plot(variances)
plt.ylabel('variance')
plt.xlabel('amount of training data (per hundreds)')
plt.title("VARIANCE BY EXPERIMENT")
plt.show()

plt.plot(biases)
plt.ylabel('bias')
plt.xlabel('amount of training data (per hundreds)')
plt.title("BIAS BY EXPERIMENT")
plt.show()

#### __CLASSIFICATION__:
LABELS ARE DISCRETE VALUES.
Here the model is trained to classify each instance into a set of predefined  discrete classes.
On inputting a feature vector into the model, the trained model is able to predict a  class of that instance. You can also output the probabilities of an instance belnging to a class.  

__ Q 3.1:  Bucket values of 'y1' i.e 'Heating Load'  from the original dataset into 3 classes:__ 

0: 'Low' ( < 15),   
1: 'Medium'  (15-30),   
2: 'High'  (>30)

This converts the given dataset  into a classification problem, classes being, Heating load is: *low, medium or high*. Use this datset with transformed 'heating load' for creating a  logistic regression classifiction model that predicts heating load type of a building. Use test-train split ratio of 0.15.  

*Report training and test accuracies and  confusion matrices.*


**HINT:** Use pandas.cut

#### Prepare Data

In [None]:
# Cut the data Frame with the corresponding new Labels.
df["Y1"] = pd.cut(df["Y1"], bins=3, labels=["Low", "Medium", "High"])
df

In [None]:
# Shuffle data and storage in new variable "data2", since we are in the same Part I.
data2= shuffle(df).reset_index(drop=True)
data2.head()

#### Data Analisis

In [None]:
# Get NaNs
print('Number of NaNs in the dataframe2:\n',data2.isnull().sum())

In [None]:
# Information about data.
data2.describe()

In [None]:
# Get feature distribution of each continuous valued feature
data2.hist(figsize=(15,5))
plt.show()

In [None]:
# Separate X from the Data Set.
X2=data2.iloc[:,:-1]
X2.head()

In [None]:
# Get Labels from the Data Set.
Y2=data2['Y1']
Y2.head()

In [None]:
# How many values I have in each Label?
Y2.value_counts()

In [None]:
# Maps Label to integers
    # - Low: 0
    # - High: 1
    # - Medium : 2
Y2=Y2.map({'Low': 0, 'High': 1,'Medium' :2})
print (Y2.value_counts()) 

# Show how is my Label array.
Y2.head()

In [None]:
# Which are my shapes?
print("Feature vector shape=", X2.shape)
print("Class shape=", Y2.shape)

#### Data Split

In [None]:
# Split data into Training Set and Validation Set  using sklearn function. 
# Storage the different Data with the label "2", since we are in the same Part I.
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2, Y2, test_size=0.15, random_state=100)
print ('Number of samples in training data:',len(x_train2))
print ('Number of samples in validation data:',len(x_test2))


#### Train Model

In [None]:
# Name our Logistic Regression object
LogisticRegressionModel = LogisticRegression()

# Train the Model
LogisticRegressionModel.fit(x_train2, y_train2)

#### Training Accuracy

In [None]:
# Training Accuracy: Float in the variable: training_accuracy2 
training_accuracy2 = LogisticRegressionModel.score(x_train2,y_train2)
print ('Training Accuracy:',training_accuracy2)

#### Prediction of Train

In [None]:
# Prediction2 of Train.
prediction_train2 = LogisticRegressionModel.predict(x_train2) 

#### Find Error

In [None]:
def find_error(real_label,predicted_label):
    """
    Find the error between Prediction and "Real Label".

    Arguments:
        real_label -- label in data
        predicted_label -- label predicted by the model

    """
    
    # Empty array of Zeros, with the length of "real_label"
    Loss_Array = np.zeros(len(real_label))
    
    for i,value in enumerate(real_label):
        
        if value == predicted_label[i]: 
            Loss_Array[i] = 0
        else:
            Loss_Array[i] = 1

    print ("Y-realLabel   Z-predictedLabel   Error \n")
    for i,value in enumerate(real_label):
        print (value,"\t\t" ,predicted_label[i],"\t\t",Loss_Array[i])
        
    error_rate = np.average(Loss_Array)
    print ("\nThe error rate is ", error_rate)
    print ('\nThe accuracy of the model is ',1-error_rate )

#### Error of Experiment

In [None]:
find_error(y_train2, prediction_train2)

####  Validation Accuracy and Variance: 

In [None]:
# Validation Accuracy: Float in the variable: validation_accuracy2 
validation_accuracy2 = LogisticRegressionModel.score(x_test2,y_test2)
print('Accuracy of the model on unseen validation data: ',validation_accuracy2)

# Variance: Float. Difference between Training and Test.
variance2 = training_accuracy2 - validation_accuracy2
print("Variance: ", variance2)

#### Prediction of Test.

In [None]:
# Prediction of Test.
y_pred2 = LogisticRegressionModel.predict(x_test2)

#### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

# Confusion Matrix. We're looking for the Diagonal.
ConfusionMatrix = pd.DataFrame(confusion_matrix(y_test2, y_pred2),columns=['Predicted 0','Predicted 1','Predicted 2'],index=['Actual 0','Actual 1','Actual 2'])
print ('Confusion matrix of test data is: \n',ConfusionMatrix)


__ Q3.2: One of the preprocessing steps in Data science is Feature Scaling i.e getting all our data on the same scale by setting same  Min-Max of feature values. This makes training less sensitive to the scale of features . Scaling is important in algorithms that use distance based classification, SVM or K means or involve gradient descent optimization.If we  Scale features in the range [0,1] it is called unity based normalization.__

__Perform unity based normalization on the above dataset and train the model again, compare model performance in training and validation with your previous model.__  

refer:http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler  
more at: https://en.wikipedia.org/wiki/Feature_scaling

#### Pre-Processing with the Same Scale

In [None]:
from sklearn import preprocessing
print(x_train2.head())

df.rename(index=str, columns={"A": "a", "B": "c"})
x_scaled = pd.DataFrame(preprocessing.scale(X2)).rename(index=str, columns={  0: "X1", 
                                                                              1: "X2",
                                                                              2: "X3",
                                                                              3: "X4",
                                                                              4: "X5",
                                                                              5: "X6",
                                                                              6: "X7",
                                                                              7: "X8"})
print(x_scaled)

In [None]:
# Which are my shapes?
print("Feature vector shape=", x_scaled.shape)
print("Class shape=", Y2.shape)

#### Split Data

In [None]:
# Split data into Training Set and Validation Set  using sklearn function. 
# Storage the different Data with the label "3", since we are in the same Part I.
x_train3, x_test3, y_train3, y_test3 = train_test_split(x_scaled, Y2, test_size=0.15, random_state=100)
print ('Number of samples in training data:',len(x_train3))
print ('Number of samples in validation data:',len(x_test3))

#### Train Scaled Model

In [None]:
# Name our Logistic Regression object. With the
LogisticRegressionModel2 = LogisticRegression()

# Train the Model
LogisticRegressionModel2.fit(x_train3, y_train3)

#### Training Accuracy

In [None]:
# Training Accuracy: Float in the variable: training_accuracy3
training_accuracy3 = LogisticRegressionModel2.score(x_train3,y_train3)
print ('Training Accuracy:',training_accuracy3)

#### Prediction in Train.

In [None]:
# Prediction3 in Train.
prediction_train3 = LogisticRegressionModel2.predict(x_train3) 

#### Error of Experiment

In [None]:
# Find error:
find_error(y_train3, prediction_train3)

####  Validation Accuracy and Variance: 

In [None]:
# Validation Accuracy: Float in the variable: validation_accuracy3
validation_accuracy3 = LogisticRegressionModel2.score(x_test3,y_test3)
print('Accuracy of the model on unseen validation data: ',validation_accuracy3)

# Variance: Float. Difference between Training and Test.
variance3 = training_accuracy3 - validation_accuracy3
print("Variance: ", variance3)

#### Prediction in Test.

In [None]:
# Prediction3 in Test.
y_pred3 = LogisticRegressionModel2.predict(x_test3)

#### Confusion Matrix

In [None]:

# Confusion Matrix. We're looking for the Diagonal.
ConfusionMatrix2 = pd.DataFrame(confusion_matrix(y_test3, y_pred3),columns=['Predicted 0','Predicted 1','Predicted 2'],index=['Actual 0','Actual 1','Actual 2'])
print ('Confusion matrix of test data is: \n',ConfusionMatrix2)


#### Comparison Two Models

#### Training Accurracy Comparison

In [None]:
# Training Accurracy Comparison
experiments_t = ("Original Model", "Scaled Model")
accuracies_t = [training_accuracy2, training_accuracy3]
 
plt.bar(np.arange(len(experiments_t)), accuracies_t, align='center', alpha=1)
plt.xticks(np.arange(len(experiments_t)), experiments_t)
plt.ylabel('Training Accuracy')
plt.title('Training Accurracy Comparison')
 
plt.show()

#### Validation Accurracy Comparison

In [None]:
# Validation Accurracy Comparison
experiments_v = ("Original Model", "Scaled Model")
accuracies_v = [validation_accuracy2, validation_accuracy3]
 
plt.bar(np.arange(len(experiments_v)), accuracies_v, align='center', alpha=1)
plt.xticks(np.arange(len(experiments_v)), experiments_v)
plt.ylabel('Validation Accuracy')
plt.title('Validation Accurracy Comparison')
 
plt.show()

#### Variances Comparison

In [None]:
# Variances Comparison
experiments = ("Original Model", "Scaled Model")
variances = [variance2, variance3]
 
plt.bar(np.arange(len(experiments)), variances, align='center', alpha=1)
plt.xticks(np.arange(len(experiments)), experiments)
plt.ylabel('Variances')
plt.title('Variances Comparison')
 
plt.show()

## Part 2



__ 1. Read __`diabetesdata.csv`__ file into a pandas dataframe. Analyze the data features, check for NaN values. 
About the data: __

1. __TimesPregnant__: Number of times pregnant 
2. __glucoseLevel__: Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. __BP__: Diastolic blood pressure (mm Hg)  
5. __insulin__: 2-Hour serum insulin (mu U/ml) 
6. __BMI__: Body mass index (weight in kg/(height in m)^2) 
7. __pedigree__: Diabetes pedigree function 
8. __Age__: Age (years) 
9. __IsDiabetic__: 0 if not diabetic or 1 if diabetic) 

#### Read Data

In [None]:
# Reading FileD
df = pd.read_csv('diabetesdata.csv')

#### Analisis of the Data Features

In [None]:
# Describing data (General)
print("Describing Data in a general view...")
df.describe()

In [None]:
# Describe data features in terms of type, distribution range and mean values.
for i in df.columns:
    nice_display(i, df[i].dtype, df[i].max(), df[i].min(), df[i].mean())

In [None]:
# Adittional INFO
df.info()

In [None]:
# Distributions for each feature
df.hist(figsize=(13,10))
plt.show()

In [None]:
# Relations between features
pd.tools.plotting.scatter_matrix(df,figsize=(13,10));

In [None]:
# SHUFFLE data.
data = shuffle(df).reset_index(drop=True)

In [None]:
# Balanced data set?
data['IsDiabetic'].value_counts()

In [None]:
 # Bayes Error for prediction accuracy
_[0]/(sum(_))

#### Check NaN Values

In [None]:
# Get NaNs
print('Number of NaNs in the dataframe:\n',data.isnull().sum())
data.head()

__ 2. Preprocess data to replace NaN values in a feature(if any) using mean of the feature.  
Train  logistic regression, SVM, perceptron, kNN, xgboost and random forest models using this preprocessed data with 20% test split.Report training and test accuracies.__


#### Preprocess data to replace NaN values in a feature(if any) using mean of the feature.

In [None]:
data["glucoseLevel"]= data["glucoseLevel"].fillna(data["glucoseLevel"].mean())
data["Age"]= data["Age"].fillna(data["Age"].mean())

In [None]:
# Get NaNs again.
print('Number of NaNs in the dataframe:\n',data.isnull().sum())

##### Split Data

In [None]:
# Separate X from the Data Set.
X=data.iloc[:,:-1]
X.head()

In [None]:
# Get Labels from the Data Set.
Y=data['IsDiabetic']
Y.head()

In [None]:
# Which are my shapes?
print("Feature vector shape=", X.shape)
print("Class shape=", Y.shape)

In [None]:
# Split Data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=100)
print ('Number of samples in training data:',len(x_train))
print ('Number of samples in validation data:',len(x_test))

#### Train Different Models

In [None]:
def model_flow(Model, x_train, x_test, y_train, y_test):
    """
    Print all the flow of a model: Fit, Training Accuracy, Error by Example, Validation Accuracy and Variance.

    Arguments:
        x_train -- Pandas Dataframe. Features Training Set.
        x_test -- Pandas Dataframe. Features Test Set.
        y_train -- Pandas Dataframe. Label Train Set.
        y_test -- Pandas Dataframe. Label Test Set.
        
    """
    
    # Train the Model
    Model.fit(x_train, y_train)
    
    # Training Accuracy
    training_accuracy = Model.score(x_train, y_train)
    print ('Training Accuracy:',training_accuracy)

    # Prediction of Train.
    prediction_train = Model.predict(x_train) 

    # Find Error
    find_error(y_train, prediction_train)

    # Validation Accuracy
    validation_accuracy = Model.score(x_test,y_test)
    print('Accuracy of the model on unseen validation data: ',validation_accuracy)

    # Prediction of Test.
    y_pred = Model.predict(x_test)

    # Variance: Float. Difference between Training and Test.
    variance = training_accuracy - validation_accuracy
    print("Variance: ", variance)
    
    
def different_models(model_name, x_train, x_test, y_train, y_test):
    """
    Differentiate models and call model_flow function

    Arguments:
        model_name -- Python String. Corresponding to the model name.
        x_train -- Pandas Dataframe. Features Training Set.
        x_test -- Pandas Dataframe. Features Test Set.
        y_train -- Pandas Dataframe. Label Train Set.
        y_test -- Pandas Dataframe. Label Test Set.
        
    """
    if model_name == "LogisticRegression":
        # Model Object
        model = LogisticRegression()
        model_flow(model, x_train, x_test, y_train, y_test)
       
    elif model_name == "SVM":
        # Model Object
        model = SVC()
        model_flow(model, x_train, x_test, y_train, y_test)
        
    elif model_name == "Perceptron":
        # Model Object
        model = Perceptron()
        model_flow(model, x_train, x_test, y_train, y_test)
        
    elif model_name == "kNN":
        # Model Object
        model = KNeighborsClassifier(n_neighbors = 2)
        model_flow(model, x_train, x_test, y_train, y_test)
        
    elif model_name == "xgboost":
        # Model Object
        model = xgb.XGBClassifier(n_estimators=1000)
        model_flow(model, x_train, x_test, y_train, y_test)
        
    elif model_name == "random forest":
        # Model Object
        model = RandomForestClassifier(n_estimators=1000)
        model_flow(model, x_train, x_test, y_train, y_test)

In [None]:
different_models("LogisticRegression", x_train, x_test, y_train, y_test)

In [None]:
different_models("SVM", x_train, x_test, y_train, y_test)

In [None]:
different_models("Perceptron", x_train, x_test, y_train, y_test)

In [None]:
different_models("kNN", x_train, x_test, y_train, y_test)

In [None]:
different_models("xgboost", x_train, x_test, y_train, y_test)

In [None]:
different_models("random forest", x_train, x_test, y_train, y_test)



__3. What is the  ratio of diabetic persons in 3 equirange bands of 'BMI' and 'Pedigree' in the provided dataset.__

 __Convert these features - 'BP','insulin','BMI' and 'Pedigree'   into categorical values by mapping different bands of values of these features to integers 0,1,2.__  
 
HINT: USE pd.cut with bin=3 to create 3 bins






#### Cut features: BP, insulin, BMI & Pedigree

In [None]:
data["BP"] = pd.cut(data["BP"], bins=3, labels=[0, 1, 2])
data["insulin"] = pd.cut(data["insulin"], bins=3, labels=[0, 1, 2])
data["BMI"] = pd.cut(data["BMI"], bins=3, labels=[0, 1, 2])
data["Pedigree"] = pd.cut(data["Pedigree"], bins=3, labels=[0, 1, 2])
data.head()

In [None]:
# BP Ratio
data[["BP", "IsDiabetic"]].groupby(["BP"]).mean()

In [None]:
# Insulin Ratio
data[["insulin", "IsDiabetic"]].groupby(["insulin"]).mean()

In [None]:
# BMI Ratio
data[["BMI", "IsDiabetic"]].groupby(["BMI"]).mean()

In [None]:
# Pedigree Ratio
data[["Pedigree", "IsDiabetic"]].groupby(["Pedigree"]).mean()


__4. Now consider the original dataset again, instead of generalizing the NAN values with the mean of the feature we will try assigning values to NANs based on some hypothesis. For example for age we assume that the relation between BMI and BP of people is a reflection of the age group.We can have 9 types of BMI and BP relations and our aim is to find the median age of each of that group:__

Your Age guess matrix will look like this:  

| BMI | 0       | 1      | 2  |
|-----|-------------|------------- |----- |
| BP  |             |              |      |
| 0   | a00         | a01          | a02  |
| 1   | a10         | a11          | a12  |
| 2   | a20         | a21          |  a22 |


__Create a guess_matrix  for NaN values of *'Age'* ( using 'BMI' and 'BP')  and  *'glucoseLevel'*  (using 'BP' and 'Pedigree') for the given dataset and assign values accordingly to the NaNs in 'Age' or *'glucoseLevel'* .__


Refer to how we guessed age in the titanic notebook in the class.



#### Create Guess Matrix for Age and glucoseLevel

In [None]:
# Get NaNs.
print('Number of NaNs in the dataframe:\n',df.isnull().sum())

In [None]:
df["BP"] = pd.cut(df["BP"], bins=3, labels=[0, 1, 2])
df["insulin"] = pd.cut(df["insulin"], bins=3, labels=[0, 1, 2])
df["BMI"] = pd.cut(df["BMI"], bins=3, labels=[0, 1, 2])
df["Pedigree"] = pd.cut(df["Pedigree"], bins=3, labels=[0, 1, 2])
df.head()

In [None]:
guess_age = np.zeros((3,3),dtype=int) # Initialize Matrix
guess_age

In [None]:
guess_glucose_level = np.zeros((3,3),dtype=int) # Initialize Matrix
guess_glucose_level

In [None]:
for i in range(0, 3):
    for j in range(0, 3):
        aux_age = df[(df['BMI'] == i) & (df['BP'] == j)]['Age'].dropna().median()
        guess_age[i,j] = int(aux_age)
        
        aux_glucose_level = df[(df['BP'] == i) & (df['Pedigree'] == j)]['glucoseLevel'].dropna().median()
        guess_glucose_level[i,j] = int(aux_glucose_level)

In [None]:
# Guess Age Matrix
guess_age

In [None]:
# Guess Glucose Level Matrix
guess_glucose_level

In [None]:
# Replace NaN Values of Age with the Guess Age Matrix
for i in range(0, 3):
        for j in range(0, 3):
            df.loc[ (df["Age"].isnull()) & (df['BMI'] == i)& (df['BP'] == j),'Age'] = guess_age[i,j]
                    

df['Age'] = df['Age'].astype(int)

In [None]:
# Replace NaN Values of Age with the Guess Age Matrix
for i in range(0, 3):
        for j in range(0, 3):
            df.loc[ (df["glucoseLevel"].isnull()) & (df['BP'] == i) & \
                   (df['Pedigree'] == j),'glucoseLevel'] = guess_age[i,j]
                    

df['glucoseLevel'] = df['glucoseLevel'].astype(int)

In [None]:
# Get NaNs, to probe my proccedure above
print('Number of NaNs in the dataframe:\n',df.isnull().sum())

In [None]:
df.head()



__5. Now, convert 'glucoseLevel' and 'Age' features also to categorical variables of 5 categories each.__

__Use this dataset (with all features in categorical form) to train perceptron, logistic regression and random forest models using 20% test split. Report training and test accuracies.__







#### Pre-Processing: Make glucoseLevel and Age categorical Variables

In [None]:
df["Age"] = pd.cut(df["Age"], bins=5, labels=[0, 1, 2, 3, 4])
df["glucoseLevel"] = pd.cut(df["glucoseLevel"], bins=5, labels=[0, 1, 2, 3, 4])
df.head()

#### Split Data

In [None]:
# Separate X from the Data Set.
X=df.iloc[:,:-1]
X.head()

In [None]:
# Get Labels from the Data Set.
Y=df['IsDiabetic']
Y.head()

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=100)
print ('Number of samples in training data:',len(x_train))
print ('Number of samples in validation data:',len(x_test))

#### Perceptron

In [None]:
different_models("Perceptron", x_train, x_test, y_train, y_test)

#### Logistic Regression

In [None]:
different_models("LogisticRegression", x_train, x_test, y_train, y_test)

#### Random Forest

In [None]:
different_models("random forest", x_train, x_test, y_train, y_test)

### Part 3

1. __Derive the expression for the optimal parameters in the linear regression equation, i.e. solve the normal equation for Ordinary Least Squares for the case of Simple Linear Regression, when we only have one input and one output__

Given a set of _n_ points $(X_i,Y_i)$ where $Yi$ is dependent on $Xi$ by a linear relation,  find the best-fit line,$$Z_i = {aX_i + b}$$  that minimizes the __sum of squared errors in Y__,i.e: $$minimize \sum_{i}{(Y_i- Z_i)^2}$$
__i. __ Show that $$ intercept \quad b = \overline{Y}-  a.\overline{X}\quad  and   \quad slope \quad a= \frac{\sum_{i}(X_i- \overline{X})(Y_i- \overline{Y})^2}{ \sum_{i}(X_i- \overline{X})}$$


 where $\overline{X}$ and  $\overline{Y}$ are the averages of the X values and the Y values, respectively.
 
__ ii. __Show that slope _a_ can be written as $ a = r.(S_y /S_x)$ where $S_y$  = the standard deviation of the Y values and $S_x$= the standard deviation of the X values and _r_ is the correlation coefficient.

##### Please try to write a nice LateXed version of your answer, and do the derivations of the expressions as nicely as possible


_____

i.
$$ (1) \ \ Z_i = {aX_i + b} \\
   (2) \ \ \sum_{i}{(Y_i- Z_i)^2}
$$

Replace (1) in (2)

$$ (3) \ \ \sum_{i}{(Y_i- {(aX_i + b)})^2} \\
   (4) \ \ \sum_{i}{(Y_i- aX_i - b)^2}
$$

We partial derivate (4) by b:
$$
   (5) \ \ \sum_{i}{2 * (Y_i- aX_i - b) * (-1)}\\
   (6) \ \ \sum_{i}{ -2Y_i + 2aX_i + 2b}
$$

We (6) = 0
$$
   (7) \ \ \sum_{i}{ -2Y_i + 2aX_i + 2b} = 0 \\
   (8) \ \ b = \sum_{i}{ Y_i - aX_i}
$$

Then:
$$
    b = \overline{Y} - a\overline{X} \\
    \blacksquare
$$

To find the Slope a we partial derivate (4) by a and = 0
$$
    (9) \ \ \sum_{i}{2 * (Y_i- aX_i - b) * (-X_i)} = 0\\
    (10) \ \ \sum_{i}{(-Y_iX_i + aX_i^2 + X_i\overline{Y} - aX_i\overline{X})} = 0
$$
Then we free a

$$
    \quad a= \frac{\sum_{i}(X_i- \overline{X})(Y_i- \overline{Y})^2}{ \sum_{i}(X_i- \overline{X})} \\
    \blacksquare
$$

# Two Extra Credit Points: Fun with Webscraping & Text manipulation
### (Mandatory for Grad students!)

<div class='alert alert-info'> `NOTE:` **If you are a Graduate Section student (enrolled in 290), the Extra Credit Questions are mandatory.**</div>

## 1. Statistics in Presidential Debates

Your first task is to scrape Presidential Debates from the Commission of Presidential Debates website: http://www.debates.org/index.php?page=debate-transcripts.

To do this, you are not allowed to manually look up the URLs that you need, instead you have to scrape them. The root url to be scraped is the one listed above, namely: http://www.debates.org/index.php?page=debate-transcripts


1. By using `requests` and `BeautifulSoup` find all the links / URLs on the website that links to transcriptions of **First Presidential Debates** from the years [2012, 2008, 2004, 2000, 1996, 1988, 1984, 1976, 1960]. In total you should find 9 links / URLs tat fulfill this criteria.
2. When you have a list of the URLs your task is to create a Data Frame with some statistics (see example of output below):
    1. Scrape the title of each link and use that as the column name in your Data Frame. 
    2. Count how long the transcript of the debate is (as in the number of characters in transcription string). Feel free to include `\` characters in your count, but remove any breakline characters, i.e. `\n`. You will get credit if your count is +/- 10% from our result.
    3. Count how many times the word **war** was used in the different debates. Note that you have to convert the text in a smart way (to not count the word **warranty** for example, but counting **war.**, **war!**, **war,** or **War** etc.
    4. Also scrape the most common used word in the debate, and write how many times it was used. Note that you have to use the same strategy as in 3 in order to do this.
    
**Tips:**

___

In order to solve question 3 and 4 above it can be useful to work with Regular Expressions and explore methods on strings like `.strip(), .replace(), .find(), .count(), .lower()` etc. Both are very powerful tools to do string processing in Python. To count common words for example I used a `Counter` object and a Regular expression pattern for only words, see example:

```python
    from collections import Counter
    import re

    counts = Counter(re.findall(r"[\w']+", text.lower()))
```

Read more about Regular Expressions here: https://docs.python.org/3/howto/regex.html
    
    
**Example output of all of the answers to EC Question 1:**


![pres_stats](https://github.com/ikhlaqsidhu/data-x/raw/master/x-archive/misc/hw2_imgs_spring2018/president_stats.png)




----

.




    
## 2. Download and read in specific line from many data sets

Scrape the first 27 data sets from this URL http://people.sc.fsu.edu/~jburkardt/datasets/regression/ (i.e.`x01.txt` - `x27.txt`). Then, save the 5th line in each data set, this should be the name of the data set author (get rid of the `#` symbol, the white spaces and the comma at the end). 

Count how many times (with a Python function) each author is the reference for one of the 27 data sets. Showcase your results, sorted, with the most common author name first and how many times he appeared in data sets. Use a Pandas DataFrame to show your results, see example.

**Example output of the answer EC Question 2:**

![author_stats](https://github.com/ikhlaqsidhu/data-x/raw/master/x-archive/misc/hw2_imgs_spring2018/data_authors.png)
