## Machine Learning Introduction

What is Machine Learning?

Machine learning is an umbrella term for a set of algorithms that analyze and find patterns from known historical information (“training data”) to make predictions on unknown/new information. 
“The field of study that gives computers the ability to learn without being explicitly programmed.”
What goes into these algorithms?

    Mathematics: Mathematical functions form the basis of the modeling process. Specifically, the fields of Probability, Linear Algebra and Calculus are crucial to building mathematical systems capable of modeling real-world data.
      
    Computer Science: Programming languages implement these mathematical models by translating them into a series of executable tasks that a computing machine can implement.  
      
    Statistics: Statistical inference and evaluation techniques are at the heart of making sure the model reflects the data as much as possible. They help us answer: How do we know our model works?

The parameters in the mathematical models (mathematics) are “learned” by implementing them using programming (computer science) and upon evaluating their performance (statistics), they’re altered if unsatisfactory… And implemented and evaluated again to see if they perform better and so on and so forth.

The machine learning process is thus not a one-and-done process. Rather, it’s iterative. Not unlike the way we humans learn! :) This capacity of machine learning to simulate human cognition makes it a subfield of Artificial Intelligence(AI).
  


### Machine Learning Engineering  

Storing and accessing vast amounts of data requires robust hardware and pipelines. And running a machine learning algorithm requires lots of computing power and an infrastructure that allows for a seamless flow of data between the different steps of the algorithm. Here’s where the “engineering” in Machine Learning Engineering comes in!

Machine Learning Engineering concerns itself with designing, building, maintaining and fine-tuning computational systems that can execute sophisticated machine learning algorithms on large amounts of data. 


### Supervised Learning: Regression

Machine learning can be branched out into the following categories:

    **Supervised Learning**
    **Unsupervised Learning**

Supervised Learning is where the data is labeled and the program learns to predict the output from the input data. For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions. For each transaction, the program would predict if it is fraudulent or not.

Supervised learning problems can be further grouped into regression and classification problems.

Regression:

In regression problems, we are trying to predict a continuous-valued output. Examples are:

    What is the housing price in New York?
    What is the value of cryptocurrencies?

Classification:

In classification problems, we are trying to predict a discrete number of values. Examples are:

    Is this a picture of a human or a picture of a cyborg?
    Is this email spam?

For a quick preview, here’s an example of a regression problem.

A real estate company wants to analyze housing costs in Neo York. They built a linear regression model to predict rent prices from two variables: the square footage of each apartment and the number of burglaries in the apartment’s neighborhood during the past year.


Unsupervised Learning

Unsupervised Learning is a type of machine learning where the program learns the inherent structure of the data based on unlabeled examples.

Clustering is a common unsupervised machine learning approach that finds patterns and structures in unlabeled data by grouping them into clusters.

Some examples:

    Social networks clustering topics in their news feed
    Consumer sites clustering users for recommendations
    Search engines to group similar objects in one cluster

For a quick preview, here’s an example of unsupervised learning.

A social media platform wants to separate their users into categories based on what kind of content they engage with. They have collected three pieces of data from a sample of users:

    Number of hours per week spent reading posts
    Number of hours per week spent watching videos
    Number of hours per week spent in virtual reality

The company is using an algorithm called k-means clustering to sort users into three different groups.  
  
  
Supervised Learning: data is labeled and the program learns to predict the output from the input data  
Unsupervised Learning: data is unlabeled and the program learns to recognize the inherent structure in the input data


# Machine Learning Pipelines  

### Column Transformer

Often times, you may not want to simply apply every function to all columns. If our columns are of different types, we may only want to apply certain parts of the pipeline to a subset of columns. This is what we saw in the two previous exercises. One set of transformations are applied to numeric columns and another set to the categorical ones. We can use scikit-learn‘s ColumnTransformer as one way of combining these processes together.

ColumnTransformer takes in a list of tuples of the form (name, pipeline, columns):

example_column_transformer = ColumnTransformer(
    transformers=[ ("name_1", pipeline_1, columns_1),
                   ("name_2", pipeline_2, columns_2)])

The transformer can be anything with a .fit and .transform method like we used previously (like SimpleImputer or StandardScaler), but can also itself be a pipeline.

## Score method of the pipeline object

 But now the final step also has a .predict method, which can be called on the entire pipeline! Additionally the .score() method, which estimates the default prediction score on any scikit-learn model can also be used to evaluate the performance of the pipeline.

 pipeline_score = pipeline.score(x_test,y_test)

## Hyperparameter Tuning

Great, we have a very condensed bit of code that does all our data cleaning, preprocessing, and modeling in a reusable fashion! What now? Well, we can tune some of the parameters of the model by applying a grid search over a range of hyperparameter values.

A linear regression model has very few hyperparameters and here we’ll be using the hyperparameter that pertains to whether we include an intercept or not. 

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA


from sklearn.metrics import confusion_matrix

from scipy.io import arff

data = arff.loadarff('bone-marrow.arff')
df = pd.DataFrame(data[0])
df.drop(columns=['Disease'], inplace=True)


#Convert all columns to numeric, coerce errors to null values
for c in df.columns:
    df[c] = pd.to_numeric(df[c], errors='coerce')
    
#Make sure binary columns are encoded as 0 and 1
for c in df.columns[df.nunique()==2]:
    df[c] = (df[c]==1)*1.0

# 1. Calculate the number of unique values for each column
print('Count of unique values in each column:', df.nunique())

# 2. Set target, survival_status,as y; features (dropping survival status and time) as X
y = df["survival_status"]
X = df.loc[:,~df.columns.isin(["survival_time","survival_status"])]

# 3. Define lists of numeric and categorical columns based on number of unique values
num_cols = X.columns[X.nunique() > 7].tolist()
cat_cols = X.columns[X.nunique() <= 7].tolist()

# 4. Print columns with missing values
print(X.isnull().sum()>0)

# 5. Split data into train/test split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=66)

# 6. Create categorical preprocessing pipeline
# Using mode to fill in missing values and OHE
cat_vals = Pipeline([
    ("imp_cat", SimpleImputer(strategy="most_frequent")),  
    ("ohe", OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore'))
])




# 7. Create numerical preprocessing pipeline
# Using mean to fill in missing values and standard scaling of features
num_vals = Pipeline([("imp_num",SimpleImputer(strategy='mean')),("scaler", StandardScaler())])

# 8. Create column transformer that will preprocess the numerical and categorical features separately

preprocess = ColumnTransformer([
    ("cat_vals", cat_vals, cat_cols),  
    ('num_vals', num_vals, num_cols)
])


# 9. Create a pipeline with preprocess, PCA, and a logistic regresssion model

pipeline = Pipeline([("preprocess",preprocess), 
                     ("pca", PCA()),
                     ("clf",LogisticRegression())])

# 10. Fit the pipeline on the training data
pipeline.fit(X_train,y_train)

#Predict the pipeline on the test data
print(pipeline.score(X_test,y_test))
# 11. Define search space of hyperparameters

#12. Search over hyperparameters abolve to optimize pipeline and fit
search_space = [{"pca":[PCA()],"pca__n_components":np.linspace(30,37,3).astype(int)},
{"logreg":[LogisticRegression()],"logreg__penalty":["l1","l2","elasticnect"],"logreg__C":[0,0.1,1,10]}]

pipeline_cv = Pipeline([("preprocess",preprocess), 
                     ("pca", PCA()),
                     ("logreg",LogisticRegression())])

gs = GridSearchCV(estimator=pipeline_cv,              param_grid=search_space,
scoring = "roc_auc",cv=4)

gs.fit(X_train,y_train)
# 13. Save the best estimator from the gridsearch and print attributes and final accuracy on test set
best_model = gs.best_estimator_

# 14. Print attributes of best_model
print("The values of the hyperparameters for the pca are: ", best_model.named_steps['pca'].get_params())

print("The values of the hyperparameters for the logistic regression are: ", best_model.named_steps['logreg'].get_params())

# 15. Print final accuracy score 
print("The score on the test set", best_model.score(X_test,y_test))