# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

### What are the benefits of using a pipline?

In [None]:
# call_on_students(1)

- simplify workflow
    - makes model validation more streamlined, and easier to compare model to model
    - allows to compartalized our steps, leads easier debugging and easier adjusting
- ensure consistency b/t train and test data
    - makes dataprocessing efficient
    - helps to prevent data leakage
- makes model iteration more efficient
    - incorporate gridsearch and pipelines
- eliminates data leakage when we use cross validation
    - splits and then feeds each split through pipeline

### What does a gridsearch achieve?

In [None]:
# call_on_students(1)

- allows to perform a comprehension/exhaustive search across hyperparameter values
    - can look at all possible combinations at once, via the same evaluation process
    - tunes hyperparameters in tandem
- can return best_estimator object, easy to pull out the 'best' model
    - can use many different metrics, including custom ones
- all helps to make model optimization easier and more efficient

### Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

In [None]:
# call_on_students(1)

**Answer**:

In [2]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [3]:
# Your code here
ss = StandardScaler()
log_pipe = Pipeline(steps=[('scaler', ss),
                          ('logreg', LogisticRegression(random_state=42))])

In [4]:
log_pipe

Pipeline(steps=[('scaler', StandardScaler()),
                ('logreg', LogisticRegression(random_state=42))])

### Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

In [None]:
# call_on_students(1)

**Answer**:

In [5]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [6]:
X_train.shape, y_train.shape

((426, 30), (426,))

In [13]:
log_pipe.named_steps['logreg'].get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [14]:
# Your code here
param_grid = {'logreg__C': [0.1, 1, 10]}
log_gs = GridSearchCV(log_pipe, param_grid=param_grid, cv=5, )
log_gs.fit(X=X_train, y=y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('logreg',
                                        LogisticRegression(random_state=42))]),
             param_grid={'logreg__C': [0.1, 1, 10]})

In [15]:
log_gs.best_params_

{'logreg__C': 10}

In [16]:
log_gs.best_score_

0.9764705882352942

In [17]:
final_model = log_gs.best_estimator_
final_model

Pipeline(steps=[('scaler', StandardScaler()),
                ('logreg', LogisticRegression(C=10, random_state=42))])

In [18]:
final_model.score(X_train, y_train)

0.9906103286384976

In [19]:
final_model.score(X_test, y_test)

0.972027972027972

# 2) Ensemble Methods

### What sorts of ensembling methods have we looked at?

In [None]:
# call_on_students(1)

- bagging (bootstrap aggregation)
- randomforest
- voting
- stacking
- extratrees
- adaboost
- gradientboost
    - xgboost

### What is random about a random forest?

In [None]:
# call_on_students(1)

- Random sampling with replacement (bootstrap)
- Random subset of features for each decision node --> still find best split from the random subset

- Extra Trees - 3rd level of randomness
    - from the subset of random features, a random value is chosen for each feature and then we choose best split based on criterion

### What hyperparameters of a random forest might it be useful to tune? How so?

In [None]:
# call_on_students(1)

- n_estimators : number of trees in the forest
- max_features: controls the feature subsetting
- max_samples: controls the random sampling (bootstrapping)
- hyperparameters carried over from base decision trees
    - max_depth: stops model from growing/splitting at specific depth
    - min_samples_leaf: minimum samples required for a leaf node
    - min_samples_split: minimum samples required for a node to be able to split
    - criterion: what to choose best split based on

### Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1).

In [None]:
# call_on_students(1)

**Answer**:

In [33]:
# Your code here
rf = RandomForestClassifier(random_state=42, max_depth=7)
rf.fit(X_train, y_train)
rf.score(X_train, y_train)

0.9976525821596244

In [34]:
from sklearn.model_selection import cross_val_score

In [35]:
cross_val_score(rf, X_train, y_train).mean()

0.9600820793433652

In [36]:
rf.score(X_test, y_test)

0.965034965034965

# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [37]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

In [39]:
corpus

['Harry Potter is the best young adult book about wizards',
 'Um, EXCUSE ME! Ever heard of Earth Sea?',
 'I only like to read non-fiction.  It makes me a better person.']

### NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [None]:
# call_on_students(1)

- standarize
    - lowercase
    - remove punctuation
- remove stopwords that don't have semantic meaning
- normalize (helps deal with different tenses, purals etc....)
    - lemmatization (uses POS tagging)
    - stemming (root)
- keep or remove numbers
- need to tokenize (create seperate tokens for each word or sequence of words)
    - n-grams

### Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [None]:
# call_on_students(1)

- matrix/dataframe
- each row is a single document from our corpus
- columns (features) are the unique tokens (n-grams)
- values depend on the kind of vectorization (count, tf-idf, hashing)
- generally returned as a sparse matrix (to save on memory)
    - lots of 0, lots of tokens that don't show up in some documents

### What does TF-IDF do?

Also, what does TF-IDF stand for?

In [None]:
# call_on_students(1)

- weights tokens by the specific importance to the document in question in contrast to the full corpus (importance score)
- tf: term frequency : how frequent is token in the document
- idf: inverse document frequency : how much the token shows up in full corpus compared to the document
- high score = more importance for that specfic document
- low score = less importance

## NLP in Code

### Set Up

In [40]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [41]:
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",1
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,1
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",1
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,1
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,1


The documents for activity are in the `policy` column, and the target is candidate. 

### Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [None]:
# call_on_students(1)

In [42]:
# First! Train-test split the dataset
X = policies['policy']
y = policies['candidate']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [43]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [79]:
# Instantiate it
cv = CountVectorizer(stop_words='english', strip_accents='unicode')

In [80]:
# Fit it
cv.fit(X_train)

CountVectorizer(stop_words='english', strip_accents='unicode')

In [81]:
cv.vocabulary_

{'1987': 108,
 'united': 9977,
 'church': 1844,
 'christ': 1837,
 'commission': 2030,
 'racial': 7560,
 'justice': 5356,
 'commissioned': 2031,
 'studies': 9210,
 'hazardous': 4543,
 'waste': 10322,
 'communities': 2052,
 'color': 1996,
 'years': 10554,
 'later': 5477,
 '28': 187,
 'ago': 649,
 'month': 6170,
 'delegates': 2732,
 'national': 6270,
 'people': 6889,
 'environmental': 3539,
 'leadership': 5513,
 'summit': 9301,
 'adopted': 556,
 '17': 61,
 'principles': 7289,
 'federal': 3919,
 'government': 4348,
 'largely': 5471,
 'failed': 3846,
 'live': 5660,
 'vision': 10227,
 'trailblazing': 9712,
 'leaders': 5512,
 'outlined': 6632,
 'responsibilities': 8095,
 'represent': 8010,
 'predominantly': 7202,
 'black': 1313,
 'neighborhoods': 6332,
 'detroit': 2889,
 'navajo': 6287,
 'southwest': 8900,
 'louisiana': 5735,
 'cancer': 1595,
 'alley': 716,
 'industrial': 4963,
 'pollution': 7108,
 'concentrated': 2127,
 'low': 5739,
 'income': 4908,
 'decades': 2625,
 'tacitly': 9420,
 'writ

### Vectorize Your Text, Then Model

In [82]:
X_train.shape

(141,)

In [83]:
# call_on_students(1)

In [84]:
# Code here to transform train and test sets with the vectorizer
X_train_vec = cv.transform(X_train)
X_test_vec = cv.transform(X_test)
X_train_vec.todense().shape

(141, 10585)

In [97]:
# Code here to instantiate and fit a Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_vec, y_train)

RandomForestClassifier(random_state=42)

In [98]:
# Code here to evaluate your model on the test set
rf_model.score(X_train_vec, y_train)

1.0

In [99]:
cross_val_score(rf_model, X_train_vec, y_train).mean()

0.9002463054187192

In [100]:
rf_model.score(X_test_vec, y_test)

0.9166666666666666

# 4) Clustering

## Clustering Concepts

### Describe how the K-Means algorithm updates its cluster centers (centroids) after initialization.

In [None]:
# call_on_students(1)

- assigns each datapoint to the closest starting centroid
- centroids are moved to their true centers (based on all the points assigned to them)
- remeasure distances from datapoints to centroids
- reassign points to closet cluster (if it changed)
- this process is then repeated for x number of iterations or centroids stop moving
- minimize for intra-cluster distance (tightly packed clusters)
- maximize for inter-cluster distance (clusters seperated)
- new initialization would repeat this process with new starting centroids

### What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# call_on_students(1)

- measure of intra-cluster distance, how spread out or tightly packed the clusters are
- SSE (sum of squares within cluster - points to center)
- Look at different numbers of clusters (k) and graph their respective intertias
    - Elbow plot : the best k is the number with the biggest change (inflection) on the plot

### What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [None]:
# call_on_students(1)

- Silhoutte score: takes into account intra and inter cluster distance
- Ranges b/t -1 and 1, higher the better
- The number of clusters with the highest silhoutte score is the best k

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [101]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

In [104]:
X

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


### Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?


In [None]:
# call_on_students(1)

In [103]:
# Code to preprocess the data
scaler = StandardScaler()
# Name the processed data X_processed
X_sc = scaler.fit_transform(X)
X_sc

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

### Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [None]:
# call_on_students(1)

In [128]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate and fit
agg = AgglomerativeClustering(n_clusters=2)
agg.fit(X_sc)

AgglomerativeClustering()

In [129]:
# Calculate a silhouette score
from sklearn.metrics import silhouette_score

In [130]:
agg.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [131]:
agg.fit_predict(X_sc)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [132]:
silhouette_score(X_sc, agg.labels_)

0.5770346019475989

In [133]:
grid = {'n_clusters': [2, 3, 4, 5, 6]}
gs_cluster = GridSearchCV(agg, param_grid=grid)
gs_cluster.fit(X=X_sc)

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator AgglomerativeClustering() does not.

### Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [None]:
# call_on_students(1)

In [140]:
def best_cluster(n, data):
    cluster = AgglomerativeClustering(n_clusters=n)
    cluster.fit(data)
    labels = cluster.labels_
    sil = silhouette_score(data, labels)
    print(n, sil)
    return labels

In [141]:
all_labels = []
for k in range(2, 10):
    all_labels.append((k, best_cluster(k, X_sc)))

2 0.5770346019475989
3 0.446689041028591
4 0.4006363159855973
5 0.33058726295230545
6 0.3148548010051283
7 0.316969830299128
8 0.310946529007258
9 0.31143422475471655


In [139]:
all_labels

[(2,
  array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
         1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])),
 (3,
  array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
         1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
         2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
         2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 