# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:
from src.call import call_on_students

# 1) Pipelines and Gridsearching

### What are the benefits of using a pipeline?

In [6]:
call_on_students(1)

['Yiyi']


- simplify workflow
    - makes model validation and evaluation/comparison easier
    - allows for compartamentalized debigging
- ensure consistency b/t train and test, new data
- makes model iteration more efficient
- eliminates data leakage b/t cross validation folds

### What does a gridsearch achieve?

In [7]:
call_on_students(1)

['Whitlee']


- allows comprehensive and exhaustive hyperparameter search to find optimal values (looks at all possible combinations based on the param_grid)
- tunes hyperparameters in tandem (at the same time)
- can return the best_estimator object
- model optimization

### Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

In [8]:
call_on_students(1)

['Nathan']


**Answer**:

In [11]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [9]:
# Your code here
pipeline = Pipeline(steps=[('scaler', StandardScaler()), 
                           ('logreg', LogisticRegression(random_state=42))])


In [10]:
pipeline

Pipeline(steps=[('scaler', StandardScaler()),
                ('logreg', LogisticRegression(random_state=42))])

In [None]:
# call_on_students(1)

### Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

In [12]:
call_on_students(1)

['Pat']


**Answer**:

In [13]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [14]:
X_train.shape, y_train.shape

((426, 30), (426,))

In [15]:
X_test.shape, y_test.shape

((143, 30), (143,))

In [24]:
# Your code here
param_dict = {'logreg__C': [0.1, 1, 10]}
log_gs = GridSearchCV(pipeline, param_dict, cv=5)
log_gs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('logreg',
                                        LogisticRegression(random_state=42))]),
             param_grid={'logreg__C': [0.1, 1, 10]})

In [25]:
log_gs.best_params_

{'logreg__C': 10}

In [26]:
final_model = log_gs.best_estimator_

In [27]:
final_model.score(X_train, y_train)

0.9906103286384976

In [28]:
log_gs.best_score_

0.9764705882352942

In [29]:
log_gs.cv_results_

{'mean_fit_time': array([0.01049519, 0.00852456, 0.01434069]),
 'std_fit_time': array([0.00179942, 0.0011278 , 0.00310185]),
 'mean_score_time': array([0.00054522, 0.00027637, 0.00027556]),
 'std_score_time': array([1.62316825e-04, 2.62266506e-05, 3.63262296e-05]),
 'param_logreg__C': masked_array(data=[0.1, 1, 10],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'logreg__C': 0.1}, {'logreg__C': 1}, {'logreg__C': 10}],
 'split0_test_score': array([0.98837209, 0.98837209, 1.        ]),
 'split1_test_score': array([0.96470588, 0.96470588, 0.96470588]),
 'split2_test_score': array([0.98823529, 1.        , 1.        ]),
 'split3_test_score': array([0.96470588, 0.96470588, 0.96470588]),
 'split4_test_score': array([0.96470588, 0.95294118, 0.95294118]),
 'mean_test_score': array([0.97414501, 0.97414501, 0.97647059]),
 'std_test_score': array([0.0115606 , 0.01731293, 0.01968612]),
 'rank_test_score': array([2, 2, 1], dtype=int32)}

# 2) Ensemble Methods

### What sorts of ensembling methods have we looked at?

In [30]:
call_on_students(1)

['Juvenson']


- randomforest
- gradientboosting
    - xbgboost
- adaboost
- bagging
- stacking
- voting
- extratrees

### What is random about a random forest?

In [32]:
call_on_students(1)

['Jay']


- Random subset of features at every decision node
- Random sampling with replacement (bootstrapping)

- for extra trees - 3rd level of randomness
    - from the subsetted features, a random value is chosen for each feature and the gini or entropy is calculated for those values and the 'best' split is chosen

### What hyperparameters of a random forest might it be useful to tune? How so?

In [33]:
call_on_students(1)

['Shelley']


- n_estimators: number of trees/sub models in our forest
- max_features: controls the feature subsetting
- max_samples: controls the random sampling
- same parameters from decision tress
    - min_samples_leaf: minimum samples required for a node to be a leaf node
    - min_samples_split: minimum samples required for a node to think about splitting
    - max_depth: stops the model from continuing to split after a certain set depth, can help prevent overfitting
    - criterion: what metric is used to determine best splits

### Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1).

In [34]:
call_on_students(1)

['Nate']


**Answer**:

In [35]:
# Your code here

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [36]:
rf.score(X_train, y_train)

1.0

In [37]:
rf.score(X_test, y_test)

0.965034965034965

# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [38]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

In [39]:
corpus

['Harry Potter is the best young adult book about wizards',
 'Um, EXCUSE ME! Ever heard of Earth Sea?',
 'I only like to read non-fiction.  It makes me a better person.']

### NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [40]:
call_on_students(1)

['Meg']


- standardize case (lowercase)
- remove stopwords
- remove punctuation
- need to tokenize
- can stem or lemmatize (normalizing words across tenses etc...)
- keep or not numbers

### Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [41]:
call_on_students(1)

['Rigat ']


- matrix/dataframe
- each row is a single document in our corpus
- columns are the unique tokens (n-grams) aka the features for a model
- values can be count, presence, or some form of score (tf-idf)
- typically returned as a sparse matrix given lots of 0 values

### What does TF-IDF do?

Also, what does TF-IDF stand for?

In [42]:
call_on_students(1)

['Shelley']


- weights words by their specific importance to documents (score)
- high tf-idf = more importance for that document 
- low tf-idf = less importance for that document
- tf = term frequency = frequency of word in that document
- idf = inverse document frequency = how much word shows up in full corpus compared to specific document

## NLP in Code

### Set Up

In [43]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [44]:
policies.head()

Unnamed: 0.1,Unnamed: 0,name,policy,candidate
0,0,100% Clean Energy for America,"As published on Medium on September 3rd, 2019:...",1
1,1,A Comprehensive Agenda to Boost America’s Smal...,Small businesses are the heart of our economy....,1
2,2,A Fair and Welcoming Immigration System,"As published on Medium on July 11th, 2019:\nIm...",1
3,3,A Fair Workweek for America’s Part-Time Workers,Working families all across the country are ge...,1
4,4,A Great Public School Education for Every Student,I attended public school growing up in Oklahom...,1


The documents for activity are in the `policy` column, and the target is candidate. 

### Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [45]:
call_on_students(1)

['Nate']


In [61]:
# First! Train-test split the dataset
X = policies['policy'] 
y = policies['candidate']
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)

In [62]:
X_train

31     In 1987, the United Church of Christ's Commiss...
12     As published in Essence on April 30, 2019:\nI ...
41     As published on Medium on August 7th, 2019:\nA...
100    Key Points Details\nFor too long, the people o...
126    Senator Klobuchar’s Plan for Seniors\nTackling...
                             ...                        
106    Key Points Details\nIn the wealthiest nation i...
14     Just two years after leaving his role as Georg...
92     Key Points Details\nWe have seen what happens ...
179    A New Rising Tide: Empowering Workers in a Cha...
102    Key Points Details\nWhen we talk about crimina...
Name: policy, Length: 141, dtype: object

In [63]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [64]:
# Instantiate it
cv = CountVectorizer(stop_words='english', strip_accents='unicode')

In [65]:
# Fit it
cv.fit(X_train)

CountVectorizer(stop_words='english', strip_accents='unicode')

In [66]:
cv.vocabulary_

{'1987': 108,
 'united': 9977,
 'church': 1844,
 'christ': 1837,
 'commission': 2030,
 'racial': 7560,
 'justice': 5356,
 'commissioned': 2031,
 'studies': 9210,
 'hazardous': 4543,
 'waste': 10322,
 'communities': 2052,
 'color': 1996,
 'years': 10554,
 'later': 5477,
 '28': 187,
 'ago': 649,
 'month': 6170,
 'delegates': 2732,
 'national': 6270,
 'people': 6889,
 'environmental': 3539,
 'leadership': 5513,
 'summit': 9301,
 'adopted': 556,
 '17': 61,
 'principles': 7289,
 'federal': 3919,
 'government': 4348,
 'largely': 5471,
 'failed': 3846,
 'live': 5660,
 'vision': 10227,
 'trailblazing': 9712,
 'leaders': 5512,
 'outlined': 6632,
 'responsibilities': 8095,
 'represent': 8010,
 'predominantly': 7202,
 'black': 1313,
 'neighborhoods': 6332,
 'detroit': 2889,
 'navajo': 6287,
 'southwest': 8900,
 'louisiana': 5735,
 'cancer': 1595,
 'alley': 716,
 'industrial': 4963,
 'pollution': 7108,
 'concentrated': 2127,
 'low': 5739,
 'income': 4908,
 'decades': 2625,
 'tacitly': 9420,
 'writ

### Vectorize Your Text, Then Model

In [50]:
call_on_students(1)

['Rigat ']


In [67]:
# Code here to transform train and test sets with the vectorizer
X_train_vec = cv.transform(X_train)
X_test_vec = cv.transform(X_test)

In [68]:
X_train_vec.shape

(141, 10585)

In [69]:
X_test_vec.shape

(48, 10585)

In [70]:
# Code here to instantiate and fit a Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_vec, y_train)

RandomForestClassifier(random_state=42)

In [71]:
rf_model.score(X_train_vec, y_train)

1.0

In [73]:
# Code here to evaluate your model on the test set
rf_model.score(X_test_vec, y_test)

0.9166666666666666

### If we can't tune the randomforest (for some reason) what could we tune to try and address the overfitting?

- we can tune the count vectorizor to reduce the number of tokens, thus reducing the number of features, thus reducing complexity hopefully helping to reduce overfitting

In [None]:
cv = CountVectorizer(stop_words='english', strip_accents='unicode',
                    max_features,
                    min_df,
                    max_df)

# 4) Clustering

## Clustering Concepts

### Describe how the K-Means algorithm updates its cluster centers (centroids) after initialization.

In [None]:
call_on_students(1)

- assigns each datapoint to its closest starting centroid
- the centroids are moved so that they are the true 'center' of all the datapoints assigned to each one (mean)
- remeasure and reassign each datapoint to its newest closet centroid
- then move centroid again to true center
- rinse and repeat in order to minimize intra-cluster distance and maximize inter-cluster distance

- repeat full process with new starting centroids (based on n_init)

### What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [77]:
call_on_students(1)

['Meg']


- measure of intra-cluster distance, how spread out/tightly pack each cluster is
- SSE (sum of squares within clusters)
- can look at multiple values of k (number of clusters) and graph the intertia
    - elbow method/elbow plot: looking for the point of biggest change as the best cluster numbers (the elbow)

### What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [78]:
call_on_students(1)

['Whitlee']


- Silhoutte Score : takes into account intra and inter cluster distance
- Ranges b/t -1, 1: best clustering is the highest score
    - 1 would be perfect clusters
    - -1 would be essentially completely wrong clusters

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [79]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?


In [None]:
call_on_students(1)

Because kmeans relies on distance to cluster we need to scale data!

In [81]:
# Code to preprocess the data
scaler = StandardScaler()
# Name the processed data X_processed
X_sc = scaler.fit_transform(X)

In [82]:
X_sc

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

### Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [83]:
call_on_students(1)

['Juvenson']


In [85]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate and fit
agg = AgglomerativeClustering(n_clusters=2)
agg.fit(X_sc)

AgglomerativeClustering()

In [86]:
from sklearn.metrics import silhouette_score

In [87]:
agg.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [89]:
agg.fit_predict(X_sc)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [91]:
# Calculate a silhouette score
sil_agg = silhouette_score(X_sc, agg.labels_)
sil_agg

0.5770346019475989

### Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [92]:
call_on_students(1)

['JD']


In [93]:
def best_cluster(n, data):
    """ 
    Tests different numbers for the hyperparameter n_clusters
    Prints the silhouette score for that clustering model
    Returns the labels that are output from the clustering model

    Parameters: 
    -----------
    n: float object
        number of clusters to use in the agglomerative clustering model
    data: Pandas DataFrame or array-like object
        Data to cluster

    Returns: 
    --------
    labels: array-like object
        Labels attribute from the clustering model
    """
    cluster = AgglomerativeClustering(n_clusters=n)
    cluster.fit(data)
    labels = cluster.labels_
    sil = silhouette_score(data, labels)
    print(sil)
    return labels

In [100]:
all_labels = []
for k in range(2,10):
    all_labels.append(best_cluster(k, X_sc))

0.5770346019475989
0.446689041028591
0.4006363159855973
0.33058726295230545
0.3148548010051283
0.316969830299128
0.310946529007258
0.31143422475471655


In [101]:
all_labels

[array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,
        1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,
        2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,
        2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0