# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [None]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

1. What are the benefits of using a pipline?

In [None]:
# call_on_students(1)

2. What does a gridsearch achieve?

In [None]:
# call_on_students(1)

3. Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

**Answer**:

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
# Your code here



4. Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

**Answer**:

In [None]:
# Your code here



# 2) Ensemble Methods

1. What sorts of ensembling methods have we looked at?

In [None]:
# call_on_students(1)

2. What is random about a random forest?

In [None]:
# call_on_students(1)

3. What hyperparameters of a random forest might it be useful to tune? How so?

In [None]:
# call_on_students(1)

4. Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Make sure you do a train-test split!

**Answer**:

In [None]:
# Your code here



# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [None]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### 1: NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [None]:
# call_on_students(1)

#### Answer:

- Lowercase (standardize case)
- Remove stopwords (really common words that likely have no semantic value)
- Stem or lemmatize to remove prefixes/suffixes/grammer bits
- Remove punctuation
- Tokenize

### 2: Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [None]:
# call_on_students(1)

#### Answer:

- Columns: every word/token in the dataset/corpus
- Rows: the documents you're vectorizing


### 3: What does TF-IDF do?

Also, what does TF-IDF stand for?

In [None]:
# call_on_students(1)

#### Answer:

- TF-IDF: term frequency inverse document frequency
- TF-IDF is a vectorizer that takes into account the rarity of the words


## NLP in Code

### Set Up

In [None]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [None]:
policies.head()

The documents for activity are in the `policy` column, and the target is candidate. 

### 4: Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [None]:
# call_on_students(1)

In [None]:
# First! Train-test split the dataset
from sklearn.model_selection import train_test_split

# Code here to train test split
X_train, X_test, y_train, y_test = train_test_split(policies['policy'], policies['candidate'])

In [None]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Instantiate it
vectorizer = CountVectorizer()

In [None]:
# Fit it
vectorizer.fit(X_train)

### 5: Vectorize Your Text, Then Model

In [None]:
# call_on_students(1)

In [None]:
# Code here to transform train and test sets with the vectorizer
X_tr_vec = vectorizer.transform(X_train)
X_te_vec = vectorizer.transform(X_test)

In [None]:
# Importing the classifier...
from sklearn.ensemble import RandomForestClassifier

# Code here to instantiate and fit a Random Forest model
rfc = RandomForestClassifier()
rfc.fit(X_tr_vec, y_train)

In [None]:
# Code here to evaluate your model on the test set
rfc.score(X_te_vec, y_test)

# 4) Clustering

## Clustering Concepts

### 1: Describe how the K-Means algorithm updates its cluster centers after initialization.

In [None]:
# call_on_students(1)

#### Answer:

- You set the number of cluster centers (K) - algorithm randomly starts with that number of cluster centers (in random spots!)
- The algorithm calculates the distance between the centers and each observation and assigns the observation to the closest cluster center to create the first iteration of clusters
- The algorithm then takes all the observations assigned to each cluster, and moves that cluster center to be at the exact actual center (mean) of the newly created cluster
- Repeat! Until the cluster centers stop moving (or tolerance is met - some parameters in the implementation)

### 2: What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# call_on_students(1)

#### Answer:

- Inertia measures the distance between each point and its center - the idea is that better clusters are more tightly concentrated
- KMeans tries to minimize inertia when choosing cluster centers
- Method to evaluate - elbow plot!

### 3: What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [None]:
# call_on_students(1)

#### Answer:

- Silhouette score
- Difference between silhouette score and inertia: silhouette score tries to maximize similarity within groups and maximize distances between clusters, while inertia just looks within each cluster

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [None]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### 4: Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?

- scale

In [None]:
# call_on_students(1)

In [None]:
# Code to preprocess the data
k_scaler = StandardScaler()

# Name the processed data X_processed
X_processed = k_scaler.fit_transform(X)

### 5: Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [None]:
# call_on_students(1)

In [None]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate
cluster = AgglomerativeClustering(n_clusters=2)
# Fit the object
cluster.fit(X_processed)

# Calculate a silhouette score
from sklearn.metrics import silhouette_score
silhouette_score(X_processed, cluster.labels_)

### 6: Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [None]:
# call_on_students(1)

In [None]:
def test_n_for_clustering(n, data):
    """ 
    Tests different numbers for the hyperparameter n_clusters
    Prints the silhouette score for that clustering model
    Returns the labels that are output from the clustering model

    Parameters: 
    -----------
    n: float object
        number of clusters to use in the agglomerative clustering model
    data: Pandas DataFrame or array-like object
        Data to cluster

    Returns: 
    --------
    labels: array-like object
        Labels attribute from the clustering model
    """
    # Create the new clustering model
    cluster = AgglomerativeClustering(n_clusters=n)
    
    # Fit the new clustering model
    cluster.fit(data)

    # Print the silhouette score
    print(silhouette_score(data, cluster.labels_))
    
    # Return the labels attribute from the fit clustering model
    return cluster.labels_

# Testing your function

for n in range(2, 9):
    test_n_for_clustering(n, X_processed)