# Welcome to the Machine Learning in Python Tutorial

## What's in scope for this morning?
The machine learning pipeline for predictive modeling
1. Exploratory data analysis
2. Preprocessing data for ML
3. Training a model
    1. Logistic Regression
    2. MultiLayer Perceptron (NN)
4. Using a model for predictions

## What's _not_  in scope this morning?
1. Different machine learning packages (TensorFlow, Keras, PyTorch, MXNet)
2. Image Recognition
3. Unsupervised learning (word embeddings, clustering, text generation)
4. That one blog you read about sheep and self-driving cars

### Python Libraries

Python libraries are collections of functions and methods that allow you to avoid "reinventing the wheel". The more you use python, the more familiar you become with the libraries that are available. Throughout today, we will be accessing numerous libraries by first importing them and then referencing their methods.

In [89]:
# get the current working directory

In [90]:
# get a vector of zeros of length 5

In [91]:
# roll the dice

Note that if i keep running the above cell, jupyter knows that `randint()` is already imported and does not keep re-importing it.

It is best practice to put all imports at the begining of a file when not using notebooks

# No compiling allowed: Python as a scripting language

One of the cool things about Python is that it supports many (if not all) types of programming. Despite Python being known as a "scripting language", it is object-oriented. When doing complex machine learning, we heavily rely on objects to store and manage the multitude of data, model parts, and results. This morning, we will be doing a lot of scripting and procedural programming, but we will also be using objects from the python library (i.e dataframes). In the afternoon (and those joining us for Day 2), we will focus more on creating our own objects. While working today, keep in mind that all of this code could be written in a more object-oriented fashion based on the needs of the program.

# What is Machine Learning?
The collection of algorithms that train models that can make predictions, create a vector space, or other useful structure. We do not tell the models what to do, they learn through trial and error


# What is AI?
At Callminer: How to market machine learning ;)

In the real world: Machine Learning is a subset of Artifical Intelligence. According to DeepAI, AI is 
>"intelligence of machines and computer programs, versus natural intelligence, which is intelligence of humans and animals."

AI is the process of using algorithms to solve tasks. This includes rules based approaches as well as ML algorithms.


![https://interestingengineering.com/whats-the-difference-between-machine-learning-and-ai](figures/ai_vs_ml.jpg)

image from [here](https://interestingengineering.com/whats-the-difference-between-machine-learning-and-ai)

Learn more about [ai vs machine learning](https://towardsdatascience.com/clearing-the-confusion-ai-vs-machine-learning-vs-deep-learning-differences-fce69b21d5eb)


# The Pipeline

To get a better understanding of where we are doing, let's look at the machine learning pipeline.
The pipeline for classification tasks in machine learning is:

1. Define the problem
2. Gather Data
3. Exploratory Data Analysis
4. Clean the Data
5. Engineer Additional Features
6. Prepare Features for ML
7. Train a Model
8. Predict
9. Refine the model


# Define the Problem

Given information about a particular Game of Thrones episode and a line of dialogue, predict who said it.



For example: In Season 6, Episode 5, who said "Hold the Door"?

# Gather Your Data

### "Transcripts"

Normally, when we make models here at CallMiner, we focus heavily on the transcripts of calls. These transcripts are broken into turns that contain turn-specific information, like who said it and when it was said. Unfortunately, we don't have quality call data to use as examples today. However, we do have Game of Thrones dialogue.

In [92]:
# create a dataframe from a csv of data

### Metadata

Although we are often looking at calls on a turn basis, it helps to add information about the overall call in the models. For example, if we are prediciting whether or not a turn contains PII, it could be helpful to know which department took the call. Therefore, we also need call-level metadata. For our Game of Thrones example, this would be episode-level metadata.

_Task_ : create a dataframe named `metadata` of episode meta data located as `data\got_csv.csv`. Let the index_col be set to default. Display the first season of metadata.


In [93]:
# add code here

Now we want to join the two dataframes so that each dialogue turn contains the episode metadata for which it was spoken.

In [94]:
# joing the two dataframes to create one dataframe, `data`

*Discussion*: What do you notice about `data`? Why did it do that? How could it have been avoided?

# Features and Instances

From this point forward, we will be thinking of our dataframe in terms of *instances* and *features*. 


### Features
Each column represents a feature. [Wikipedia](https://en.wikipedia.org/wiki/Feature_(machine_learning)) defined it the best :
> In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression. 

We will work with *numerical*, *categorical*, and *text* features:


1. Numerical features: features that are numbers
    1. Discrete: numerical values that are integers (age in years)
    2. Continuous: numerical values that can be any type of number (velocity)
    
2. Categorical features: feature value is chosen from a small group of possible options (college major)

3. Text features: these features include strings that are extremely unique to a given instance (summary)

The features tell us everything we know about the data we are predicting on. 
    

### Instances

Each row represents an instance. An instance is a singular unit of data. Our goal is that the model will be able to predict the correct label given the features from the instance. In our example, each line of spoken dialogue is a unique instance and each instance has a value for each feature.

# Exploratory Data Analysis

Exploratory Data Analysis is the process of analyzing your data in order to understand its main chatacteristics. It's like:
1. reading the ingredients before taking a bite of a mystery nugget. 
2. running a background check before a blind date


It goes back to knowing what you are getting yourself into before you dive in

### Survey the Data

In [95]:
# What even are the features?

In [96]:
# What are run times?

In [97]:
# not per instance, per episode!

In [98]:
# What are the unique run times?

In [99]:
# how many times do each of these runtimes occur?

In [100]:
# How many times does someone say "you know nothing"?

*Task*: Explore other fields in the data. What are the values? How often do they occur? Is anything weird happening?

In [101]:
# add code here!

### Visualize distributions

Although counts and lists are useful, it is often helpful to look at how features are distributed. Python has a variety of plotting packages to help with this task

In [102]:
import matplotlib.pyplot as plt
import seaborn as sns


In [103]:
# plot a histogram of runtimes

In [104]:
# plot a count plot of episodes per director

In [105]:
# plot a line graph of IMDB ratings for each episode

Task: Pick some interesting features, and visualize their distribution! Anything strange? Anything cool? Find any correlations?
(Pro Tip: pyplot and seaborn have gremlins. For time constraints, try to stick with one of the sample graphs we've provided! However here is the documentation for [matplot lib](https://matplotlib.org/gallery/index.html) and [seaborn](https://seaborn.pydata.org/examples/index.html) )

In [106]:
# add graph here

# Clean the Data
if you've asked us anything about how much data is enough, we almost always respond "as much as you can get!". However, the data needs to be as "clean" as possible. What does that even mean?
Clean data:
1. Not dirty

So what is dirty data: 
1. Missing values
2. Inaccurate values
3. Irrelevant information
4. It is in a format that machine learning algorithms won't accept

Although some machine learning algorithms are great at sifting through the garbage to find important features, it's nice if we help along the way. A good thought to keep in mind: "Would knowing this information help _me_ predict the answer?"

### Marie Kondo the features --> What brings you joy?

There are certain types of features that we like as data scientists. These features are easy for algorithms to use OR it is easy for us to transform them into formats algorithms can use.



#### Features that bring me joy?
1. RELEVANT continuous and discrete data 
    1. heights in cm
    2. distances
    3. words per minute
    4. NPS score
2. Categorical data with a reasonable number of categories
    1. state
    2. agent
    3. reason for call
        
#### Features that do not bring me joy?
1. Unprocessed text
    1. Transcripts
    2. Comments
2. Redundant features
    1. DOB and Birth_date
    2. Product Name and Product ID
3. Categorical features with too many unique values
    1. ANI
    2. Last Name

[What makes a good feature-video](https://www.youtube.com/watch?v=N9fDIAflCMY)

Task: Look at your columns and create a list of column names for the columns that do not bring you joy. Assign this list to the variable `to_be_removed`. Do NOT remove `Sentence` nor `Name`. We need them later.

Ex: `to_be_removed = ["ANI", "Last Name", "DOB"...]`

In [107]:
# add list here

In [108]:

# if you like keeping a back_up before deleting
# Note: if something does get messed up, you can always rerun all the cells from the beginning
back_up_data_1 = data.copy()


In [109]:
# remove columns that contain bad features

### Fix Inconsistencies
Until the robots finally take our jobs, much of data collection and data entry is still done by hand. And WE'RE ONLY HUMAN! (I can't imagine how many typos and syntax errors have already occured by now if you're experiencing this live). Therefore, data can be inconsistent-- whether it be spelling errors, missing values, changes in format, etc. Before creating a model, it is important that we find any inconsistencies with our data.

In [110]:
# Who were the unique writers on Game of Thrones?

 We want our model to recognize a unique list of writers (Cogman, Hill, Weiss, Benioff, Martin, Espenson, Taylor). Instead of seing 'David Benioff & D.B Weiss' as one new author, we want it to recognize that it is author 'David Benioff' AND author 'D.B. Weiss'. We aren't going to fix this right now BUT we will fix it very soon. Don't forget about it!
 

_Task_ : What other important column is full of typos? Find it!

In [111]:
# add code here

There are at least 4 ways that `alliser thorne` is referenced but that is the same person! Not good if our model is trying to predict who said a particular line. We don't want the model deciding between `alliser thorn` and `alliser thorne`! Therefore, we need to fix these so there is one unique name for each character.


In [112]:
# replace `alliser thorn` with `alliser thorne` 

What we would have to do: find all the instances where a character was referenced different ways and change it to one reference

What i did for you: that.

Earlier, we mentioned that you can import python libraries, which are packages of code. Turns out you can also import methods and objects that you have in other files. This makes data cleaning much easier because once you write it once, you can reuse it.

In [113]:
# import NamesToReplace object, get ntR.names()

_Task_ : Write code that loops through the names in `nTR` and replaces them in `data` as we did with allister. How many characters do we have left?

In [114]:
# add code here

We still have a TON of characters, many who don't say much

In [115]:
# graph distribution of counts of characters

We have a ton of characters who don't say hardly anything and then it is pretty even. We want to count those characters as "other"

In [116]:
 # Get a list of all characters who spoke less than 30 times

_Task_ : Replace all those character names with "other"

In [117]:
# add code here

In [118]:
# Who's left?

# Preparing data for ML 

Features must be represented in a way that a computer can understand and _learn_ from. We do this through one-hot coding and normalizing.


### One-hot Encodings of Categorical Data

One hot encoding is how we turn categorical data into zeros and ones.
![one-hot-encoding](figures/one_hot_encoding.png)

One-hot encoding is necessary because the math used in machine learning algorithms requires all features to be numbers. One-hot coding provides a way to express membership within a group and lack of membership in the others.

As you see above, categorical features should be one-hot encoded. But when should discrete data be one-hot encoded? Let's think about the column "N_Season".... Does the fact that an episode is in season 4 mean that it is more _something_ than an episode in season 2? Does the fact that an episode had 4.4 million views tell you that it is more _something_ than an episode with 2.2 million views? If there is a relationship with how the discrete value increases or decreases, it should be treated like a number. If there is not, it should be one-hot-encoded.

_Task_ : Create a list of column names for the columns that should be one-hot encoded. Assign this list to the variable `one_hots`. Ignore `Name` and `Sentence`

In [119]:
# add code here

In [120]:
# get the one hot encoding for a column

In [121]:
# concatenate it to the current dataframe

In [122]:
# remove the original column

In [123]:
# since we already did encoding for "Number in Season", remove it from your list

In [124]:
# write a loop to do it for the remaining columns

In [125]:
# use the function from tutorial_utils to one-hot encode the writers
from tutorial_utils import get_writers_one_hot

_Discussion_: What is different about this one hot?

_Task_: Concatenate to `data` and drop the original column

In [126]:
#add code here

## Normalizing your data


Normalizing your data is when you transform numerical features so that they are between 0 and 1, inclusively. 

![https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range](figures/normalize.png)

We do this because many machine learning algorithms rely on multiplying the features with a matrix of weights to get a value. If the number of views range from 5,000,000, to 20,000,000 while the rating range from 7.0 to 9.0, that large number of viewing will have a bigger impact on the model, even though we really just want the ratio of one episodes views to another.

You can still have a good model without normalizing data. You will just have a harder (read: longer) time training.

In [127]:
# normalize the data
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
for column in list(data):
    try:
        x_scaled = min_max_scaler.fit_transform(data[column].to_frame())
        scaled_column = pd.DataFrame(x_scaled, columns = [column])
        data.drop([column], axis = 1, inplace = True)
        data = pd.concat([data, scaled_column], axis = 1)   
    except ValueError: #Can't normalize categorical or text features
        pass

data.tail(10)


is_jon,Unnamed: 1
1,500
1,17825
1,3666
1,594
1,2802
1,29
1,5743
1,8615
1,6805
1,7082


# Choose Your Target Characters

Today, we want to focus on a binary classification, meaning that we want our model to only have to choose between two options. Therefore, instead of predicting who said a given line from a script, we will choose a target character and predict whether or not that character said a particular line.

In [128]:
# get a the name and counts for the 10 most common characters (ones with the most lines)

_Task_ : Choose your target character. The more lines the better. Set `target_character= <name string>`

In [129]:
# your code goes here

Now we must create the feature `<target_column>` that is `1` if the character is the target and `0` if it is not. This is the feature we will be trying to predict

In [130]:
# create the target_column

## Feature Engineering
An important part of machine learning is deriving additional features about the data for the model to learn. This is called [feature engineering](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114). You do this by looking at your data and thinking, "what can i figure out that the machine may not?"
We will come back to parts of this section in order to improve our model. For now, we'll skip some.

### Deriving discrete or continuous features

In [131]:
# Create a discrete feature that marks the number of words in the sentence for each feature. 

In [132]:
# Normalize the column

### Deriving categorical features

In [133]:
# create the features "previous speaker" and "next speaker" that tells the model 
# who spoke before this instance and who spoke after this instance

In [134]:
# We no longer need the name column
data.drop(["Name"], axis=1, inplace=True)

KeyError: "['Name'] not found in axis"

_Task_ : Which of those new features should be one-hot encoded? Do it.

In [None]:
# add code here

### Adding Features using Annotation

_Annotation_ is the process of labeling data. When we talk about annotation at CallMiner, we are referring to human annotation -- where a person manually labels the data. This label can either be used as a feature or (more often), it is used as the target to predict.


In [None]:
# add a column named sentiment on a random sample of rows, default to zero
annotation_sample = data.sample(10, random_state = 666)
annotation_sample["sentiment"]= 0

In [None]:
# annotate the sample
for i, row in annotation_sample.iterrows():
    sentence = annotation_sample.at[i, "Sentence"]
    sentiment = input(f"Is this negative sentiment (1 yes, 0 no): {sentence}" )
    annotation_sample.at[i, "sentiment"] = sentiment
annotation_sample

_Discussions_ : What are the challenges with human annotation? 

### Deriving features from text

Models cannot process blocks of text the way humans can--text must be transformed into numbers. One way to do this is word embeddings (more on this in the afternoon). Another way to do it is to create features like `said_dragon` . `said_dragon = 1` if the word "dragon" is said in a turn and `said_dragon = 0` if it is not.

_Task_ : What words might be said in the Game of Thrones scripts that would distinguish one character from another? (Hint: "dracarys" is spelled "d-r-a-c-a-r-y-s"). Make a list of these word strings. Assign the list to the variable `target_words`. Create a column in `data` for each word in target_words called `f"says_{target_word"`. The default value should be `0`

In [None]:
# add code here

In [None]:
# Now we'll create features for each of these words

## Balance your data

Balancing data is important because sometimes we have so few examples of one class that the model scores great if we just pretend it never happens. Consider the model predicting if it is Christmas yet:
![Christmas](figures/Christmas.jpg)
Since it is rarely Christmas, the model scores well if it guesses that it is never Christmas. There are two main ways to [handle unbalanced data](https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758):
1. Weight balancing: When we are training our model, we tell it that it is waaaaay more important that it gets the minority class right than the majority class

2. Oversampling/Undersampling: Replicating the data in your minority class or throwing out data in your majority class so that both are equally represented. The code below undersamples the majority class.

In [None]:
# shuffle your data
data = data.sample(frac = 1).reset_index(drop = True)
# have an even number of your target's dialogue and other dialogue
data = data.groupby(target_column)
data = data.apply(lambda x: x.sample(data.size().min()))
len(data)

## Train-Test Split
When training a machine learning model, you must split your data in at least 2 sets -- the training set and the testing set [(link)](https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7).
The training set is used by the model to learn. The model has access to the labels in the training set. For example, if you are predicting if a call leads to a sale, the model is aware of which calls did and did not lead to a sale in the training set only. Therefore, it adjusts the parameter weights based on what calls it gets right/wrong (more on this later). 


The test set is used to evaluate a trained model. After using the training labels to adjust the model weights, we want to ensure that the model can still perform on other data points. Test sets can be used across models to compare performance.

Depending on how you chose to train your model, you may want to use a validation set in addition to train and test. These are like practice tests. They provide unbiased evaluation of the model while it is still training.  The models we are using today use [cross validation](https://machinelearningmastery.com/k-fold-cross-validation/) so we only need a train set and a test set.

In general, you want most of your data in the training set with small percents for test (and validation). The exact percents depend on the model, but we usually do a .8-.1-.1 or a .9-.1 split.
![pic](figures/train-test.png)

In [None]:
# separate your target column from your other features
# Naming your features X and your target y is standard naming protocol

In [None]:
# split into train and test data

# What is a predictive model anyway...?
I'm about to throw a lot of definitions at you, inspired or directly quoted from [here](https://machinelearningmastery.com/gentle-introduction-to-predictive-modeling/).

A model without a training algorithm is nothing. So let's start with the algorithm.
The goal of a supervised learning algorithm is to take some data with a known relationship and encode those relationships in a way that a model can interpret for predictions....Notice that I never said "rules"...don't think of them as "rules"! The model contains the learned relationships. ![pic](figures/what_is_a_model.png)

Once trained, the model is nothing but "a handful of numbers and a way of using those numbers to relate input...to an output". ![pic](figures/Make-Predictions.png)

Models are not _smart_. Models do not _know things_. Models are really, really, good at linear algebra. 


There are a LOT of different models out there. These aren't even all of them ![pic](figures/machine-learning-algorithms.png)

These models are different because they use different algorithms to learn the relationships in the data. I wish we had time to go into all of them but we will primarily focus on logistic regression.

## Logistic Regression
Logistic regression is a classification algorithm that predicts the probability of discrete values ( is jon/isn't jon, yes/no, sale/no sale) given the features from the data instance. Therefore the output for each data instance is a value between zero and one-- where one is 100% probability of the positive class. 

It is called logistic "regression" because it is trying to fit the data to the logit function ![pic](figures/logit.png)


### So what is it learning?
The logistic regression algorithm is learning the relationship between all of the features and the probability for any given instance.
The probability of an instance is calculated as ![pic](figures/LR_formula.png)

i = an instance

$P_{i}$ =the probability of the positive class given the features

$1-P_{i}$ = the probability that it is not the positive class given features

$\alpha$ = the [bias coefficient](https://www.quora.com/What-does-the-bias-term-represent-in-logistic-regression) (don't worry about it)

$\beta_{k}$ = the weight of feature k

$x_{k}$ = the value of feature k for instance i


It will calculate $Z_{i}$ for all instances $i$ (aka all the data points) and then see how well it is doing using a loss function.


### Loss Function

When a logistic regression model is intialized all the weights ($\beta s$) are randomly set. It does its best to calculate $Z_{i}$ for all ${i}$ and then needs to see how badly it did. A cost function is used to mathematically measure how far off the model is. 
Let's think of a sale/no sale model. Intuitively, we want to change the weights of our model a lot if it is confident that a call is a sale but it is actually not a sale. Similarly, it should change the weights a lot if it is confident a call is not a sale but it is. On the other hand, if the model isn't confident either way and just makes a best guess, we don't want to overly change it. To accomplish this idea, we use the following cost function (in this picture $Z_i$ is $h_\theta$): ![pic](figures/cost_function.png)

Notice how mathematically, it accomplishes just that. If the model is way off, the cost is much more. 

The algorithm sums the loss for all instances and send this information to the weights ($\beta s$) through a process called [back propogation](https://brilliant.org/wiki/backpropagation/) (aka a lot of calculus <3 ). This process slowly adjusts the weights to _minimize the cost_ over time. Essentially, by minimizing the loss, you're making your model more correct. 
This process is similar to game of hot-cold. "Warmer...colder...warmer...burning up....". The difference between a game of hot-cold and training a model is that when playing a game, there is an exact location that we are in search of--when you find it, you win! We do not have an exact destination in mind when training a model. So how do we know when to stop?

### When is training complete
Training is complete when one of three conditions is met:
1. The weights are barely changing with each iteration


2. The model has completed a maximum number of iterations (epochs) through the process
    
    
3. Using your validation set, you see that your model is beginning to [overfit](https://towardsdatascience.com/what-are-overfitting-and-underfitting-in-machine-learning-a96b30864690)
    
    

# Training a model
Let's train out own logistic regression model

## Tune Hyperparameters

Hyperparameters are settings in a model that can be adjusted to maximize a model's performance. The different hyperparameters to play with can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# set the hyperparameters
tol = .0005 # tolerance threshold
class_weight = 'balanced' # this automatically weight balances instead of under sampling for balance
max_iter = 500 # how many times to repeat the cost minimizing process before giving up
random_state = 666 # a seed so that your weights are randomly initalized the same every time while refining

## Train

In [None]:
# make an instance of a model and train it

## Evaluate your model

There are many ways that a model can be scored ([precision, recall, f1, accuracy](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9), [AUC-ROC](https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5), etc). We will be using average accuracy.

This is calculated as $\frac{NumberCorrectSamples}{ TotalSamples}$

In [None]:
# evaluate the model

### Is this too good to be true? Check the Confusion Matrix

A confusion matrix is a matrix (duh..) that shows how the model is labeling instances on a class by class basis: ![pic](figures/confusion_matrix.png)

The diagonal from top left to bottom right shows the number of instances guessed correctly, while the diagonal from the bottom left to top right shows the instances guessed in correctly.

In [None]:
# make confusion matrix

_Task_ : Under "Tuning Hyperparameters", change `max_iter = 4`, `class_weight = 'None'`, and (if we got to it) comment out the code under "Balance Your Data". Rerun all cells above. What happened to your accuracy? What about the confusion matrix? Undo the changes and rerun it

 # Using the model for predictions


## Probabilities
Logistic Regression models use probabilities to predict if an instance is in the positive class. Once a logistic regression model is trained, we can pass instances to the model and it will return the probabilities.

In [None]:
# get probabilities of the test set

In a sale/ no sale, this information would be extremely helpful. Instead of predicting "this agent will make a sale", you can predict "this agent is 70.1% likely to make the sale"....


But what is the next thing that will be asked.....

## Feature Weights

The next thing that most supervisors or C-levels or agents at risk of missing their bonus will ask is ..... "WHY?". Since logistic regression probabilities are calculated using a very clear relationship between features and their weights, it is fairly easy to understand what features play an important role in the decision by the model. It also helps with refining the model when you can see what is informative.

Note that the feature weights only correspond to feature importance if each feature is *independent* of all other features

In [None]:
# get weights 

_Discussion_ : What does a negative weight mean? Which features would have the least impact on the model?

## Refine

We've built a model..... now what? We make it better! Once a model is made, we continue the process of creating more features, tuning hyper parameters and evaluating. Eventually, we get the best model possible. Let's go back and implement any features in the "Feature Engineering" section we might have skipped

Task: Make your model better! Some suggestions:
1. Add more words / Remove words from your to your text features
2. Derive more categorical or continuous features
    1. The IMDB rating / number of views
    2. The number of words in the previous/ next dialogue
    3. The month the show aired
3. Tune the hyper parameters -- check out the documentation if you want more control
4. Try a different model [algorithm](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model )
5. Try a non-binary classifier (i.e predicting jon, arya, or someone else)
6. Adjust your train-test percentages

# Neural Networks
![pic](figures/trained_a_neural_net.png)

### What's the difference between logistic regression and a basic neural network
Neural nets are made up of nodes and each node has a weight associated to it.
In logistic regression, there is simply one node for each feature and these weights input directly into the decision. In neural nets, there are hidden layers of nodes between the feature nodes and the decision. These hidden nodes also have weights and can represent a variety of aspects (combinations of features, combinations of combinations of features, parts of multiple features....). The architect does not get to decide what theses hidden nodes represent-that is something the algorithm figures out ![pic](figures/lr_vs_nn.jpg)

### So what does the architect get to decide?
Like with logistic regression, the architect gets to decide the hyperparameters. Most of the sort of hyperparameters set for logistic regression would also apply to a neural net. However, the architect must also decide:
1. How many hidden layers
2. How many nodes are in each hidden layer

In more complex architectures, things like attention layers, convolution layers, and memory gates can also be added! Feel free to research more on your own! ![pic](figures/hidden_layers.png)

### Common misconception about neural nets
There are a lot of misconceptions about neural nets. The term is often thrown around like "AI". We want to take a moment to address a few of them


#### Neural nets will take over the world
![pic](figures/cat_as_dog.png)

No, the end is not near. Like any other machine learning algorithm, neural nets can get very, very, good at specific tasks. However, they cannot "reason". They cannot branch out beyond what they are trained to do. They are also only as good as the data they are trained on. Therefore, there is little risk of a robot revolution any time soon.

#### Neural nets are a black box
In the words of the Black-Eyed Peas, "that's so 2000 and late". In recent years, researchers have developed a variety of techniques to expose what each hidden node is focusing on. For example, researchers can determine what a convolutional neural network is focusing on when classifying images (hint: neural nets love edges and boundaries) ![pic](figures/inside_cnn.png)

#### Neural nets are the answer to everything
![pic](figures/hammer.jpg)
Not all problems require a neural net (or machine learning in general). If you simply want to know how certain features are correlated, there is no need to build a nn and then disect the weights--statistical methods can work just as well (if not better). If you are predicting home values, a linear regression might be enough! Lastly, there are still cases where rule-based approaches are the answer (we'll talk about that more with cluster labeling!). Before diving into building a neural net, make sure the problem statement requires it.

#### A neural net is a specific thing
A neural net is a type of architecture. Researchers are constantly developing new types of neural nets. Different types of neural nets thrive at different tasks (like CNNs for image classification). Picking the right type of neural net is a huge part of the job. ![pic](figures/NeuralNetworkZoo.png)

#### If you want a neural net, you better build it from scratch
Nowadays, there are many python packages that help you build neural nets. Some are simple (like the scikit learn ones we use today) while other packages are more complex but allow for more complex neural net architectures. There are also platforms like DataRobot that choose and make machine learning models for its users. So before you get nose-deep in calculus, find what already exists and utilize those tools. ![pic](figures/ml_packages.png)

## Training our own neural net: multi layered perceptron

We will use scikit learn to create a [multilayer perceptron](http://deeplearning.net/tutorial/mlp.html).

The main hyperparameters we must decide are how many hidden layers and the size of each hidden layer. For scikit learn, you create a tuple, where the length of your tuple is the number of hidden layers and the integer at tuple\[i\] is the size of the ith hidden layer. For example, if `hidden_layers = (32, 16, 8, 2)`, my model would have 4 hidden layers of size 32, 16, 8, and 2. The size of your last hidden layer should equal the number of classes. Since we are doing a binary classification, the size of last hidden layer should be 2.

In [None]:
# Tune hyperparameters
solver= 'adam'
hidden_layer_sizes = (32, 16, 8, 2)
random_state = 666
batch_size = 5 #instead of training on all your instances at once, the MLP trains in batches
max_iter= 1000

In [None]:
# make, train, and score your model

In [None]:
# check out confusion matrix

_Task_ : Refine this model like you did for the logistic regression. How accurate can it get? Which model performed better?

## Homework
A random sample of dialogue has been held out by us. Using only python code and packages, refine/create your best possible model to predict if a piece of dialogue is spoken by Tyrion Lannister. The person with the best performing model wins a prize to be determined (probably lunch). There will also be a prize for best feature engineer. 

# Survey !!!

Please complete the [course survey](https://forms.office.com/Pages/ResponsePage.aspx?id=gwv7BWBlfUGFbTjusOst_QYpnoW2nrtJmgVZLQ3gu25UMURGMDdaUTA0QUhJQTM3NlMxNE9GVVkyRC4u)