# Logistic Regression and Intro to scikit-learn

The scikit-learn API has a user-friendly interface and highly optimized implementation of several classification algorithms. It also has a lot of great preprocessing functions and tuning functions (which we'll visit later).

### Benefits and drawbacks of scikit-learn
#### Benefits
* Consistent interface to machine learning models
* Provides many tuning parameters with sensible defaults
* Great documentation
* Rich set of functionality for companion tasks 
* Active community for development and support


#### Drawbacks
* Less emphasis on model interpretability (than R)
* May be harder to get started with machine learning (than R)

# Logistic Regression

Run a logistic regression on the diabetes dataset. 

#### Import libraries and read in data.

### Machine Learning Terminology
* Each row is an **observation** (AKA sample, example, instance, record)
* Each column is a **feature** (AKA predictor, attribute, independent variable, regressor, input, covariate)
* Each value we are predicting is the **response** (AKA target, outcome, dependent variable, label)

### Types of Supervised Learning
* **Classification** means the response is categorical
* **Regression** means the response is continuous

#### Let's take a moment and find each of these vocabulary terms for the dataset and determine which type of supervised learning we are dealing with

### Requirements for working with data in scikit-learn
1.  Features and response must be separate objects
2.  Features and response must be numeric
3.  Features and response should be NumPy arrays (Pandas dataframes are built on these, so they'll work)
4.  Features and response should have specific shapes

#### 1. Select features and response

Divide the given columns into two types of variables dependent (target variable) and independent variable (feature variables).

We capitalize the X because it represents a matrix and lowercase the y because it represents a vector.

#### 2. Features and response must be numeric

#### 3. Features and response should be numpy arrays (Pandas dataframes and series count because they are built on numpy arrays).
So X can be a pandas dataframe and y can be a pandas series.

#### 4. Features and response shape requirements
* Response should have only one dimension
* Response should have same amount of rows (first dimension) as features

### 4 Step Modeling Pattern for scikit-learn
1. Import the class you intend to use
2. Instantiate the estimator
3. Fit the model
4. Predict the response for a new observation

**This is the pattern that we will use with every model because scikit-learn has a uniform interface for every model**

#### 1. Import the class you intend to use

First, we need to import the `Logistic Regression` module and create a Logistic Regression classifier object using `LogisticRegression()` function.

#### 2. Instantiate the estimator
* Estimator is what scikit-learn calls models
* Instantiating means to make an instance of

* What we name this does not matter
* This is where we can specify tuning parameters (hyperparameters)--more on this later
* Any parameters that we don't specify are set to their defaults (let's check out the documentation)

Luckily scikit-learn provides sensible defaults for us. But we'll learn more about best practices for these as the week goes on.

#### 3. Fit the model with data (AKA "training)
* The model is learning the relationship between X and y (features and response)
* Occurs in place (don't have to assign the results to a new variable)

#### 4. Predict the response for a new observation 
* New observations are called "out of sample" data
* This uses the information it learned during the training process on data it hasn't seen before

*Note that we aren't using "new" observations here, but we will in a bit*

* Returns a NumPy array
* Can predict for multiple observations at once

### Model evaluation

### Training and testing on same data

### Classification accuracy
* Proportion of correct predictions
* Common evaluation metric for classification problems 

*We'll talk about other metrics later on*

#### The score above is called our **training accuracy** because we trained and tested the model on the same data

### Problems with training and testing on the same data
* Our goal is to maximize our performance on out-of-sample data
* If we maximize training accuracy, we'll be rewarding overly complex models that memorize the training data and won't generalize to new data well
* We learned the noise, not the signal
* Overly complex models **overfit** the data

### So we use train/test/split
1. Split the data into two pieces: a training set and a testing set
2. Train the model on the training set
3. Test the model on the testing set

Divide dataset into testing and training datasets so we can evaluate the model's performance. 

We'll use the function `train_test_split()`. You need to pass 3 parameters features, target, and test_set size. Additionally, you can use random_state to save your random state.

Here, `test_size=0.25` means the dataset is broken into two parts in a ratio of 75:25. It means 75% data will be used for model training and 25% for model testing.

This is a random process. If you want to save your state, you can use the `random_state` parameter and give it an integer value. 

#### How does this help us?
* Model can be trained and tested on different datasets
* We know the response values for the testing set, so we can evaluate our predictions
* **Testing accuracy** is a better estimation of out-of-sample model performance (generalizability)

#### Let's check the shapes of our new split datasets

### Fit the model on the training set

### Make predictions using the test set

### Evaluate the testing accuracy

### What are the downsides of train/test/split?
* Provides a high **variance** estimate of out-of-sample data (depends on which samples get put in the training and the testing sets)
* **K-fold cross validation** overcomes this limitation (we'll talk about it soon)
* But train/test/split is useful because of its flexibility and speed

### Independent Work
Fit a logistic regression model to the titanic dataset.