<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Acquiring" data-toc-modified-id="Acquiring-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Acquiring</a></span></li><li><span><a href="#Exploring" data-toc-modified-id="Exploring-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploring</a></span></li></ul></div>

In [24]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from pydataset import data

## Classification

**<font color=red>What is Classification?</font>**

**Classification is a Supervised Machine Learning technique.** Like Regression, it also uses labeled data from a training dataset to learn the rules for predicting an output variable by using an input variable or variables that it can use to make predictions on future, unseen datasets. However, **classification is used to predict category membership** of observations in a dataset. 

**Simply put, Regression predicts a continuous variable while classification predicts a categorical variable.**

**Type of Classification**

>**Binary Classification -** This type of classification predicts an observation to be a member of one of only two groups: churn/not churn, pass/fail, male/female, smoker/non-smoker, data science student/webdev student.

>**Multiclass Classification -** 

**<font color=orange>So What?</font>**


**<font color=green>Now What?</font>**

We will work through the data science pipeline.

## Acquiring

**<font color=orange>A Few Example Methods for Creating Pandas DataFrames</font>**

>**Create your DataFrame using a SQL query to access and database**

**<font color=purple>Use your env file and a handy function to get your connection_url argument</font>**

`from env import host, pasword, user`

`def get_connection(db, user=user, host=host, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{db}'`
    
`sql_query = 'write your sql query here; test it in Pancakes first'`

`pd.read_sql(sql_query, connection_url)`

**<font color=purple>Put it all together in a single function and throw it into a .py file.</font>**

`def get_titanic_data():
    return pd.read_sql('SELECT * FROM passengers', get_connection('titanic_db'))`

>**Create your DataFrame using a csv file, a Google sheet, from AWS S3**

`pd.read_csv()`

**<font color=purple>If you are going to read in a Google sheet using its Share url, you can format it correctly using the following:</font>**

`sheet_url = 'https://docs.google.com/spreadsheets/d/1Uhtml8KY19LILuZsrDtlsHHDC9wuDGUSe8LTEwvdI5g/edit#gid=341089357'`  

`csv_export_url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')`

`df_googlesheet = pd.read_csv(csv_export_url)`

>**Create your DataFrame using copy-pasted tabular data**

`pd.read_clipboard(header=None, names=colums)`

>**Create your DataFrame using an Excel sheet**

`pd.read_excel('Excel_Exercises.xlsx', sheet_name='Table1_CustDetails', usecols=['this_one', 'this_one'])`

>**Create your DataFrame using Pydata sets**

`from pydataset import data`

`df_iris = data('iris')`

In [23]:
data('iris', show_doc=True)

iris

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Edgar Anderson's Iris Data

### Description

This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
species are _Iris setosa_, _versicolor_, and _virginica_.

### Usage

    iris
    iris3

### Format

`iris` is a data frame with 150 cases (rows) and 5 variables (columns) named
`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.

`iris3` gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with names `Sepal
L.`, `Sepal W.`, `Petal L.`, and `Petal W.`, and the third the species.

### Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomi

In [11]:
df_iris = data('iris')
df_iris.head(3)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


- Pring the shape of your df

In [13]:
df_iris.shape

(150, 5)

- Print the columns of your df

In [20]:
# This is a nice way to get the column names in a list, but you could do df_iris.columns alone, too.

df_iris.columns.tolist()

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

In [12]:
df_iris.columns = df_iris.columns.str.replace('.', '_').str.lower()
df_iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


- Print data type of each column

In [14]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB


In [18]:
df_iris.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

- Print the summary statistics of each numeric column. Should we rescale the data?


- In the documentation, we see that all of the numeric columns are measured in cm, so I don't see a reason that we would need to rescale the df.

In [21]:
df_iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [26]:
# Using Seaborn Datasets

iris = sns.load_dataset('iris')
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


**Other Seaborn Datasets you can use are linked [here](https://github.com/mwaskom/seaborn-data)**

## Preparation - Imputing and Encoding Focus



## Exploring

**There's some pretty cool EDA code and explanation in [this article.](https://towardsdatascience.com/exploratory-data-analysis-for-linear-regression-classification-8a27da23debc) Check it out!**