# Machine Learning Workflow





## 1) Define business goal 

- We are trying to predict penguins species.


In [4]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split


## 2) Get data

In [5]:
df = pd.read_csv('penguins_simple.csv', sep=";")

In [6]:
df[df['Species'] == 'Chinstrap']

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
146,Chinstrap,46.5,17.9,192.0,3500.0,FEMALE
147,Chinstrap,50.0,19.5,196.0,3900.0,MALE
148,Chinstrap,51.3,19.2,193.0,3650.0,MALE
149,Chinstrap,45.4,18.7,188.0,3525.0,FEMALE
150,Chinstrap,52.7,19.8,197.0,3725.0,MALE
...,...,...,...,...,...,...
209,Chinstrap,55.8,19.8,207.0,4000.0,MALE
210,Chinstrap,43.5,18.1,202.0,3400.0,FEMALE
211,Chinstrap,49.6,18.2,193.0,3775.0,MALE
212,Chinstrap,50.8,19.0,210.0,4100.0,MALE


In [7]:
df = df[df['Species'] != 'Chinstrap']

In [8]:
df[df['Species'] == 'Chinstrap']

Unnamed: 0,Species,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex


  ### Select columns for y and X

In [11]:
# Define Variable of interest
X = df[['Culmen Length (mm)', 'Body Mass (g)']]
y = df['Species']


In [12]:
X

Unnamed: 0,Culmen Length (mm),Body Mass (g)
0,39.1,3750.0
1,39.5,3800.0
2,40.3,3250.0
3,36.7,3450.0
4,39.3,3650.0
...,...,...
328,47.2,4925.0
329,46.8,4850.0
330,50.4,5750.0
331,45.2,5200.0


In [13]:
y

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
328    Gentoo
329    Gentoo
330    Gentoo
331    Gentoo
332    Gentoo
Name: Species, Length: 265, dtype: object

In [14]:
y.shape, X.shape

((265,), (265, 2))

## 3) Train-test-split


In [15]:
# Split the DataFrame into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y)


In [16]:
X_train.shape, X_test.shape

((198, 2), (67, 2))

## 4) EDA




## 5) Feature Engineering

Will be based on your EDA.




#### **Fill in missing values**

      - back fill,
      - front fill,
      - linear interpolation
      - custom filling
    
- **Remove data (drop NAs, drop columns, etc.)**
    - dropna()
- **Create new columns (linear combination of other columns, or by using .apply())**
    - Adding / subtracting / Multiplying columns together
    - creating new columns using custom functions and .apply()
- **bin our data**
    - qcut()
    - pcut()
    - custom binning (using .apply())
- **scale our data**
    - MinMaxScaler()
    - StandardScaler()
- **convert labels into numbers**
    - Label Encoder (clunky)
    - pd.factorize()
    - pd.get_dummies()
    - .map()
       

## 6) Fit  a model

In [10]:
from sklearn.dummy import DummyClassifier

In [11]:
model = DummyClassifier(strategy='most_frequent') #initialize the model
model.fit(X_train, y_train)   # trains the model
model.score(X_train, y_train) # calculates accuracy

0.5656565656565656

**Accuracy:** Ratio of correct predictions over all cases. 

## 7) Cross-validate / GridSearch


#### GridSearch helps you to optimize Hyperparameters of a model. 
     



## 8) Testing your model on the test data

In [12]:
model.score(X_test, y_test)

0.5074626865671642

## 9) Predict

In [13]:

model.predict(X_test)

array(['Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie', 'Adelie',
       'Adelie'], dtype='<U6')