## Series 3: Logistic Regression with pipelines: How to automate workflow.

In this third and final Jupyter notebook I will demonstrate how to use pipelines to allow for my code to be reuseable. Since I have already completed an exploratory analysis in my previous two series, I will exclude it from this one. This final notebook will only focus on using pipelines from the sklearn library to automate workflow.

The power of pipelines are that it allows you to transform (i.e.clean) and make predictions on the data in a single codeblock. 

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

In [2]:
df = pd.read_csv('train.csv', index_col= 'PassengerId')
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [14]:

def dummies(df):
    return pd.get_dummies(df[['Embarked','Sex']])
dummies_tf = FunctionTransformer(dummies,validate=False)

def clean_data(df):
    return df[['Pclass','SibSp','Parch']]
ft = FunctionTransformer(clean_data, validate=False)



In [4]:
features = FeatureUnion([('clean_data',ft), ('dummies',dummies_tf)])

In [5]:
pipe = Pipeline([('features', features),('lr',LogisticRegression())])

In [6]:
from sklearn.model_selection import cross_val_score

In [7]:
cross_val_score(pipe,df,df['Survived']).mean()

0.7845117845117846

In [8]:
pipe.fit(df,df['Survived'])

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('clean_data', FunctionTransformer(accept_sparse=False,
          func=<function clean_data at 0x1124d9950>, inv_kw_args=None,
          inverse_func=None, kw_args=None, pass_y=False, validate=False)), ('dummies', FunctionTransforme...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [9]:
test= pd.read_csv('test.csv',index_col='PassengerId')

In [10]:
predict = pipe.predict(test)

#### Below, I simply:
1.) Pass my predictions to the Survived column.

2.) Reset the PassengerID to be a column instead of an index.

3.) Pass dataframe to a csv.

4.) Upload my results!!


In [11]:
test['Survived'] = predict 

In [12]:
test.reset_index('PassengerId',inplace = True)

In [13]:
test[['PassengerId','Survived']].to_csv('my_results.csv',index = False)

## Thats it! Thanks so much for checking out my code! I hope this helps in understanding how to progress from Data Analysis to Data Scientist! 