**Learning Objectives**
- Recognize the benefits of using an IDE or text editor
- Understand the concept of Python modules
- Know how to set up projects using Cookiecutter Data Science structure

# Tricky Jupyter
Let's see some examples of Jupyter being tricky, using code snippets from the cross validation notebook.

In [17]:
#Data loading: cars data set (using car characteristics to predict the price)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression, Ridge 
from sklearn.preprocessing import StandardScaler


df=pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', header=None)

columns= ['symboling','normalized-losses','make','fuel-type',
          'aspiration','num-of-doors','body-style','drive-wheels',
          'engine-location','wheel-base','length','width','height',
          'curb-weight','engine-type','num-of-cylinders','engine-size',
          'fuel-system','bore','stroke','compression-ratio','horsepower',
          'peak-rpm','city-mpg','highway-mpg','price']
df.columns=columns

**Simple cleaning**

In [18]:
df_cleaned = df.replace('?', np.NaN).dropna().reset_index()
df_cleaned['price'] = df_cleaned['price'].astype(float)

cars = df_cleaned.select_dtypes(exclude=['object']).copy()

In [19]:
cars.head()

Unnamed: 0,index,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg,price
0,3,2,99.8,176.6,66.2,54.3,2337,109,10.0,24,30,13950.0
1,4,2,99.4,176.6,66.4,54.3,2824,136,8.0,18,22,17450.0
2,6,1,105.8,192.7,71.4,55.7,2844,136,8.5,19,25,17710.0
3,8,1,105.8,192.7,71.4,55.9,3086,131,8.3,17,20,23875.0
4,10,2,101.2,176.8,64.8,54.3,2395,108,8.8,23,29,16430.0


**Prepare X and y and split into train and test**

In [25]:
X, y = cars.drop('price',axis=1), df_cleaned['price']

In [26]:
X.shape

(159, 11)

What happens to X if I run this cell below twice?

In [27]:
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=10)

In [28]:
X.shape

(127, 11)

**Do cross validation**

In [29]:
kf = KFold(n_splits=5, shuffle=True, random_state = 71)
cv_lm_r2s, cv_lm_reg_r2s = [], [] #collect the validation results for both models

What happens to cv_lm_r2s if I run this cell below twice?

In [30]:
#this helps with the way kf will generate indices below
X, y = np.array(X), np.array(y)
for train_ind, val_ind in kf.split(X,y):
    
    X_train, y_train = X[train_ind], y[train_ind]
    X_val, y_val = X[val_ind], y[val_ind] 
    
    #simple linear regression
    lm = LinearRegression()
    lm_reg = Ridge(alpha=1)

    lm.fit(X_train, y_train)
    cv_lm_r2s.append(lm.score(X_val, y_val))
    
    #ridge with feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    
    lm_reg.fit(X_train_scaled, y_train)
    cv_lm_reg_r2s.append(lm_reg.score(X_val_scaled, y_val))

In [31]:
cv_lm_r2s

[0.821236288914905,
 0.7034914527402574,
 0.8844247203259455,
 0.7616388470757506,
 0.7300561885585927]

What does lm refer to at this point?

In [10]:
lm.coef_

array([-17.48436162, 315.14686238,  15.54203019,  -7.11751391,
       503.76892701,   4.87336583,   6.79210618,  42.12944193,
        18.18710716,  46.50504417, -73.19890995])

**Jupyter is not always best**  

Jupyter is great for some things, such as:
- Exploration
- Visualizations
- Integration of code, commentary, and visualizations

But it's not always the best choice because:
- Out of order execution can cause confusion, as shown above
- Hard to version control and collaborate with others
- No help within code cells (function parameters, syntax highlighting, linting)
- Cannot reuse code without repeating yourself
- No [modularity](https://en.wikipedia.org/wiki/Modular_programming) -- important for larger scale projects

# Try an IDE

Instead of doing all your work in Jupyter, try an IDE like [PyCharm](https://www.jetbrains.com/pycharm/).  

****Demo Time**

# Work with Modules

Python [modules](https://docs.python.org/3/tutorial/modules.html) allow you to package your code into reusable functions and reference them from other modules or from jupyter noteboooks. A module is simply a .py file containing Python definitions and statements.

### Exercise: Write a Module
1) In a module called model_selection.py, write a function that: 
- Takes X and y as inputs, both of which are numpy arrays
- Runs a manual k-fold cross validation loop to fit LinearRegression, Ridge, and Lasso
- Return 3 lists of CV R2 scores for LinearRegression, Ridge, and Lasso (in that order)

If time permits, you can also accept other input parameters that the user might want to specify, such as n_splits (for number of k-fold splits), random_state, alpha (regularization parameter for Ridge and Lasso), etc.

Hint: Most of the code is in the for loop in section 1 of this notebook. 

2) Place the module in the same directory as this notebook.  
3) Import the function and run the function on X and y (defined in section 1)

In [11]:
# The autoreload Ipython extension will reload your modules before executing. So any updates 
# to your python files will be reloaded into Jupyter when you call a function.
%load_ext autoreload
%autoreload 2

In [None]:
# IMPORT AND RUN YOUR FUNCTION HERE

# Cookiecutter Data Science

[Cookiecutter Data Science](https://drivendata.github.io/cookiecutter-data-science/) is a project structure that will help to organize and present your projects.

Run the following to install Cookiecutter and start up a new project using its structure:
```
pip install cookiecutter
cookiecutter https://github.com/drivendata/cookiecutter-data-science
```

You will be prompted to enter some information about your project name and details. A new project directory will be created containing all the Cookiecutter subfolders and files.

****Walkthrough time**

# Takeaways
- Jupyter is not always the best tool
- Practice coding outside Jupyter, in an IDE or text editor, to prepare for life outside Metis
- Write functions in modules to develop reusable, organized code 
- Try Cookiecutter for a flexible yet standardized structure for future data science projects