## Efficient Data Science Workflows Use Functions in .py Files

In order to avoid the clutter of jupyter notebooks and to aid collaboration, an efficient data science workflow puts most of its work into **functions**.  

These functions are then put inside **.py files** and called to run through whole chunks of processing at a time

We'll run through an example below

### Imports

In [1]:
#run this cell w/o changes

#data manip
import pandas as pd
import numpy as np

#tests
from test_background import pkl_dump, test_obj_dict, run_test_dict, run_test

**Load in** fight_songs.csv from the data folder as a dataframe

In [2]:
fight_songs = pd.read_csv('data/fight_songs.csv')

Notice that the `Year` column has **some weird values** in it, and is an object dtype (specifically, a string)

In [3]:
print(fight_songs.year.value_counts().head())

type(fight_songs['year'][0])

Unknown    5
1912       4
1915       4
1919       3
1950       3
Name: year, dtype: int64


str

Write a quick function to **turn the value `"Unknown"` into `np.nan`**, wherever it appears in the dataframe.  

**Include two parameters** (objects inside the parens of the function that are inputs used inside the function): 
- the dataframe 
- the value being replaced as `np.nan`

(but it's ok to hardcode `np.nan` as what's replacing the value)

*Don't forget the docstring!*

Run it with the correct arguments as inputs and assign it to `fight_songs`

In [5]:
def turn_value_null(fight_df, to_replace_value='Unknown'):
    '''
    function takes fightsong. csv and turns unknown values into np.nan
    
    parameters:
    
    returns:
    '''
    #your code here 
    #that creates a variable 
    #named `frame`
    
    frame = fight_df.copy()
    frame['year'] = frame['year'].replace(to_replace_value, np.nan)
    
    return frame
    

In [None]:
dir()

In [7]:
fight_songs = turn_value_null(fight_songs, 'Unknown')
fight_songs['year'].value_counts()

1912    4
1915    4
1919    3
1950    3
1909    3
1947    2
1908    2
1936    2
1914    2
1923    2
1959    1
1955    1
1922    1
1948    1
1925    1
1917    1
1918    1
1916    1
1924    1
1898    1
1937    1
1934    1
1933    1
1907    1
1930    1
1962    1
1952    1
1893    1
1911    1
1946    1
1910    1
1967    1
1941    1
1939    1
1928    1
1904    1
1961    1
1968    1
1926    1
1913    1
1905    1
1927    1
1931    1
Name: year, dtype: int64

Now, write a function that **removes all the nulls**.

Again, use the dataframe as a parameter to the function 

Run it with the correct arguments as inputs and assign it to `fight_songs`

In [8]:
def drop_nulls(fight_df):
    '''
    Fight df drop nulls
    '''
    frame = fight_df.dropna()
    
    return frame

fight_songs = drop_nulls(fight_songs)

Finally, write a function to **turn the `type` of the `year` column into an `int`**

This time, have the column be a parameter

Call the function and assign it to `fight_songs['year']` (written out for you)

In [20]:
def turn_column_int(year_series):
    '''
    takes column from fight df and converts it to int
    '''
    
    column = year_series.astype(int)
    return column

fight_songs['year'] = turn_column_int(fight_songs['year'])

In [14]:
#run this to check you work

run_test(fight_songs, 'fight_songs')

'Try again'

## Now the fun part:

**Write a function** with the file path as the parameter, that:
- **calls**  `turn_value_null`, `drop_nulls`, and `turn_column_int` **sequentially**
    - (make sure to include all the specific parameters of those functions called above which are necessary to make them run)
    
    
- **returns** a dataframe at the end

It should be ***the same columns, rows and data*** as the dataframe we ended up with above

In [18]:
def load_clean_fight_songs(file_path):
    '''
    prepare cleaned version of fight_song csv loaded from file_path
    call turn_value_null, drop_nulls, and turn_column_int
    '''
    
    df = pd.read_csv(file_path)
    df = turn_value_null(df, 'Unknown')
    df = drop_nulls(df)
    df['year'] = turn_column_int(df['year'])
    
    return df


In [21]:
load_clean_fight_songs('data/fight_songs.csv').head()

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,...,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id
0,Notre Dame,Independent,Victory March,Michael J. Shea and John F. Shea,1908,No,Yes,No,152,64,...,Yes,Yes,Yes,No,Yes,Yes,No,No,6,15a3ShKX3XWKzq0lSS48yr
1,Baylor,Big 12,Old Fight,Dick Baker and Frank Boggs,1947,Yes,Yes,No,76,99,...,Yes,Yes,No,No,Yes,No,No,Yes,5,2ZsaI0Cu4nz8DHfBkPt0Dl
2,Iowa State,Big 12,Iowa State Fights,"Jack Barker, Manly Rice, Paul Gnam, Rosalind K...",1930,Yes,Yes,No,155,55,...,No,No,Yes,No,No,Yes,No,Yes,4,3yyfoOXZQCtR6pfRJqu9pl
3,Kansas,Big 12,I'm a Jayhawk,"George ""Dumpy"" Bowles",1912,Yes,Yes,No,137,62,...,No,No,No,Yes,No,Yes,Yes,No,3,0JzbjZgcjugS0dmPjF9R89
4,Kansas State,Big 12,Wildcat Victory,Harry E. Erickson,1927,Yes,Yes,No,80,67,...,No,Yes,No,No,Yes,No,No,No,3,4xxDK4g1OHhZ44sTFy8Ktm


In [22]:
#run this cell to test your code!

fight_songs_function_test = load_clean_fight_songs('data/fight_songs.csv')

run_test(fight_songs_function_test, 'fight_songs')

'Try again'

## Now the *really* fun part:


Open a new **text file**, and **save it** as `data_cleaning.py`

**Write out import statements for pandas and numpy**, using the same aliases we always do, in the same manner we always do

**Write out** (in order to get your fingers some muscle memory time) **all the functions** you made above, in the order you made them

At the top of `data_cleaning.py`, **write** (again, don't copy) in triple-quotes (like a docstring) the following:

'''
These functions are used to clean the fight_songs.csv dataset

load_clean_fight_songs can be used with a path to the file to load the csv into a dataframe, run cleaning functions, and return a clean frame

Individually, they are used to:


\- turn_value_null: change values of "Unknown" into np.nan

\- drop_nulls: drop the rows with np.nan values

\- turn_column_int: change the 'year' column into an int type


\- load_clean_fight_songs calls the above functions sequentially and returns the frame
'''

# A Note on the Path Variable

"The variable sys.path is a list of strings that determines the interpreter’s search path for modules."  
[python_docs](https://docs.python.org/3/tutorial/modules.html)

In [24]:
import sys
sys.path

['C:\\Users\\Erin Vu\\documents\\flatiron\\phase2\\py_files_vp',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\python38.zip',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\DLLs',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env',
 '',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib\\site-packages',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib\\site-packages\\win32',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib\\site-packages\\win32\\lib',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib\\site-packages\\Pythonwin',
 'C:\\Users\\Erin Vu\\anaconda3\\envs\\learn-env\\lib\\site-packages\\IPython\\extensions',
 'C:\\Users\\Erin Vu\\.ipython']

Make a copy of data_cleaning.py and rename it dc.py.  Move it into the empty src folder.  

In [25]:
# This will not work
from dc import load_clean_fight_songs

ModuleNotFoundError: No module named 'dc'

> we have to specify the path to the src folder.

In [26]:
from src.dc import load_clean_fight_songs

ModuleNotFoundError: No module named 'src.dc'

In [27]:
Depending on how you structure your projects, you may have to add to your path.

SyntaxError: invalid syntax (<ipython-input-27-e92a3b9f3d56>, line 1)

In [28]:
sys.path.append('..')


In [29]:
import os

os.getcwd()

'C:\\Users\\Erin Vu\\documents\\flatiron\\phase2\\py_files_vp'

In [None]:
# You can add an absolute path by splitting on the repo name

repo_name = ''

root = os.getcwd().split(repo_name)[0] + repo_name
sys.path.append(root)

In [None]:
sys.path

> This way, no matter where you specify src, you will not encounter an error.

# Why This Matters

The workflow that will make you an efficient data scientist goes something like this:

- **Write preliminary code** in Jupyter Notebooks
- Complete a **small** section of code that you know completes a necessary task
- **Write that code into a function** in a .py file
- In another notebook, **import that function** and run it

#### There are -several- advantages to doing this

- **Jupyter Notebooks are MeSsY**
    - Easy to jump around cells and **lose track** of what you're doing
    - Easy to **change the value of a variable** and not remember it later
    - Not that easy to **combine work**
    
    
- Importing functions through **.py files** into another book **helps mitigate** those problems
    - Your important work is all in **one spot without the clutter** of producing that work
    - Everything's in a tidy package, and so it's **harder for variables to get re-named**
    - **Combining work becomes easier**. Instead of sharing code through Jupyter Notebooks, and having to figure out which cells to run in what order, we can share .py files where we've already put in the work of figuring out what to run in what order as we've been working
    
    
- **Saves time in the long run**
    - Might not seem worth the time investment at first, but as your projects become bigger and more sprawling the problems it helps mitigate will become laRG**ER**
    - Doing this forces a **marathon mentality over a sprint mentality**, and helps keep one focused on small, necessary tasks


![](viz/siren.gif)     ![](viz/siren.gif)
# Is This Required for the Project?
![](viz/siren.gif)     ![](viz/siren.gif)

No


### Should we try it?

Sure!  But if it seems like it's becoming a hinderance to getting stuff done, go ahead and skip it

In [23]:
%load_ext autoreload
%autoreload 2