# Quick .py File Example Warmup 

Feel free to do what you need to do for the code challenge (ie, rest), but go through this at some point before you start the project in earnest

![](viz/rest.gif)

## Efficient Data Science Workflows Use Functions in .py Files

In order to avoid the clutter of jupyter notebooks and to aid collaboration, an efficient data science workflow puts most of its work into **functions**.  

These functions are then put inside **.py files** and called to run through whole chunks of processing at a time

We'll run through an example below

### Imports

In [3]:
#run this cell w/o changes

#data manip
import pandas as pd
import numpy as np

#tests
from test_background import pkl_dump, test_obj_dict, run_test_dict, run_test

**Load in** fight_songs.csv from the data folder as a dataframe

In [4]:

fight_songs = pd.read_csv('data/fight_songs.csv')

fight_songs.head()

Unnamed: 0,school,conference,song_name,writers,year,student_writer,official_song,contest,bpm,sec_duration,...,win_won,victory_win_won,rah,nonsense,colors,men,opponents,spelling,trope_count,spotify_id
0,Notre Dame,Independent,Victory March,Michael J. Shea and John F. Shea,1908,No,Yes,No,152,64,...,Yes,Yes,Yes,No,Yes,Yes,No,No,6,15a3ShKX3XWKzq0lSS48yr
1,Baylor,Big 12,Old Fight,Dick Baker and Frank Boggs,1947,Yes,Yes,No,76,99,...,Yes,Yes,No,No,Yes,No,No,Yes,5,2ZsaI0Cu4nz8DHfBkPt0Dl
2,Iowa State,Big 12,Iowa State Fights,"Jack Barker, Manly Rice, Paul Gnam, Rosalind K...",1930,Yes,Yes,No,155,55,...,No,No,Yes,No,No,Yes,No,Yes,4,3yyfoOXZQCtR6pfRJqu9pl
3,Kansas,Big 12,I'm a Jayhawk,"George ""Dumpy"" Bowles",1912,Yes,Yes,No,137,62,...,No,No,No,Yes,No,Yes,Yes,No,3,0JzbjZgcjugS0dmPjF9R89
4,Kansas State,Big 12,Wildcat Victory,Harry E. Erickson,1927,Yes,Yes,No,80,67,...,No,Yes,No,No,Yes,No,No,No,3,4xxDK4g1OHhZ44sTFy8Ktm


Notice that the `Year` column has **some weird values** in it, and is an object dtype (specifically, a string)

Write a quick function to **turn the value `"Unknown"` into `np.nan`**, wherever it appears in the dataframe.  

**Include two parameters** (objects inside the parens of the function that are inputs used inside the function): 
- the dataframe 
- the value being replaced as `np.nan`

(but it's ok to hardcode `np.nan` as what's replacing the value)

*Don't forget the docstring!*

Run it with the correct arguments as inputs and assign it to `fight_songs`

In [3]:

def turn_value_null(frame, value):
    '''
    data cleaning: turn argument value to null
    
    input: 
        frame: dataframe
        value_to_nan: specific value to turn to np.nan
        
    output: frame w/ all values of value_to_nan replaced w/ np.nan
    '''
    frame = frame.replace(value, np.nan)
    return frame


fight_songs = turn_value_null(fight_songs, 'Unknown')

print(f'fight_songs now has {fight_songs.year.isnull().sum()} nulls')

fight_songs now has 5 nulls


Now, write a function that **removes all the nulls**.

Again, use the dataframe as a parameter to the function 

Run it with the correct arguments as inputs and assign it to `fight_songs`

In [4]:

def drop_nulls(frame):
    '''
    data cleaning: drop rows w/ np.nan anywhere in frame
    
    input: dataframe 
    output: dataframe w/ rows w/ np.nan dropped
    '''
    
    frame = frame.dropna(axis=0, how="any")
    
    return frame

fight_songs = drop_nulls(fight_songs)

fight_songs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 64
Data columns (total 23 columns):
school             60 non-null object
conference         60 non-null object
song_name          60 non-null object
writers            60 non-null object
year               60 non-null object
student_writer     60 non-null object
official_song      60 non-null object
contest            60 non-null object
bpm                60 non-null int64
sec_duration       60 non-null int64
fight              60 non-null object
number_fights      60 non-null int64
victory            60 non-null object
win_won            60 non-null object
victory_win_won    60 non-null object
rah                60 non-null object
nonsense           60 non-null object
colors             60 non-null object
men                60 non-null object
opponents          60 non-null object
spelling           60 non-null object
trope_count        60 non-null int64
spotify_id         60 non-null object
dtypes: int64(4), object(19

Finally, write a function to **turn the `type` of the `year` column into an `int`**

This time, have the column be a parameter

Call the function and assign it to `fight_songs['year']` (written out for you)

In [5]:

def turn_column_int(column):
    '''
    data cleaning: turn column to float type
    
    input: column from dataframe
    output: column as float type
    '''
    column = column.astype(int)
    return column

fight_songs['year'] = turn_column_int(fight_songs['year'])

#used for tests:
# pkl_dump([
#     (
#         fight_songs,
#         'fight_songs'
        
#     )
# ])

Now, write a function that **loads fight_songs.csv** into a dataframe and returns it. *(It doesn't need any parameters!)*

In [10]:

def load_fight_songs():
    
    '''
    loads in fight_songs.csv from the data folder using pd.read_csv
    
    outputs: dataframe of fight_songs.csv
    '''
    
    df = pd.read_csv('data/fight_songs.csv')
    
    return df

## Now the fun part:

**Write a function** (which doesn't take in any parameters) that:
- **calls** `load_fight_songs`, `turn_value_null`, `drop_nulls`, and `turn_column_int` **sequentially**
    - (make sure to include all the specific parameters of those functions called above which are necessary to make them run)
    
    
- **returns** a dataframe at the end

It should be ***the same columns, rows and data*** as the dataframe we ended up with above

In [15]:

def load_clean_fight_songs():
    '''
    runs sequentially:
        load_fight_songs() 
            - loads fight_songs.csv
        
        df = turn_value_null(df, 'Unknown') 
            - turns values "Unknown" to np.nan
        
        df = drop_nulls(df)
            - drops null rows from df
            
        df['year'] = turn_column_float(df['year'])
            - turns 'year' column to float type
            
    result:
        fight_songs.csv loaded and cleaned
    '''
    
    
    df = load_fight_songs()
    df = turn_value_null(df, 'Unknown')
    df = drop_nulls(df)
    df['year'] = turn_column_int(df['year'])
    
    return df

## Now the *really* fun part:


Open a new **text file**, and **save it** as `data_cleaning.py`

**Write out import statements for pandas and numpy**, using the same aliases we always do, in the same manner we always do

**Write out** (in order to get your fingers some muscle memory time) **all five functions** you made above, in the order you made them

At the top of `data_cleaning.py`, **write** (again, don't copy) in triple-quotes (like a docstring) the following:

'''
These functions are used to clean the fight_songs.csv dataset

load_clean_fight_songs can be used without parameters to load the csv into a dataframe, run cleaning functions, and return a clean frame

Individually, they are used to:

\- load_fight_songs: load the csv into a dataframe

\- turn_value_null: change values of "Unknown" into np.nan

\- drop_nulls: drop the rows with np.nan values

\- turn_column_int: change the 'year' column into an int type


\- load_clean_fight_songs calls the above functions sequentially and returns the frame
'''

### Now the ***REALLY*** fun part

Switch .py files with someone from the cohort

Save it in this repo as `testing_data_cleaning.py`

***Restart your kernal***

Run the cell below to test your fellow student's work!

In [None]:

from testing_data_cleaning import load_clean_fight_songs
from test_background import pkl_dump, test_obj_dict, run_test_dict, run_test

test_frame = load_clean_fight_songs

run_test(test_frame, 'fight_songs')

# Why This Matters

The workflow that will make you an efficient data scientist goes something like this:

- **Write preliminary code** in Jupyter Notebooks
- Complete a **small** section of code that you know completes a necessary task
- **Write that code into a function** in a .py file
- In another notebook, **import that function** and run it

#### There are -several- advantages to doing this

- **Jupyer Notebooks are MeSsY**
    - Easy to jump around cells and **lose track** of what you're doing
    - Easy to **change the value of a variable** and not remember it later
    - Not that easy to **combine work**
    
    
- Importing functions through **.py files** into another book **helps mitigate** those problems
    - Your important work is all in **one spot without the clutter** of producing that work
    - Everything's in a tidy package, and so it's **harder for variables to get re-named**
    - **Combining work becomes easier**. Instead of sharing code through Jupyter Notebooks, and having to figure out which cells to run in what order, we can share .py files where we've already put in the work of figuring out what to run in what order as we've been working
    
    
- **Saves time in the long run**
    - Might seem like not worth the time investment at first, but as your projects become bigger and more sprawling the problems it helps mitigate will become laRG**ER**
    - Doing this forces a **marathon mentality over a sprint mentality**, and helps keep one focused on small, necessary tasks


![](viz/siren.gif)     ![](viz/siren.gif)
# Is This Required for the Project? 
![](viz/siren.gif)     ![](viz/siren.gif)

No


### Should we try it?

Sure !  But if it seems like it's becoming a hinderance to getting stuff done, go ahead and skip it