## Clean Up From Day 1

1. How to view global variables
2. External help pages
3. Key word arguments

In [2]:
import pandas as pd

### Global Variables

In [2]:
# define a bunch of different variables
a = 4
b = 'otter'
c = pd.DataFrame({'colA': [1,2,3], 'colB': ['a','b','c']})
d = ['my', 'list', 'of', 'stuff']

In [3]:
# .... forgot everything we did above, maybe delete something. want to find the variables we have?

We can use a specific Jupyter **inline function** to get a list of everything we have that exists in the notebook right now.

Jupyter functions are called using `%` inside a code cell. 

The `whos` function lists all variables, their types, and some info about the variables!

In [3]:
%whos

Variable   Type         Data/Info
---------------------------------
a          int          4
b          str          otter
c          DataFrame       colA colB\n0     1    <...>     2    b\n2     3    c
d          list         n=4
pd         module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


---

If we find some dataframe or variable is hogging a lot of memory, we can also delete it once we are done with it to free up space.

Generally this is not needed, but it's good to know about!

In [4]:
# delete a variable
del(d)

In [5]:
%whos

Variable   Type         Data/Info
---------------------------------
a          int          4
b          str          otter
c          DataFrame       colA colB\n0     1    <...>     2    b\n2     3    c
pd         module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


In [6]:
d

NameError: name 'd' is not defined

### External Help Pages

While we talked a lot about the help functions built directly into Python and Jupyter, a lot of the time going directly to the source will have the best, most well laid out information for a function.

For example, if we go straight to the [Pandas read_csv documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html), we get:

* A fairly easy to read function with all the default values displayed
* A list of what each parameter is specifically used for 
* Examples of the function in action

### Key Word Arguments

We skimmed over this yesterday, but it deserves a mention here.

Python functions have two types of arguments:

* **Positional** arguments.

    * These are required arguments, that must be included in the correct order. 
    
    
* **Keyword** arguments

    * These are typically optional arguments, and can be included in any order, as long as the keyword is included in the function call. 
    
    
Things to note when supplying arguments to functions:

* Positional arguments always come first. If you supply some keyword argument, it must always be after the unnamed positional arguments. 

* Keywords do not always have to be given, but if no keyword is given, Python assumes that you are trying to assign variables to the order in which they are written into the function.

In [8]:
# SOME FUNCTIONS!
def my_positional_function(x, y, z):
    print(x, y, z)
    
def my_keyword_function(x = 'we', y = 'love', z = 'otters'):
    print(x, y, z)
    
def my_mixed_function(x, y, z = 42):
    print(x, y, z)

In [9]:
# a function with only positional arguments
my_positional_function(1, 2, 3)

1 2 3


In [10]:
# if positional, they must all be supplied!
my_positional_function(1, 2)

TypeError: my_positional_function() missing 1 required positional argument: 'z'

In [11]:
# a function with keyword arguments can be treated the same way as positional
my_keyword_function(1, 2, 3)

1 2 3


In [12]:
# or we can suppled fewer, knowing that the ones we do not supply (IN ORDER) will be given a default value
my_keyword_function(1, 2)

1 2 otters


In [13]:
# or we can supply our own values, in whichever order we like
my_keyword_function(x = 6, z = 99, y = 'countdown')

6 countdown 99


In [14]:
# or supply nothing at all if it is completely built with keywords
my_keyword_function()

we love otters


In [15]:
# or treat them as a mix of positional and keyword, as long as the positional comes first
my_keyword_function(x = 7, 1, 2)

SyntaxError: positional argument follows keyword argument (1864917937.py, line 2)

In [16]:
# we said first!
my_keyword_function(1, 2, z='would I lie to you')

1 2 would I lie to you


In [17]:
# functions that have mixed type parameters exist as well, again with requirements on positional arguments being defined
my_mixed_function()

TypeError: my_mixed_function() missing 2 required positional arguments: 'x' and 'y'

In [18]:
# x and y are required, so we must define at least those
my_mixed_function(1, 2)

1 2 42


In [19]:
# and we can add in a value for z either as a positional, or as a keyword
my_mixed_function(1, 2, 42)
my_mixed_function(1, 2, z=42)

1 2 42
1 2 42


In [18]:
# for fun, we can check the whos function again now that we've done all this extra work to see how much more there is 
# in the notebook

In [20]:
%whos

Variable                 Type         Data/Info
-----------------------------------------------
a                        int          4
b                        str          otter
c                        DataFrame       colA colB\n0     1    <...>     2    b\n2     3    c
my_keyword_function      function     <function my_keyword_func<...>on at 0x000001E774432790>
my_mixed_function        function     <function my_mixed_functi<...>on at 0x000001E77733F0D0>
my_positional_function   function     <function my_positional_f<...>on at 0x000001E774432430>
pd                       module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


## Clean Up From Day 2

There were a few questions about how to do certain things in Pandas during our second day. 

Here are some example solutions to those questions.

As a reminder, we were using this dataset:

https://raw.githubusercontent.com/bcgov/ds-intro-to-python/main/data/movieratings.csv

---

#### Question 1

How can we return only the rows that have missing data?

In [3]:
# first import the data
url = 'https://raw.githubusercontent.com/bcgov/ds-intro-to-python/main/data/movieratings.csv'
movie_ratings = pd.read_csv(url)
movie_ratings

Unnamed: 0,Rater,Star Wars,Finding Nemo,Forrest Gump,Parasite,Citizen Kane
0,Floriana,,5.0,5.0,3.0,
1,Raymundo,4.0,,,,5.0
2,Jung,5.0,,,5.0,
3,Kumar,5.0,,4.0,,4.0
4,Maria,5.0,4.0,5.0,,
5,Arthur,2.0,2.0,3.0,3.0,3.0
6,Marcellus,,,4.0,5.0,4.0
7,Martina,5.0,5.0,5.0,5.0,5.0
8,Orson,1.0,1.0,1.0,2.0,5.0
9,Luke,5.0,,,,


In [4]:
# use the isna() function to determine where the missing data is
where_na = movie_ratings.isna()
where_na

Unnamed: 0,Rater,Star Wars,Finding Nemo,Forrest Gump,Parasite,Citizen Kane
0,False,True,False,False,False,True
1,False,False,True,True,True,False
2,False,False,True,True,False,True
3,False,False,True,False,True,False
4,False,False,False,False,True,True
5,False,False,False,False,False,False
6,False,True,True,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,True,True,True,True


In [5]:
# use the any() function to determine if a row has at least 1 missing value 
where_na_by_row = where_na.any(axis=1)
where_na_by_row

0     True
1     True
2     True
3     True
4     True
5    False
6     True
7    False
8    False
9     True
dtype: bool

In [6]:
# supply this series of booleans to filter our original dataset
missing_data = movie_ratings[where_na_by_row]
missing_data

Unnamed: 0,Rater,Star Wars,Finding Nemo,Forrest Gump,Parasite,Citizen Kane
0,Floriana,,5.0,5.0,3.0,
1,Raymundo,4.0,,,,5.0
2,Jung,5.0,,,5.0,
3,Kumar,5.0,,4.0,,4.0
4,Maria,5.0,4.0,5.0,,
6,Marcellus,,,4.0,5.0,4.0
9,Luke,5.0,,,,


---

#### Question 2

What if we want to only drop the rows if BOTH Finding Nemo and Parasite are NA?

In [7]:
# start the same as before, find all the NAs
where_na = movie_ratings.isna()
where_na

Unnamed: 0,Rater,Star Wars,Finding Nemo,Forrest Gump,Parasite,Citizen Kane
0,False,True,False,False,False,True
1,False,False,True,True,True,False
2,False,False,True,True,False,True
3,False,False,True,False,True,False
4,False,False,False,False,True,True
5,False,False,False,False,False,False
6,False,True,True,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,True,True,True,True


In [8]:
# reduce to only the two columns of interest
where_na_fn_p = where_na[['Finding Nemo', 'Parasite']]
where_na_fn_p

Unnamed: 0,Finding Nemo,Parasite
0,False,False
1,True,True
2,True,False
3,True,True
4,False,True
5,False,False
6,True,False
7,False,False
8,False,False
9,True,True


In [9]:
# find the rows where both finding nemo AND parasite are missing (both are true)
# do this using the all() function
where_both_na = where_na_fn_p.all(axis=1)
where_both_na

0    False
1     True
2    False
3     True
4    False
5    False
6    False
7    False
8    False
9     True
dtype: bool

In [10]:
# negate this as we want to return all the other rows
where_not_both_na = ~where_both_na
where_not_both_na

0     True
1    False
2     True
3    False
4     True
5     True
6     True
7     True
8     True
9    False
dtype: bool

In [11]:
# supply this to the original dataset to filter to what we wanted
filtered_movies = movie_ratings[where_not_both_na]
filtered_movies

Unnamed: 0,Rater,Star Wars,Finding Nemo,Forrest Gump,Parasite,Citizen Kane
0,Floriana,,5.0,5.0,3.0,
2,Jung,5.0,,,5.0,
4,Maria,5.0,4.0,5.0,,
5,Arthur,2.0,2.0,3.0,3.0,3.0
6,Marcellus,,,4.0,5.0,4.0
7,Martina,5.0,5.0,5.0,5.0,5.0
8,Orson,1.0,1.0,1.0,2.0,5.0


---

#### Question 3

How can we sort the value counts output by the first column (our categories) instead of by the counts?

Reminder, here we were using this dataset:

https://raw.githubusercontent.com/bcgov/ds-intro-to-python/main/data/techhealth.csv

In [19]:
url = 'https://raw.githubusercontent.com/bcgov/ds-intro-to-python/main/data/techhealth.csv'
m_health = pd.read_csv(url)
m_health.columns = m_health.columns.str.lower().str.replace(' ','_')
m_health.head()

Unnamed: 0,timestamp,age,gender,country,self_employed,family_history,treatment,work_interfere,remote_work,tech_company,benefits,leave,mental_health_consequence
0,27/08/2014 11:35,46,Male,United States,No,No,Yes,Often,Yes,Yes,Yes,Don't know,Maybe
1,27/08/2014 11:36,41,Male,United States,No,No,Yes,Never,No,No,Don't know,Don't know,Maybe
2,27/08/2014 11:36,33,male,United States,No,Yes,Yes,Rarely,No,Yes,Yes,Don't know,No
3,27/08/2014 11:37,35,male,United States,No,Yes,Yes,Sometimes,No,No,Yes,Very easy,Yes
4,27/08/2014 11:42,35,M,United States,No,No,Yes,Rarely,Yes,Yes,Yes,Very easy,No


In [21]:
# a different way to map the Never/Rarely/Sometimes/Often to 0/1/2/3
my_map = {'Never': 0, 'Rarely': 1, 'Sometimes': 2, 'Often': 3}
m_health['work_interfere'] = m_health['work_interfere'].map(my_map)

In [23]:
# get the value counts series
work_interfere_counts = m_health['work_interfere'].value_counts()
work_interfere_counts

2    12
0     5
1     4
3     3
Name: work_interfere, dtype: int64

In [24]:
# sort by the index instead of counts
work_interfere_counts.sort_index()

0     5
1     4
2    12
3     3
Name: work_interfere, dtype: int64