# Programming in Python for Data Science 

# Assignment 8: A Slice of NumPy and Advanced Data Wrangling

You can't learn technical subjects without hands-on practice. The assignments are an important part of the course. To submit this assignment you will need to make sure that you save your Jupyter notebook. 

Below are the links of 2 videos that explain:

1. [How to save your Jupyter notebook](https://youtu.be/0aoLgBoAUSA) and,       
2. [How to answer a question in a Jupyter notebook assignment](https://youtu.be/7j0WKhI3W4s).       

### Assignment Learning Goals:

By the end of the module, students are expected to:

- Use [NumPy](https://numpy.org/) to create ndarrays with `np.array()` and from functions such as `np.arrange()`, `np.linspace()` and `np.ones()`.
- Describe the shape, dimension and size of an array.
- Identify null values in a dataframe and manage them by removing them using [`.dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) or replacing them using [`.fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html).
- Manipulate non-standard date/time formats into standard Pandas datetime using [`pd.to_datetime()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html).
- Find, and replace text from a dataframe using verbs such as [`.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) and [`.contains()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html).  


This assignment covers [Module 8](https://prog-learn.mds.ubc.ca/en/module8) of the online course. You should complete this module before attempting this assignment.

Any place you see `...`, you must fill in the function, variable, or data to complete the code. Substitute the `None` and the `raise NotImplementedError # No Answer - remove if you provide an answer` with your completed code and answers then proceed to run the cell!

Note that some of the questions in this assignment will have hidden tests. This means that no feedback will be given as to the correctness of your solution. It will be left up to you to decide if your answer is sufficiently correct. These questions are worth 2 points.

In [None]:
# Import libraries needed for this lab
import pandas as pd
import numpy as np
import test_assignment8 as t
from hashlib import sha1
import altair as alt
import inspect

## 1.  Using NumPy 

**Question 1(a)** <br> {points: 1}  

Create a slice from `arr` named `answer_1b` of the values `[1,5,9]`.

In [None]:
arr = np.arange(1, 11)
answer_1a = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1a(answer_1a)

**Question 1(b)** <br> {points: 1}  

Create a 2d array named `answer_1a` of shape (2,2) filled with value 3.4 using `np.full()`.

In [None]:
answer_1b = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1b(answer_1b)

**Question 1(c)** <br> {points: 1}  

Create a 3d array named `answer_1c` of shape (2, 3, 4) using `np.ones()`.

In [None]:
answer_1c = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_1c(answer_1c)

**Question 1(d)** <br> {points: 2}  

Which of the following arrays are two dimensional? 

 `array_1 = np.array([1, 4, 5, 6])`

 `array_2 = np.array([[1, 4, 5, 6]])`

 `array_3 = np.array([[1], [4], [5], [6]])`

 `array_4 = np.array([[[1, 4]], [[5, 6]]])`

Save all possible answers as strings within a list.      
Remember you can chose from the following data types:  

***Example:***    

`answer1_d = ['array_1', 'array_2']`


In [None]:
answer1_d = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
# check that the function exists
assert 'answer1_d' in globals(
), "Please make sure that your solution is named 'answer1_d'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

## 2. DateTime Wrangling 

<a href="https://en.wikipedia.org/wiki/Chopped_(TV_series)" target="_blank">Chopped</a> is a cooking show aired in North America where 4 contestants must prepare a dish that incorporates unusual basket ingredients unknown to the contestants beforehand. The dishes are then presented to a panel of three celebrity chef judges where the contestant of the least liked dish is "chopped" from the competition. There are 3 rounds in the contest ("Appetizer", "Entrée", and "Dessert") and the winner of the final round is deemed the "Chopped Champion". 

[This Chopped open-source dataset](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data) combines allows us to identify some insights into this popular TV series. 


**Question 2(a)** <br> {points: 1}  

Load in the data, assigning the `air_date` column as  `Datetime64` dtype.     
Save the dataframe as an object named `chopped`.

In [None]:
chopped = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2a(chopped)

**Question 2(b)** <br> {points: 2}  

Determine how long the show been airing for (in years) by looking at the earliest and latest air dates.

Save the result as an object named `air_length_yrs`. 

In [None]:
air_length_yrs = None 
days_per_year = 365.25 # This is the total number of days per year including 0.25 to account for the leap year.



# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


air_length_yrs = round(air_length_yrs, 2) # This will round your answer to 2 decimal places. Do not delete! 
air_length_yrs

In [None]:
# check that the function exists
assert 'air_length_yrs' in globals(
), "Please make sure that your solution is named 'air_length_yrs'"

# This test has been intentionally hidden. It will be up to you to decide if your solution
# is sufficiently good.

**Question 2(c)** <br> {points: 1}  

How many days are between each of the 569 episodes?  
Save this as an object named `days_apart`. 

*Hints:* 
- You may need to use `.diff()` and `days_apart` should have 568 rows.        
- Here you are measuring time between episodes. `diff()` produces a dataframe that have a `NaT` value for the first row since there is no episode before it to calculate an interval from. We need to remove this row. Although there are 569 episodes, the number of intervals *between* episodes is 568.  


In [None]:
days_apart = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_2c(days_apart)

**Question 2(d)** <br> {points: 1}  

Of these inter-episode intervals, what fraction of them were not aired on a weekly basis? 

Save the result in an object named `irregular_aired_fraction`.


In [None]:
irregular_aired_percent = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2d(irregular_aired_fraction)

**Question 2(e)** <br> {points: 1}  

Make a new dataframe named `chopped2` that contains an additional column named `weekday_aired` that specifies the day of the week that it was aired.

*Hint: you'll need to used `dt.day_name()`* 


In [None]:
chopped2 = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer

In [None]:
t.test_2e(chopped2)

**Question 2(f)** <br> {points: 1}  

Most Chopped episodes are aired on a `Tuesday`. How many were not? 
Save this value in an object name `irregular_airdays`.


In [None]:
irregular_airdays = None

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_2f(irregular_airdays)

**Question 2(g)** <br> {points: 2}  

How many of the 45 chopped seasons had a perfectly consistent schedule with each episode being released exactly on a weekly basis?
Save this value in an object name `num_perfect_season`.

*Hint:*

* You may find some of the skills you used in 2(c) and 2(d) helpful here. 
* To loop over all the groups in a groupby object you can use the syntax `for name, group in data.groupby(['grouping_column']):`.
* For a season to have a consistence airing schedule, both the max and min days between episodes would equal 7.

In [None]:
num_perfect_season = 0

# your code here
raise NotImplementedError # No Answer - remove if you provide an answer


In [None]:
t.test_2g(num_perfect_season)

## 3. Cleaning a dataframe with Strings and Handling missing values

**Question 3** <br> {points: 8}  

Now that you have learned about string operations and an entry level of regular expressions, let's see you apply your skills to a real dataset. 

In this exercise, you will start with the dirty version of the `Gapminder` dataset that we've seen before. By "dirty" we mean there are some inconsistencies and irregularities in the dataset as one would more typically find with real world data.  Your task is to write a function named `clean_gapminder` that takes in this dataset as an argument, and returns a cleaned up dataframe. The goal of this exercise is to use Python code to clean up the `dirty_gapminder` to the point that it's identical to `clean_gapminder`. 

Note: in the real world you wouldn't have a `clean_gapminder` reference to compare to!

Things you might want to do to clean up `dirty_gapminder`:

1. We recommend first writing code that cleans this dataset and then moving it all into a function after. 
1. If there is missing data (NaNs or empty strings) fill it in with sensible values.
1. Check that all values match those in `clean_gapminder` (e.g., check capitalization, spelling, grammar, etc).
1. There may be entries that appear to have the exact same spelling and capitalization in both the dirty and clean gapminder datasets, but still don't match... Extra whitespace is often a frustrating (and invisible) problem when wrangling text data. You can use `print('**' + x + '**')` to identify any strings with whitespace and `Series.str.strip()` to trim unwanted whitespace around a string. 
1. When you are ready, test that your dirty dataframe matches the clean gapminder data using `df.equals()`.
1. Since you are writing a function named `cleaned_gapminder`, our autograding tests will grade that your function contains certain code and returns the expected output.

Hint: We've provided a unit test for you to compare the two dataframes after wranging. However, during your wrangling you can check the equality of individual elements in two dataframes using `df.eq()`. If your dataframes are `df1` and `df2`, you can check which rows are not equal using `df1[(~df2.eq(df1)).any(axis=1)]` (You've seen something of this nature in Module 3).



In [None]:
dirty = pd.read_csv('data/dirty_gapminder.csv')
dirty.head()

In [None]:
clean = pd.read_csv('data/clean_gapminder.csv')
clean.head()

The code below shows that there are 28 rows in total that are not equal between the two dataframes.

In [None]:
dirty[(~clean.eq(dirty)).any(axis=1)].shape

In [None]:
def cleaned_gapminder(dirty_df):
    
# your code here
raise NotImplementedError # No Answer - remove if you provide an answer
    
    return dirty_df


cleaned_data = cleaned_gapminder(dirty)
assert cleaned_data.equals(clean), "Dataframes are not the same!"

In [None]:
t.test_3(cleaned_gapminder,dirty,clean)

Before submitting your assignment please do the following:

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Verify that the tests from the questions you answered have obtained the output "Success"

This is a simple way to make sure that you are submitting all the variables needed to mark the assignment. This method should help avoid losing marks due to changes in your environment.  

You did it! You got to the end of all 8 assignments in Programming in Python for Data Science. We are all very proud of you here and are excited to see you translate everything you've learned into a final project! 

## Attributions
- Gapminder Dataset - [Gapminder](https://www.gapminder.org/data/)
- UBC's original STAT545 - [Stat545 by Jenny Bryan](https://stat545.com/)
- MDS DSCI 523 - Data Wrangling course - [MDS's GitHub website](hhttps://ubc-mds.github.io/) 
- Chopped Dataset - [Kaggle](https://www.kaggle.com/jeffreybraun/chopped-10-years-of-episode-data)

## Module Debriefing

If this video is not showing up below, click on the cell and click the ▶ button in the toolbar above.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('PCBPzCFQwHs', width=854, height=480)