# CPSC 330 hw1

**About this assignment:** The main purpose of this assignment is to check whether your programming knowledge is adequate to take CPSC 330. This assignment covers two python packages, numpy and pandas, that we'll be using throughout the course. For some of you, Python/numpy/pandas will be familiar; for others, it will be new. Either way, if you find this assignment very difficult then that could be a sign that you will struggle later on in the course. While CPSC 330 is a machine learning course rather than a programming course, programming will be an essential part of it.

Also: as part of this assignment you will likely need to consult the documentation for various Python packages we're using. This is, of course, totally OK and in fact strongly encouraged. Reading and interpreting documentation is an important skill, and in fact is one of the skills this assignment is meant to assess.

**Due date: Tuesday Sep 15, 2020 at 11:59pm.**

## Set-up

In order to do this assignment, you will need to set up the CPSC 330 software stack, which is Python and Jupyter. For software install help, see [here](https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md). One you have the software stack installed, you should be able to run the next cell, which imports some packages needed for the assignment.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Points

Each question or sub-question will have a number of points allocated to it. For example, the "rubric={points:3}" below means that following the instructions is worth 3 points/marks.

## Instructions
rubric={points:5}

Follow the [CPSC 330 homework instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md), which includes information on how to do your assignment and how to submit your assignment. 



## Exercise 1: Loading files with Pandas
rubric={points:6}

When working with tabular data, you will typically be creating Pandas dataframes by reading data from .csv files using `pd.read_csv()`. The documentation for this function is available [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).

In the "data" folder in this homework repository there are 6 different .csv files named `wine_#.csv/.txt`. Look at each of these files and use `pd.read_csv()` to load these data so that they resemble the following:

| Bottle | Grape | Origin | Alcohol | pH | Colour | Aroma |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 1 |  Chardonnay | Australia | 14.23 | 3.51 | White | Floral |
| 2 |  Pinot Grigio | Italy | 13.20 | 3.30 | White | Fruity |
| 3 |  Pinot Blanc | France | 13.16 | 3.16 | White | Citrus |
| 4 |  Shiraz | Chile | 14.91 | 3.39 | Red | Berry |
| 5 |  Malbec | Argentina | 13.83 | 3.28 | Red | Fruity |

You are provided with tests that use `df.equals()` to check that all the dataframes are identical. If you're in a situation where the two dataframes look identical but `df.equals()` is returning `False`, it may be an issue of types - try checking `df.index`, `df.columns`, or `df.info()`.

In [None]:
df1 = None
df2 = None
df3 = None
df4 = None
df5 = None
df6 = None

In [None]:
for i, df in enumerate([df2, df3, df4, df5, df6]):
    assert df1.equals(df), f"df1 not equal to df{i + 2}"
print('All tests passed.')

In [None]:
df

## Exercise 2: The Titanic dataset

The file *titanic.csv* contains data of 1309 passengers who were on the Titanic's unfortunate voyage. For each passenger, the following data are recorded:

* survival - Survival (0 = No; 1 = Yes)
* class - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name - Name
* sex - Sex
* age - Age
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* ticket - Ticket Number
* fare - Passenger Fare
* cabin - Cabin
* embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
* boat - Lifeboat (if survived)
* body - Body number (if did not survive and body was recovered)

In this exercise you will perform a number of wrangling operations to manipulate and extract subsets of the data.

#### 2(a)
rubric={points:1}

Load the `titanic.csv` dataset into a pandas dataframe named `titanic_df`.

In [None]:
titanic_df = None

 

In [None]:
assert set(titanic_df.columns) == set(['pclass',
 'survived',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'boat',
 'body',
 'home.dest']), "All required columns are not present"
assert len(titanic_df.index) == 1309, "Wrong number of rows in dataframe"
print("Success")

#### 2(b)
rubric={points:1}

The column names `sibsp` and `parch` are not very descriptive. Use `df.rename()` to rename these columns to `siblings_spouses` and `parents_children` respectively.

In [None]:
assert set(['siblings_spouses', 'parents_children']).issubset(titanic_df.columns), "Column names were not changed properly"
print("Success")

#### 2(c)
rubric={points:2}

We will practice indexing different subsets of the dataframe in the following questions.

Select the column `age` using single bracket notation `[]`. What type of object is returned?

#### 2(d)
rubric={points:2}

Now select the `age` using double bracket notation `[[]]`. What type of object is returned?

#### 2(e)
rubric={points:1}

Select the columns `pclass`, `survived`, and `age` using a single line of code.

#### 2(f)
rubric={points:2}

Use the `iloc` method to obtain the first 5 rows of the columns `name`, `sex` and `age` using a single line of code.

#### 2(g)
rubric={points:2}

Now use the `loc` method to obtain the first 5 rows of the columns `name`, `sex` and `age` using a single line of code.

#### 2(h)
rubric={points:1}

How many passengers survived (`survived = 1`) the disaster? Hint: try using `df.query()` or `[]` notation to subset the dataframe and then `df.shape` to check its size.

#### 2(i)
rubric={points:1}

How many passengers that survived the disaster (`survived = 1`) were over 60 years of age?

#### 2(j)
rubric={points:2}

What was the lowest and highest fare paid to board the titanic? Store your answers as floats in the variables `lowest` and `highest`.

In [None]:
 
lowest = None
highest = None

#### 2(k)
rubric={points:1}

Sort the dataframe by fare paid (most to least).

#### 2(l)
rubric={points:1}

Save the sorted dataframe to a .csv file called 'titanic_fares.csv' using `to_csv()`.

#### 2(m)
rubric={points:3}

Create a scatter plot of fare (y-axis) vs. age (x-axis). Make sure to follow the [guidelines on figures](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md#figures). You are welcome to use pandas built-in plotting or matplotlib. 

#### 2(n)
rubric={points:3}

Create a histogram of `embarked` values.

## Exercise 3: Scavenger Hunt

For each of the following cases, index into the Python object to obtain the "T" (for Treasure). Please do not modify the original line of code that generates `x` (though you are welcome to copy it). You are welcome to answer this question "manually" or by writing code - whatever works for you. However, your submission should always end with a line of code that prints out `'T'` at the end (because you've found it).

In [2]:
import string
letters = string.ascii_uppercase

The first one is done for you as an example.

#### Example question

In [3]:
x = ("nothing",{-i : l for i, l in enumerate(letters)})

**Example answer**:

In [4]:
x[1][-19]

'T'

**Explanation:** In these questions, the goal is not to understand the code itself, which may be confusing. Instead, try to probe the types of the various objects. For example `type(x)` reveals that `x` is a tuple, and `len(x)` reveals that it has two elements. Element 0 just contains "nothing", but element 1 contains more stuff, hence `x[1]`. Then we can again probe `type(x[1])` and see that it's a dictionary. If you `print(x[1])` you'll see that the letter "T" corresponds to the key -19, hence `x[1][-19]`.

#### 3(a)
rubric={points:2}

In [None]:
# Do not modify this cell
x = [[letters[i] for i in range(26) if i % 2 == 0], [letters[i] for i in range(26) if i % 2 == 1]]

#### 3(b)
rubric={points:2}

In [None]:
# Do not modify this cell
np.random.seed(1)
x = np.random.choice(list(set(letters)-set("T")),size=(100, 26),replace=True)
x[np.random.randint(100),np.random.randint(26)] = "T"

#### 3(c)
rubric={points:2}

In [None]:
# Do not modify this cell
n = 26
x = dict()
for i in range(n):
    x[string.ascii_lowercase[i]] = {string.ascii_lowercase[(j+1)%n] : [[letters[j]] if j-2==i else None] for j in range(n)}

## Submission to Canvas

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`.
2. Convert your notebook to `.html` format using the `convert_notebook()` function below **or** by `File -> Export Notebook As... -> Export Notebook to HTML`.
3. Run the code `submit()` below to go through an interactive submission process to Canvas.
>For this step, you will need a Canvas *Access Token* token. If you haven't already got one, log-in to Canvas, click `Account` (top-left of the screen), then `Settings`, then scroll down until you see the `+ New Access Token` button. Click that button, give your token any name you like and set the expiry date to Dec 31, 2020. Then click `Generate token`. Save this token in a safe place on your computer as you'll need it for all assignments. Treat the token with as much care as you would an important password. 

In [None]:
from canvasutils.submit import submit, convert_notebook

# Note: the canvasutils package should have been installed as part of your environment setup - 
# see https://github.com/UBC-CS/cpsc330/blob/master/docs/setup.md

In [None]:
# convert_notebook("hw1.ipynb", "html")  # uncomment and run when you want to try convert your notebook to HTML (or you can convert manually from the File menu)

In [None]:
# submit(course_code=53561, token=False)  # uncomment and run when ready to submit 