# HW 1

* due **Thursday Sep 9, at 11:59 pm**.
The purpose of this lab is to make sure that you can run Python notebooks and successfully submit an assignment.
Keep the following in mind for **all** notebooks you develop:

1. Structure your notebook. Use headings with meaningful levels in Markdown cells, and explain the questions each piece of code is to answer or the reason it is there.
2. Make sure your notebook can always be rerun from top to bottom.

The data analysis commands are straightforward.  Objective is for you to familiarize with the use of basic functions in python packages we will use going forward, make a succesfull python notebook and submit it on git.  

Follow [README.md](README.md) for homework submission instructions 


## Setup

This section loads the relevant Python modules and does any configuration needed for the notebook to work. 
* Almost all projects will need basic data analysis packages - Pandas and Seaborn.  
* For other projects, you will need some more imports:

```python
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
```
These are not necessary for this lab.

In [1]:
import pandas as pd
import seaborn as sns

## Data Acquisition step 

1.  **Download the Capital Bike Share data set** from <https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset>.  Click 'Data Folder', download the zip file, and extract the `day.csv` file.
  * Note that I did not provide the data file for you.  I want you to get used to downloading data files, so you learn where they come from.

2. Read the Data File: use Pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)[] function to read the file into `bikes` dataframe.  
   * Our data file does not have column headers, so we need to specify the names.
   [read_csv]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

3.  Use `head` to show the first few rows of the table: 
  * brief preview is a safety check you are exploring the correct data frame 

4. Use `info` to show a description of the columns, along with the shape and memory use of the data frame:
  * good method to explore the shape and memory use of data frame

**Note:**   `.info()` or `.head()` can be called in the same cell as data load once you get the hang of it.  I separate them out in this notebook so that we can discuss them in the markdown cells, but we can combine them in the future. 

## Plotting the Data

Make a bar plot showing the mean number of riders (y-axis) per weekday (x-axis) using seaborn [`catplot`](https://seaborn.pydata.org/generated/seaborn.catplot.html) method.

Now, the X-axis labels isn't very helpful.  Which day is 0?

This is a question about how the data is _coded_. We'll talk more about data encoding next week. Unfortunately, the data documentation doesn't actually say how weekdays are coded!  But we can infer from the data in this case: first data point is January 1, 2011, which was a Saturday, coded as weekday 6; it then resets to 0 for the next day, and starts counting up.

**Always *look* at your data.**

Often, we will not be able to infer the data encoding from the data itself - we need to consult the codebook or data set description. We got lucky this time.  But looking at the data can help us make sense of the codebook.

Let's turn these weekday numbers into a _categorical_ variable so Pandas knows how to label them. 
Hint: use [pandas.Categorical.from_codes()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.from_codes.html).

Lets plot again using seaborn [`catplot`](https://seaborn.pydata.org/generated/seaborn.catplot.html), where data=bikes, x-axis is `day_names` and y-axis is `cnt`

You have now now plotted the average rides per day.  
**Note:**  When we do not tell [`catplot`](https://seaborn.pydata.org/generated/seaborn.catplot.html) what to do with multiple points for the same value (in this case the weekday name), it computes the mean and a bootstrapped 95% confidence interval.  

## Viewing the Data over Time

Lets explore how did rides-per-day change over the course of the data set? 
* This kind of data - a sequence of data points associated with times - is called a *time series*.  
* This data set gives us an `instant` column that records the data number since the start of the data set

Use [seaborn.lineplot()](https://seaborn.pydata.org/generated/seaborn.lineplot.html) where data=bikes, x-axis is `instant` and y-axis is `cnt` value. 

Lets view this graph for actual times on x-axis. The `dteday` column records the date. We can transform `dteday` column to the actual date comlumn using [pandas.to_datetime()](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) method on the column:

In [2]:
bikes['dt'] = pd.to_datetime(bikes['dteday'])

NameError: name 'bikes' is not defined

Now create a plot using [seaborn.lineplot()](https://seaborn.pydata.org/generated/seaborn.lineplot.html) where data=bikes, x-axis is `dt` and y-axis is `cnt` value. 


Next, plot the _weekly_ rides by resampling.  Right now, our `bikes` data is indexed by row number in the CSV file.  We can change its index to another column, such as our `dt` column with the date, which then lets us do things like resample by week:

In [None]:
bikes.set_index('dt')['cnt'].resample('1W').sum().plot()

What that code did, in one line, is:

1. Set the data frame's index to `dt` (`bikes.set_index('dt')`), returning a new DF
2. Select the count column (`['cnt']`), returning a series
3. Resample the series by week (`.resample('1W')`)
4. Combine measurements within each sample by summing them (`.sum()`)
5. Plotting the results using Pandas' defaults (`.plot()`)

Pandas default plotting functions are useful for quick plots to see what's in a data frame or series. They often are difficult to use to turn in to publication-ready charts.

## Submission Instructions

1. select the 'Kernel' menu and choose 'Restart and Run All'.  This will also help you test requirement (2) above: that the notebook can be rerun from top to bottom.
2. Save the notebook. Your submitted notebook **must include results**.  
3. Copy it to your local ML/NetID.git repo 
   ```
   git add HW1.ipynb day.csv
   git commit -m "HW1 submission"
   git push origin
   ```
   OR upload to `HW1.ipynb` and `day.csv` files to https://git.txstate.edu/ML/<NetID>.git
