<img src="https://gist.githubusercontent.com/jakubczakon/10e5eb3d5024cc30cdb056d5acd3d92f/raw/5c464c16ccbc7150b4025e0a2a05b84ab99a7bc3/logo_DS_AI.png" alt="Drawing" width="600"/>

# deepsense.ai's workshop

# 1.1. Data exploration (bikes)

* Loading data
* Calculating summary statistics
* Creating plots

### Preamble

Before writing actual code, we need to import libraries. Typically all libraries sit at the top of a file.

In [None]:
# library for loading data and tables
# we will use it a lot, so we use its standard abbreviation - pd
import pandas as pd

# interactive HTML profile reports from a pandas DataFrame
from ydata_profiling import ProfileReport

import matplotlib.pyplot as plt
# IPython Notebook option to show plots in the notebook (not in a separate window)
%matplotlib inline

# plotting library
# loading it changes default styles to much nicer ones
import seaborn as sns
sns.set()

## Bike Sharing Dataset

We will be working with a 2 year record of bike renting in Washington DC.
https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

The official description of columns goes as follows:

- `instant`: record index
- `dteday`: date
- `season`: season (1:spring, 2:summer, 3:fall, 4:winter)
- `yr`: year (0: 2011, 1:2012)
- `mnth`: month ( 1 to 12)
- `holiday`: whether day is holiday or not (extracted from [Web Link](http://dchr.dc.gov/page/holiday-schedule))
- `weekday`: day of the week
- `workingday`: if day is neither weekend nor holiday is 1, otherwise is 0.
- `weathersit`: 
    - 1: Clear, Few clouds, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- `temp`: Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- `atemp`: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- `hum`: Normalized humidity. The values are divided to 100 (max)
- `windspeed`: Normalized wind speed. The values are divided to 67 (max)
- `casual`: count of casual users
- `registered`: count of registered users
- `cnt`: count of total rental bikes including both casual and registered

In [None]:
# pd.read_csv reads comma-separated values as DataFrame
# it infers format of integers, floats and strings,
# but date columns need to be specified
df = pd.read_csv("data/Bike-Sharing-Dataset/day.csv", parse_dates=["dteday"])

## Simple exploration

Before we do **any** machine learning it's essential to explore data. Reading its description is crucial, but not enough!

Possible issues we can catch:

* missing values (are there? often? correlated?),
* inconsistent schema (e.g. zip codes sometimes as strings, sometimes as numbers),
* inconsistent values (e.g. "USA" and "United States" for the same country),
* abundance of certain data, biases (e.g. in plants and animals data, 90% are plants),
* censoring (e.g. people with high income capped at a given number),
* absurd values (especially for self-reported fields), 
* units (e.g. time in seconds, or money in local currency).

Data is as it is. Data scientists claim that they use 80-90% of their time to clean and prepare data.

Investigating data quality is not a nuisance - it's a big part of being a data scientist! And often it's a fascinating exploration.

Do you know anything about biking patterns? I don't. But I would love to learn (from the data)!

In [None]:
# df.head() shows the first 5 rows of a dataframe 
df.head()

In [None]:
# a python start with a `head` and ends with a `tail`
df.tail(3)

### Exercise

* Decipher weekdays. Is `0`: `Sunday`, `Monday` or something different?

In [None]:
# list column names
df.columns

In [None]:
# number of rows and number of columns
df.shape

In [None]:
# types for each column
df.dtypes

In [None]:
# show a given column
df["dteday"]

In [None]:
# we can also use a dot
df.dteday

In [None]:
# plot a column
df["windspeed"].plot(figsize=(12,6))

In [None]:
# for more readable plots we set 'dteday' as the index
df = df.set_index("dteday")
# WARNING: do it only once; when you do it there is no longer `dteday` column

In [None]:
df.head(3)

In [None]:
df["windspeed"].plot(figsize=(12,6))

### Exercise

* Try plotting other variables (e.g. `hum` and `temp`). 

In [None]:
# df.query("condition") creates a DataFrame  
df.query("cnt < 500")

In [None]:
# we can obtain the same result in another way
df[df.cnt < 500]

### Exercise

* Look at the total number of rentals on the 29th of October 2012. Is that data error? Google the answer.

In [None]:
df[["temp", "cnt"]].plot(figsize=(12,6))

In [None]:
#to make plot sensible let us rescale unnormalized variables
df["cnt_scaled"] = df["cnt"]/df["cnt"].max()

In [None]:
df[["temp", "cnt_scaled"]].plot(figsize=(12,6))

In [None]:
df.drop(["cnt_scaled"], axis=1, inplace=True)

### Exercise

* Show registered and casual users on the same plot.
* Show humidity, temperature and wind speed on the same plot. 

In [None]:
# what is computed in this table?
df.pivot_table(index="mnth", columns='season', values="instant", aggfunc=len).fillna(0)

### Exercise

* What is wrong with the above? (Hint: look at column descriptions at the beginning of the notebook.)

In [None]:
# mean of each column
df.mean()

In [None]:
# series.value_counts() show all values along their counts
df["season"].value_counts().sort_index()

In [None]:
# and now as plots
df["season"].value_counts().sort_index().plot(kind="bar")

### Exercise 

* Make similar plots for `weathersit` and `holiday`.

In [None]:
# Can holiday be a working day?
df[df.holiday == 1].workingday.unique()

### Exercise 

* Is every non-workingday a holiday?

In [None]:
# statistics
df["temp"].describe()

In [None]:
# series.hist() plots a histogram
# you can tweak the number of bins by using keyword argument bins (by default it is 10)
df["workingday"].hist(bins=25)

### Exercise

* Plot histogram of registered users. Tweak bins to your taste. 

In [None]:
# a scatter plot
df.plot(kind='scatter', x='temp', y='hum')

In [None]:
# describe all numerical columns
df.describe()

In [None]:
# table of Pearson correlation coefficient
df.corr()

In [None]:
# too hard to read? a heatmap from seaborn makes it much cleaner
sns.heatmap(df.corr())

In [None]:
# or maybe even a sorted correlation plot?
sns.clustermap(df.corr())

In [None]:
# sns.catplot allows us to analyze multifactor relations
# other parameters: hue and row
sns.catplot(data=df,                                   # dataframe
            x="weekday", y="cnt",                      # mandatory parameters
            col="season",                              # optional parameters
            col_wrap=4,
            order=[0, 1, 2, 3, 4, 5, 6],               # by default entries are not sorted
            kind="bar")                                # plot type, optional

### Exercises

* Using `sns.catplot` show dependency of the `casual` user count on the day of `week` and `mnth`.

In [None]:
# all in one - profiling report
ProfileReport(df, title="Profiling Report")
# Note: Jupyter Lab 0.34.8 displays it incorrectly. Try Jupyter Notebook instead.

In [None]:
# or maybe you prefer to watch a separated html file
profile = ProfileReport(df, title="Profiling Report")
profile.to_file("your_report.html")