#### Key questions
The dataset : total number of coffees made vs time.

1. Who are the main contributors to this dataset, and when are contributions generally made ?
2. What are the department's weekday coffee habits ?
3. How much coffee are people drinking ?

In [1]:
import pandas as pd
%matplotlib inline
from IPython.display import Image

**Note : **
The second line here tells matplotlib to plot directly under the cell where any plotting code is called. pandas uses matplotlib to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called plt.show() - but we just want them to appear without having to do this.

http://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1

#### Importing the data
Let's import the coffee data from CSV.

In [2]:
data = pd.read_csv('D:\Dataset\coffees.csv')

** Note : **
Pandas can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.

http://pandas.pydata.org/pandas-docs/version/0.20/io.html

###### What does the data look like?

In [3]:
data

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397.0,Quentin
1,2011-10-04 11:48:00,410.0,Quentin
2,2011-10-05 07:02:00,testing,Anthony
3,2011-10-05 08:25:00,,Quentin
4,2011-10-05 10:47:00,464.0,Quentin
5,2011-10-05 13:15:00,481.0,Quentin
6,2011-10-06 07:21:00,503.0,Anthony
7,2011-10-06 10:04:00,513.0,Quentin
8,2011-10-06 12:14:00,539.0,Mike
9,2011-10-06 12:49:00,540.0,Quentin


###### Show me just the first few rows of the dataframe

In [4]:
data.head()

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397.0,Quentin
1,2011-10-04 11:48:00,410.0,Quentin
2,2011-10-05 07:02:00,testing,Anthony
3,2011-10-05 08:25:00,,Quentin
4,2011-10-05 10:47:00,464.0,Quentin


We have an index, and three columns : timestamp, coffees, and contributor.

###### Problems with our dataset
1. Why is there a string of text, testing, in our coffee numbers? 
2. What's going on in the coffees column in the row after that?

** Note : **
df.head(n=10) would show the first ten rows. The default is n=5.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

###### Let's look at the string in the third row

In [5]:
# .loc or .iloc
data.loc[2]

timestamp      2011-10-05 07:02:00
coffees                    testing
contributor                Anthony
Name: 2, dtype: object

Definitely a string. We'll note this as something to fix after we finish looking around.

**Note : ** 

**.loc** uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. 

**.iloc** is integer-location-based, so .iloc[2] would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.

Indexing a dataframe with [] directly returns a pd.Series or pd.DataFrame by searching over columns, not rows. Indexing a pd.Series with [] is like indexing a dataframe with .iloc.

https://pandas.pydata.org/pandas-docs/stable/indexing.html

###### We should also take a look at that NaN. In fact, let's look at the first five values in coffees.

In [6]:
# [] indexing on a series
data.coffees

0        397.0
1        410.0
2      testing
3          NaN
4        464.0
5        481.0
6        503.0
7        513.0
8        539.0
9        540.0
10       563.0
11       581.0
12       587.0
13       605.0
14       616.0
15         NaN
16       626.0
17       635.0
18       650.0
19       656.0
20       673.0
21       694.0
22       699.0
23       713.0
24       770.0
25       790.0
26       799.0
27       805.0
28       818.0
29       819.0
        ...   
641        NaN
642        NaN
643    16195.0
644    16237.0
645    16257.0
646    16513.0
647    16659.0
648    16714.0
649    16891.0
650    16909.0
651    16977.0
652    17104.0
653        NaN
654    17165.0
655    17345.0
656    17354.0
657    17468.0
658    17489.0
659    17564.0
660    17789.0
661    17793.0
662    17824.0
663    17852.0
664    17868.0
665    18062.0
666    18235.0
667    18942.0
668    19698.0
669    24450.0
670    24463.0
Name: coffees, Length: 671, dtype: object

**Note : **here, we're indexing a series ( a pd.Series object ). From a pd.DataFrame ( here, data ), when you access a single column (  data.coffees or data["coffees"] ), the object returned is a pd.Series. From that, indexing directly with [] works in an integer-location-based manner, and like with numpy arrays, you can take slices ( [:5] ).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

In [7]:
data.coffees[:5] # To give us the first five elements in that series

0      397.0
1      410.0
2    testing
3        NaN
4      464.0
Name: coffees, dtype: object

** Difference between a DataFrame and a Series** - A df contains multiple columns all or any of which can be viewed as a series

###### How long is the dataset?


In [8]:
print("Dataset length :")

# len()
print(len(data))

Dataset length :
671


In [9]:
# We could have used shape to get the same result but much quicker and easier
data.shape

(671, 3)

** What else can we find out? **

In [10]:
# .describe() - We can use this to find out a number of things but as is
data.describe()

Unnamed: 0,timestamp,coffees,contributor
count,671,658.0,671
unique,671,654.0,9
top,2012-10-01 13:08:00,9134.0,Quentin
freq,1,2.0,367


Looks like we also have some missing data - we have 671 rows, but the coffees column only has 658 entries.

** Note : ** .describe() returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. 

For all columns, .describe() will return the count of objects in that column (not counting NaNs which is why coffees only shows 658 entries whereas timestamp and contibutor show 671 entries) and the unique number of entries. Theoretically, all of the timestamp data should be unique, hence 671 entries in the unique part and we can see that we have 9 different contributors.

You can determine what's returned using .describe()'s keyword arguments.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

###### What to do with the missing data?
We could drop those rows as they are not contributing anything or we could try to interpolate over the NaN's...we'll do both today.

**Let's look at the dataframe where coffees is null.**

In [11]:
# .isnull() and boolean indexing with []
data.isnull() # We are looking for True(s) as this tells us that we are missing data

Unnamed: 0,timestamp,coffees,contributor
0,False,False,False
1,False,False,False
2,False,False,False
3,False,True,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


In [12]:
data['coffees'].isnull() # He may not like the brackets but I do

0      False
1      False
2      False
3       True
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15      True
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
641     True
642     True
643    False
644    False
645    False
646    False
647    False
648    False
649    False
650    False
651    False
652    False
653     True
654    False
655    False
656    False
657    False
658    False
659    False
660    False
661    False
662    False
663    False
664    False
665    False
666    False
667    False
668    False
669    False
670    False
Name: coffees, Length: 671, dtype: bool

**Note : **.isnull() returns a boolean array ( an array of Trues and Falses ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the coffees column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where coffees is null.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html

In [13]:
data[data['coffees'].isnull()] # Return to me the data where data.coffess isnull

Unnamed: 0,timestamp,coffees,contributor
3,2011-10-05 08:25:00,,Quentin
15,2011-10-07 14:10:00,,Ben
72,2011-10-28 10:53:00,,Mike M
95,2011-11-11 11:13:00,,Quentin
323,2012-06-10 16:10:00,,Sergio
370,2012-07-13 13:59:00,,Mike
394,2012-08-03 14:35:00,,Sergio
479,2012-09-21 10:15:00,,Sergio
562,2012-11-01 09:45:00,,Quentin
606,2012-11-30 13:11:00,,Quentin


i.e. return to me all the lines, from the coffees series, that have an NaN or missing value

###### What type of Python objects are the columns?
Returning to the problem of the string in the coffees column

In [14]:
# .dtypes
data.dtypes # Returns the data types of all of my columns

timestamp      object
coffees        object
contributor    object
dtype: object

The contributor column makes sense as object, because we expect strings there; but surely the timestamp should be a timestamp-type, and coffees should be numerical? We need the data to be in the correct format so that we can inspect it properly.

**Let's inspect what is in the timestamp column.**

In [15]:
# print the first element of the series with [] indexing
print(data.timestamp[0])

# print its type()
print(type(data.timestamp[0]))

2011-10-03 08:22:00
<class 'str'>


It looks like the timestamp field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have pandas interpret these strings as datetimes for us automatically.

**Note :** here's an example of using direct [] indexing on a pd.Series. We're accessing the first entry, just to see what type of object we have there.

On our first pass, what problems did we find?
1. The timestamp column contains strings; these need to be datetime objects
2. The coffees column contains some null values and at least one string


### Cleaning the data
**Problem One - The coffees column should only contain numerical data.**

In [17]:
# cast the coffees column using pd.to_numeric, and coerce errors
data.coffees = pd.to_numeric(data.coffees, errors = 'coerce')

data.head()

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397.0,Quentin
1,2011-10-04 11:48:00,410.0,Quentin
2,2011-10-05 07:02:00,,Anthony
3,2011-10-05 08:25:00,,Quentin
4,2011-10-05 10:47:00,464.0,Quentin


If we tell Pandas to convert our column to numeric values only it will have a good go. (We use to_numeric first as it handles errors quite well.) The code above throws an error because we have a string, testing, in there and it cannot parse that. NaN values are ignored as they are actually valid elements<br>
With *<font color=blue>errors = 'coerce'</font>* added to the command, we can tell Pandas to turn any errors into an NaN. The testing string is the only error that we have and it is safe to discard it as it doesn't provide any useful information...

In [18]:
data.dtypes

timestamp       object
coffees        float64
contributor     object
dtype: object

We can see the to_numeric argument ran fine, we can see our df and when we check the data types of the different columns, we see that columns is now a float. We have some NaN's but we were expecting those.<br><br>
**The coffees column contains NaN's.**<br>
There are several ways to deal with these...<br>
1. We could pretend that they don't exist and just drop the rows<br>
2. You could fill in the rows, maybe by interpolating<br>
    a. If you have a value above and below you could take the mean<br>
3. We could look at the time and date between two dates and wherever that value lies on that line, that is the value that we'll give it<br>
    a.This is an interpolating between dates rather than two points that are equidistant<br><br>
For now, we are going to use dropna()... 

In [19]:
# Use .dropna() using a subset, and pass inplace
data.dropna(inplace = True) # Without inplace we would get a copy of the df which still had NaN's
#data = data.dropna() - Would work just as well
data.head()

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397.0,Quentin
1,2011-10-04 11:48:00,410.0,Quentin
4,2011-10-05 10:47:00,464.0,Quentin
5,2011-10-05 13:15:00,481.0,Quentin
6,2011-10-06 07:21:00,503.0,Anthony


Using *<font color = blue>data = data.dropna()</font>* has the advantage of us losing the NaN's immediately as we are setting the Series with dropna() having been executed.<br><br>
**The coffees column is of type float.**<br>
We want this column to be integers

In [23]:
# Cast to int using .astype()
data.coffees = data.coffees.astype('int64')
data.head()

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397,Quentin
1,2011-10-04 11:48:00,410,Quentin
4,2011-10-05 10:47:00,464,Quentin
5,2011-10-05 13:15:00,481,Quentin
6,2011-10-06 07:21:00,503,Anthony


In [24]:
data.coffees.dtype

dtype('int64')

In [25]:
data.dtypes

timestamp      object
coffees         int64
contributor    object
dtype: object

**Let's have pandas parse the timestamp strings to datetime objects.**<br>
We want to convert the timestamp column from strings(object) to a datetime object

In [26]:
# pd.to_datetime()
data.timestamp = pd.to_datetime(data.timestamp)

# Confirm dtypes
data.dtypes

timestamp      datetime64[ns]
coffees                 int64
contributor            object
dtype: object

**So where do we stand?**

In [27]:
# .describe(), passing the include kwarg to see all information
data.describe()

Unnamed: 0,coffees
count,657.0
mean,8568.471842
std,4600.215049
min,397.0
25%,4986.0
50%,9172.0
75%,11562.0
max,24463.0


We are only seeing the coffees column now because it is the only numeric column that we have left. (The others are datetime and object.) Pandas will always just show us the numeric values, in describe, when it has mixed data types in the df.<br>
It is possible to get around this limitation and see everything tho...

In [28]:
data.describe(include = 'all') # Now we see the whole df

Unnamed: 0,timestamp,coffees,contributor
count,657,657.0,657
unique,657,,9
top,2011-10-24 14:32:00,,Quentin
freq,1,,361
first,2011-10-03 08:22:00,,
last,2013-09-13 10:28:00,,
mean,,8568.471842,
std,,4600.215049,
min,,397.0,
25%,,4986.0,


**Note :** *<font color = blue>.describe(include="all")</font>* is describing all attributes of all columns, but some don't make sense based on the column's dtype. For example, the contributor column has no first and last attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?<br><br>
**The first five rows**

In [29]:
data.iloc[:5] # To show us the first five rows

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397,Quentin
1,2011-10-04 11:48:00,410,Quentin
4,2011-10-05 10:47:00,464,Quentin
5,2011-10-05 13:15:00,481,Quentin
6,2011-10-06 07:21:00,503,Anthony


In [30]:
data[:5] # This works too

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397,Quentin
1,2011-10-04 11:48:00,410,Quentin
4,2011-10-05 10:47:00,464,Quentin
5,2011-10-05 13:15:00,481,Quentin
6,2011-10-06 07:21:00,503,Anthony


In [31]:
data.head() # Is the one that we know and love

Unnamed: 0,timestamp,coffees,contributor
0,2011-10-03 08:22:00,397,Quentin
1,2011-10-04 11:48:00,410,Quentin
4,2011-10-05 10:47:00,464,Quentin
5,2011-10-05 13:15:00,481,Quentin
6,2011-10-06 07:21:00,503,Anthony


### The time-series at a glance<br> 32:02
**Let's begin by visualising the coffee counts.**