# Tips dataset analysis

Description: Fundamentals of Data Analysis - assignment project, GMIT 2019. See README.md for more background info.

>Author: **Andrzej Kocielski**  
>Github: [andkoc001](https://github.com/andkoc001/)  
>Email: G00376291@gmit.ie, and.koc001@gmail.com

Date of creation: 23-09-2019

This Notebook should be read in conjunction with the corresponding README.md file at the assignment repository at GitHub: <https://github.com/andkoc001/Tips_dataset_analysis/>.

___

## Introduction

### The data

It is a common custom to offer some small extra money - a tip - to the staff of a restaurant on top of the bill after a satisfactory service received. Although the tips are voluntary, and the amount of the tips is not (usually) stated, by convention it is often advised to leave as the tip several percent of the total bill for the meal and service. 

The _tips dataset_ is a representation of tips given in a restaurant. The dataset in question is a record (allegedly real and true) of tips along with total bills and some other particulars of a restaurant customers collated by a waiter working in the restaurant for several weeks. 

The data is organised in a form of an array, where the dataset attributes (aka features) are organised in columns, and the observations (aka instances) - in rows. The dataset consists of several data categories describing tips received in connection to circumstances, such as day of the week, total bill, etc. The data set includes 244 data observations.

The _tips dataset_ is also integrated with the [Seaborn](https://seaborn.pydata.org/) package, an external Python package for data visualisation, used also in this analysis. This dataset will be used in the subsequent analysis. The dataset, can be also obtained as a .csv file from, for example, [here](http://vincentarelbundock.github.io/Rdatasets/datasets.html).  

### Hypothesis

The dataset provokes to ask the following question: **Is there a linear relationship between _total bill_ and _amount of tip_?**  
Below we will try to find evidences either supporting or the opposite this question.

### The Analysis
This Notebook presents my analysis and interpretation of the _tips dataset_. This is done through the following:
1. descriptive statistics of the raw data,
2. graphical representation of the data - plots,
3. inference of the implicit information by application of selected tools, methods and algorithms used in data analytics.


___
## Setting up the environment

### Importing additional packages  

In [1]:
# numerical calculations libraries
import numpy as np
import pandas as pd

# plotting and data visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

# below command will allow for the plots being displayed inside the Notebook, rather than in a separate screen.
%matplotlib inline

### Loading the data set

Assigning the data from _Seaborn_ package to variable `tips`.

In [3]:
# Loading the data set
tips = sns.load_dataset("tips")

### The dataset integrity check and insight into the database 

Let's first see what is the shape of the data set, that is how many raws and how many columns are there.

In [4]:
tips.shape # number of rows and columns respectively

(244, 7)

Let's check datatype for each column, using `data.dtypes` method.

In [6]:
tips.dtypes

total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object

Now, let's see what are the unique values in each 'category type' column'.  
Adapted from [List Unique Values In A pandas Column](https://chrisalbon.com/python/data_wrangling/pandas_list_unique_values_in_column/).

In [7]:
#List unique values in the _day_ column
print(tips.sex.unique())
print(tips.smoker.unique())
print(tips.day.unique())
print(tips.time.unique())

[Female, Male]
Categories (2, object): [Female, Male]
[No, Yes]
Categories (2, object): [No, Yes]
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]
[Dinner, Lunch]
Categories (2, object): [Dinner, Lunch]


As it is a good practice to check the data integrity, let's see if there are any empty cells or corrupted data. We will use for this purpose the `df.isnull().any()` function, which checks if there is any `null` value in _any_ column. If the output is _False_, that means the column does not contain any `null` value.

In [5]:
tips.isnull().any()

total_bill    False
tip           False
sex           False
smoker        False
day           False
time          False
size          False
dtype: bool

The initial dataset checks prove its integrity and fitting for subsequent analysis.

___

## Data analysis

### 1.1 Descriptive statistical analysis

#### Raw data analysis

Below are listed several first raws of data displayed. This listing allows to get initial impression on the dataset structure, as well as its attributes (columns) and data types of the variables.

In [8]:
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


The last several rows of the dataset (in case it got corrupted):

In [22]:
tips.tail(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sum,tip_ratio,bpp,tpp
242,17.82,1.75,Male,No,Sat,Dinner,2,19.57,0.089423,8.91,0.875
243,18.78,3.0,Female,No,Thur,Dinner,2,21.78,0.137741,9.39,1.5


Basic statistical description of the numerical attributes of the data set. The information include, inter alia: mean, standard deviation or min and max of each column.

#### Raw data modeling

It would be interesting to consider and evaluate the relationship between the data. For this purpose, let's model the dataset by creating new attributes. I am going to create the following:
- the sum of total bill and tip, `sum`, and ratio of tip to sum of bill and tip, `tip_ratio`,
- amount of total bill divided by number of people in the group, `bpp`,
- amount of tip per person, `tpp`.

In [13]:
# new column created - sum of total bill and tip
tips["sum"] = tips["total_bill"]+tips["tip"] # appended at the end of the array

# new column created - ratio of tip to sum
tips["tip_ratio"] = tips["tip"]/tips["sum"] 

# add column: bpp - bill per person
tips["bpp"] = tips["total_bill"]/tips["size"]

# add column: tpp - tip per person
tips["tpp"] = tips["tip"]/tips["size"]


Now, let's see the dataset array and the characteristics of the additional columns.

In [21]:
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sum,tip_ratio,bpp,tpp
0,16.99,1.01,Female,No,Sun,Dinner,2,18.0,0.056111,8.495,0.505
1,10.34,1.66,Male,No,Sun,Dinner,3,12.0,0.138333,3.446667,0.553333


In [20]:
tips.describe()

Unnamed: 0,total_bill,tip,size,sum,tip_ratio,bpp,tpp
count,244.0,244.0,244.0,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,22.784221,0.136432,7.88823,1.212762
std,8.902412,1.383638,0.9511,9.890116,0.040585,2.91435,0.491705
min,3.07,1.0,1.0,4.07,0.034412,2.875,0.4
25%,13.3475,2.0,2.0,15.475,0.11436,5.8025,0.8625
50%,17.795,2.9,2.0,20.6,0.134026,7.255,1.1075
75%,24.1275,3.5625,3.0,27.7225,0.160704,9.39,1.5
max,50.81,10.0,6.0,60.81,0.415323,20.275,3.333333


### 1.2 Graphical interpretation 

For a better readability, let's change the Seaborn global plots style as follows (https://seaborn.pydata.org/tutorial/aesthetics.html, https://www.datacamp.com/community/tutorials/seaborn-python-tutorial)

In [23]:
sns.set(style="darkgrid")
sns.set_style("ticks")
# sns.set_context("paper") # small print, detailed scale

#### Histograms

 For this purpose, we will use `seaborn.countplot()` function ([Seaborn documentation](https://seaborn.pydata.org/generated/seaborn.countplot.html)).

Let us see on what days there is most customers. The plot will be further detailed by the size of the party.

In [None]:
sns.countplot(x="day", hue='size', data=tips)

Now, let us see smokers vs non-smokers dist. In order to make it even more interesting, I will differentiate the customers by their gender.

In [None]:
sns.countplot(x="sex", hue='smoker', data=tips)

In [None]:
sns.countplot(x="smoker", hue='sex', data=tips)

#### Generating scatter plots  
This section is based on official Seaborn [tutorial](https://seaborn.pydata.org/tutorial/relational.html).

The default kind of the plot (scatter) is used.  
The values on x-axis are describe the total bill; y-axis - amount of tip received.

In [None]:
sns.relplot(x="total_bill", y="tip", data=tips);

Although the plot is 2D, a third variable (3rd dimension).
It can be represented by colour (as shown below) or by shape of the marker. 

In [None]:
# Adding 3rd dimension, represented by colours.
sns.relplot(x="total_bill", y="tip", hue="smoker", data=tips);

In [None]:
# 3-dimensional plot, where the 3rd axis is represented by shape of marker
sns.relplot(x="total_bill", y="tip", style="smoker", data=tips);

It is also possible to show 4 different set of variables (4 dimensions), by differenting the data points by colour and the shape of the markers at the same time.

In [None]:
# 4-dimensional plot
sns.relplot(x="total_bill", y="tip", hue="smoker", style="time", data=tips);

For numeric variables, a squential palette (shades of the same colour) apply. The default colour palette can be modified, as below.

In [None]:
sns.relplot(x="total_bill", y="tip", hue="size", palette="ch:r=-.5,l=.75", data=tips);

Yet another way of adding extra dimension is by the means of size of the marker, as below.

In [None]:
sns.relplot(x="total_bill", y="tip", size="size", data=tips);

The default marker sizes can be altered (as well as combined with previously discussed means of representing variables).

In [None]:
sns.relplot(x="total_bill", y="tip", size="size", sizes=(15, 200), hue="time", palette="ch:r=-.5,l=.75", data=tips);

___
### 2. Regression

In this section I will look at the initial hypothesis, that is: **What is the relationship (if any) between total bill and amount of tips**.

#### Total bill vs tips

Note: there seems to me no implication of the `time` (lunch and dinner) with respect to the relationship between total bill and tip. See below plot, where both _lunch_ and _dinner_ times seem to be uniformly distributed. Therefore, **this category will not be considered** in further analysis. A column exclusion adopted from [Stackoverflow entry](https://stackoverflow.com/a/29319200).

In [None]:
tips = tips.drop(['time'], axis=1, inplace=False)

Once again, let's have a look at a plot showing _total bill_ (x-axes) against _tip_. This time, however, I am going to use Seaborne `jointplot()` function, which shows also the variables distribution.

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips)

#### Linear regression
This part is based on [Seaborn tutorial](https://seaborn.pydata.org/tutorial/regression.html)

Add regression line to the plot.

In [None]:
sns.regplot(x="total_bill", y="tip", data=tips);

More interesting plots can be produced with further variables discrimination, for example by whether the client was a smoker or not.

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);

Below we see the relationship plotted separately for each day.

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="day", col="day", data=tips, height=4, aspect=.5)

___
### 3. Analysis

#### k-nearest neighbours algorithm
Based on the Programming for Data Analysis, GMIT, lecture videos and [Notebook](https://github.com/ianmcloughlin/jupyter-teaching-notebooks/raw/master/pandas-with-iris.ipynb).  
Other consulted references:  
[Machine Learning Basics with the K-Nearest Neighbors Algorithm](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)  
[K-Nearest Neighbors Algorithm Using Python](https://www.edureka.co/blog/k-nearest-neighbors-algorithm/)
[K-Nearest Neighbors Algorithm in Python and Scikit-Learn](https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-and-scikit-learn/)

_k-nearest neighbours_ is a supervised machine learning algorithms. It is used to solve a classification problem. It produces a discreet output (that is: either this or that class (there may be more classes), but not something in between). 

#### Importing SciKit Learn Library

In [None]:
import sklearn.neighbors as nei
import sklearn.model_selection as mod

In [None]:
tips.head(2) # shape reminder: (244, 7)

#### A glimpse into plot.  
The below plot - relationship between tip size, total bill and the sex - is deemed the most suitable for the algorithm application. The other variables produce more fuzzy plots (a lot of overlapping data points).

In [None]:
sns.pairplot(tips, hue="sex") 

#### Inputs and Outputs

In [None]:
inputs = tips[['total_bill', 'tip']]
outputs = tips['sex']

#### Classifier

In [None]:
knn = nei.KNeighborsClassifier(n_neighbors=5) # will consider 5 nearest neighbours

#### Fit function

In [None]:
knn.fit(inputs, outputs)

#### Evaluate

In [None]:
(knn.predict(inputs) == outputs).sum() # Returns number of correctly recognised samples; total number of samples is 244

#### Training and testing data sub-sets
Splitting the dataset randomly into:  
1) training (75% of entire dataset size, i.e. 183), and  
2) testing (25%, i.e. 61)

In [None]:
inputs_train, inputs_test, outputs_train, outputs_test = mod.train_test_split(inputs, outputs, test_size=0.25)

In [None]:
knn = nei.KNeighborsClassifier(n_neighbors=5)
knn.fit(inputs_train, outputs_train)

In [None]:
# knn.predict(inputs_test) == outputs_test

In [None]:
answer = (knn.predict(inputs_test) == outputs_test).sum()
answer

Hence, the accuracy of the algorithm is the ratio of correctly recognised to actual number.

In [None]:
print("%.02f" % ((answer/61) * 100), "%")