# Tips data set sandbox 

This is my sandbox for experimenting with the data set and for testing ideas. 
> Author: Andrzej Kocielski, 2019

---

## Importing packages

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

As the Tips data set comes with the Seaborn package, I am going to call it up and assign to variable `tips`.

In [None]:
# Loading the data set
tips = sns.load_dataset("tips")

## A quick insight into the data set.

In [None]:
# General shape of the data set (number of raws - observations, and columns - categories):
tips.shape

In [None]:
# First several raws of the data set:
tips.head()

In [None]:
# Several last raws of the data set:
tips.tail()

Here will come description of the data in each column - what is the data type (int, float, string, etc.), give number of variants and possibly list them, etc 


...
...
...

Let's look at the essential statistics of the entire data set, using `.describe()` method.

In [None]:
tips.describe()

Here will come narrative desciption of the above. For example, note that only numerical columns were considered.

In [None]:
type(tips)

In [None]:
tips.dtypes

In [None]:
df = np.array(tips)

In [None]:
type(df)

## Data set modeling

In [None]:
# add column: tph - tip per head
tph = tips["tip"]/tips["size"]
tips["tph"] = tph
print(tips["tph"].describe())
# check what is the mean 'tph' value for smokers and non-smokers
# example (from https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/):
# gapminder_2002 = gapminder[gapminder['year']==2002]
print("Tip per head: ", tips["tph"].mean())
# mean tip per smoker head
#print((tips[tips["smoker"] == "Yes"]).head())
print("Bill per smoker head: ", tips[tips["smoker"] == "Yes"]["tph"].mean())

In [None]:
"tips[""sum""] = tips[""total_bill""]+tips[""tip""]
tips[""ratio""] = tips[""tip""]/tips[""sum""]
print(ratio.describe())
sns.regplot(x=""total_bill"", y=""ratio"", data=tips, marker='.')
# source: https://python-graph-gallery.com/41-control-marker-features/"

## k-nearest neighbors algorithm
Based on the Programming for Data Analysis, GMIT, lecture videos and [Notebook](https://github.com/ianmcloughlin/jupyter-teaching-notebooks/raw/master/pandas-with-iris.ipynb).  
Other references:  
[Machine Learning Basics with the K-Nearest Neighbors Algorithm](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)  
[K-Nearest Neighbors Algorithm Using Python](https://www.edureka.co/blog/k-nearest-neighbors-algorithm/)

### Importing SciKit Learn Library

In [None]:
import sklearn.neighbors as nei
import sklearn.model_selection as mod

In [None]:
tips.head(2) # shape reminder: (244, 7)

#### A glimpse into plot.  
The below plot - relationship between tip size, total bill and the sex - is deemed the most suitable for the algorithm application. The other variables produce more fuzzy plots (a lot of overlapping data points).

In [None]:
sns.pairplot(tips, hue="sex") 

#### Inputs and Outputs

In [None]:
inputs = tips[['total_bill', 'tip']]
outputs = tips['sex']

#### Classifier

In [None]:
knn = nei.KNeighborsClassifier(n_neighbors=5) # will consider 5 nearest neighbours

#### Fit function

In [None]:
knn.fit(inputs, outputs)

#### Evaluate

In [None]:
(knn.predict(inputs) == outputs).sum() # Returns number of correctly recognised samples; total number of samples is 244

#### Training and testing data sub-sets
Splitting the dataset randomly into:  
1) training (75% of entire dataset size, i.e. 183), and  
2) testing (25%, i.e. 61)

In [None]:
inputs_train, inputs_test, outputs_train, outputs_test = mod.train_test_split(inputs, outputs, test_size=0.25)

In [None]:
knn = nei.KNeighborsClassifier(n_neighbors=5)
knn.fit(inputs_train, outputs_train)

In [None]:
# knn.predict(inputs_test) == outputs_test

In [None]:
answer = (knn.predict(inputs_test) == outputs_test).sum()
answer

#### Accuracy

Ratio of correctly recognised to actual number

In [None]:
(answer/61) * 100

## histograms

In [None]:
#d = tips.loc[:, 'day']
#sns.distplot(tips['day'])
sns.countplot(x="day", hue='smoker', data=tips)

In [None]:
sns.countplot(x="sex", hue='smoker', data=tips)

### 2. Regression

In this section I will look at the initial hypothesis, that is: **What is the relationship (if any) between total bill and amount of tips**.

#### Total bill vs tips

Note: there seems to me no implication of the `time` (lunch and dinner) with respect to the relationship between total bill and tip. See below plot, where both _lunch_ and _dinner_ times seem to be uniformly distributed. Therefore, **this category will not be considered** in further analysis.

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips, kind="kde");

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="day", col="day", data=tips, height=4, aspect=.5)

In [None]:
sns.lmplot(x="size", y="tip", hue="day", col="day", data=tips, height=4, aspect=.5, x_jitter=0)

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="size", col="size", data=tips, height=4, aspect=.5, x_jitter=0)

In [None]:
sns.lmplot(x="size", y="tip", hue="day", col="day", data=tips, height=4, aspect=.5, x_jitter=0)